Academic AIdvisor

Project Overview:

Introduction:

As part of my Natural Language Processing course at Tulane University, I developed a Retrieval-Augmented Generation (RAG) system to provide answers to Tulane-specific academic queries. The project focused on combining data scraping, preprocessing, and advanced retrieval techniques with a state-of-the-art language model to enhance response quality. This system showcases how Natural Language Processing (NLP) can address real-world challenges in academic advising.

Project Goals:

Build a robust system for answering academic queries about Tulane University programs and courses.

Integrate a retrieval system with a language model to generate accurate, context-aware responses.

Develop a dynamic approach to document retrieval to maximize context window usage for large language models.

Project Workflow:

Data Collection:

Scraped over 400 program pages and 1,000+ course descriptions using Beautiful Soup.

Used regex matching to parse course descriptions and filter out irrelevant entries, such as independent study courses.

Data Processing:

Preprocessed text data with SpaCy, including tokenization, lemmatization, stop word removal, and Named Entity Recognition (NER).

Chunked program data into 5,000-character segments with overlapping windows to improve retrieval effectiveness.

Created embeddings using TF-IDF for efficient vectorized document representation.

Retrieval System:

Combined cosine similarity with keyword-based prioritization to rank documents.

Implemented a dynamic top-k retrieval system to maximize the LLM’s 16,000-token context window for query responses.

Integration with Language Model:

Used GPT-3.5-Turbo to generate responses from retrieved documents.

Optimized context assembly to improve the relevance of responses.

Web Application:

Developed a Flask-based web app for answering user queries.

Enabled users to switch seamlessly between 'Program Expert' and 'Class Expert' modes.

Project Review:

Challenges:

Token Limitations: Managing varying document lengths to fit within the 16,000-token context window of the language model.

Data Separation: Ensuring clear distinctions between program and course data to improve retrieval accuracy.

Data Supplementation: Adding informatin from course syllabi, canvas pages, etc. would provide more useful and robust context for query response.

Future Work:

Data Expansion: Incorporate informal sources such as Reddit and Rate My Professor for more student-centered advice.

Enhanced Retrieval: Transition to LangChain or Faiss for more efficient indexing and retrieval.

Deployment: Deploy the Flask web app to a cloud platform like Heroku for broader accessibility.

Storage Optimization: Move from a directory of text files to a more robust data storage solution to handle larger datasets effectively.

Skills Gained:

Web Scraping: Automated data collection with Beautiful Soup and regex.

NLP Techniques: Text preprocessing, tokenization, lemmatization, and Named Entity Recognition.

Retrieval Systems: Implementing vectorized search with TF-IDF and cosine similarity.

Web Development: Building Flask-based applications for user interaction.

Integration with LLMs: Optimizing retrieval and response generation using GPT-3.5 API.

Reflection:

This project deepened my understanding of NLP concepts and demonstrated the power of retrieval-augmented generation systems. It also highlighted the potential of NLP in building human-like applications that address practical challenges. This project also gave me insight into webapp development for improved user interfaces and experiences. I hope to return to this project in the future and improve upon it further.