Project Overview:
Introduction:
As part of my Natural Language Processing course at Tulane University, I developed a Retrieval-Augmented Generation (RAG) system to provide answers to Tulane-specific academic queries. The project focused on combining data scraping, preprocessing, and advanced retrieval techniques with a state-of-the-art language model to enhance response quality. This system showcases how Natural Language Processing (NLP) can address real-world challenges in academic advising.
Project Goals:
- Build a robust system for answering academic queries about Tulane University programs and courses.
- Integrate a retrieval system with a language model to generate accurate, context-aware responses.
- Develop a dynamic approach to document retrieval to maximize context window usage for large language models.
Project Workflow:
- Data Collection:
- Scraped over 400 program pages and 1,000+ course descriptions using Beautiful Soup.
- Used regex matching to parse course descriptions and filter out irrelevant entries, such as independent study courses.
- Data Processing:
- Preprocessed text data with SpaCy, including tokenization, lemmatization, stop word removal, and Named Entity Recognition (NER).
- Chunked program data into 5,000-character segments with overlapping windows to improve retrieval effectiveness.
- Created embeddings using TF-IDF for efficient vectorized document representation.
- Retrieval System:
- Combined cosine similarity with keyword-based prioritization to rank documents.
- Implemented a dynamic top-k retrieval system to maximize the LLM’s 16,000-token context window for query responses.
- Integration with Language Model:
- Used GPT-3.5-Turbo to generate responses from retrieved documents.
- Optimized context assembly to improve the relevance of responses.
- Web Application:
- Developed a Flask-based web app for answering user queries.
- Enabled users to switch seamlessly between 'Program Expert' and 'Class Expert' modes.
Project Review:
Challenges:
- Token Limitations:Â Managing varying document lengths to fit within the 16,000-token context window of the language model.
- Data Separation:Â Ensuring clear distinctions between program and course data to improve retrieval accuracy.
- Data Supplementation: Adding informatin from course syllabi, canvas pages, etc. would provide more useful and robust context for query response.
Future Work:
- Data Expansion:Â Incorporate informal sources such as Reddit and Rate My Professor for more student-centered advice.
- Enhanced Retrieval:Â Transition to LangChain or Faiss for more efficient indexing and retrieval.
- Deployment:Â Deploy the Flask web app to a cloud platform like Heroku for broader accessibility.
- Storage Optimization:Â Move from a directory of text files to a more robust data storage solution to handle larger datasets effectively.
Skills Gained:
- Web Scraping:Â Automated data collection with Beautiful Soup and regex.
- NLP Techniques:Â Text preprocessing, tokenization, lemmatization, and Named Entity Recognition.
- Retrieval Systems:Â Implementing vectorized search with TF-IDF and cosine similarity.
- Web Development:Â Building Flask-based applications for user interaction.
- Integration with LLMs:Â Optimizing retrieval and response generation using GPT-3.5 API.
Reflection:
This project deepened my understanding of NLP concepts and demonstrated the power of retrieval-augmented generation systems. It also highlighted the potential of NLP in building human-like applications that address practical challenges. This project also gave me insight into webapp development for improved user interfaces and experiences. I hope to return to this project in the future and improve upon it further.
Gallery:
Â