Academic AIdvisor
🎓

Academic AIdvisor

Tags
Machine Learning
Web Dev
Flask
Natural Language Processing
Artifical Intelligence
GitHub
URL
Published
Author

Project Overview:

Introduction:

As part of my Natural Language Processing course at Tulane University, I developed a Retrieval-Augmented Generation (RAG) system to provide answers to Tulane-specific academic queries. The project focused on combining data scraping, preprocessing, and advanced retrieval techniques with a state-of-the-art language model to enhance response quality. This system showcases how Natural Language Processing (NLP) can address real-world challenges in academic advising.

Project Goals:

  • Build a robust system for answering academic queries about Tulane University programs and courses.
  • Integrate a retrieval system with a language model to generate accurate, context-aware responses.
  • Develop a dynamic approach to document retrieval to maximize context window usage for large language models.

Project Workflow:

  1. Data Collection:
      • Scraped over 400 program pages and 1,000+ course descriptions using Beautiful Soup.
      • Used regex matching to parse course descriptions and filter out irrelevant entries, such as independent study courses.
  1. Data Processing:
      • Preprocessed text data with SpaCy, including tokenization, lemmatization, stop word removal, and Named Entity Recognition (NER).
      • Chunked program data into 5,000-character segments with overlapping windows to improve retrieval effectiveness.
      • Created embeddings using TF-IDF for efficient vectorized document representation.
  1. Retrieval System:
      • Combined cosine similarity with keyword-based prioritization to rank documents.
      • Implemented a dynamic top-k retrieval system to maximize the LLM’s 16,000-token context window for query responses.
  1. Integration with Language Model:
      • Used GPT-3.5-Turbo to generate responses from retrieved documents.
      • Optimized context assembly to improve the relevance of responses.
  1. Web Application:
      • Developed a Flask-based web app for answering user queries.
      • Enabled users to switch seamlessly between 'Program Expert' and 'Class Expert' modes.

Project Review:

Challenges:

  • Token Limitations: Managing varying document lengths to fit within the 16,000-token context window of the language model.
  • Data Separation: Ensuring clear distinctions between program and course data to improve retrieval accuracy.
  • Data Supplementation: Adding informatin from course syllabi, canvas pages, etc. would provide more useful and robust context for query response.

Future Work:

  • Data Expansion: Incorporate informal sources such as Reddit and Rate My Professor for more student-centered advice.
  • Enhanced Retrieval: Transition to LangChain or Faiss for more efficient indexing and retrieval.
  • Deployment: Deploy the Flask web app to a cloud platform like Heroku for broader accessibility.
  • Storage Optimization: Move from a directory of text files to a more robust data storage solution to handle larger datasets effectively.

Skills Gained:

  • Web Scraping: Automated data collection with Beautiful Soup and regex.
  • NLP Techniques: Text preprocessing, tokenization, lemmatization, and Named Entity Recognition.
  • Retrieval Systems: Implementing vectorized search with TF-IDF and cosine similarity.
  • Web Development: Building Flask-based applications for user interaction.
  • Integration with LLMs: Optimizing retrieval and response generation using GPT-3.5 API.

Reflection:

This project deepened my understanding of NLP concepts and demonstrated the power of retrieval-augmented generation systems. It also highlighted the potential of NLP in building human-like applications that address practical challenges. This project also gave me insight into webapp development for improved user interfaces and experiences. I hope to return to this project in the future and improve upon it further.

Gallery:

Â