AI System / RAG Pipeline

Berkeley Course Navigator

Production-level RAG system over UC Berkeley's STEM course catalog combining a Neo4j prerequisite knowledge graph with hybrid semantic retrieval, grounded generation, and a LangGraph multi-hop agent in development.

RoleSolo engineer

Timeline2026

PythonNeo4jChromaDBPineconeOpenAICohereLangChainLangGraphLangSmith

View GitHub

Project gallery

Neo4j graph showing MATH 54 as a prerequisite for 45 courses across departments.

Full pipeline from query rewriting through hybrid retrieval, reranking, and grounded generation.

End-to-end query response with course code citations from retrieved context.

Overview

Berkeley Course Navigator is a production-level RAG system that answers complex student questions about UC Berkeley's STEM course catalog. It combines a Neo4j knowledge graph for prerequisite traversal with a hybrid semantic retrieval pipeline, grounded LLM generation with citations, and a LangGraph multi-hop agent that dynamically routes between the graph and vector store based on query type.

The challenge

Prerequisite questions — like 'what do I need before CS 189?' — are fundamentally a graph traversal problem, not a text retrieval problem. Storing prerequisites as text chunks would require recursive LLM calls that compound errors at each hop, making the system unreliable and expensive. The system also needed to handle both structural queries (prerequisite chains) and semantic queries (course discovery) with different retrieval strategies.

What I built

Reverse engineered Berkeley's internal Coursedog API to ingest 946 courses and 799 prerequisite relationships across 7 STEM departments, with no public documentation available.

Architected a Neo4j knowledge graph encoding AND/OR prerequisite logic as typed edges, enabling deterministic multi-hop traversal with a single Cypher query.

Built a Chroma vector store with 918 course embeddings using OpenAI's text-embedding-3-small, with rich text combining description, department, level, units, and prerequisite text.

Implemented hybrid retrieval combining BM25 keyword search and semantic vector search, merged via Reciprocal Rank Fusion and reranked with Cohere's cross-encoder for true relevance scoring.

Built a grounded generation layer using GPT-4o that answers strictly from retrieved context with mandatory course code citations, preventing hallucination by design.

Developed a full pytest test suite with mocking, fixtures, and integration tests following TDD practices throughout every pipeline stage.

Developing a LangGraph multi-hop agent that dynamically routes between the knowledge graph and vector store to answer complex course planning queries.

Outcomes

946 course nodes and 799 prerequisite edges loaded into Neo4j with correct AND/OR logic preserved.

918 course embeddings stored in Chroma with metadata filters for department, level, and units.

End-to-end pipeline answers student questions with grounded, cited responses in under 3 seconds.

Full test coverage across ingestion, graph loading, vector storage, retrieval, reranking, and generation.

Prerequisite chain queries that would require 10+ recursive LLM calls now resolve in a single Cypher traversal.