Work

tum-search

When campus search mixes crawling, summaries, embeddings, graph structure, and live updates, which signal should users trust first?

tum-search

Why this article exists

What should a campus search system trust when structure and intent point in different directions? This project explores how university knowledge can be crawled, summarized, embedded, connected, and updated without treating ranking as only keyword matching.

Problem

Campus knowledge search needs more than text lookup. It needs recursive crawling, concise page summaries, semantic retrieval, graph relationships, freshness signals, and visible progress when the index changes.

What shipped

Crawler, Gemini-powered summaries, Qdrant/CLIP vector search, knowledge-graph ideas, WebSocket crawl progress, dependency checks, setup scripts, and admin utilities.

Evidence

The README documents the crawler, summarization, vector-search, knowledge-graph, WebSocket update, setup, environment, and admin-tool surfaces.

Inspect path

Inspect the README, `web_server.py`, dependency scripts, crawler/summarization paths, Qdrant configuration, WebSocket update path, and admin scripts for database clearing and summary regeneration.

Boundary

The public README exposes a research/prototype search system, not a production campus search service, validated ranking benchmark, or official university information product.

What changed

Search quality became a systems question: topology, semantics, generated summaries, and update feedback matter together before ranking claims are credible.

Next question

Which signal should be trusted first when graph structure, semantic similarity, freshness, and keyword match disagree?

Open public repository

https://github.com/89325516/tum-search