The constraint nobody talks about
When you're a solo builder deploying on a budget, your server has limits. TEBIBX runs on a Render instance with 512MB of RAM. That's not a complaint - it's cheap, it's fast to deploy, and for serving API responses it's more than enough.
The problem showed up the moment users started indexing real repositories. A mid-size production codebase - say, 400 files, 60,000 lines of Python - doesn't fit cleanly inside a 512MB process. Not when you're chunking it, embedding it, and holding those vectors in memory during ingestion. The first time I watched the server spike to 490MB and then silently die, I understood the problem clearly: the original architecture was going to fight this constraint every single day.
What I was originally building
The first version of TEBIBX followed the standard RAG playbook:
- User submits a GitHub repository URL
- Server clones or fetches the repo via GitHub API
- FastAPI pipeline chunks the files, generates embeddings, writes vectors to Qdrant
- User queries hit the retrieval pipeline - Qdrant for semantic search, SQLite for lexical matching, call graph traversal for structural context
Simple. Clean. Completely wrong for a 512MB deployment. The issue isn't query time - retrieval is fast and memory-stable. The issue is ingestion. Chunking and embedding a large codebase is a memory-hungry, time-consuming operation. Running it inside the same process that serves user queries means one large ingestion job can starve everything else.
Why the obvious fix was wrong
The obvious answer is to scale up. Upgrade the instance, get more RAM, solve it with money. I considered it for about ten minutes. Then I thought about what that actually buys.
Ingestion is a per-user, per-repository operation. Every developer who uses TEBIBX has their own set of repos they care about. Scaling the server vertically means I'm paying for peak ingestion load across all users - which is unpredictable, bursty, and mostly idle. I'd be buying RAM to handle spikes that happen when someone first adds a repo, and that RAM sits unused 90% of the time.
Worse: vertical scaling doesn't actually solve the architecture problem. It just delays it. Add enough concurrent users and you're back to the same fight. The real question wasn't "how do I give the server more memory?" It was: does ingestion need to happen on the server at all?
The shift: make the client do the work
Ingestion and retrieval are separate operations. Retrieval needs the server - it needs the six-stage hybrid pipeline, the vector index, the LLM calls. But ingestion? Ingestion is just reading files, splitting them into chunks, and generating embeddings. Nothing about that requires a server. The browser can do it.
The solution I built moves the entire ingestion pipeline to the client:
- A Web Worker handles chunking and processing off the main thread, so the UI stays responsive during ingestion
- Embeddings are generated in small batches via lightweight API calls that don't spike server memory - the server processes one chunk at a time and returns the vector, never holding the full corpus
- The resulting vectors are stored in IndexedDB - the browser's built-in structured storage, which can handle hundreds of megabytes without touching the server
When the user queries, the client reads vectors from IndexedDB and sends them to the server's retrieval pipeline. The server sees a stateless request - it doesn't know or care how the vectors were generated or where they're stored. From the server's perspective, memory usage during ingestion is now flat. It processes one chunk, returns one vector, forgets it. The 512MB limit became irrelevant for ingestion overnight.
How the retrieval pipeline works
Queries go through six stages:
- Query embedding - the user's natural language question is embedded server-side
- Qdrant semantic search - vector similarity against the repository corpus
- SQLite lexical search - BM25 keyword matching for exact identifiers (function names, class names, file paths)
- Call graph traversal - structural context, tracing how functions call each other
- Map phase - Gemini Flash processes each retrieved chunk in parallel, extracting relevant context
- Reduce phase - Groq Llama synthesizes the mapped results into a coherent answer
The hybrid retrieval (semantic + lexical + structural) means TEBIBX doesn't miss results that are conceptually similar but syntactically exact - which matters enormously when a developer asks about a specific function they know by name.
What I gave up
This architecture has real tradeoffs and I'm not going to pretend otherwise.
- Cross-device persistence is harder. A user who indexes a repo on their laptop doesn't automatically have it available on their desktop. The vectors live in that browser's IndexedDB, not on a central server. This is a meaningful UX cost for heavy users.
- First ingestion takes longer client-side. The browser is doing real work. On a large repo, that's noticeable.
- Browser storage quotas are real. IndexedDB has limits that vary by browser and available disk space. For most codebases this is fine. For very large monorepos, it becomes a constraint.
These are known tradeoffs, not surprises. And against them: the server stays stable regardless of how many users are indexing simultaneously. Costs stay predictable. The architecture scales horizontally because each user's ingestion is fully isolated in their own browser.
What I learned
The 512MB limit felt like a problem. It turned out to be a forcing function. The standard playbook for RAG systems assumes you control the compute - that you can scale the server when you need to. When you can't, you have to ask which parts of the pipeline actually need centralized infrastructure. For TEBIBX, the answer was: retrieval, synthesis, and the LLM calls. Not ingestion.
Pushing ingestion to the client made the architecture cleaner, not messier. The server does less. Each component has a clearer responsibility. The system is easier to reason about.
If I'd solved the problem with money - more RAM, bigger instance - I'd have a more expensive system with the same structural weakness. The constraint forced a better answer.