April 15, 2025

Chat with documents using LLMs — the cheap & private way

Why chat with documents at all?

LLMs are amazing at answering questions — but what if the answers you're looking for are buried inside your own PDFs, notes, or docs?

The simplest idea is to just paste the entire doc into your prompt and start chatting. Thanks to models with 100k+ token context windows, that actually works… up to a point.

But here's the problem: token limits are real, and token costs add up fast. Every time you re-send the full doc, you're wasting tokens and money — even if you're only asking about a single paragraph. If you're paying per token, this quickly becomes inefficient and expensive.


A smarter, cheaper way: retrieve first, then prompt

Instead of feeding the whole document, a better approach is to:

  • Split your doc into chunks
  • Convert those chunks into embeddings
  • Store them in a local, in-memory vector database
  • When a user asks a question, find the most relevant chunks
  • Send only those snippets to the LLM for context

This is far more cost-efficient and often more accurate. You avoid flooding the model with irrelevant info, and you only pay for what's actually useful.


Use LangChain to wire it all together

LangChain makes this super easy. You can spin up a local in-memory vector store (like FAISS or Chroma), use a text splitter, create embeddings, and retrieve relevant content — all in a few lines of code.

There’s no need to set up a database server or cloud infra. Just run it locally, fast and free. It’s the perfect companion for a local-first GPT setup like yo-GPT.


But wait — what are embeddings?

Embeddings are numerical representations of text that capture semantic meaning. You need them to measure similarity between the question and parts of the document. There are two good options here:

  • OpenAI Embeddings API – Simple to use, high quality, but requires payment and has rate limits.
  • Local FastText Embeddings (FREE) – Download pre-trained word vectors and use them locally. Load only the top 100,000 most common words to keep it memory-friendly.

For most use cases, FastText covers 99% of what you need. No API keys, no tracking, no cost.


TL;DR — Local, cheap, and private

If you're building tools like yo-GPT and care about privacy, cost, and simplicity, here's the best way to chat with documents:

  • Don't send the whole doc — it’s expensive and inefficient
  • Use LangChain to split, embed, and retrieve relevant chunks
  • Run your vector DB in memory — no servers, no DB setup
  • Use FastText locally for free, no-API embedding generation

This approach is in line with my philosophy for yo-GPT: run GPT models locally, keep your data private, and stay in full control — all while keeping your stack simple and cost-efficient.

Want help integrating this into your workflow? Or need a ready-to-go code snippet? Feel free to check out the repo or reach out!