PDF to Markdown for AI: RAG, Claude, ChatGPT (2026)
Convert PDF to Markdown in your browser, prep clean context for RAG pipelines, Claude, ChatGPT, and Notion. Four methods compared with token-cost data.
Why PDF to Markdown matters for AI
If you have ever pasted a PDF into Claude or ChatGPT and watched the response stumble — repeated headers, broken paragraphs, table cells in the wrong order — the format is the problem. PDFs were designed for print. They encode visual layout, not document structure. Tokenizers see every page header, footer, and column break as content, which fattens the context window and drowns out the actual signal.
Markdown solves this by stripping the layout and preserving the structure. Headings stay headings. Lists stay lists. Tables stay tables. Headers, footers, and page numbers — the noise — get filtered out. The result is text that reads cleanly to a human and parses cleanly to an LLM.
The numbers back it up. On a 50-page government circular, the raw text extraction from PDF.js produced about 38,000 tokens (Claude tokenizer). The Markdown conversion, with headers and footers removed, came to 24,000 tokens — a 37% reduction with the same information density. On a per-million-token basis at Claude Sonnet 4.6 input pricing ($3/M), that is the difference between $0.114 and $0.072 per query. Across a thousand queries a day, that is $1,500 a year, on one document.
For retrieval-augmented generation (RAG), the case is even stronger. RAG quality depends on chunk boundaries that respect semantic structure. Markdown headings give you those boundaries for free — LangChain's MarkdownHeaderTextSplitter and LlamaIndex's MarkdownNodeParser both split on heading levels by default. Raw extracted PDF text gives you nothing structural; you end up chunking on character count, which slices through paragraphs and tanks retrieval precision.
4 methods compared
There are four serious options for PDF-to-Markdown in 2026. Each has a different tradeoff profile.
1. Marker (datalab.to)
Marker is the highest-fidelity option. It uses a stack of vision-language models to recognize equations, complex tables, multi-column flow, and figures with captions. On academic papers, the output is nearly print-quality. The catch: it needs Python, PyTorch, and a GPU for the deep model — practical for batch ingest on a server, not for ad-hoc conversion. The hosted API at datalab.to is paid and uploads your PDF.
2. MarkItDown (Microsoft)
Open-sourced by Microsoft in late 2024 and stable through 2026. CPU-only, fast, handles office documents (PDF, DOCX, PPTX, XLSX) with a single CLI. Output quality is solid for business PDFs but weaker on equations and complex tables compared to Marker. Pure-Python install, no GPU required.
3. pymupdf4llm
The lightest pure-Python option. Built on PyMuPDF, optimized for RAG ingest, no model dependencies. Sub-second conversion on a typical business PDF. Output is opinionated for LLM consumption — strips visuals, preserves logical flow. Default choice if you are running a Python ingest pipeline and do not need vision-model fidelity.
4. PDF Mavericks (browser-local)
The only browser-local option in this set. Drop the PDF, get Markdown, no server upload, no install, no API key. The tradeoff: text-layer-only conversion, so scanned PDFs need OCR first (run our browser-local OCR tool first if your PDF is image-based). For text-based PDFs — the majority of contracts, reports, articles, and manuals — the output is interchangeable with pymupdf4llm. The privacy story is the differentiator: when the PDF is a salary slip, a bank statement, or an Aadhaar copy, server-upload tools are non-starters.
Quick decision rule. Academic papers with equations and figures, server-side batch: Marker. Office documents, server-side: MarkItDown. RAG ingest pipeline in Python: pymupdf4llm. One-off conversion in your browser, sensitive document: PDF Mavericks.
How to convert in PDF Mavericks
The flow is three steps:
- Open the PDF to Markdown tool. The page loads in under a second; nothing downloads beyond standard JavaScript.
- Drop your PDF on the drop zone, or click "Choose PDF" and pick a file. The tool reads the PDF directly from your local file system using the browser File API.
- Click "Convert". The conversion runs in WebAssembly. A 50-page PDF takes 6 to 12 seconds on a modern laptop. The output appears in a copy-able text area; click "Download .md" to save it.
That is it. No upload, no signup, no quota. The Markdown is yours; do whatever you want with it.
Plugging the output into a RAG pipeline
The Markdown output is structured for chunking by heading. Two minimal-effort splits work well in practice:
LangChain. Use MarkdownHeaderTextSplitter with H2 and H3 as split levels. Each chunk gets the heading path as metadata, which improves retrieval precision because the embedder sees the section context. For a 60-page report, this typically produces 80 to 140 chunks — a workable number for a vector store.
LlamaIndex. Use MarkdownNodeParser with default settings. It produces hierarchical nodes that mirror the heading tree, which lets you do parent-child retrieval — fetch the small chunk for embedding match, return the parent section for context.
For embeddings, a 768- or 1024-dim model is fine. Mistral Embed and Voyage 3 both index Markdown well; OpenAI's text-embedding-3-small is the cheap default. Skip embedding the headings themselves as separate chunks — they bias retrieval toward generic matches.
3 pitfalls to avoid
1. Do not skip the OCR step on scanned PDFs. A "PDF" that is really a stack of images has no text layer. Conversion will produce empty Markdown. Run OCR first; we have a browser-local OCR tool that adds the text layer in-place. The whole chain stays local.
2. Do not include encrypted PDFs without unlocking. Bank statements and tax documents in India often arrive password-protected. Browser PDF readers cannot extract text from encrypted PDFs. Unlock first using our unlock-pdf tool (also browser-local), then convert. The unlock tool needs the password — it does not crack anything; it just removes the encryption layer once the password is supplied.
3. Do not feed the Markdown raw into a 200K context model expecting magic. Even with clean Markdown, a 200K-token context window has retrieval challenges — the "lost in the middle" problem is real, even on Claude. For documents above ~80K tokens, RAG outperforms long-context insertion in most benchmarks. Convert, chunk, embed, retrieve. The Markdown conversion is step one of a four-step pipeline, not the whole answer.
When to convert and when to skip
Not every PDF needs to become Markdown. If you only need a one-shot summary and the PDF is under 30 pages, pasting the raw text into Claude or ChatGPT works fine — the model handles light noise. The conversion pays off when one of three things is true: the PDF is part of a recurring pipeline (RAG, Notion ingest, weekly report), the PDF is sensitive enough that server-upload tools are off the table, or the document is large enough that token cost per query starts mattering. For everything else, raw paste is the cheaper path.
For India-specific workflows — bank statements, tax documents, government circulars, salary slips — the privacy story drives the choice more than token economics. Even a small PDF gets converted in-browser when uploading it to a third party would be a compliance headache. The 6-second conversion is cheap insurance against a data exposure that does not need to happen.
Your files never leave your browser
PDF Mavericks processes everything locally using WebAssembly. No file is uploaded to any server.
Frequently asked questions
Why convert PDF to Markdown for AI instead of pasting the PDF?
PDFs are encoded for print layout, not for tokenizers. Headers, footers, page numbers, and column breaks become noise tokens that inflate context cost and hurt retrieval. Markdown strips the layout, keeps the structure (headings, lists, tables), and gives an LLM clean text to reason over. On a 50-page report, the same content can drop from roughly 38,000 tokens (raw extracted PDF) to 24,000 tokens (clean Markdown) — a 37% reduction with no information loss.
Does PDF Mavericks upload my PDF to a server?
No. The conversion runs entirely in your browser using PDF.js (Mozilla's open-source PDF renderer) and Turndown for HTML-to-Markdown. Your PDF never leaves your device. There is no server processing step, no temporary cache, and no upload at any point. This matters when the PDF contains salary slips, bank statements, Aadhaar copies, contracts, or anything you would not paste into a public chatbot.
How does the output compare to Marker, MarkItDown, or pymupdf4llm?
Marker (datalab.to) is the highest-fidelity option for academic papers — it handles equations, complex tables, and figures via vision models, but requires Python and a GPU for the deep model. MarkItDown (Microsoft) is fast and CPU-only, good for office documents. pymupdf4llm is the lightest pure-Python option, optimized for RAG ingest. PDF Mavericks is the only browser-local option in this set — you trade some equation/table fidelity for zero-install, zero-upload, zero-cost conversion. For most business and content PDFs, the output is interchangeable.
What does the conversion preserve and what does it drop?
Preserved: paragraph order, headings (detected via font-size heuristics), bulleted and numbered lists, basic tables, links, bold and italic emphasis. Dropped: pixel-perfect layout, multi-column flow as visual columns (it linearizes), background images, page numbers and headers (a feature, not a bug — these poison RAG retrieval), and embedded fonts. If your downstream task is RAG, Notion import, or LLM context, the dropped pieces are exactly what you want gone.
What is the file size limit?
Soft limit of 100 MB or 500 pages, whichever comes first. The conversion runs in WebAssembly inside your browser tab — beyond that the tab can run out of memory on older devices. For a 1,200-page contract bundle, split it first using our split tool and concatenate the resulting Markdown files. On modern laptops, a 200-page PDF converts in 8 to 14 seconds.
Can I feed the Markdown directly to Claude or ChatGPT?
Yes — the output is plain UTF-8 Markdown, ready to paste into any chat or to send via API. For Claude's 200K-token context, a typical 60-page Markdown document occupies roughly 30,000 tokens, leaving plenty of room for the question and response. For RAG, the output is already structured for chunk-by-heading splitting; LangChain's MarkdownHeaderTextSplitter and LlamaIndex's MarkdownNodeParser both work directly on it.
Does it handle scanned PDFs?
Not directly. Scanned PDFs are images with no text layer, so there is nothing to extract. Run them through OCR first — our browser-local OCR tool produces a searchable PDF with a text layer, which then converts to Markdown cleanly. The full chain (scan → OCR → Markdown) runs entirely in the browser; no document leaves your device at any stage.