pdf2md

pdf2md converts PDF files into clean, structured Markdown using a Vision LLM. It leverages dots.ocr to perform state-of-the-art OCR and layout analysis, preserving headings, paragraphs, tables, and images with high fidelity.

Features:

Checkpoint/restart — resume from partially processed PDFs without losing progress.
Page range selection — convert only the pages you need.
Preserves document structure and (optionally) filters out headers/footers.
Experimental MPS (Apple Silicon) and CPU backends for non-CUDA systems.

Quickstart

# 1. Install PyTorch (CUDA 12.8 build)
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128

# 2. Install pdf2md in editable mode
pip install -e .

Example

pdf2md mydoc.pdf -o mydoc.md

Notes

*	Requires a local or auto-downloaded dots.ocr model (see dots.ocr repo for details).
*	On first run, the model will be downloaded to ./weights/DotsOCR unless overridden with --model-dir.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src/pdf2md		src/pdf2md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pdf2md

Quickstart

Notes

About

Uh oh!

Releases

Packages

Languages

pllopis/pdf2md

Folders and files

Latest commit

History

Repository files navigation

pdf2md

Quickstart

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages