API reference
Convert documents
programmatically
Upload, poll, download. The API is versioned at /api/v1/. Same credit balance as the web UI; no separate plan, no per-key surcharge.
Authentication
Generate a key from the dashboard. Pass it as a Bearer token:
Authorization: Bearer dpk_live_<your_key>Each key is shown once at creation; only its hash is stored on the server. Rate limit: 60 requests/minute per key.
POST /api/v1/jobs
Upload a document for conversion. Charges credits automatically when the estimate is computed.
Request
POST /api/v1/jobs
Authorization: Bearer dpk_live_...
Content-Type: multipart/form-data
file: <binary> # the document
outputs[]: markdown # one or more (repeated field)
outputs[]: docxSupported targets per source
| Source | Targets |
|---|---|
| markdown, chunks, images, docx, xlsx, pptx | |
| .docx | markdown, chunks, pdf |
| .pptx | markdown, chunks, pdf |
| .xlsx | markdown, chunks, csv, pdf |
| .csv, txt, md, eml | markdown, chunks |
| .jpg, jpeg, png, heic, heif | jpg, png, pdf, image, markdown |
Response (202)
{
"job_id": "6ec9b16d-c551-48fc-aebe-b151fe3d20d8",
"status": "PENDING_CLASSIFICATION",
"page_count": 3,
"requested_outputs": ["markdown", "docx"],
"credits_estimated": 6
}GET /api/v1/jobs/:id
Poll job status. The job moves through this state machine:
PENDING_CLASSIFICATION → CLASSIFYING → AWAITING_CONFIRMATION
→ PROCESSING → DONE
→ ERRORResponse (200) — DONE
{
"job_id": "6ec9b16d-...",
"status": "DONE",
"page_count": 3,
"credits_charged": 6,
"duration_ms": 5661,
"download_url": "https://.../outputs/.../all.zip?...",
"error_code": null,
"error_message": null
}The download URL is a presigned link valid for 7 days. The zip contains a subdirectory per requested output (e.g. markdown/sample.md, docx/sample.docx).
Chunks (RAG)
Structure-aware chunks for RAG. Drop-in JSONL with breadcrumbs and heading paths included — not bolted on after the fact. Bring your own embedding model.
Each chunk carries its position in the document hierarchy (breadcrumb, heading_path), its type (text, table, etc.), and the strategy that produced it. Document type is detected automatically — no parameter tuning required.
Request
POST /api/v1/jobs
Authorization: Bearer dpk_live_...
Content-Type: multipart/form-data
file: <binary> # the document
outputs[]: chunksOutput: chunks/chunks.jsonl
First line is a manifest, subsequent lines are one chunk each.
{"_manifest": true, "schema_version": 1, "source_filename": "paper.pdf",
"chunk_count": 42, "strategy": "general", "created_at": "2026-04-30T13:36:00Z"}
{"id": "8f6c...", "text": "...", "chunk_index": 0, "token_count": 312,
"breadcrumb": "Chapter 2 > Section 2.1",
"heading_path": ["Chapter 2", "Section 2.1"],
"chunk_type": "text", "source_filename": "paper.pdf",
"strategy_used": "general", "metadata": {}}Schema
| Field | Type | Notes |
|---|---|---|
| id | string | Stable within a job; uuid4 across jobs |
| text | string | Chunk content (markdown) |
| chunk_index | int | 0-based position in the document |
| token_count | int | Estimated tokens |
| breadcrumb | string | e.g. "Chapter 2 > Section 2.1" |
| heading_path | string[] | Ancestor headings, top to bottom |
| chunk_type | enum | text | table | whole_doc | cross_reference | formula_annotation |
| source_filename | string | Original upload filename |
| strategy_used | string | general | manual | table | one | spreadsheet_advanced |
| metadata | object | Strategy-specific extras (opaque) |
Embed and ingest (OpenAI + Chroma)
import json, openai, chromadb
client = openai.OpenAI()
col = chromadb.PersistentClient("./db").get_or_create_collection("docs")
for line in open("chunks/chunks.jsonl"):
r = json.loads(line)
if r.get("_manifest"):
continue
vec = client.embeddings.create(
input=r["text"], model="text-embedding-3-small"
).data[0].embedding
col.add(
ids=[r["id"]],
embeddings=[vec],
documents=[r["text"]],
metadatas=[{"breadcrumb": r["breadcrumb"], "source": r["source_filename"]}],
)Schema is v1 and additive-only — new fields may appear, existing fields will not be renamed or removed without a version bump. Chunk ids are not stable across re-runs of the same job; if you re-ingest, dedupe on (source_filename, chunk_index).
Error codes
| HTTP | Code | Meaning |
|---|---|---|
| 400 | MISSING_FILE | Request did not include a 'file' field |
| 400 | BAD_JSON | Body is malformed JSON |
| 401 | UNAUTHENTICATED | Missing, malformed, or revoked API key |
| 402 | INSUFFICIENT_CREDITS | Account balance below estimate (set on the job after classification) |
| 403 | FORBIDDEN | Job belongs to another account |
| 404 | NOT_FOUND | Job not found |
| 413 | TOO_LARGE | File exceeds 32 MB |
| 415 | BAD_EXTENSION | Extension not in allowlist |
| 415 | BAD_MAGIC | Detected MIME does not match an allowed type |
| 422 | PAGE_LIMIT | PDF exceeds 600 pages |
| 422 | ZIP_BOMB | Office archive ratio or uncompressed size exceeds limits |
| 429 | RATE_LIMITED | Per-key rate limit exceeded (60 req/min) |
Operational errors discovered during conversion land on the job record as status=ERROR with a descriptive error_code + error_message. Credits charged for the job are auto-refunded.
Example (curl)
# upload + auto-charge + auto-process
curl -sS -X POST https://docparser.app/api/v1/jobs \
-H "Authorization: Bearer dpk_live_..." \
-F "file=@./contract.pdf" \
-F "outputs[]=markdown" \
-F "outputs[]=docx"
# poll until done
JOB_ID=...
while true; do
STATUS=$(curl -sS https://.../api/v1/jobs/$JOB_ID \
-H "Authorization: Bearer dpk_live_..." | jq -r .status)
[[ "$STATUS" == "DONE" || "$STATUS" == "ERROR" ]] && break
sleep 2
done
# fetch the result
curl -sS https://.../api/v1/jobs/$JOB_ID \
-H "Authorization: Bearer dpk_live_..." | jq -r .download_url | \
xargs curl -sS -o output.zip