Document types, formats, and the intermediate representation (DocumentIR)
Doclo processes documents through a standardized intermediate representation called DocumentIR. This decouples parsing from extraction, allowing you to use different providers for each step.
Doclo accepts documents in many formats. Provider support varies:
Format
Extension
Datalab
Mistral
Reducto
Unsiloed
OpenAI
Anthropic
Google
xAI
PDF
.pdf
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
JPEG
.jpg, .jpeg
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
PNG
.png
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
WebP
.webp
Yes
Yes
Yes
-
Yes
Yes
Yes
-
GIF
.gif
Yes
Yes
Yes
-
Yes
Yes
Yes
-
TIFF
.tiff, .tif
Yes
Yes
-
Yes
-
-
Yes
-
BMP
.bmp
-
Yes
Yes
-
-
-
Yes
-
HEIC
.heic, .heif
-
Yes
Yes
-
-
-
Yes
-
AVIF
.avif
-
Yes
-
-
-
-
-
-
PSD
.psd
-
-
Yes
-
-
-
-
-
DOCX
.docx
Yes
Yes
Yes
Yes
-
-
-
Yes
DOC
.doc
Yes
-
-
-
-
-
-
-
XLSX
.xlsx
-
-
Yes
Yes
-
-
-
-
XLS
.xls
-
-
-
-
-
-
-
-
PPTX
.pptx
-
Yes
Yes
Yes
-
-
-
-
ODT
.odt
Yes
Yes
-
-
-
-
-
-
ODS
.ods
Yes
-
-
-
-
-
-
-
ODP
.odp
Yes
-
-
-
-
-
-
-
HTML
.html, .htm
Yes
-
-
-
-
-
-
-
TXT
.txt
-
Yes
Yes
-
-
-
-
Yes
CSV
.csv
-
-
Yes
-
-
-
-
Yes
RTF
.rtf
-
Yes
Yes
-
-
-
-
-
EPUB
.epub
Yes
Yes
-
-
-
-
-
-
MD
.md
-
-
-
-
-
-
-
Yes
LaTeX
.tex
-
Yes
-
-
-
-
-
-
Jupyter
.ipynb
-
Yes
-
-
-
-
-
-
VLM providers support images and PDFs directly, with variations by provider (see table). xAI also supports DOCX, TXT, CSV, and MD files natively. Mistral OCR has the widest format support including LaTeX and Jupyter notebooks. For other Office documents and text formats, use an OCR provider first.
When using URL input, the URL must be publicly accessible. Doclo’s servers fetch the document directly, so URLs behind authentication or private networks will fail.For sensitive documents, use signed URLs (pre-signed S3 URLs, GCS signed URLs, Azure SAS URLs) that provide temporary public access. Alternatively, use base64 encoding to pass the document content directly without exposing a URL.
DocumentIR tracks metadata about the parsing process:
Copy
// Access document-level metadataconsole.log(documentIR.extras?.pageCount); // Total pages in documentconsole.log(documentIR.extras?.costUSD); // Processing cost// Use character offsets for citationsconst line = documentIR.pages[0].lines[0];console.log(`Characters ${line.startChar}-${line.endChar}: ${line.text}`);// For chunked documentsconsole.log(documentIR.extras?.chunkIndex); // 0, 1, 2...console.log(documentIR.extras?.totalChunks); // Total chunk countconsole.log(documentIR.extras?.pageRange); // [1, 5] - pages in this chunk