Doclo processes documents through a standardized intermediate representation called DocumentIR. This decouples parsing from extraction, allowing you to use different providers for each step.
Doclo accepts documents in many formats. Provider support varies:
| Format | Extension | Datalab | Mistral | Reducto | Unsiloed | OpenAI | Anthropic | Google | xAI |
|---|
| PDF | .pdf | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| JPEG | .jpg, .jpeg | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| PNG | .png | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| WebP | .webp | Yes | Yes | Yes | - | Yes | Yes | Yes | - |
| GIF | .gif | Yes | Yes | Yes | - | Yes | Yes | Yes | - |
| TIFF | .tiff, .tif | Yes | Yes | - | Yes | - | - | Yes | - |
| BMP | .bmp | - | Yes | Yes | - | - | - | Yes | - |
| HEIC | .heic, .heif | - | Yes | Yes | - | - | - | Yes | - |
| AVIF | .avif | - | Yes | - | - | - | - | - | - |
| PSD | .psd | - | - | Yes | - | - | - | - | - |
| DOCX | .docx | Yes | Yes | Yes | Yes | - | - | - | Yes |
| DOC | .doc | Yes | - | - | - | - | - | - | - |
| XLSX | .xlsx | - | - | Yes | Yes | - | - | - | - |
| XLS | .xls | - | - | - | - | - | - | - | - |
| PPTX | .pptx | - | Yes | Yes | Yes | - | - | - | - |
| ODT | .odt | Yes | Yes | - | - | - | - | - | - |
| ODS | .ods | Yes | - | - | - | - | - | - | - |
| ODP | .odp | Yes | - | - | - | - | - | - | - |
| HTML | .html, .htm | Yes | - | - | - | - | - | - | - |
| TXT | .txt | - | Yes | Yes | - | - | - | - | Yes |
| CSV | .csv | - | - | Yes | - | - | - | - | Yes |
| RTF | .rtf | - | Yes | Yes | - | - | - | - | - |
| EPUB | .epub | Yes | Yes | - | - | - | - | - | - |
| MD | .md | - | - | - | - | - | - | - | Yes |
| LaTeX | .tex | - | Yes | - | - | - | - | - | - |
| Jupyter | .ipynb | - | Yes | - | - | - | - | - | - |
VLM providers support images and PDFs directly, with variations by provider (see table). xAI also supports DOCX, TXT, CSV, and MD files natively. Mistral OCR has the widest format support including LaTeX and Jupyter notebooks. For other Office documents and text formats, use an OCR provider first.
Pass documents to flows using any of these methods:
// Base64 data URL (most common)
await flow.run({
base64: 'data:application/pdf;base64,JVBERi0xLjQK...'
});
// HTTP/HTTPS URL
await flow.run({
url: 'https://example.com/invoice.pdf'
});
// Auto-detection (string input)
await flow.run('data:application/pdf;base64,...');
await flow.run('https://example.com/doc.pdf');
When using URL input, the URL must be publicly accessible. Doclo’s servers fetch the document directly, so URLs behind authentication or private networks will fail.For sensitive documents, use signed URLs (pre-signed S3 URLs, GCS signed URLs, Azure SAS URLs) that provide temporary public access. Alternatively, use base64 encoding to pass the document content directly without exposing a URL.
Converting Files to Base64
Use the bufferToDataUri utility from @doclo/core:
import fs from 'fs';
import { bufferToDataUri, detectDocumentType } from '@doclo/core';
// Read file and convert to data URI
const buffer = fs.readFileSync('./invoice.pdf');
const mimeType = detectDocumentType('./invoice.pdf'); // 'application/pdf'
const dataUri = bufferToDataUri(buffer, mimeType);
await flow.run({ base64: dataUri });
Or use a simple helper for Node.js:
import fs from 'fs';
function fileToBase64(filePath: string): string {
const buffer = fs.readFileSync(filePath);
const base64 = buffer.toString('base64');
const ext = filePath.split('.').pop()?.toLowerCase();
const mimeTypes: Record<string, string> = {
pdf: 'application/pdf',
png: 'image/png',
jpg: 'image/jpeg',
jpeg: 'image/jpeg',
webp: 'image/webp'
};
const mimeType = mimeTypes[ext || ''] || 'application/octet-stream';
return `data:${mimeType};base64,${base64}`;
}
DocumentIR is Doclo’s standard format for representing parsed documents. It preserves structure, layout, and enables citation tracking.
Structure
DocumentIR uses a page-centric format:
type DocumentIR = {
pages: IRPage[];
extras?: DocumentIRExtras;
};
type IRPage = {
pageNumber?: number; // 1-indexed page number
width: number; // Page width in pixels
height: number; // Page height in pixels
lines: IRLine[]; // Text lines with positions
markdown?: string; // Optional markdown representation
html?: string; // Optional HTML representation
extras?: Record<string, unknown>;
};
type IRLine = {
text: string; // The text content
bbox: BBox; // Bounding box { x, y, w, h }
startChar?: number; // Character offset (for citations)
endChar?: number;
lineId?: string; // Unique identifier (e.g., "p1_l5")
};
type BBox = {
x: number; // Left position
y: number; // Top position
w: number; // Width
h: number; // Height
};
type DocumentIRExtras = {
pageCount?: number; // Total pages in original document
costUSD?: number; // Processing cost
chunkIndex?: number; // For chunked docs: which chunk (0-indexed)
totalChunks?: number; // For chunked docs: total chunks
pageRange?: [number, number]; // Page range [start, end] (1-indexed)
[key: string]: unknown; // Arbitrary additional fields
};
Content Formats
DocumentIR supports multiple output formats:
- Plain text: Line-by-line OCR output with spatial coordinates
- Markdown: Structured documents with tables, headers, lists preserved
- HTML: Rich formatting with tables and semantic structure
Layout Preservation
DocumentIR preserves document structure through:
- Bounding boxes: Every line has
(x, y, width, height) coordinates
- Reading order: Lines are ordered as they should be read
- Table structure: Markdown/HTML capture tables and columns
- Semantic structure: Headings, lists, and formatting preserved
Provenance Tracking
DocumentIR tracks metadata about the parsing process:
// Access document-level metadata
console.log(documentIR.extras?.pageCount); // Total pages in document
console.log(documentIR.extras?.costUSD); // Processing cost
// Use character offsets for citations
const line = documentIR.pages[0].lines[0];
console.log(`Characters ${line.startChar}-${line.endChar}: ${line.text}`);
// For chunked documents
console.log(documentIR.extras?.chunkIndex); // 0, 1, 2...
console.log(documentIR.extras?.totalChunks); // Total chunk count
console.log(documentIR.extras?.pageRange); // [1, 5] - pages in this chunk
When to Use OCR vs VLM
Choose your parsing approach based on document characteristics:
Use VLM Direct (No DocumentIR)
VLMs excel when layout context matters:
- Handwritten forms - VLMs understand spatial relationships between fields and handwriting
- Photos of documents - Receipts, whiteboards, ID cards captured by phone
- Varied layouts - When documents come in many different formats/structures
- Charts and diagrams - Visual elements that need interpretation
- Quick prototyping - Fastest path to get something working
const flow = createFlow()
.step('extract', extract({ provider: vlmProvider, schema }))
.build();
Pros: Faster (one API call), handles visual complexity, simpler setup
Cons: Higher cost per document, may miss dense text blocks
Use OCR → LLM (With DocumentIR)
OCR shines for text-heavy, structured documents:
- Clean PDFs - Invoices, contracts, reports with consistent formatting
- Dense text - Multi-page documents where accuracy matters
- RAG pipelines - When you need to store and search document content
- Agentic loops - Repeated queries against the same document
- Citation tracking - When you need to trace extracted values back to source lines
const flow = createFlow()
.step('parse', parse({ provider: ocrProvider }))
.step('extract', extract({ provider: llmProvider, schema, inputMode: 'ir' }))
.build();
Pros: Maximum text accuracy, lower cost at scale, enables citations, reusable DocumentIR
Cons: Slower (two API calls), requires OCR provider setup
Document Lifecycle
- Raw Document: PDF, image, or Office document input
- DocumentIR: Parsed text with layout (optional - skipped with VLM direct)
- Structured JSON: Extracted data matching your schema
Next Steps