Documents - Doclo

Doclo processes documents through a standardized intermediate representation called DocumentIR. This decouples parsing from extraction, allowing you to use different providers for each step.

Supported Input Formats

Doclo accepts documents in many formats. Provider support varies:

Format	Extension	Datalab	Mistral	Reducto	Unsiloed	OpenAI	Anthropic	Google	xAI
PDF	`.pdf`	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
JPEG	`.jpg`, `.jpeg`	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
PNG	`.png`	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
WebP	`.webp`	Yes	Yes	Yes	-	Yes	Yes	Yes	-
GIF	`.gif`	Yes	Yes	Yes	-	Yes	Yes	Yes	-
TIFF	`.tiff`, `.tif`	Yes	Yes	-	Yes	-	-	Yes	-
BMP	`.bmp`	-	Yes	Yes	-	-	-	Yes	-
HEIC	`.heic`, `.heif`	-	Yes	Yes	-	-	-	Yes	-
AVIF	`.avif`	-	Yes	-	-	-	-	-	-
PSD	`.psd`	-	-	Yes	-	-	-	-	-
DOCX	`.docx`	Yes	Yes	Yes	Yes	-	-	-	Yes
DOC	`.doc`	Yes	-	-	-	-	-	-	-
XLSX	`.xlsx`	-	-	Yes	Yes	-	-	-	-
XLS	`.xls`	-	-	-	-	-	-	-	-
PPTX	`.pptx`	-	Yes	Yes	Yes	-	-	-	-
ODT	`.odt`	Yes	Yes	-	-	-	-	-	-
ODS	`.ods`	Yes	-	-	-	-	-	-	-
ODP	`.odp`	Yes	-	-	-	-	-	-	-
HTML	`.html`, `.htm`	Yes	-	-	-	-	-	-	-
TXT	`.txt`	-	Yes	Yes	-	-	-	-	Yes
CSV	`.csv`	-	-	Yes	-	-	-	-	Yes
RTF	`.rtf`	-	Yes	Yes	-	-	-	-	-
EPUB	`.epub`	Yes	Yes	-	-	-	-	-	-
MD	`.md`	-	-	-	-	-	-	-	Yes
LaTeX	`.tex`	-	Yes	-	-	-	-	-	-
Jupyter	`.ipynb`	-	Yes	-	-	-	-	-	-

VLM providers support images and PDFs directly, with variations by provider (see table). xAI also supports DOCX, TXT, CSV, and MD files natively. Mistral OCR has the widest format support including LaTeX and Jupyter notebooks. For other Office documents and text formats, use an OCR provider first.

Input Methods

Pass documents to flows using any of these methods:

// Base64 data URL (most common)
await flow.run({
  base64: 'data:application/pdf;base64,JVBERi0xLjQK...'
});

// HTTP/HTTPS URL
await flow.run({
  url: 'https://example.com/invoice.pdf'
});

// Auto-detection (string input)
await flow.run('data:application/pdf;base64,...');
await flow.run('https://example.com/doc.pdf');

When using URL input, the URL must be publicly accessible. Doclo’s servers fetch the document directly, so URLs behind authentication or private networks will fail.For sensitive documents, use signed URLs (pre-signed S3 URLs, GCS signed URLs, Azure SAS URLs) that provide temporary public access. Alternatively, use base64 encoding to pass the document content directly without exposing a URL.

Converting Files to Base64

Use the bufferToDataUri utility from @doclo/core:

import fs from 'fs';
import { bufferToDataUri, detectDocumentType } from '@doclo/core';

// Read file and convert to data URI
const buffer = fs.readFileSync('./invoice.pdf');
const mimeType = detectDocumentType('./invoice.pdf'); // 'application/pdf'
const dataUri = bufferToDataUri(buffer, mimeType);

await flow.run({ base64: dataUri });

Or use a simple helper for Node.js:

import fs from 'fs';

function fileToBase64(filePath: string): string {
  const buffer = fs.readFileSync(filePath);
  const base64 = buffer.toString('base64');
  const ext = filePath.split('.').pop()?.toLowerCase();
  const mimeTypes: Record<string, string> = {
    pdf: 'application/pdf',
    png: 'image/png',
    jpg: 'image/jpeg',
    jpeg: 'image/jpeg',
    webp: 'image/webp'
  };
  const mimeType = mimeTypes[ext || ''] || 'application/octet-stream';
  return `data:${mimeType};base64,${base64}`;
}

DocumentIR (Intermediate Representation)

DocumentIR is Doclo’s standard format for representing parsed documents. It preserves structure, layout, and enables citation tracking.

Structure

DocumentIR uses a page-centric format:

type DocumentIR = {
  pages: IRPage[];
  extras?: DocumentIRExtras;
};

type IRPage = {
  pageNumber?: number;       // 1-indexed page number
  width: number;             // Page width in pixels
  height: number;            // Page height in pixels
  lines: IRLine[];           // Text lines with positions
  markdown?: string;         // Optional markdown representation
  html?: string;             // Optional HTML representation
  extras?: Record<string, unknown>;
};

type IRLine = {
  text: string;              // The text content
  bbox: BBox;                // Bounding box { x, y, w, h }
  startChar?: number;        // Character offset (for citations)
  endChar?: number;
  lineId?: string;           // Unique identifier (e.g., "p1_l5")
};

type BBox = {
  x: number;                 // Left position
  y: number;                 // Top position
  w: number;                 // Width
  h: number;                 // Height
};

type DocumentIRExtras = {
  pageCount?: number;        // Total pages in original document
  costUSD?: number;          // Processing cost
  chunkIndex?: number;       // For chunked docs: which chunk (0-indexed)
  totalChunks?: number;      // For chunked docs: total chunks
  pageRange?: [number, number]; // Page range [start, end] (1-indexed)
  [key: string]: unknown;    // Arbitrary additional fields
};

Content Formats

DocumentIR supports multiple output formats:

Plain text: Line-by-line OCR output with spatial coordinates
Markdown: Structured documents with tables, headers, lists preserved
HTML: Rich formatting with tables and semantic structure

Layout Preservation

DocumentIR preserves document structure through:

Bounding boxes: Every line has (x, y, width, height) coordinates
Reading order: Lines are ordered as they should be read
Table structure: Markdown/HTML capture tables and columns
Semantic structure: Headings, lists, and formatting preserved

Provenance Tracking

DocumentIR tracks metadata about the parsing process:

// Access document-level metadata
console.log(documentIR.extras?.pageCount);  // Total pages in document
console.log(documentIR.extras?.costUSD);    // Processing cost

// Use character offsets for citations
const line = documentIR.pages[0].lines[0];
console.log(`Characters ${line.startChar}-${line.endChar}: ${line.text}`);

// For chunked documents
console.log(documentIR.extras?.chunkIndex);    // 0, 1, 2...
console.log(documentIR.extras?.totalChunks);   // Total chunk count
console.log(documentIR.extras?.pageRange);     // [1, 5] - pages in this chunk

When to Use OCR vs VLM

Choose your parsing approach based on document characteristics:

Use VLM Direct (No DocumentIR)

VLMs excel when layout context matters:

Handwritten forms - VLMs understand spatial relationships between fields and handwriting
Photos of documents - Receipts, whiteboards, ID cards captured by phone
Varied layouts - When documents come in many different formats/structures
Charts and diagrams - Visual elements that need interpretation
Quick prototyping - Fastest path to get something working

const flow = createFlow()
  .step('extract', extract({ provider: vlmProvider, schema }))
  .build();

Pros: Faster (one API call), handles visual complexity, simpler setup Cons: Higher cost per document, may miss dense text blocks

Use OCR → LLM (With DocumentIR)

OCR shines for text-heavy, structured documents:

Clean PDFs - Invoices, contracts, reports with consistent formatting
Dense text - Multi-page documents where accuracy matters
RAG pipelines - When you need to store and search document content
Agentic loops - Repeated queries against the same document
Citation tracking - When you need to trace extracted values back to source lines

const flow = createFlow()
  .step('parse', parse({ provider: ocrProvider }))
  .step('extract', extract({ provider: llmProvider, schema, inputMode: 'ir' }))
  .build();

Pros: Maximum text accuracy, lower cost at scale, enables citations, reusable DocumentIR Cons: Slower (two API calls), requires OCR provider setup

Document Lifecycle

Raw Document: PDF, image, or Office document input
DocumentIR: Parsed text with layout (optional - skipped with VLM direct)
Structured JSON: Extracted data matching your schema

Next Steps

Providers

Learn about OCR and LLM providers

Parse Node

Configure document parsing

​Supported Input Formats

​Input Methods

​Converting Files to Base64

​DocumentIR (Intermediate Representation)

​Structure

​Content Formats

​Layout Preservation

​Provenance Tracking

​When to Use OCR vs VLM

​Use VLM Direct (No DocumentIR)

​Use OCR → LLM (With DocumentIR)

​Document Lifecycle

​Next Steps

Providers

Parse Node

Supported Input Formats

Input Methods

Converting Files to Base64

DocumentIR (Intermediate Representation)

Structure

Content Formats

Layout Preservation

Provenance Tracking

When to Use OCR vs VLM

Use VLM Direct (No DocumentIR)

Use OCR → LLM (With DocumentIR)

Document Lifecycle

Next Steps