Skip to main content
The parse node converts raw documents (PDFs, images) into Doclo’s intermediate representation (DocumentIR) with text content and layout information.

Basic Usage

import { createFlow, parse } from '@docloai/flows';
import { suryaProvider } from '@docloai/providers-datalab';

const ocrProvider = suryaProvider({
  endpoint: 'https://www.datalab.to/api/v1/ocr',
  apiKey: process.env.DATALAB_API_KEY!
});

const flow = createFlow()
  .step('parse', parse({ provider: ocrProvider }))
  .build();

const result = await flow.run({ base64: pdfDataUrl });
// result.output is DocumentIR

Configuration Options

parse({
  provider: ocrProvider,           // Required: OCR or VLM provider
  format: 'text',                  // Output format: 'text' | 'markdown' | 'html'
  describeFigures: false,          // Describe charts/diagrams (VLM only)
  includeImages: false,            // Extract images (Surya/Marker only)
  citations: { enabled: true },    // Enable citation tracking
  consensus: { runs: 3 },          // Multi-run consensus
  chunked: {                       // Process large PDFs in chunks
    maxPagesPerChunk: 10,
    parallel: true
  }
})

Options Reference

OptionTypeDefaultDescription
providerOCRProvider | VLMProviderRequiredProvider for parsing
format'text' | 'markdown' | 'html''text'Output format
describeFiguresbooleanfalseDescribe charts/diagrams (VLM only)
includeImagesbooleanfalseExtract embedded images
citationsCitationConfig-Citation tracking configuration
consensusConsensusConfig-Multi-run voting for accuracy
chunkedobject-Large document chunking
reasoningobject-Extended reasoning (VLM only)
additionalInstructionsstring-Custom parsing guidance

Output Format

Text Format (Default)

Line-by-line output with position data:
parse({
  provider: ocrProvider,
  format: 'text'
})
Best for: Maximum precision, citation tracking, numeric data.

Markdown Format

Preserves document structure (tables, headers, lists):
parse({
  provider: vlmProvider,
  format: 'markdown'
})
Best for: Structured documents, reports, forms with tables.

HTML Format

Rich formatting with semantic structure:
parse({
  provider: vlmProvider,
  format: 'html'
})
Best for: Complex layouts, multi-column documents.

Provider Types

OCR Provider

Use for text-heavy documents requiring high accuracy:
import { suryaProvider } from '@docloai/providers-datalab';

const ocrProvider = suryaProvider({
  endpoint: 'https://www.datalab.to/api/v1/ocr',
  apiKey: process.env.DATALAB_API_KEY!
});

parse({ provider: ocrProvider })

VLM Provider

Use for visual documents or when you need structure detection:
import { createVLMProvider } from '@docloai/providers-llm';

const vlmProvider = createVLMProvider({
  provider: 'google',
  model: 'google/gemini-flash-2.5',
  apiKey: process.env.OPENROUTER_API_KEY!,
  via: 'openrouter'
});

parse({
  provider: vlmProvider,
  format: 'markdown',
  describeFigures: true
})

Large Document Handling

For PDFs with many pages, use chunking to avoid timeouts and memory issues:
parse({
  provider: ocrProvider,
  chunked: {
    maxPagesPerChunk: 10,  // Pages per chunk
    overlap: 0,            // Page overlap between chunks
    parallel: true         // Process chunks in parallel
  }
})
The output combines all chunks into a single DocumentIR.

Citation Tracking

Enable line-level citations for source tracking:
parse({
  provider: ocrProvider,
  format: 'text',
  citations: {
    enabled: true
  }
})
Each line in the output includes a lineId (e.g., p1_l5 for page 1, line 5) that can be referenced during extraction.

Extended Reasoning

For VLM providers that support it, enable extended reasoning:
parse({
  provider: vlmProvider,
  format: 'markdown',
  reasoning: {
    enabled: true,
    effort: 'medium'  // 'low' | 'medium' | 'high'
  }
})

Custom Instructions

Add parsing guidance:
parse({
  provider: vlmProvider,
  format: 'markdown',
  additionalInstructions: 'Pay special attention to preserving table structures and footnotes.'
})

Output: DocumentIR

The parse node outputs a DocumentIR object:
interface DocumentIR {
  pages: Array<{
    lines: Array<{
      text: string;
      bbox: { x: number; y: number; w: number; h: number };
      lineId?: string;  // For citations
    }>;
    width: number;
    height: number;
    markdown?: string;  // If format: 'markdown'
    html?: string;      // If format: 'html'
  }>;
  extras?: {
    providerType: 'ocr' | 'vlm';
  };
}

Next Steps