Parse Node

The parse node converts raw documents (PDFs, images) into Doclo’s intermediate representation (DocumentIR) with text content and layout information.

Basic Usage

import { createFlow, parse } from '@doclo/flows';
import { suryaProvider } from '@doclo/providers-datalab';

const ocrProvider = suryaProvider({
  endpoint: 'https://www.datalab.to/api/v1/ocr',
  apiKey: process.env.DATALAB_API_KEY!
});

const flow = createFlow()
  .step('parse', parse({ provider: ocrProvider }))
  .build();

const result = await flow.run({ base64: pdfDataUrl });
// result.output is DocumentIR

Configuration Options

parse({
  provider: ocrProvider,           // Required: OCR or VLM provider
  format: 'text',                  // Output format: 'text' | 'markdown' | 'html'
  describeFigures: false,          // Describe charts/diagrams (VLM only)
  includeImages: false,            // Extract images (Surya/Marker only)
  citations: { enabled: true },    // Enable citation tracking
  consensus: { runs: 3 },          // Multi-run consensus
  chunked: {                       // Process large PDFs in chunks
    maxPagesPerChunk: 10,
    parallel: true
  }
})

Options Reference

Option	Type	Default	Description
`provider`	`OCRProvider \| VLMProvider`	Required	Provider for parsing
`format`	`'text' \| 'markdown' \| 'html'`	`'text'`	Output format
`describeFigures`	`boolean`	`false`	Describe charts/diagrams (VLM only)
`includeImages`	`boolean`	`false`	Extract embedded images
`citations`	`CitationConfig`	-	Citation tracking configuration
`consensus`	`ConsensusConfig`	-	Multi-run voting for accuracy
`chunked`	`object`	-	Large document chunking
`reasoning`	`object`	-	Extended reasoning (VLM only)
`additionalInstructions`	`string`	-	Custom parsing guidance

Output Format

Text Format (Default)

Line-by-line output with position data:

parse({
  provider: ocrProvider,
  format: 'text'
})

Best for: Maximum precision, citation tracking, numeric data.

Markdown Format

Preserves document structure (tables, headers, lists):

parse({
  provider: vlmProvider,
  format: 'markdown'
})

Best for: Structured documents, reports, forms with tables.

HTML Format

Rich formatting with semantic structure:

parse({
  provider: vlmProvider,
  format: 'html'
})

Best for: Complex layouts, multi-column documents.

Provider Types

OCR Provider

Use for text-heavy documents requiring high accuracy:

import { suryaProvider } from '@doclo/providers-datalab';

const ocrProvider = suryaProvider({
  endpoint: 'https://www.datalab.to/api/v1/ocr',
  apiKey: process.env.DATALAB_API_KEY!
});

parse({ provider: ocrProvider })

VLM Provider

Use for visual documents or when you need structure detection:

import { createVLMProvider } from '@doclo/providers-llm';

const vlmProvider = createVLMProvider({
  provider: 'google',
  model: 'google/gemini-2.5-flash',
  apiKey: process.env.OPENROUTER_API_KEY!,
  via: 'openrouter'
});

parse({
  provider: vlmProvider,
  format: 'markdown',
  describeFigures: true
})

Large Document Handling

For PDFs with many pages, use chunking to avoid timeouts and memory issues:

parse({
  provider: ocrProvider,
  chunked: {
    maxPagesPerChunk: 10,  // Pages per chunk
    overlap: 0,            // Page overlap between chunks
    parallel: true         // Process chunks in parallel
  }
})

The output combines all chunks into a single DocumentIR.

Citation Tracking

Enable line-level citations for source tracking:

parse({
  provider: ocrProvider,
  format: 'text',
  citations: {
    enabled: true
  }
})

Each line in the output includes a lineId (e.g., p1_l5 for page 1, line 5) that can be referenced during extraction.

Extended Reasoning

For VLM providers that support it, enable extended reasoning:

parse({
  provider: vlmProvider,
  format: 'markdown',
  reasoning: {
    enabled: true,
    effort: 'medium'  // 'low' | 'medium' | 'high'
  }
})

Custom Instructions

Add parsing guidance:

parse({
  provider: vlmProvider,
  format: 'markdown',
  additionalInstructions: 'Pay special attention to preserving table structures and footnotes.'
})

Output: DocumentIR

The parse node outputs a DocumentIR object:

interface DocumentIR {
  pages: Array<{
    lines: Array<{
      text: string;
      bbox: { x: number; y: number; w: number; h: number };
      lineId?: string;  // For citations
    }>;
    width: number;
    height: number;
    markdown?: string;  // If format: 'markdown'
    html?: string;      // If format: 'html'
  }>;
  extras?: {
    providerType: 'ocr' | 'vlm';
  };
}

Getting Started

Concepts

SDK

Doclo Cloud

Guides

Resources

Basic Usage

Configuration Options

Options Reference

Output Format

Text Format (Default)

Markdown Format

HTML Format

Provider Types

OCR Provider

VLM Provider

Large Document Handling

Citation Tracking

Extended Reasoning

Custom Instructions

Output: DocumentIR

Next Steps

extract

OCR Providers

Getting Started

Concepts

SDK

Doclo Cloud

Guides

Resources

​Basic Usage

​Configuration Options

​Options Reference

​Output Format

​Text Format (Default)

​Markdown Format

​HTML Format

​Provider Types

​OCR Provider

​VLM Provider

​Large Document Handling

​Citation Tracking

​Extended Reasoning

​Custom Instructions

​Output: DocumentIR

​Next Steps

extract

OCR Providers

Basic Usage

Configuration Options

Options Reference

Output Format

Text Format (Default)

Markdown Format

HTML Format

Provider Types

OCR Provider

VLM Provider

Large Document Handling

Citation Tracking

Extended Reasoning

Custom Instructions

Output: DocumentIR

Next Steps