Extract Node

The extract node uses AI to extract structured data from documents according to a JSON Schema. It works with both raw documents (via VLM) and parsed DocumentIR (via LLM).

Basic Usage

import { createFlow, extract } from '@doclo/flows';
import { createVLMProvider } from '@doclo/providers-llm';

const vlmProvider = createVLMProvider({
  provider: 'google',
  model: 'google/gemini-2.5-flash',
  apiKey: process.env.OPENROUTER_API_KEY!,
  via: 'openrouter'
});

const invoiceSchema = {
  type: 'object',
  properties: {
    invoiceNumber: { type: 'string' },
    totalAmount: { type: 'number' },
    date: { type: 'string' }
  },
  required: ['invoiceNumber', 'totalAmount']
};

const flow = createFlow()
  .step('extract', extract({
    provider: vlmProvider,
    schema: invoiceSchema
  }))
  .build();

const result = await flow.run({ base64: pdfDataUrl });
// result.output matches invoiceSchema

Configuration Options

extract({
  provider: vlmProvider,           // Required: VLM or LLM provider
  schema: invoiceSchema,           // Required: JSON Schema for output
  citations: { enabled: true },    // Enable source tracking
  consensus: { runs: 3 },          // Multi-run voting for accuracy
  reasoning: { enabled: true },    // Extended thinking (supported providers)
  additionalInstructions: '...'    // Custom extraction guidance
})

Options Reference

Option	Type	Default	Description
`provider`	`VLMProvider \| LLMProvider`	Required	AI provider for extraction
`schema`	`object \| { ref: string }`	Required	JSON Schema or registry reference
`inputMode`	`'auto' \| 'ir' \| 'ir+source' \| 'source'`	`'auto'`	Controls what inputs the node ingests
`preferVisual`	`boolean`	`true`	When auto mode, prefer multimodal extraction
`useOriginalSource`	`boolean`	`false`	Use original unsplit document in forEach contexts
`citations`	`CitationConfig`	-	Citation/source tracking
`consensus`	`ConsensusConfig`	-	Multi-run voting configuration
`reasoning`	`object`	-	Extended reasoning options
`additionalInstructions`	`string`	-	Custom extraction guidance
`promptRef`	`string`	-	Reference to prompt asset
`promptVariables`	`object`	-	Variables for prompt rendering
`maxTokens`	`number`	-	Maximum tokens for LLM response

Input Mode

The inputMode option controls what input the extract node uses for extraction. This is one of the most important configuration options for optimizing accuracy and cost.

Mode Options

Mode	Description	Provider	Best For
`auto`	Automatically detect and route (default)	VLM or LLM	Most use cases
`ir`	Text-only from parsed DocumentIR	LLM or VLM	Cost-effective, text-heavy docs
`ir+source`	Both parsed text AND source images	VLM only	Maximum accuracy, complex layouts
`source`	Direct from raw document	VLM only	Simple docs without prior parsing

Auto Mode (Default)

Auto mode intelligently selects the best extraction path:

extract({
  provider: vlmProvider,
  schema: invoiceSchema,
  inputMode: 'auto',     // Default - automatically determines best mode
  preferVisual: true     // When both IR and source available, use ir+source
})

Auto mode decision tree:

If DocumentIR + source available + VLM provider + preferVisual: true → ir+source
If only DocumentIR available → ir
If only FlowInput (raw document) + VLM provider → source

IR Mode (Text-Only)

Use parsed text only, ignoring visual context:

const flow = createFlow()
  .step('parse', parse({ provider: ocrProvider }))
  .step('extract', extract({
    provider: llmProvider,  // LLM is sufficient for text-only
    schema: invoiceSchema,
    inputMode: 'ir'
  }))
  .build();

Best for:

Text-heavy documents (contracts, reports)
Cost optimization (LLM is cheaper than VLM)
When OCR accuracy is sufficient

IR+Source Mode (Hybrid)

Combine parsed text with visual context for maximum accuracy:

const flow = createFlow()
  .step('parse', parse({ provider: ocrProvider }))
  .step('extract', extract({
    provider: vlmProvider,  // VLM required for multimodal
    schema: invoiceSchema,
    inputMode: 'ir+source'
  }))
  .build();

Best for:

Complex layouts (tables, forms with checkboxes)
Documents with visual elements (signatures, stamps)
When highest accuracy is required

Source Mode (Direct VLM)

Skip parsing entirely, extract directly from raw document:

const flow = createFlow()
  .step('extract', extract({
    provider: vlmProvider,  // VLM required
    schema: invoiceSchema,
    inputMode: 'source'
  }))
  .build();

Best for:

Simple, well-structured documents
When OCR adds no value (clean PDFs)
Fastest processing time

Using Original Source in forEach

When processing split documents, use useOriginalSource to reference the full document instead of individual segments:

.forEach('process', (doc) =>
  createFlow()
    .step('extract', extract({
      provider: vlmProvider,
      schema: doc.schema,
      inputMode: 'ir+source',
      useOriginalSource: true  // Use full document, not segment
    }))
)

Input Types

The extract node accepts different input types depending on the configured mode:

Raw Documents (VLM)

Direct extraction from PDFs or images:

// VLM provider processes the document directly
const flow = createFlow()
  .step('extract', extract({
    provider: vlmProvider,  // Must be VLM for raw input
    schema: invoiceSchema
  }))
  .build();

await flow.run({ base64: pdfDataUrl });

Parsed Documents (LLM)

Extract from previously parsed DocumentIR:

// Parse first, then extract with LLM
const flow = createFlow()
  .step('parse', parse({ provider: ocrProvider }))
  .step('extract', extract({
    provider: llmProvider,  // Can use LLM for text input
    schema: invoiceSchema,
    inputMode: 'ir'         // Explicitly use text-only mode
  }))
  .build();

Schema Definition

Basic Schema

const schema = {
  type: 'object',
  properties: {
    invoiceNumber: {
      type: 'string',
      description: 'Invoice number or reference ID'
    },
    vendor: {
      type: 'object',
      properties: {
        name: { type: 'string', description: 'Company name' },
        address: { type: 'string', description: 'Full address' }
      }
    },
    totalAmount: {
      type: 'number',
      description: 'Total invoice amount'
    },
    lineItems: {
      type: 'array',
      items: {
        type: 'object',
        properties: {
          description: { type: 'string' },
          quantity: { type: 'number' },
          amount: { type: 'number' }
        }
      }
    }
  },
  required: ['invoiceNumber', 'totalAmount']
};

Schema Registry Reference

Use registered schemas:

extract({
  provider: vlmProvider,
  schema: { ref: 'invoice@1.0.0' }
})

Enhanced Schema

Include examples and extraction guidance:

const enhancedSchema = {
  schema: invoiceSchema,
  contextPrompt: 'This is a maritime bunker delivery note',
  extractionRules: 'Focus on the delivery summary table',
  examples: [
    {
      description: 'Standard invoice',
      input: 'Invoice #: INV-001\nTotal: $1,250.00',
      output: { invoiceNumber: 'INV-001', totalAmount: 1250.00 }
    }
  ]
};

extract({
  provider: vlmProvider,
  schema: enhancedSchema
})

Citation Tracking

Track which parts of the source document contributed to each field:

extract({
  provider: vlmProvider,
  schema: invoiceSchema,
  citations: {
    enabled: true,
    detectInferred: true  // Flag calculated/inferred values
  }
})

Output includes citation metadata:

interface OutputWithCitations<T> {
  data: T;  // Extracted data
  citations: {
    [fieldPath: string]: {
      lineIds: string[];     // Source line IDs (e.g., 'p1_l5')
      confidence: number;    // 0-1 confidence score
      inferred?: boolean;    // True if value was calculated
      reasoning?: string;    // Explanation for inferred values
    };
  };
}

Consensus Voting

Run extraction multiple times and vote on results:

extract({
  provider: vlmProvider,
  schema: invoiceSchema,
  consensus: {
    runs: 3,              // Number of extraction runs
    strategy: 'majority', // Voting strategy
    threshold: 0.6        // Minimum agreement threshold
  }
})

See Consensus Voting for strategies and configuration.

Extended Reasoning

Enable chain-of-thought reasoning for complex extractions:

extract({
  provider: vlmProvider,
  schema: invoiceSchema,
  reasoning: {
    enabled: true,
    effort: 'high',    // 'low' | 'medium' | 'high'
    exclude: false     // Include reasoning in output
  }
})

Extended reasoning improves accuracy for complex documents but increases latency and cost.

Custom Instructions

Add extraction guidance:

extract({
  provider: vlmProvider,
  schema: invoiceSchema,
  additionalInstructions: `
    - Be strict with date formats. Use YYYY-MM-DD format only.
    - For amounts, preserve exact decimal precision.
    - If a field is partially visible, extract what's readable.
  `
})

Type-Safe Extraction

Use TypeScript generics for typed output:

interface Invoice {
  invoiceNumber: string;
  totalAmount: number;
  lineItems?: Array<{
    description: string;
    amount: number;
  }>;
}

const flow = createFlow()
  .step('extract', extract<Invoice>({
    provider: vlmProvider,
    schema: invoiceSchema
  }))
  .build();

const result = await flow.run({ base64: pdf });
// result.output is typed as Invoice

Error Handling

Extraction may fail if:

Document cannot be read
Schema cannot be satisfied
Provider returns invalid response

Handle errors:

try {
  const result = await flow.run({ base64: pdf });
} catch (error) {
  if (error.code === 'SCHEMA_VALIDATION_FAILED') {
    console.error('Extracted data does not match schema');
  }
}

Next Steps

Schemas

Learn about schema definition

Consensus Voting

Improve accuracy with multi-run voting

Citations

Track extraction sources

Providers

Configure LLM/VLM providers

Getting Started

Concepts

SDK

Doclo Cloud

Guides

Resources

Extract Node

Basic Usage

Configuration Options

Options Reference

Input Mode

Mode Options

Auto Mode (Default)

IR Mode (Text-Only)

IR+Source Mode (Hybrid)

Source Mode (Direct VLM)

Using Original Source in forEach

Input Types

Raw Documents (VLM)

Parsed Documents (LLM)

Schema Definition

Basic Schema

Schema Registry Reference

Enhanced Schema

Citation Tracking

Consensus Voting

Extended Reasoning

Custom Instructions

Type-Safe Extraction

Error Handling

Next Steps

Schemas

Consensus Voting

Citations

Providers

Getting Started

Concepts

SDK

Doclo Cloud

Guides

Resources

​Basic Usage

​Configuration Options

​Options Reference

​Input Mode

​Mode Options

​Auto Mode (Default)

​IR Mode (Text-Only)

​IR+Source Mode (Hybrid)

​Source Mode (Direct VLM)

​Using Original Source in forEach

​Input Types

​Raw Documents (VLM)

​Parsed Documents (LLM)

​Schema Definition

​Basic Schema

​Schema Registry Reference

​Enhanced Schema

​Citation Tracking

​Consensus Voting

​Extended Reasoning

​Custom Instructions

​Type-Safe Extraction

​Error Handling

​Next Steps

Schemas

Consensus Voting

Citations

Providers

Basic Usage

Configuration Options

Options Reference

Input Mode

Mode Options

Auto Mode (Default)

IR Mode (Text-Only)

IR+Source Mode (Hybrid)

Source Mode (Direct VLM)

Using Original Source in forEach

Input Types

Raw Documents (VLM)

Parsed Documents (LLM)

Schema Definition

Basic Schema

Schema Registry Reference

Enhanced Schema

Citation Tracking

Consensus Voting

Extended Reasoning

Custom Instructions

Type-Safe Extraction

Error Handling

Next Steps