Skip to main content
The extract node uses AI to extract structured data from documents according to a JSON Schema. It works with both raw documents (via VLM) and parsed DocumentIR (via LLM).

Basic Usage

import { createFlow, extract } from '@doclo/flows';
import { createVLMProvider } from '@doclo/providers-llm';

const vlmProvider = createVLMProvider({
  provider: 'google',
  model: 'google/gemini-2.5-flash',
  apiKey: process.env.OPENROUTER_API_KEY!,
  via: 'openrouter'
});

const invoiceSchema = {
  type: 'object',
  properties: {
    invoiceNumber: { type: 'string' },
    totalAmount: { type: 'number' },
    date: { type: 'string' }
  },
  required: ['invoiceNumber', 'totalAmount']
};

const flow = createFlow()
  .step('extract', extract({
    provider: vlmProvider,
    schema: invoiceSchema
  }))
  .build();

const result = await flow.run({ base64: pdfDataUrl });
// result.output matches invoiceSchema

Configuration Options

extract({
  provider: vlmProvider,           // Required: VLM or LLM provider
  schema: invoiceSchema,           // Required: JSON Schema for output
  citations: { enabled: true },    // Enable source tracking
  consensus: { runs: 3 },          // Multi-run voting for accuracy
  reasoning: { enabled: true },    // Extended thinking (supported providers)
  additionalInstructions: '...'    // Custom extraction guidance
})

Options Reference

OptionTypeDefaultDescription
providerVLMProvider | LLMProviderRequiredAI provider for extraction
schemaobject | { ref: string }RequiredJSON Schema or registry reference
inputMode'auto' | 'ir' | 'ir+source' | 'source''auto'Controls what inputs the node ingests
preferVisualbooleantrueWhen auto mode, prefer multimodal extraction
useOriginalSourcebooleanfalseUse original unsplit document in forEach contexts
citationsCitationConfig-Citation/source tracking
consensusConsensusConfig-Multi-run voting configuration
reasoningobject-Extended reasoning options
additionalInstructionsstring-Custom extraction guidance
promptRefstring-Reference to prompt asset
promptVariablesobject-Variables for prompt rendering
maxTokensnumber-Maximum tokens for LLM response

Input Mode

The inputMode option controls what input the extract node uses for extraction. This is one of the most important configuration options for optimizing accuracy and cost.

Mode Options

ModeDescriptionProviderBest For
autoAutomatically detect and route (default)VLM or LLMMost use cases
irText-only from parsed DocumentIRLLM or VLMCost-effective, text-heavy docs
ir+sourceBoth parsed text AND source imagesVLM onlyMaximum accuracy, complex layouts
sourceDirect from raw documentVLM onlySimple docs without prior parsing

Auto Mode (Default)

Auto mode intelligently selects the best extraction path:
extract({
  provider: vlmProvider,
  schema: invoiceSchema,
  inputMode: 'auto',     // Default - automatically determines best mode
  preferVisual: true     // When both IR and source available, use ir+source
})
Auto mode decision tree:
  1. If DocumentIR + source available + VLM provider + preferVisual: trueir+source
  2. If only DocumentIR available → ir
  3. If only FlowInput (raw document) + VLM provider → source

IR Mode (Text-Only)

Use parsed text only, ignoring visual context:
const flow = createFlow()
  .step('parse', parse({ provider: ocrProvider }))
  .step('extract', extract({
    provider: llmProvider,  // LLM is sufficient for text-only
    schema: invoiceSchema,
    inputMode: 'ir'
  }))
  .build();
Best for:
  • Text-heavy documents (contracts, reports)
  • Cost optimization (LLM is cheaper than VLM)
  • When OCR accuracy is sufficient

IR+Source Mode (Hybrid)

Combine parsed text with visual context for maximum accuracy:
const flow = createFlow()
  .step('parse', parse({ provider: ocrProvider }))
  .step('extract', extract({
    provider: vlmProvider,  // VLM required for multimodal
    schema: invoiceSchema,
    inputMode: 'ir+source'
  }))
  .build();
Best for:
  • Complex layouts (tables, forms with checkboxes)
  • Documents with visual elements (signatures, stamps)
  • When highest accuracy is required

Source Mode (Direct VLM)

Skip parsing entirely, extract directly from raw document:
const flow = createFlow()
  .step('extract', extract({
    provider: vlmProvider,  // VLM required
    schema: invoiceSchema,
    inputMode: 'source'
  }))
  .build();
Best for:
  • Simple, well-structured documents
  • When OCR adds no value (clean PDFs)
  • Fastest processing time

Using Original Source in forEach

When processing split documents, use useOriginalSource to reference the full document instead of individual segments:
.forEach('process', (doc) =>
  createFlow()
    .step('extract', extract({
      provider: vlmProvider,
      schema: doc.schema,
      inputMode: 'ir+source',
      useOriginalSource: true  // Use full document, not segment
    }))
)

Input Types

The extract node accepts different input types depending on the configured mode:

Raw Documents (VLM)

Direct extraction from PDFs or images:
// VLM provider processes the document directly
const flow = createFlow()
  .step('extract', extract({
    provider: vlmProvider,  // Must be VLM for raw input
    schema: invoiceSchema
  }))
  .build();

await flow.run({ base64: pdfDataUrl });

Parsed Documents (LLM)

Extract from previously parsed DocumentIR:
// Parse first, then extract with LLM
const flow = createFlow()
  .step('parse', parse({ provider: ocrProvider }))
  .step('extract', extract({
    provider: llmProvider,  // Can use LLM for text input
    schema: invoiceSchema,
    inputMode: 'ir'         // Explicitly use text-only mode
  }))
  .build();

Schema Definition

Basic Schema

const schema = {
  type: 'object',
  properties: {
    invoiceNumber: {
      type: 'string',
      description: 'Invoice number or reference ID'
    },
    vendor: {
      type: 'object',
      properties: {
        name: { type: 'string', description: 'Company name' },
        address: { type: 'string', description: 'Full address' }
      }
    },
    totalAmount: {
      type: 'number',
      description: 'Total invoice amount'
    },
    lineItems: {
      type: 'array',
      items: {
        type: 'object',
        properties: {
          description: { type: 'string' },
          quantity: { type: 'number' },
          amount: { type: 'number' }
        }
      }
    }
  },
  required: ['invoiceNumber', 'totalAmount']
};

Schema Registry Reference

Use registered schemas:
extract({
  provider: vlmProvider,
  schema: { ref: 'invoice@1.0.0' }
})

Enhanced Schema

Include examples and extraction guidance:
const enhancedSchema = {
  schema: invoiceSchema,
  contextPrompt: 'This is a maritime bunker delivery note',
  extractionRules: 'Focus on the delivery summary table',
  examples: [
    {
      description: 'Standard invoice',
      input: 'Invoice #: INV-001\nTotal: $1,250.00',
      output: { invoiceNumber: 'INV-001', totalAmount: 1250.00 }
    }
  ]
};

extract({
  provider: vlmProvider,
  schema: enhancedSchema
})

Citation Tracking

Track which parts of the source document contributed to each field:
extract({
  provider: vlmProvider,
  schema: invoiceSchema,
  citations: {
    enabled: true,
    detectInferred: true  // Flag calculated/inferred values
  }
})
Output includes citation metadata:
interface OutputWithCitations<T> {
  data: T;  // Extracted data
  citations: {
    [fieldPath: string]: {
      lineIds: string[];     // Source line IDs (e.g., 'p1_l5')
      confidence: number;    // 0-1 confidence score
      inferred?: boolean;    // True if value was calculated
      reasoning?: string;    // Explanation for inferred values
    };
  };
}

Consensus Voting

Run extraction multiple times and vote on results:
extract({
  provider: vlmProvider,
  schema: invoiceSchema,
  consensus: {
    runs: 3,              // Number of extraction runs
    strategy: 'majority', // Voting strategy
    threshold: 0.6        // Minimum agreement threshold
  }
})
See Consensus Voting for strategies and configuration.

Extended Reasoning

Enable chain-of-thought reasoning for complex extractions:
extract({
  provider: vlmProvider,
  schema: invoiceSchema,
  reasoning: {
    enabled: true,
    effort: 'high',    // 'low' | 'medium' | 'high'
    exclude: false     // Include reasoning in output
  }
})
Extended reasoning improves accuracy for complex documents but increases latency and cost.

Custom Instructions

Add extraction guidance:
extract({
  provider: vlmProvider,
  schema: invoiceSchema,
  additionalInstructions: `
    - Be strict with date formats. Use YYYY-MM-DD format only.
    - For amounts, preserve exact decimal precision.
    - If a field is partially visible, extract what's readable.
  `
})

Type-Safe Extraction

Use TypeScript generics for typed output:
interface Invoice {
  invoiceNumber: string;
  totalAmount: number;
  lineItems?: Array<{
    description: string;
    amount: number;
  }>;
}

const flow = createFlow()
  .step('extract', extract<Invoice>({
    provider: vlmProvider,
    schema: invoiceSchema
  }))
  .build();

const result = await flow.run({ base64: pdf });
// result.output is typed as Invoice

Error Handling

Extraction may fail if:
  • Document cannot be read
  • Schema cannot be satisfied
  • Provider returns invalid response
Handle errors:
try {
  const result = await flow.run({ base64: pdf });
} catch (error) {
  if (error.code === 'SCHEMA_VALIDATION_FAILED') {
    console.error('Extracted data does not match schema');
  }
}

Next Steps