Skip to main content
The extract node uses AI to extract structured data from documents according to a JSON Schema. It works with both raw documents (via VLM) and parsed DocumentIR (via LLM).

Basic Usage

import { createFlow, extract } from '@docloai/flows';
import { createVLMProvider } from '@docloai/providers-llm';

const vlmProvider = createVLMProvider({
  provider: 'google',
  model: 'google/gemini-flash-2.5',
  apiKey: process.env.OPENROUTER_API_KEY!,
  via: 'openrouter'
});

const invoiceSchema = {
  type: 'object',
  properties: {
    invoiceNumber: { type: 'string' },
    totalAmount: { type: 'number' },
    date: { type: 'string' }
  },
  required: ['invoiceNumber', 'totalAmount']
};

const flow = createFlow()
  .step('extract', extract({
    provider: vlmProvider,
    schema: invoiceSchema
  }))
  .build();

const result = await flow.run({ base64: pdfDataUrl });
// result.output matches invoiceSchema

Configuration Options

extract({
  provider: vlmProvider,           // Required: VLM or LLM provider
  schema: invoiceSchema,           // Required: JSON Schema for output
  citations: { enabled: true },    // Enable source tracking
  consensus: { runs: 3 },          // Multi-run voting for accuracy
  reasoning: { enabled: true },    // Extended thinking (supported providers)
  additionalInstructions: '...'    // Custom extraction guidance
})

Options Reference

OptionTypeDescription
providerVLMProvider | LLMProviderRequired. AI provider for extraction
schemaobject | { ref: string }Required. JSON Schema or registry reference
citationsCitationConfigCitation/source tracking
consensusConsensusConfigMulti-run voting configuration
reasoningobjectExtended reasoning options
additionalInstructionsstringCustom extraction guidance
promptRefstringReference to prompt asset
promptVariablesobjectVariables for prompt rendering

Input Types

The extract node accepts two input types:

Raw Documents (VLM)

Direct extraction from PDFs or images:
// VLM provider processes the document directly
const flow = createFlow()
  .step('extract', extract({
    provider: vlmProvider,  // Must be VLM for raw input
    schema: invoiceSchema
  }))
  .build();

await flow.run({ base64: pdfDataUrl });

Parsed Documents (LLM)

Extract from previously parsed DocumentIR:
// Parse first, then extract with LLM
const flow = createFlow()
  .step('parse', parse({ provider: ocrProvider }))
  .step('extract', extract({
    provider: llmProvider,  // Can use LLM for text input
    schema: invoiceSchema
  }))
  .build();

Schema Definition

Basic Schema

const schema = {
  type: 'object',
  properties: {
    invoiceNumber: {
      type: 'string',
      description: 'Invoice number or reference ID'
    },
    vendor: {
      type: 'object',
      properties: {
        name: { type: 'string', description: 'Company name' },
        address: { type: 'string', description: 'Full address' }
      }
    },
    totalAmount: {
      type: 'number',
      description: 'Total invoice amount'
    },
    lineItems: {
      type: 'array',
      items: {
        type: 'object',
        properties: {
          description: { type: 'string' },
          quantity: { type: 'number' },
          amount: { type: 'number' }
        }
      }
    }
  },
  required: ['invoiceNumber', 'totalAmount']
};

Schema Registry Reference

Use registered schemas:
extract({
  provider: vlmProvider,
  schema: { ref: 'invoice@1.0.0' }
})

Enhanced Schema

Include examples and extraction guidance:
const enhancedSchema = {
  schema: invoiceSchema,
  contextPrompt: 'This is a maritime bunker delivery note',
  extractionRules: 'Focus on the delivery summary table',
  examples: [
    {
      description: 'Standard invoice',
      input: 'Invoice #: INV-001\nTotal: $1,250.00',
      output: { invoiceNumber: 'INV-001', totalAmount: 1250.00 }
    }
  ]
};

extract({
  provider: vlmProvider,
  schema: enhancedSchema
})

Citation Tracking

Track which parts of the source document contributed to each field:
extract({
  provider: vlmProvider,
  schema: invoiceSchema,
  citations: {
    enabled: true,
    detectInferred: true  // Flag calculated/inferred values
  }
})
Output includes citation metadata:
interface OutputWithCitations<T> {
  data: T;  // Extracted data
  citations: {
    [fieldPath: string]: {
      lineIds: string[];     // Source line IDs (e.g., 'p1_l5')
      confidence: number;    // 0-1 confidence score
      inferred?: boolean;    // True if value was calculated
      reasoning?: string;    // Explanation for inferred values
    };
  };
}

Consensus Voting

Run extraction multiple times and vote on results:
extract({
  provider: vlmProvider,
  schema: invoiceSchema,
  consensus: {
    runs: 3,              // Number of extraction runs
    strategy: 'majority', // Voting strategy
    threshold: 0.6        // Minimum agreement threshold
  }
})
See Consensus Voting for strategies and configuration.

Extended Reasoning

Enable chain-of-thought reasoning for complex extractions:
extract({
  provider: vlmProvider,
  schema: invoiceSchema,
  reasoning: {
    enabled: true,
    effort: 'high',    // 'low' | 'medium' | 'high'
    exclude: false     // Include reasoning in output
  }
})
Extended reasoning improves accuracy for complex documents but increases latency and cost.

Custom Instructions

Add extraction guidance:
extract({
  provider: vlmProvider,
  schema: invoiceSchema,
  additionalInstructions: `
    - Be strict with date formats. Use YYYY-MM-DD format only.
    - For amounts, preserve exact decimal precision.
    - If a field is partially visible, extract what's readable.
  `
})

Type-Safe Extraction

Use TypeScript generics for typed output:
interface Invoice {
  invoiceNumber: string;
  totalAmount: number;
  lineItems?: Array<{
    description: string;
    amount: number;
  }>;
}

const flow = createFlow()
  .step('extract', extract<Invoice>({
    provider: vlmProvider,
    schema: invoiceSchema
  }))
  .build();

const result = await flow.run({ base64: pdf });
// result.output is typed as Invoice

Error Handling

Extraction may fail if:
  • Document cannot be read
  • Schema cannot be satisfied
  • Provider returns invalid response
Handle errors:
try {
  const result = await flow.run({ base64: pdf });
} catch (error) {
  if (error.code === 'SCHEMA_VALIDATION_FAILED') {
    console.error('Extracted data does not match schema');
  }
}

Next Steps