The extract node uses AI to extract structured data from documents according to a JSON Schema. It works with both raw documents (via VLM) and parsed DocumentIR (via LLM).
Basic Usage
import { createFlow, extract } from '@docloai/flows';
import { createVLMProvider } from '@docloai/providers-llm';
const vlmProvider = createVLMProvider({
provider: 'google',
model: 'google/gemini-flash-2.5',
apiKey: process.env.OPENROUTER_API_KEY!,
via: 'openrouter'
});
const invoiceSchema = {
type: 'object',
properties: {
invoiceNumber: { type: 'string' },
totalAmount: { type: 'number' },
date: { type: 'string' }
},
required: ['invoiceNumber', 'totalAmount']
};
const flow = createFlow()
.step('extract', extract({
provider: vlmProvider,
schema: invoiceSchema
}))
.build();
const result = await flow.run({ base64: pdfDataUrl });
// result.output matches invoiceSchema
Configuration Options
extract({
provider: vlmProvider, // Required: VLM or LLM provider
schema: invoiceSchema, // Required: JSON Schema for output
citations: { enabled: true }, // Enable source tracking
consensus: { runs: 3 }, // Multi-run voting for accuracy
reasoning: { enabled: true }, // Extended thinking (supported providers)
additionalInstructions: '...' // Custom extraction guidance
})
Options Reference
| Option | Type | Description |
|---|
provider | VLMProvider | LLMProvider | Required. AI provider for extraction |
schema | object | { ref: string } | Required. JSON Schema or registry reference |
citations | CitationConfig | Citation/source tracking |
consensus | ConsensusConfig | Multi-run voting configuration |
reasoning | object | Extended reasoning options |
additionalInstructions | string | Custom extraction guidance |
promptRef | string | Reference to prompt asset |
promptVariables | object | Variables for prompt rendering |
The extract node accepts two input types:
Raw Documents (VLM)
Direct extraction from PDFs or images:
// VLM provider processes the document directly
const flow = createFlow()
.step('extract', extract({
provider: vlmProvider, // Must be VLM for raw input
schema: invoiceSchema
}))
.build();
await flow.run({ base64: pdfDataUrl });
Parsed Documents (LLM)
Extract from previously parsed DocumentIR:
// Parse first, then extract with LLM
const flow = createFlow()
.step('parse', parse({ provider: ocrProvider }))
.step('extract', extract({
provider: llmProvider, // Can use LLM for text input
schema: invoiceSchema
}))
.build();
Schema Definition
Basic Schema
const schema = {
type: 'object',
properties: {
invoiceNumber: {
type: 'string',
description: 'Invoice number or reference ID'
},
vendor: {
type: 'object',
properties: {
name: { type: 'string', description: 'Company name' },
address: { type: 'string', description: 'Full address' }
}
},
totalAmount: {
type: 'number',
description: 'Total invoice amount'
},
lineItems: {
type: 'array',
items: {
type: 'object',
properties: {
description: { type: 'string' },
quantity: { type: 'number' },
amount: { type: 'number' }
}
}
}
},
required: ['invoiceNumber', 'totalAmount']
};
Schema Registry Reference
Use registered schemas:
extract({
provider: vlmProvider,
schema: { ref: 'invoice@1.0.0' }
})
Enhanced Schema
Include examples and extraction guidance:
const enhancedSchema = {
schema: invoiceSchema,
contextPrompt: 'This is a maritime bunker delivery note',
extractionRules: 'Focus on the delivery summary table',
examples: [
{
description: 'Standard invoice',
input: 'Invoice #: INV-001\nTotal: $1,250.00',
output: { invoiceNumber: 'INV-001', totalAmount: 1250.00 }
}
]
};
extract({
provider: vlmProvider,
schema: enhancedSchema
})
Citation Tracking
Track which parts of the source document contributed to each field:
extract({
provider: vlmProvider,
schema: invoiceSchema,
citations: {
enabled: true,
detectInferred: true // Flag calculated/inferred values
}
})
Output includes citation metadata:
interface OutputWithCitations<T> {
data: T; // Extracted data
citations: {
[fieldPath: string]: {
lineIds: string[]; // Source line IDs (e.g., 'p1_l5')
confidence: number; // 0-1 confidence score
inferred?: boolean; // True if value was calculated
reasoning?: string; // Explanation for inferred values
};
};
}
Consensus Voting
Run extraction multiple times and vote on results:
extract({
provider: vlmProvider,
schema: invoiceSchema,
consensus: {
runs: 3, // Number of extraction runs
strategy: 'majority', // Voting strategy
threshold: 0.6 // Minimum agreement threshold
}
})
See Consensus Voting for strategies and configuration.
Extended Reasoning
Enable chain-of-thought reasoning for complex extractions:
extract({
provider: vlmProvider,
schema: invoiceSchema,
reasoning: {
enabled: true,
effort: 'high', // 'low' | 'medium' | 'high'
exclude: false // Include reasoning in output
}
})
Extended reasoning improves accuracy for complex documents but increases latency and cost.
Custom Instructions
Add extraction guidance:
extract({
provider: vlmProvider,
schema: invoiceSchema,
additionalInstructions: `
- Be strict with date formats. Use YYYY-MM-DD format only.
- For amounts, preserve exact decimal precision.
- If a field is partially visible, extract what's readable.
`
})
Use TypeScript generics for typed output:
interface Invoice {
invoiceNumber: string;
totalAmount: number;
lineItems?: Array<{
description: string;
amount: number;
}>;
}
const flow = createFlow()
.step('extract', extract<Invoice>({
provider: vlmProvider,
schema: invoiceSchema
}))
.build();
const result = await flow.run({ base64: pdf });
// result.output is typed as Invoice
Error Handling
Extraction may fail if:
- Document cannot be read
- Schema cannot be satisfied
- Provider returns invalid response
Handle errors:
try {
const result = await flow.run({ base64: pdf });
} catch (error) {
if (error.code === 'SCHEMA_VALIDATION_FAILED') {
console.error('Extracted data does not match schema');
}
}
Next Steps