Extract Invoice Data

This guide walks through building a complete invoice extraction pipeline, from defining a schema to processing documents and handling the extracted data.

Prerequisites

Node.js 18+
A Doclo API key or VLM provider API key
An invoice document (PDF or image)

What You’ll Build

A flow that extracts:

Invoice number and date
Vendor information (name, address, tax ID)
Line items with descriptions, quantities, and amounts
Totals (subtotal, tax, total)

Step 1: Define the Schema

The schema tells the AI what data to extract and in what structure. Use JSON Schema format with descriptions to guide extraction accuracy.

const invoiceSchema = {
  type: 'object',
  properties: {
    invoiceNumber: {
      type: 'string',
      description: 'Invoice number or reference ID (e.g., INV-2024-001)'
    },
    invoiceDate: {
      type: 'string',
      description: 'Invoice date in YYYY-MM-DD format'
    },
    dueDate: {
      type: 'string',
      description: 'Payment due date in YYYY-MM-DD format'
    },
    vendor: {
      type: 'object',
      description: 'Seller/vendor information',
      properties: {
        name: { type: 'string', description: 'Company or business name' },
        address: { type: 'string', description: 'Full street address' },
        city: { type: 'string' },
        country: { type: 'string' },
        taxId: { type: 'string', description: 'VAT number or tax ID' },
        email: { type: 'string' },
        phone: { type: 'string' }
      },
      required: ['name']
    },
    customer: {
      type: 'object',
      description: 'Buyer/customer information',
      properties: {
        name: { type: 'string' },
        address: { type: 'string' },
        taxId: { type: 'string' }
      }
    },
    lineItems: {
      type: 'array',
      description: 'Individual items or services on the invoice',
      items: {
        type: 'object',
        properties: {
          description: { type: 'string', description: 'Item or service description' },
          quantity: { type: 'number', description: 'Number of units' },
          unitPrice: { type: 'number', description: 'Price per unit' },
          amount: { type: 'number', description: 'Line total (quantity × unitPrice)' }
        },
        required: ['description', 'amount']
      }
    },
    subtotal: {
      type: 'number',
      description: 'Sum of line items before tax'
    },
    taxRate: {
      type: 'number',
      description: 'Tax rate as percentage (e.g., 20 for 20%)'
    },
    taxAmount: {
      type: 'number',
      description: 'Total tax amount'
    },
    total: {
      type: 'number',
      description: 'Final amount due including tax'
    },
    currency: {
      type: 'string',
      description: 'Currency code (e.g., USD, EUR, GBP)'
    }
  },
  required: ['invoiceNumber', 'total']
};

Add description fields to guide the AI. Descriptions like “in YYYY-MM-DD format” help ensure consistent output.

Step 2: Set Up the Provider

Choose a VLM provider for extraction. For invoices with tables and complex layouts, a VLM provider works best since it can see the document visually.

import { createVLMProvider } from '@doclo/providers-llm';

const vlmProvider = createVLMProvider({
  provider: 'google',
  model: 'google/gemini-2.5-flash',
  apiKey: process.env.OPENROUTER_API_KEY!,
  via: 'openrouter'
});

For higher accuracy on complex invoices, use Claude or GPT-4:

const vlmProvider = createVLMProvider({
  provider: 'anthropic',
  model: 'anthropic/claude-sonnet-4.5',
  apiKey: process.env.OPENROUTER_API_KEY!,
  via: 'openrouter'
});

Step 3: Build the Flow

Basic Extraction Flow

The simplest approach is direct VLM extraction:

import { createFlow, extract } from '@doclo/flows';

const invoiceFlow = createFlow()
  .step('extract', extract({
    provider: vlmProvider,
    schema: invoiceSchema
  }))
  .build();

With OCR Pre-Processing

For text-heavy invoices or when you need lower costs, parse first then extract:

import { createFlow, parse, extract } from '@doclo/flows';
import { createOCRProvider } from '@doclo/providers-datalab';

const ocrProvider = createOCRProvider({
  endpoint: 'https://www.datalab.to/api/v1/marker',
  apiKey: process.env.DATALAB_API_KEY!
});

const invoiceFlow = createFlow()
  .step('parse', parse({ provider: ocrProvider }))
  .step('extract', extract({
    provider: vlmProvider,
    schema: invoiceSchema,
    inputMode: 'ir'  // Use parsed text only
  }))
  .build();

With Hybrid Input Mode

For maximum accuracy on invoices with visual elements (stamps, signatures, checkboxes), use both parsed text and visual context:

const invoiceFlow = createFlow()
  .step('parse', parse({ provider: ocrProvider }))
  .step('extract', extract({
    provider: vlmProvider,
    schema: invoiceSchema,
    inputMode: 'ir+source'  // Both text and images
  }))
  .build();

Step 4: Run the Flow

Prepare the Document

Convert your document to base64:

import { readFileSync } from 'fs';

// Read and encode the file
const fileBuffer = readFileSync('./invoice.pdf');
const base64 = `data:application/pdf;base64,${fileBuffer.toString('base64')}`;

// Or for images
const imageBuffer = readFileSync('./invoice.png');
const imageBase64 = `data:image/png;base64,${imageBuffer.toString('base64')}`;

Execute the Flow

const result = await invoiceFlow.run({ base64 });

console.log('Invoice Number:', result.output.invoiceNumber);
console.log('Total:', result.output.currency, result.output.total);
console.log('Line Items:', result.output.lineItems?.length);

Step 5: Add Type Safety

Use TypeScript interfaces for type-safe extraction:

interface Invoice {
  invoiceNumber: string;
  invoiceDate?: string;
  dueDate?: string;
  vendor: {
    name: string;
    address?: string;
    city?: string;
    country?: string;
    taxId?: string;
    email?: string;
    phone?: string;
  };
  customer?: {
    name?: string;
    address?: string;
    taxId?: string;
  };
  lineItems?: Array<{
    description: string;
    quantity?: number;
    unitPrice?: number;
    amount: number;
  }>;
  subtotal?: number;
  taxRate?: number;
  taxAmount?: number;
  total: number;
  currency?: string;
}

const invoiceFlow = createFlow()
  .step('extract', extract<Invoice>({
    provider: vlmProvider,
    schema: invoiceSchema
  }))
  .build();

const result = await invoiceFlow.run({ base64 });
// result.output is typed as Invoice

Step 6: Enable Citation Tracking

Track which parts of the document each field came from:

const invoiceFlow = createFlow()
  .step('extract', extract<Invoice>({
    provider: vlmProvider,
    schema: invoiceSchema,
    citations: {
      enabled: true,
      detectInferred: true  // Flag calculated values
    }
  }))
  .build();

const result = await invoiceFlow.run({ base64 });

// Access citation data
const output = result.output as any;
if (output.citations) {
  console.log('Total came from:', output.citations['total']);
  // { lineIds: ['p1_l42'], confidence: 0.95 }

  // Check if a value was inferred/calculated
  if (output.citations['subtotal']?.inferred) {
    console.log('Subtotal was calculated, not directly extracted');
  }
}

Step 7: Add Custom Instructions

Guide the extraction with additional instructions:

const invoiceFlow = createFlow()
  .step('extract', extract<Invoice>({
    provider: vlmProvider,
    schema: invoiceSchema,
    additionalInstructions: `
      - Extract dates in YYYY-MM-DD format regardless of how they appear
      - For amounts, preserve exact decimal precision (e.g., 125.50, not 125.5)
      - If tax rate is not explicitly shown, calculate from taxAmount / subtotal
      - Currency should be a 3-letter ISO code (USD, EUR, GBP, etc.)
      - For line items, include all rows from the items table
    `
  }))
  .build();

Complete Example

import { createFlow, parse, extract } from '@doclo/flows';
import { createVLMProvider } from '@doclo/providers-llm';
import { createOCRProvider } from '@doclo/providers-datalab';
import { readFileSync } from 'fs';

// Types
interface Invoice {
  invoiceNumber: string;
  invoiceDate?: string;
  dueDate?: string;
  vendor: {
    name: string;
    address?: string;
    taxId?: string;
  };
  lineItems?: Array<{
    description: string;
    quantity?: number;
    unitPrice?: number;
    amount: number;
  }>;
  subtotal?: number;
  taxAmount?: number;
  total: number;
  currency?: string;
}

// Schema
const invoiceSchema = {
  type: 'object',
  properties: {
    invoiceNumber: { type: 'string', description: 'Invoice number' },
    invoiceDate: { type: 'string', description: 'Date in YYYY-MM-DD' },
    dueDate: { type: 'string', description: 'Due date in YYYY-MM-DD' },
    vendor: {
      type: 'object',
      properties: {
        name: { type: 'string' },
        address: { type: 'string' },
        taxId: { type: 'string' }
      },
      required: ['name']
    },
    lineItems: {
      type: 'array',
      items: {
        type: 'object',
        properties: {
          description: { type: 'string' },
          quantity: { type: 'number' },
          unitPrice: { type: 'number' },
          amount: { type: 'number' }
        },
        required: ['description', 'amount']
      }
    },
    subtotal: { type: 'number' },
    taxAmount: { type: 'number' },
    total: { type: 'number' },
    currency: { type: 'string' }
  },
  required: ['invoiceNumber', 'total']
};

// Providers
const vlmProvider = createVLMProvider({
  provider: 'google',
  model: 'google/gemini-2.5-flash',
  apiKey: process.env.OPENROUTER_API_KEY!,
  via: 'openrouter'
});

const ocrProvider = createOCRProvider({
  endpoint: 'https://www.datalab.to/api/v1/marker',
  apiKey: process.env.DATALAB_API_KEY!
});

// Build flow with observability
const invoiceFlow = createFlow({
  observability: {
    onStepEnd: (ctx) => {
      console.log(`${ctx.stepId}: ${ctx.duration}ms, $${ctx.cost?.toFixed(4)}`);
    }
  }
})
  .step('parse', parse({ provider: ocrProvider }))
  .step('extract', extract<Invoice>({
    provider: vlmProvider,
    schema: invoiceSchema,
    inputMode: 'ir+source',
    citations: { enabled: true }
  }))
  .build();

// Process invoice
async function processInvoice(filePath: string) {
  const fileBuffer = readFileSync(filePath);
  const mimeType = filePath.endsWith('.pdf') ? 'application/pdf' : 'image/png';
  const base64 = `data:${mimeType};base64,${fileBuffer.toString('base64')}`;

  const result = await invoiceFlow.run({ base64 });

  console.log('\n--- Extracted Invoice ---');
  console.log('Invoice #:', result.output.invoiceNumber);
  console.log('Date:', result.output.invoiceDate);
  console.log('Vendor:', result.output.vendor?.name);
  console.log('Line Items:', result.output.lineItems?.length ?? 0);
  console.log('Total:', result.output.currency, result.output.total);
  console.log('\n--- Metrics ---');
  console.log('Duration:', result.aggregated.totalDurationMs, 'ms');
  console.log('Cost: $', result.aggregated.totalCostUSD.toFixed(4));

  return result.output;
}

// Run
processInvoice('./invoice.pdf')
  .then(invoice => console.log('\nDone:', invoice.invoiceNumber))
  .catch(console.error);

Using Doclo Cloud

To process invoices via Doclo Cloud instead of running locally:

import { DocloClient } from '@doclo/client';
import { readFileSync } from 'fs';

const client = new DocloClient({
  apiKey: process.env.DOCLO_API_KEY!
});

const fileBuffer = readFileSync('./invoice.pdf');
const base64 = fileBuffer.toString('base64');

// Sync execution (wait for result)
const result = await client.flows.run<Invoice>('your-invoice-flow-id', {
  input: {
    document: {
      base64,
      filename: 'invoice.pdf',
      mimeType: 'application/pdf'
    }
  },
  wait: true,
  timeout: 60000
});

console.log('Invoice:', result.output);

Next Steps

Multi-Provider Extraction

Improve accuracy with consensus voting

Process Large Documents

Handle multi-page invoices

Schemas

Learn schema definition patterns

Citations

Track extraction sources

Getting Started

Concepts

SDK

Doclo Cloud

Guides

Resources

Extract Invoice Data

Prerequisites

What You’ll Build

Step 1: Define the Schema

Step 2: Set Up the Provider

Step 3: Build the Flow

Basic Extraction Flow

With OCR Pre-Processing

With Hybrid Input Mode

Step 4: Run the Flow

Prepare the Document

Execute the Flow

Step 5: Add Type Safety

Step 6: Enable Citation Tracking

Step 7: Add Custom Instructions

Complete Example

Using Doclo Cloud

Next Steps

Multi-Provider Extraction

Process Large Documents

Schemas

Citations

Getting Started

Concepts

SDK

Doclo Cloud

Guides

Resources

​Prerequisites

​What You’ll Build

​Step 1: Define the Schema

​Step 2: Set Up the Provider

​Step 3: Build the Flow

​Basic Extraction Flow

​With OCR Pre-Processing

​With Hybrid Input Mode

​Step 4: Run the Flow

​Prepare the Document

​Execute the Flow

​Step 5: Add Type Safety

​Step 6: Enable Citation Tracking

​Step 7: Add Custom Instructions

​Complete Example

​Using Doclo Cloud

​Next Steps

Multi-Provider Extraction

Process Large Documents

Schemas

Citations

Prerequisites

What You’ll Build

Step 1: Define the Schema

Step 2: Set Up the Provider

Step 3: Build the Flow

Basic Extraction Flow

With OCR Pre-Processing

With Hybrid Input Mode

Step 4: Run the Flow

Prepare the Document

Execute the Flow

Step 5: Add Type Safety

Step 6: Enable Citation Tracking

Step 7: Add Custom Instructions

Complete Example

Using Doclo Cloud

Next Steps