Skip to main content
The split node uses VLM to identify document boundaries in multi-document PDFs and categorize each document by type.

Basic Usage

import { createFlow, split, extract, combine } from '@docloai/flows';
import { createVLMProvider } from '@docloai/providers-llm';

const vlmProvider = createVLMProvider({
  provider: 'google',
  model: 'google/gemini-flash-2.5',
  apiKey: process.env.OPENROUTER_API_KEY!,
  via: 'openrouter'
});

const flow = createFlow()
  .step('split', split({
    provider: vlmProvider,
    schemas: {
      invoice: invoiceSchema,
      receipt: receiptSchema
    }
  }))
  .forEach('process', (doc) =>
    createFlow()
      .step('extract', extract({
        provider: vlmProvider,
        schema: doc.schema
      }))
  )
  .step('combine', combine())
  .build();

const result = await flow.run({ base64: multiDocPdf });
// result.output is an array of extracted documents

Configuration Options

split({
  provider: vlmProvider,           // Required: VLM provider
  schemas: {                       // Required: Document type schemas
    invoice: invoiceSchema,
    receipt: receiptSchema,
    contract: contractSchema
  },
  includeOther: true,              // Include uncategorized documents
  consensus: { runs: 3 },          // Multi-run voting for accuracy
  reasoning: { enabled: true }     // Extended thinking
})

Options Reference

OptionTypeDefaultDescription
providerVLMProviderRequiredVLM provider for document analysis
schemasRecord<string, object>RequiredMap of document types to schemas
schemaRefstring-Reference to schema registry
includeOtherbooleantrueInclude documents that don’t match any type
consensusConsensusConfig-Multi-run voting
reasoningobject-Extended reasoning options

Output: SplitDocument[]

The split node outputs an array of SplitDocument objects:
interface SplitDocument {
  type: string;           // Document type ('invoice', 'receipt', 'other')
  schema: object;         // Schema for this document type
  pages: number[];        // Page numbers in original PDF
  input: FlowInput;       // Input for processing this document
}

Processing Split Documents

With forEach

Process each split document with its corresponding schema:
const flow = createFlow()
  .step('split', split({
    provider: vlmProvider,
    schemas: { invoice: invoiceSchema, receipt: receiptSchema }
  }))
  .forEach('process', (doc) =>
    createFlow()
      .step('extract', extract({
        provider: vlmProvider,
        schema: doc.schema  // Uses the matched schema
      }))
  )
  .step('combine', combine())
  .build();

With Conditional Routing

Route to different processing flows:
const flow = createFlow()
  .step('split', split({
    provider: vlmProvider,
    schemas: { invoice: invoiceSchema, receipt: receiptSchema }
  }))
  .forEach('process', (doc) =>
    createFlow()
      .conditional('extract', (d) => {
        if (d.type === 'invoice') {
          return extract({ provider: vlmProvider, schema: invoiceSchema });
        } else if (d.type === 'receipt') {
          return extract({ provider: vlmProvider, schema: receiptSchema });
        }
        return extract({ provider: vlmProvider, schema: genericSchema });
      })
  )
  .step('combine', combine())
  .build();

Schema Registry

Use registered schemas instead of inline:
split({
  provider: vlmProvider,
  schemaRef: 'document-types@1.0.0'
})
The schema registry entry should contain a schemas property:
// Registry entry structure
{
  id: 'document-types',
  version: '1.0.0',
  schema: {
    schemas: {
      invoice: { /* schema */ },
      receipt: { /* schema */ }
    }
  }
}

Handling “Other” Documents

By default, documents that don’t match any defined type are categorized as “other”:
split({
  provider: vlmProvider,
  schemas: { invoice: invoiceSchema },
  includeOther: true  // Default: true
})
Set includeOther: false to exclude unrecognized documents:
split({
  provider: vlmProvider,
  schemas: { invoice: invoiceSchema },
  includeOther: false  // Skip unrecognized documents
})

Extended Reasoning

Enable for complex document analysis:
split({
  provider: vlmProvider,
  schemas: { invoice: invoiceSchema, receipt: receiptSchema },
  reasoning: {
    enabled: true,
    effort: 'high'
  }
})

Example: Insurance Document Bundle

const insuranceFlow = createFlow()
  .step('split', split({
    provider: vlmProvider,
    schemas: {
      claim_form: claimSchema,
      medical_report: medicalSchema,
      receipt: receiptSchema,
      id_document: idSchema
    }
  }))
  .forEach('extract', (doc) =>
    createFlow()
      .step('extract', extract({
        provider: vlmProvider,
        schema: doc.schema
      }))
  )
  .step('combine', combine({ strategy: 'merge' }))
  .build();

const result = await insuranceFlow.run({ base64: documentBundle });

// result.output contains all extracted documents
// [
//   { type: 'claim_form', data: { ... } },
//   { type: 'medical_report', data: { ... } },
//   { type: 'receipt', data: { ... } }
// ]

Next Steps