Skip to main content
The chunk node splits a parsed DocumentIR into smaller pieces, useful for RAG pipelines, embedding generation, and processing documents that exceed context limits.

Basic Usage

import { createFlow, parse, chunk } from '@docloai/flows';

const flow = createFlow()
  .step('parse', parse({ provider: ocrProvider }))
  .step('chunk', chunk({
    strategy: 'recursive',
    maxSize: 1000,
    overlap: 100
  }))
  .build();

const result = await flow.run({ base64: pdfDataUrl });
// result.output is ChunkOutput

Configuration Options

chunk({
  strategy: 'recursive',     // Chunking strategy
  maxSize: 1000,             // Max characters per chunk
  minSize: 100,              // Min characters per chunk
  overlap: 100,              // Character overlap between chunks
  separators: ['\n\n', '\n', '. ', ' ']  // Hierarchical separators
})

Chunking Strategies

Recursive (Default)

Splits by hierarchical separators, respecting natural boundaries:
chunk({
  strategy: 'recursive',
  maxSize: 1000,
  minSize: 100,
  overlap: 100,
  separators: ['\n\n', '\n', '. ', ' ']
})
OptionTypeDefaultDescription
maxSizenumber1000Maximum characters per chunk
minSizenumber100Minimum characters per chunk
overlapnumber0Character overlap between chunks
separatorsstring[]['\n\n', '\n', '. ', ' ']Separator hierarchy
Best for: General documents, articles, reports.

Section

Splits by document sections (headers, chapters):
chunk({
  strategy: 'section',
  maxSize: 2000,
  minSize: 100
})
OptionTypeDefaultDescription
maxSizenumber2000Maximum characters per section chunk
minSizenumber100Minimum characters per section chunk
Best for: Structured documents with clear sections.

Page

Splits by page boundaries:
chunk({
  strategy: 'page',
  pagesPerChunk: 1,
  combineShortPages: true,
  minPageContent: 100
})
OptionTypeDefaultDescription
pagesPerChunknumber1Pages per chunk
combineShortPagesbooleantrueCombine short pages together
minPageContentnumber100Minimum content to keep a page
Best for: Maintaining page-level context.

Fixed

Splits at fixed intervals:
chunk({
  strategy: 'fixed',
  size: 512,
  unit: 'characters',
  overlap: 50
})
OptionTypeDefaultDescription
sizenumber512Fixed size per chunk
unit'characters' | 'tokens''characters'Size unit
overlapnumber0Overlap between chunks
Best for: Uniform chunk sizes for embeddings.

Output: ChunkOutput

interface ChunkOutput {
  chunks: ChunkMetadata[];
  totalChunks: number;
  averageChunkSize: number;
  sourceDocument?: DocumentIR;  // Original for citation mapping
}

interface ChunkMetadata {
  content: string;         // Chunk text content
  id: string;              // Unique chunk identifier
  index: number;           // Position in sequence
  startChar: number;       // Start position in original
  endChar: number;         // End position in original
  pageNumbers: number[];   // Pages this chunk spans
  section?: string;        // Section title if detected
  headers?: string[];      // Header hierarchy
  strategy: string;        // Which strategy created this
  wordCount: number;
  charCount: number;
}

Use Cases

RAG Pipeline

Chunk for retrieval-augmented generation:
const ragFlow = createFlow()
  .step('parse', parse({ provider: ocrProvider }))
  .step('chunk', chunk({
    strategy: 'recursive',
    maxSize: 500,
    overlap: 50
  }))
  .build();

const result = await ragFlow.run({ base64: pdf });

// Generate embeddings for each chunk
for (const chunkData of result.output.chunks) {
  const embedding = await generateEmbedding(chunkData.content);
  await saveToVectorStore(chunkData.id, embedding, chunkData);
}

Large Document Processing

Process documents exceeding context limits:
const flow = createFlow()
  .step('parse', parse({ provider: ocrProvider }))
  .step('chunk', chunk({
    strategy: 'page',
    pagesPerChunk: 5
  }))
  .forEach('extract', (chunkOutput) =>
    // Process each page group
    createFlow()
      .step('extract', extract({
        provider: llmProvider,
        schema: schema
      }))
  )
  .step('combine', combine({ strategy: 'merge' }))
  .build();

Overlap for Context

Use overlap to maintain context across chunk boundaries:
chunk({
  strategy: 'recursive',
  maxSize: 1000,
  overlap: 200  // 200 char overlap
})
This ensures that sentences split across chunks are fully contained in at least one chunk.

Preserving Source Information

Chunk metadata includes page numbers and positions for citation mapping:
const result = await flow.run({ base64: pdf });

for (const chunkData of result.output.chunks) {
  console.log(`Chunk ${chunkData.index}:`);
  console.log(`  Pages: ${chunkData.pageNumbers.join(', ')}`);
  console.log(`  Position: ${chunkData.startChar}-${chunkData.endChar}`);
  console.log(`  Content: ${chunkData.content.substring(0, 100)}...`);
}

Next Steps