Documentation Index Fetch the complete documentation index at: https://docs.doclo.ai/llms.txt
Use this file to discover all available pages before exploring further.
The chunk node splits a parsed DocumentIR into smaller pieces, useful for RAG pipelines, embedding generation, and processing documents that exceed context limits.
Basic Usage
import { createFlow , parse , chunk } from '@doclo/flows' ;
const flow = createFlow ()
. step ( 'parse' , parse ({ provider: ocrProvider }))
. step ( 'chunk' , chunk ({
strategy: 'recursive' ,
maxSize: 1000 ,
overlap: 100
}))
. build ();
const result = await flow . run ({ base64: pdfDataUrl });
// result.output is ChunkOutput
Configuration Options
chunk ({
strategy: 'recursive' , // Chunking strategy
maxSize: 1000 , // Max characters per chunk
minSize: 100 , // Min characters per chunk
overlap: 100 , // Character overlap between chunks
separators: [ ' \n\n ' , ' \n ' , '. ' , ' ' ] // Hierarchical separators
})
Chunking Strategies
Recursive (Default)
Splits by hierarchical separators, respecting natural boundaries:
chunk ({
strategy: 'recursive' ,
maxSize: 1000 ,
minSize: 100 ,
overlap: 100 ,
separators: [ ' \n\n ' , ' \n ' , '. ' , ' ' ]
})
Option Type Default Description maxSizenumber1000Maximum characters per chunk minSizenumber100Minimum characters per chunk overlapnumber0Character overlap between chunks separatorsstring[]['\n\n', '\n', '. ', ' ']Separator hierarchy
Best for: General documents, articles, reports.
Section
Splits by document sections (headers, chapters):
chunk ({
strategy: 'section' ,
maxSize: 2000 ,
minSize: 100
})
Option Type Default Description maxSizenumber2000Maximum characters per section chunk minSizenumber100Minimum characters per section chunk
Best for: Structured documents with clear sections.
Page
Splits by page boundaries:
chunk ({
strategy: 'page' ,
pagesPerChunk: 1 ,
combineShortPages: true ,
minPageContent: 100
})
Option Type Default Description pagesPerChunknumber1Pages per chunk combineShortPagesbooleantrueCombine short pages together minPageContentnumber100Minimum content to keep a page
Best for: Maintaining page-level context.
Fixed
Splits at fixed intervals:
chunk ({
strategy: 'fixed' ,
size: 512 ,
unit: 'characters' ,
overlap: 50
})
Option Type Default Description sizenumber512Fixed size per chunk unit'characters' | 'tokens''characters'Size unit overlapnumber0Overlap between chunks
Best for: Uniform chunk sizes for embeddings.
Output: ChunkOutput
interface ChunkOutput {
chunks : ChunkMetadata [];
totalChunks : number ;
averageChunkSize : number ;
sourceDocument ?: DocumentIR ; // Original for citation mapping
}
interface ChunkMetadata {
content : string ; // Chunk text content
id : string ; // Unique chunk identifier
index : number ; // Position in sequence
startChar : number ; // Start position in original
endChar : number ; // End position in original
pageNumbers : number []; // Pages this chunk spans
section ?: string ; // Section title if detected
headers ?: string []; // Header hierarchy
strategy : string ; // Which strategy created this
wordCount : number ;
charCount : number ;
}
Use Cases
RAG Pipeline
Chunk for retrieval-augmented generation:
const ragFlow = createFlow ()
. step ( 'parse' , parse ({ provider: ocrProvider }))
. step ( 'chunk' , chunk ({
strategy: 'recursive' ,
maxSize: 500 ,
overlap: 50
}))
. build ();
const result = await ragFlow . run ({ base64: pdf });
// Generate embeddings for each chunk
for ( const chunkData of result . output . chunks ) {
const embedding = await generateEmbedding ( chunkData . content );
await saveToVectorStore ( chunkData . id , embedding , chunkData );
}
Large Document Processing
Process documents exceeding context limits:
const flow = createFlow ()
. step ( 'parse' , parse ({ provider: ocrProvider }))
. step ( 'chunk' , chunk ({
strategy: 'page' ,
pagesPerChunk: 5
}))
. forEach ( 'extract' , ( chunkOutput ) =>
// Process each page group
createFlow ()
. step ( 'extract' , extract ({
provider: llmProvider ,
schema: schema
}))
)
. step ( 'combine' , combine ({ strategy: 'merge' }))
. build ();
Overlap for Context
Use overlap to maintain context across chunk boundaries:
chunk ({
strategy: 'recursive' ,
maxSize: 1000 ,
overlap: 200 // 200 char overlap
})
This ensures that sentences split across chunks are fully contained in at least one chunk.
Chunk metadata includes page numbers and positions for citation mapping:
const result = await flow . run ({ base64: pdf });
for ( const chunkData of result . output . chunks ) {
console . log ( `Chunk ${ chunkData . index } :` );
console . log ( ` Pages: ${ chunkData . pageNumbers . join ( ', ' ) } ` );
console . log ( ` Position: ${ chunkData . startChar } - ${ chunkData . endChar } ` );
console . log ( ` Content: ${ chunkData . content . substring ( 0 , 100 ) } ...` );
}
Next Steps
parse Parse documents before chunking
combine Merge results from chunked processing