Prerequisites
- Node.js 18+
- A Doclo API key or provider API keys
- A multi-page document (PDF)
When to Use Chunking
Consider chunking when:- Documents exceed 20+ pages
- Extraction quality degrades on long documents
- You need page-level citations or references
- Processing time exceeds acceptable limits
Strategy Overview
Basic Page-Based Chunking
Split a document by pages and process each chunk:Chunking Strategies
Page Strategy
Process documents in page groups:- Documents where page boundaries matter (contracts, reports)
- Maintaining page-level references
- Consistent chunk sizes
Section Strategy
Split by document sections (headers, chapters):- Structured documents with clear headings
- Technical documentation
- Reports with distinct sections
Recursive Strategy
Split by natural text boundaries:- Unstructured text
- Articles and narratives
- RAG pipelines
Fixed Strategy
Split at exact intervals:- Embedding generation
- Uniform processing requirements
Overlap for Context Continuity
When content spans chunk boundaries, use overlap to ensure complete extraction:Processing Multi-Document Files
For files containing multiple logical documents (e.g., a PDF with several invoices), usesplit instead of chunk:
Parallel Processing with forEach
TheforEach step processes chunks in parallel:
Accessing Chunk Metadata
Within forEach, you have access to chunk information:Combining Results
Choose a combine strategy based on your data:Merge (Default)
Intelligently merges based on type:Concatenate
Keep all results as an array:First/Last
Return first or last non-null result:Complete Example: Contract Processing
Extract clauses from a multi-page contract:Progress Tracking
Monitor chunked processing with observability hooks:Memory Considerations
For very large documents:- Stream processing: Process chunks sequentially if memory is constrained
- Reduce pagesPerChunk: Smaller chunks use less memory per operation
- Use IR-only mode: Skip visual processing to reduce memory usage
Error Handling in Chunked Flows
Handle failures gracefully:Using Doclo Cloud
Process large documents via Doclo Cloud with async execution:Next Steps
Chunk Node
Chunking configuration reference
Combine Node
Result merging strategies
Observability
Monitor chunked processing
Error Recovery
Handle failures gracefully