Documentation Index
Fetch the complete documentation index at: https://docs.doclo.ai/llms.txt
Use this file to discover all available pages before exploring further.
Marker is a document conversion provider from Datalab that converts PDFs and images to clean Markdown with optional image extraction.
Installation
npm install @doclo/providers-datalab
Basic Setup
import { markerOCRProvider } from '@doclo/providers-datalab';
const ocrProvider = markerOCRProvider({
apiKey: process.env.DATALAB_API_KEY!
});
Configuration Options
markerOCRProvider({
apiKey: string, // Required: API key
endpoint?: string, // Custom endpoint (default: datalab API)
// Processing mode
mode?: 'fast' | 'balanced' | 'high_accuracy',
// Page selection
maxPages?: number, // Process first N pages only
pageRange?: string, // Specific pages, e.g., "0,2-4,10"
// Output options
force_ocr?: boolean, // Force OCR (default: true)
extractImages?: boolean, // Extract images (default: true)
paginate?: boolean, // Add page delimiters
formatLines?: boolean, // Format lines in output
// Language
langs?: string[], // OCR languages, e.g., ['en', 'de']
// Processing
stripExistingOCR?: boolean, // Redo OCR from scratch
polling?: {
maxAttempts?: number, // Max polling attempts (default: 60)
pollingInterval?: number // Polling interval ms (default: 2000)
}
})
Processing Modes
| Mode | Cost/Page | Best For |
|---|
fast | $0.002 | Quick previews, simple documents |
balanced | $0.004 | General use (default) |
high_accuracy | $0.006 | Complex layouts, critical accuracy |
const provider = markerOCRProvider({
apiKey: process.env.DATALAB_API_KEY!,
mode: 'high_accuracy'
});
Usage with Flows
import { createFlow, parse, extract } from '@doclo/flows';
import { markerOCRProvider } from '@doclo/providers-datalab';
const ocrProvider = markerOCRProvider({
apiKey: process.env.DATALAB_API_KEY!,
mode: 'balanced'
});
const flow = createFlow()
.step('parse', parse({ provider: ocrProvider }))
.step('extract', extract({
provider: vlmProvider,
schema: documentSchema
}))
.build();
Output: DocumentIR with Markdown
Marker returns DocumentIR with markdown content:
interface DocumentIR {
pages: {
width: number;
height: number;
markdown?: string; // Page markdown content
lines: {
text: string;
bbox?: object;
}[];
}[];
extras?: {
raw: object;
costUSD: number;
pageCount: number;
status: string;
success: boolean;
images?: ExtractedImage[]; // Extracted figures/tables
};
}
Marker can extract figures, tables, and charts as images:
const provider = markerOCRProvider({
apiKey: process.env.DATALAB_API_KEY!,
extractImages: true // Default
});
const result = await flow.run({ base64: documentData });
// Access extracted images
const images = result.artifacts.parse?.extras?.images;
for (const image of images || []) {
console.log(`Image ${image.id}:`);
console.log(` Page: ${image.pageNumber}`);
console.log(` Type: ${image.caption}`); // 'Figure', 'Table', etc.
console.log(` Data: ${image.base64.substring(0, 50)}...`);
}
To disable image extraction:
const provider = markerOCRProvider({
apiKey: process.env.DATALAB_API_KEY!,
extractImages: false
});
Page Selection
Process specific pages to reduce cost and time:
// First 5 pages only
const provider = markerOCRProvider({
apiKey: process.env.DATALAB_API_KEY!,
maxPages: 5
});
// Specific page range (0-indexed)
const provider = markerOCRProvider({
apiKey: process.env.DATALAB_API_KEY!,
pageRange: '0,2-4,10' // Pages 1, 3-5, and 11
});
Language Support
Specify OCR languages for better accuracy:
const provider = markerOCRProvider({
apiKey: process.env.DATALAB_API_KEY!,
langs: ['en', 'de', 'fr'] // English, German, French
});
| Format | MIME Type |
|---|
| PDF | application/pdf |
| PNG | image/png |
| JPEG | image/jpeg |
| WebP | image/webp |
Pricing
| Mode | Cost per Page |
|---|
| fast | $0.002 |
| balanced | $0.004 |
| high_accuracy | $0.006 |
Marker vs Surya
| Feature | Marker | Surya |
|---|
| Output | Markdown | Text + bbox |
| Image extraction | Yes | No |
| Processing modes | 3 | 1 |
| Cost | $0.002-0.006/page | $0.01/page |
| Best for | Markdown conversion | Position data |
Choose Marker when:
- You need clean Markdown output
- You want to extract figures and tables
- Variable quality/cost tradeoff is useful
Choose Surya when:
- You need precise bounding boxes
- Building citation systems
- RAG with positional data
Example: Document to Markdown
import { createFlow, parse } from '@doclo/flows';
import { markerOCRProvider } from '@doclo/providers-datalab';
const ocrProvider = markerOCRProvider({
apiKey: process.env.DATALAB_API_KEY!,
mode: 'high_accuracy',
extractImages: true
});
const flow = createFlow()
.step('parse', parse({ provider: ocrProvider }))
.build();
const result = await flow.run({
base64: 'data:application/pdf;base64,...'
});
// Get full markdown
const fullMarkdown = result.output.pages
.map(p => p.markdown || p.lines.map(l => l.text).join('\n'))
.join('\n\n---\n\n');
console.log('Document markdown:');
console.log(fullMarkdown);
// Get extracted images
const images = result.output.extras?.images || [];
console.log(`Extracted ${images.length} images`);
Next Steps
Surya OCR
Text with bounding boxes
Reducto
Chunking and splitting