Skip to main content
Marker is a document conversion provider from Datalab that converts PDFs and images to clean Markdown with optional image extraction.

Installation

npm install @docloai/providers-datalab

Basic Setup

import { markerOCRProvider } from '@docloai/providers-datalab';

const ocrProvider = markerOCRProvider({
  apiKey: process.env.DATALAB_API_KEY!
});

Configuration Options

markerOCRProvider({
  apiKey: string,              // Required: API key
  endpoint?: string,           // Custom endpoint (default: datalab API)

  // Processing mode
  mode?: 'fast' | 'balanced' | 'high_accuracy',

  // Page selection
  maxPages?: number,           // Process first N pages only
  pageRange?: string,          // Specific pages, e.g., "0,2-4,10"

  // Output options
  force_ocr?: boolean,         // Force OCR (default: true)
  extractImages?: boolean,     // Extract images (default: true)
  paginate?: boolean,          // Add page delimiters
  formatLines?: boolean,       // Format lines in output

  // Language
  langs?: string[],            // OCR languages, e.g., ['en', 'de']

  // Processing
  stripExistingOCR?: boolean,  // Redo OCR from scratch
  polling?: {
    maxAttempts?: number,      // Max polling attempts (default: 60)
    pollingInterval?: number   // Polling interval ms (default: 2000)
  }
})

Processing Modes

ModeCost/PageBest For
fast$0.002Quick previews, simple documents
balanced$0.004General use (default)
high_accuracy$0.006Complex layouts, critical accuracy
const provider = markerOCRProvider({
  apiKey: process.env.DATALAB_API_KEY!,
  mode: 'high_accuracy'
});

Usage with Flows

import { createFlow, parse, extract } from '@docloai/flows';
import { markerOCRProvider } from '@docloai/providers-datalab';

const ocrProvider = markerOCRProvider({
  apiKey: process.env.DATALAB_API_KEY!,
  mode: 'balanced'
});

const flow = createFlow()
  .step('parse', parse({ provider: ocrProvider }))
  .step('extract', extract({
    provider: vlmProvider,
    schema: documentSchema
  }))
  .build();

Output: DocumentIR with Markdown

Marker returns DocumentIR with markdown content:
interface DocumentIR {
  pages: {
    width: number;
    height: number;
    markdown?: string;      // Page markdown content
    lines: {
      text: string;
      bbox?: object;
    }[];
  }[];
  extras?: {
    raw: object;
    costUSD: number;
    pageCount: number;
    status: string;
    success: boolean;
    images?: ExtractedImage[];  // Extracted figures/tables
  };
}

Extracted Images

Marker can extract figures, tables, and charts as images:
const provider = markerOCRProvider({
  apiKey: process.env.DATALAB_API_KEY!,
  extractImages: true  // Default
});

const result = await flow.run({ base64: documentData });

// Access extracted images
const images = result.artifacts.parse?.extras?.images;
for (const image of images || []) {
  console.log(`Image ${image.id}:`);
  console.log(`  Page: ${image.pageNumber}`);
  console.log(`  Type: ${image.caption}`);  // 'Figure', 'Table', etc.
  console.log(`  Data: ${image.base64.substring(0, 50)}...`);
}
To disable image extraction:
const provider = markerOCRProvider({
  apiKey: process.env.DATALAB_API_KEY!,
  extractImages: false
});

Page Selection

Process specific pages to reduce cost and time:
// First 5 pages only
const provider = markerOCRProvider({
  apiKey: process.env.DATALAB_API_KEY!,
  maxPages: 5
});

// Specific page range (0-indexed)
const provider = markerOCRProvider({
  apiKey: process.env.DATALAB_API_KEY!,
  pageRange: '0,2-4,10'  // Pages 1, 3-5, and 11
});

Language Support

Specify OCR languages for better accuracy:
const provider = markerOCRProvider({
  apiKey: process.env.DATALAB_API_KEY!,
  langs: ['en', 'de', 'fr']  // English, German, French
});

Supported Formats

FormatMIME Type
PDFapplication/pdf
PNGimage/png
JPEGimage/jpeg
WebPimage/webp

Pricing

ModeCost per Page
fast$0.002
balanced$0.004
high_accuracy$0.006

Marker vs Surya

FeatureMarkerSurya
OutputMarkdownText + bbox
Image extractionYesNo
Processing modes31
Cost$0.002-0.006/page$0.01/page
Best forMarkdown conversionPosition data
Choose Marker when:
  • You need clean Markdown output
  • You want to extract figures and tables
  • Variable quality/cost tradeoff is useful
Choose Surya when:
  • You need precise bounding boxes
  • Building citation systems
  • RAG with positional data

Example: Document to Markdown

import { createFlow, parse } from '@docloai/flows';
import { markerOCRProvider } from '@docloai/providers-datalab';

const ocrProvider = markerOCRProvider({
  apiKey: process.env.DATALAB_API_KEY!,
  mode: 'high_accuracy',
  extractImages: true
});

const flow = createFlow()
  .step('parse', parse({ provider: ocrProvider }))
  .build();

const result = await flow.run({
  base64: 'data:application/pdf;base64,...'
});

// Get full markdown
const fullMarkdown = result.output.pages
  .map(p => p.markdown || p.lines.map(l => l.text).join('\n'))
  .join('\n\n---\n\n');

console.log('Document markdown:');
console.log(fullMarkdown);

// Get extracted images
const images = result.output.extras?.images || [];
console.log(`Extracted ${images.length} images`);

Next Steps