Schemas define the structure of data you want to extract from documents. They use JSON Schema format and tell the AI exactly what fields to look for and how to format them.
What are Schemas?
Schemas serve multiple purposes:
- Define structure: Specify fields, types, and nesting
- Guide extraction: Help the AI understand what to look for
- Validate output: Ensure extracted data matches expected format
- Enable type safety: Generate TypeScript types from schemas
Schema Structure
Schemas follow the JSON Schema specification (draft-07):
const invoiceSchema = {
type: 'object',
required: ['invoiceNumber', 'totalAmount'],
properties: {
invoiceNumber: {
type: 'string',
description: 'Invoice number or ID'
},
vendor: {
type: 'object',
properties: {
name: { type: 'string', description: 'Company name' },
address: { type: 'string', description: 'Full address' }
}
},
date: {
type: 'string',
format: 'date',
description: 'Invoice date in ISO format (YYYY-MM-DD)'
},
totalAmount: {
type: 'number',
description: 'Total invoice amount'
},
lineItems: {
type: 'array',
items: {
type: 'object',
properties: {
description: { type: 'string' },
quantity: { type: 'number' },
unitPrice: { type: 'number' },
amount: { type: 'number' }
}
}
}
}
};
Key Schema Properties
description
The most important property for extraction quality. Descriptions tell the AI what to look for and where to find it:
{
invoiceNumber: {
type: 'string',
description: 'Invoice number, invoice ID, or reference number. Usually at the top of the document. Look for labels like "Invoice #", "Inv No", or "Reference".'
}
}
Good descriptions significantly improve extraction accuracy. Be specific about where to find the data, what labels to look for, and what format to expect.
required
Fields that must be extracted. If a required field cannot be found, the extraction may fail or return null:
{
type: 'object',
required: ['invoiceNumber', 'totalAmount'],
properties: { ... }
}
type
Data type for the field:
| Type | Description | Example |
|---|
string | Text values | "INV-001" |
number | Numeric values (integers or decimals) | 1234.56 |
integer | Whole numbers only | 42 |
boolean | True/false values | true |
array | Lists of items | [1, 2, 3] |
object | Nested structures | { name: "..." } |
null | Explicit null value | null |
String format hints that help the AI understand expected patterns:
{
date: {
type: 'string',
format: 'date',
description: 'Invoice date'
},
email: {
type: 'string',
format: 'email',
description: 'Contact email address'
}
}
Common formats:
date: ISO date (YYYY-MM-DD)
time: Time (HH:MM or HH:MM:SS)
date-time: ISO 8601 datetime
email: Email address
uri: URL/URI
enum
Restrict values to a specific set:
{
status: {
type: 'string',
enum: ['paid', 'pending', 'overdue'],
description: 'Payment status'
},
temperatureUnit: {
type: 'string',
enum: ['Celsius', 'Fahrenheit'],
description: 'Unit of temperature measurement'
}
}
nullable
Allow a field to be null when data is not present:
{
taxAmount: {
type: 'number',
nullable: true,
description: 'Tax amount if shown, null if not applicable'
}
}
Different providers handle nullable differently. The SDK automatically translates nullable: true to the appropriate format for each provider (e.g., anyOf with null type for OpenAI).
pattern
Regular expression pattern for string validation:
{
bunkeringTime: {
type: 'string',
pattern: '^([0-1]?[0-9]|2[0-3]):[0-5][0-9]$',
description: 'Time in HH:MM format. Hours 00-23, minutes 00-59.'
}
}
Using Schemas
Inline Schema
Pass the schema object directly to the extract node:
import { createFlow, extract } from '@doclo/flows';
const flow = createFlow()
.step('extract', extract({
provider: vlmProvider,
schema: invoiceSchema
}))
.build();
Schema Reference
Reference schemas stored in the registry by ID and version:
const flow = createFlow()
.step('extract', extract({
provider: vlmProvider,
schema: { ref: 'invoice@2.1.0' }
}))
.build();
The reference format is id@version (e.g., bdn@1.0.0, invoice@2.1.0).
Enhanced Schema
Add examples and extraction guidance alongside the schema:
const flow = createFlow()
.step('extract', extract({
provider: vlmProvider,
schema: {
schema: invoiceSchema,
examples: [
{
description: 'Standard invoice',
input: 'Invoice #: INV-001\nTotal: $500.00',
output: { invoiceNumber: 'INV-001', totalAmount: 500.00 }
}
],
extractionRules: 'Focus on the header section for invoice metadata',
contextPrompt: 'This is a commercial invoice document',
hints: ['Amounts may include currency symbols', 'Dates vary in format']
}
}))
.build();
Schema Registry
The SDK includes a schema registry for storing and retrieving versioned schemas.
Local Registry
Register and retrieve schemas in memory:
import {
SCHEMA_REGISTRY,
registerSchema,
getSchema,
getLatestSchema
} from '@doclo/schemas';
// Register a schema
registerSchema({
id: 'invoice',
version: '2.1.0',
schema: invoiceSchema,
description: 'Invoice extraction schema',
tags: ['finance', 'accounting'],
createdAt: new Date().toISOString(),
updatedAt: new Date().toISOString()
});
// Get a specific version
const schema = getSchema('invoice', '2.1.0');
// Get the latest version
const latest = getLatestSchema('invoice');
// List all versions
const versions = SCHEMA_REGISTRY.listVersions('invoice');
// ['2.1.0', '2.0.0', '1.0.0'] (sorted descending)
// Check if a schema exists
const exists = SCHEMA_REGISTRY.has('invoice', '2.1.0');
// Get by reference string
const schemaByRef = SCHEMA_REGISTRY.getByRef('invoice@2.1.0');
Remote Registry (Cloud)
Fetch schemas from Doclo Cloud with automatic caching:
import { DocloClient, RemoteSchemaRegistry } from '@doclo/client';
const client = new DocloClient({ apiKey: process.env.DOCLO_API_KEY });
const schemas = new RemoteSchemaRegistry(client, {
ttlMs: 600000, // Cache for 10 minutes
autoRegisterLocal: true // Also register in local SCHEMA_REGISTRY
});
// Fetch a specific version
const schema = await schemas.get('invoice', '2.1.0');
// Fetch the latest version
const latest = await schemas.getLatest('invoice');
// Preload multiple schemas
await schemas.preload([
'invoice@2.1.0',
'receipt@1.0.0',
'bdn@1.0.0'
]);
You can also use the client directly:
// Get a specific version
const schemaAsset = await client.schemas.get('invoice', '2.1.0');
// Get the latest version
const latest = await client.schemas.getLatest('invoice');
// List all versions
const { versions } = await client.schemas.listVersions('invoice');
Schema Asset Structure
A schema asset includes metadata alongside the schema definition:
type SchemaAsset = {
// Identity
id: string; // "invoice"
version: string; // "2.1.0" (semver)
// Content
schema: JSONSchemaObject; // The JSON Schema
// Metadata
description?: string; // Human-readable description
tags?: string[]; // Categorization tags
changelog?: string; // Version change notes
// Timestamps
createdAt: string; // ISO date
updatedAt: string; // ISO date
};
Built-in Schemas
The SDK includes pre-built schemas for common document types:
| Schema | Description |
|---|
bdn@1.0.0 | Bunker Delivery Note (maritime fuel delivery) |
Access via the schema registry:
import { SCHEMA_REGISTRY, bdnSchema } from '@doclo/schemas';
// Get the schema asset (with metadata)
const bdnAsset = SCHEMA_REGISTRY.get('bdn', '1.0.0');
// Or use the raw schema directly
console.log(bdnSchema); // The JSON Schema object
Schema Best Practices
Write Detailed Descriptions
The description is the most important property for extraction accuracy:
// Less effective
{ description: 'The total' }
// More effective
{
description: 'Total amount due including tax. Look for labels like "Total", "Amount Due", "Grand Total", or "Balance Due". Usually appears at the bottom of the document near the subtotal. May include currency symbols.'
}
Specify Where to Find Data
Help the AI locate fields in the document:
{
refNumber: {
type: 'string',
description: 'The primary unique identifier for the document. Look for labels such as "Reference", "Ref No", "Document No.", or "ID". This is often located in the header or top section. Examples: "REF-2023-10543", "DOC/23/554".'
}
}
Handle Optional and Missing Data
Use nullable types for fields that may not exist:
{
taxAmount: {
type: 'number',
nullable: true,
description: 'Tax amount if shown. Set to null if not applicable or not present on the document.'
}
}
Or use union types:
{
taxAmount: {
type: ['number', 'null'],
description: 'Tax amount if shown, null if not applicable'
}
}
Structure Nested Data Logically
Group related fields into objects:
{
vendor: {
type: 'object',
description: 'Details of the supplier or vendor',
properties: {
name: {
type: 'string',
description: 'Company or trading name'
},
legalName: {
type: 'string',
nullable: true,
description: 'Full registered legal name with corporate designator (Ltd, Inc, etc.)'
},
address: {
type: 'string',
description: 'Street address, city, and postal code'
},
phone: {
type: 'string',
nullable: true,
description: 'Primary phone number, digits only without country code'
}
},
required: ['name']
}
}
Define Array Item Schemas
Always specify the structure of array items:
{
lineItems: {
type: 'array',
description: 'Individual line items or products on the invoice',
items: {
type: 'object',
properties: {
description: {
type: 'string',
description: 'Product or service description'
},
quantity: {
type: 'number',
description: 'Quantity ordered or delivered'
},
unitPrice: {
type: 'number',
description: 'Price per unit'
},
amount: {
type: 'number',
description: 'Line total (quantity x unitPrice)'
}
},
required: ['description', 'amount']
}
}
}
Provide guidance for common format variations:
{
date: {
type: 'string',
format: 'date',
description: 'Document date. Convert from formats like DD.MM.YY, MM/DD/YYYY, or "January 15, 2024" to the standard YYYY-MM-DD format.'
},
phone: {
type: 'string',
description: 'Phone number. Extract only numeric digits, excluding country code and formatting characters like dashes or parentheses. For "(123) 456-7890", extract "1234567890".'
}
}
Use Enums for Known Value Sets
Constrain values to valid options:
{
documentType: {
type: 'string',
enum: ['invoice', 'receipt', 'credit_note', 'quote'],
description: 'Type of financial document'
},
paymentStatus: {
type: 'string',
enum: ['paid', 'pending', 'overdue', 'cancelled'],
description: 'Current payment status'
}
}
Add Validation Constraints
Use validation keywords for data quality:
{
quantity: {
type: 'number',
minimum: 0,
description: 'Quantity must be positive'
},
email: {
type: 'string',
format: 'email',
description: 'Contact email address'
},
postalCode: {
type: 'string',
pattern: '^[0-9]{5}(-[0-9]{4})?$',
description: 'US postal code (ZIP or ZIP+4)'
}
}
TypeScript Integration
Infer Types from Schemas
Use TypeScript’s as const for type inference:
const invoiceSchema = {
type: 'object',
required: ['invoiceNumber', 'totalAmount'],
properties: {
invoiceNumber: { type: 'string' },
totalAmount: { type: 'number' },
vendor: {
type: 'object',
properties: {
name: { type: 'string' }
}
}
}
} as const;
// Define a matching type
type Invoice = {
invoiceNumber: string;
totalAmount: number;
vendor?: {
name?: string;
};
};
// Use with extract node
const flow = createFlow()
.step('extract', extract<Invoice>({
provider: vlmProvider,
schema: invoiceSchema
}))
.build();
const result = await flow.run({ base64: pdf });
// result.output is typed as Invoice
Use Zod Schemas
The SDK automatically converts Zod schemas to JSON Schema:
import { z } from 'zod';
const invoiceSchema = z.object({
invoiceNumber: z.string().describe('Invoice number or ID'),
totalAmount: z.number().describe('Total invoice amount'),
date: z.string().optional().describe('Invoice date in YYYY-MM-DD format'),
lineItems: z.array(z.object({
description: z.string(),
amount: z.number()
})).optional()
});
type Invoice = z.infer<typeof invoiceSchema>;
const flow = createFlow()
.step('extract', extract<Invoice>({
provider: vlmProvider,
schema: invoiceSchema // Zod schema converted automatically
}))
.build();
Provider Considerations
Different providers have varying schema support. The SDK handles translation automatically:
| Provider | Native Format | Nullable Support | Notes |
|---|
| OpenAI | JSON Schema (strict mode) | anyOf with null | SDK auto-adds additionalProperties: false |
| Anthropic | Tool input schema | nullable: true | Direct nullable support |
| Google Gemini | OpenAPI 3.0 subset | nullable: true | SDK adds propertyOrdering |
| OpenRouter | Varies by model | anyOf with null | Uses OpenAI-style for Claude |
The SchemaTranslator class handles these conversions:
import { SchemaTranslator } from '@doclo/providers-llm';
const translator = new SchemaTranslator();
// OpenAI format (converts nullable to anyOf)
const openaiSchema = translator.toOpenAISchema(schema);
// Claude tool format (wraps in tool definition)
const claudeSchema = translator.toClaudeToolSchema(schema);
// Gemini format (adds propertyOrdering)
const geminiSchema = translator.toGeminiSchema(schema);
Validation
The SDK includes a lightweight JSON Schema validator that works in all environments including Edge Runtime:
import { validateJson } from '@doclo/core';
const data = { invoiceNumber: 'INV-001', totalAmount: 500 };
try {
const validated = validateJson<Invoice>(data, invoiceSchema);
// validated is typed as Invoice
} catch (error) {
console.error('Validation failed:', error.message);
}
Next Steps