JSON Schema definitions for structured data extraction
Schemas define the structure of data you want to extract from documents. They use JSON Schema format and tell the AI exactly what fields to look for and how to format them.
The most important property for extraction quality. Descriptions tell the AI what to look for and where to find it:
Copy
{ invoiceNumber: { type: 'string', description: 'Invoice number, invoice ID, or reference number. Usually at the top of the document. Look for labels like "Invoice #", "Inv No", or "Reference".' }}
Good descriptions significantly improve extraction accuracy. Be specific about where to find the data, what labels to look for, and what format to expect.
Allow a field to be null when data is not present:
Copy
{ taxAmount: { type: 'number', nullable: true, description: 'Tax amount if shown, null if not applicable' }}
Different providers handle nullable differently. The SDK automatically translates nullable: true to the appropriate format for each provider (e.g., anyOf with null type for OpenAI).
import { SCHEMA_REGISTRY, registerSchema, getSchema, getLatestSchema} from '@doclo/schemas';// Register a schemaregisterSchema({ id: 'invoice', version: '2.1.0', schema: invoiceSchema, description: 'Invoice extraction schema', tags: ['finance', 'accounting'], createdAt: new Date().toISOString(), updatedAt: new Date().toISOString()});// Get a specific versionconst schema = getSchema('invoice', '2.1.0');// Get the latest versionconst latest = getLatestSchema('invoice');// List all versionsconst versions = SCHEMA_REGISTRY.listVersions('invoice');// ['2.1.0', '2.0.0', '1.0.0'] (sorted descending)// Check if a schema existsconst exists = SCHEMA_REGISTRY.has('invoice', '2.1.0');// Get by reference stringconst schemaByRef = SCHEMA_REGISTRY.getByRef('invoice@2.1.0');
Fetch schemas from Doclo Cloud with automatic caching:
Copy
import { DocloClient, RemoteSchemaRegistry } from '@doclo/client';const client = new DocloClient({ apiKey: process.env.DOCLO_API_KEY });const schemas = new RemoteSchemaRegistry(client, { ttlMs: 600000, // Cache for 10 minutes autoRegisterLocal: true // Also register in local SCHEMA_REGISTRY});// Fetch a specific versionconst schema = await schemas.get('invoice', '2.1.0');// Fetch the latest versionconst latest = await schemas.getLatest('invoice');// Preload multiple schemasawait schemas.preload([ 'invoice@2.1.0', 'receipt@1.0.0', 'bdn@1.0.0']);
You can also use the client directly:
Copy
// Get a specific versionconst schemaAsset = await client.schemas.get('invoice', '2.1.0');// Get the latest versionconst latest = await client.schemas.getLatest('invoice');// List all versionsconst { versions } = await client.schemas.listVersions('invoice');
The SDK includes pre-built schemas for common document types:
Schema
Description
bdn@1.0.0
Bunker Delivery Note (maritime fuel delivery)
Access via the schema registry:
Copy
import { SCHEMA_REGISTRY, bdnSchema } from '@doclo/schemas';// Get the schema asset (with metadata)const bdnAsset = SCHEMA_REGISTRY.get('bdn', '1.0.0');// Or use the raw schema directlyconsole.log(bdnSchema); // The JSON Schema object
The description is the most important property for extraction accuracy:
Copy
// Less effective{ description: 'The total' }// More effective{ description: 'Total amount due including tax. Look for labels like "Total", "Amount Due", "Grand Total", or "Balance Due". Usually appears at the bottom of the document near the subtotal. May include currency symbols.'}
{ refNumber: { type: 'string', description: 'The primary unique identifier for the document. Look for labels such as "Reference", "Ref No", "Document No.", or "ID". This is often located in the header or top section. Examples: "REF-2023-10543", "DOC/23/554".' }}
{ date: { type: 'string', format: 'date', description: 'Document date. Convert from formats like DD.MM.YY, MM/DD/YYYY, or "January 15, 2024" to the standard YYYY-MM-DD format.' }, phone: { type: 'string', description: 'Phone number. Extract only numeric digits, excluding country code and formatting characters like dashes or parentheses. For "(123) 456-7890", extract "1234567890".' }}
Different providers have varying schema support. The SDK handles translation automatically:
Provider
Native Format
Nullable Support
Notes
OpenAI
JSON Schema (strict mode)
anyOf with null
SDK auto-adds additionalProperties: false
Anthropic
Tool input schema
nullable: true
Direct nullable support
Google Gemini
OpenAPI 3.0 subset
nullable: true
SDK adds propertyOrdering
OpenRouter
Varies by model
anyOf with null
Uses OpenAI-style for Claude
The SchemaTranslator class handles these conversions:
Copy
import { SchemaTranslator } from '@doclo/providers-llm';const translator = new SchemaTranslator();// OpenAI format (converts nullable to anyOf)const openaiSchema = translator.toOpenAISchema(schema);// Claude tool format (wraps in tool definition)const claudeSchema = translator.toClaudeToolSchema(schema);// Gemini format (adds propertyOrdering)const geminiSchema = translator.toGeminiSchema(schema);