Skip to main content
Schemas define the structure of data you want to extract from documents. They use JSON Schema format and tell the AI exactly what fields to look for and how to format them.

What are Schemas?

Schemas serve multiple purposes:
  • Define structure: Specify fields, types, and nesting
  • Guide extraction: Help the AI understand what to look for
  • Validate output: Ensure extracted data matches expected format
  • Enable type safety: Generate TypeScript types from schemas

Schema Structure

Schemas follow the JSON Schema specification (draft-07):
const invoiceSchema = {
  type: 'object',
  required: ['invoiceNumber', 'totalAmount'],
  properties: {
    invoiceNumber: {
      type: 'string',
      description: 'Invoice number or ID'
    },
    vendor: {
      type: 'object',
      properties: {
        name: { type: 'string', description: 'Company name' },
        address: { type: 'string', description: 'Full address' }
      }
    },
    date: {
      type: 'string',
      format: 'date',
      description: 'Invoice date in ISO format (YYYY-MM-DD)'
    },
    totalAmount: {
      type: 'number',
      description: 'Total invoice amount'
    },
    lineItems: {
      type: 'array',
      items: {
        type: 'object',
        properties: {
          description: { type: 'string' },
          quantity: { type: 'number' },
          unitPrice: { type: 'number' },
          amount: { type: 'number' }
        }
      }
    }
  }
};

Key Schema Properties

description

The most important property for extraction quality. Descriptions tell the AI what to look for and where to find it:
{
  invoiceNumber: {
    type: 'string',
    description: 'Invoice number, invoice ID, or reference number. Usually at the top of the document. Look for labels like "Invoice #", "Inv No", or "Reference".'
  }
}
Good descriptions significantly improve extraction accuracy. Be specific about where to find the data, what labels to look for, and what format to expect.

required

Fields that must be extracted. If a required field cannot be found, the extraction may fail or return null:
{
  type: 'object',
  required: ['invoiceNumber', 'totalAmount'],
  properties: { ... }
}

type

Data type for the field:
TypeDescriptionExample
stringText values"INV-001"
numberNumeric values (integers or decimals)1234.56
integerWhole numbers only42
booleanTrue/false valuestrue
arrayLists of items[1, 2, 3]
objectNested structures{ name: "..." }
nullExplicit null valuenull

format

String format hints that help the AI understand expected patterns:
{
  date: {
    type: 'string',
    format: 'date',
    description: 'Invoice date'
  },
  email: {
    type: 'string',
    format: 'email',
    description: 'Contact email address'
  }
}
Common formats:
  • date: ISO date (YYYY-MM-DD)
  • time: Time (HH:MM or HH:MM:SS)
  • date-time: ISO 8601 datetime
  • email: Email address
  • uri: URL/URI

enum

Restrict values to a specific set:
{
  status: {
    type: 'string',
    enum: ['paid', 'pending', 'overdue'],
    description: 'Payment status'
  },
  temperatureUnit: {
    type: 'string',
    enum: ['Celsius', 'Fahrenheit'],
    description: 'Unit of temperature measurement'
  }
}

nullable

Allow a field to be null when data is not present:
{
  taxAmount: {
    type: 'number',
    nullable: true,
    description: 'Tax amount if shown, null if not applicable'
  }
}
Different providers handle nullable differently. The SDK automatically translates nullable: true to the appropriate format for each provider (e.g., anyOf with null type for OpenAI).

pattern

Regular expression pattern for string validation:
{
  bunkeringTime: {
    type: 'string',
    pattern: '^([0-1]?[0-9]|2[0-3]):[0-5][0-9]$',
    description: 'Time in HH:MM format. Hours 00-23, minutes 00-59.'
  }
}

Using Schemas

Inline Schema

Pass the schema object directly to the extract node:
import { createFlow, extract } from '@doclo/flows';

const flow = createFlow()
  .step('extract', extract({
    provider: vlmProvider,
    schema: invoiceSchema
  }))
  .build();

Schema Reference

Reference schemas stored in the registry by ID and version:
const flow = createFlow()
  .step('extract', extract({
    provider: vlmProvider,
    schema: { ref: 'invoice@2.1.0' }
  }))
  .build();
The reference format is id@version (e.g., bdn@1.0.0, invoice@2.1.0).

Enhanced Schema

Add examples and extraction guidance alongside the schema:
const flow = createFlow()
  .step('extract', extract({
    provider: vlmProvider,
    schema: {
      schema: invoiceSchema,
      examples: [
        {
          description: 'Standard invoice',
          input: 'Invoice #: INV-001\nTotal: $500.00',
          output: { invoiceNumber: 'INV-001', totalAmount: 500.00 }
        }
      ],
      extractionRules: 'Focus on the header section for invoice metadata',
      contextPrompt: 'This is a commercial invoice document',
      hints: ['Amounts may include currency symbols', 'Dates vary in format']
    }
  }))
  .build();

Schema Registry

The SDK includes a schema registry for storing and retrieving versioned schemas.

Local Registry

Register and retrieve schemas in memory:
import {
  SCHEMA_REGISTRY,
  registerSchema,
  getSchema,
  getLatestSchema
} from '@doclo/schemas';

// Register a schema
registerSchema({
  id: 'invoice',
  version: '2.1.0',
  schema: invoiceSchema,
  description: 'Invoice extraction schema',
  tags: ['finance', 'accounting'],
  createdAt: new Date().toISOString(),
  updatedAt: new Date().toISOString()
});

// Get a specific version
const schema = getSchema('invoice', '2.1.0');

// Get the latest version
const latest = getLatestSchema('invoice');

// List all versions
const versions = SCHEMA_REGISTRY.listVersions('invoice');
// ['2.1.0', '2.0.0', '1.0.0'] (sorted descending)

// Check if a schema exists
const exists = SCHEMA_REGISTRY.has('invoice', '2.1.0');

// Get by reference string
const schemaByRef = SCHEMA_REGISTRY.getByRef('invoice@2.1.0');

Remote Registry (Cloud)

Fetch schemas from Doclo Cloud with automatic caching:
import { DocloClient, RemoteSchemaRegistry } from '@doclo/client';

const client = new DocloClient({ apiKey: process.env.DOCLO_API_KEY });
const schemas = new RemoteSchemaRegistry(client, {
  ttlMs: 600000,        // Cache for 10 minutes
  autoRegisterLocal: true  // Also register in local SCHEMA_REGISTRY
});

// Fetch a specific version
const schema = await schemas.get('invoice', '2.1.0');

// Fetch the latest version
const latest = await schemas.getLatest('invoice');

// Preload multiple schemas
await schemas.preload([
  'invoice@2.1.0',
  'receipt@1.0.0',
  'bdn@1.0.0'
]);
You can also use the client directly:
// Get a specific version
const schemaAsset = await client.schemas.get('invoice', '2.1.0');

// Get the latest version
const latest = await client.schemas.getLatest('invoice');

// List all versions
const { versions } = await client.schemas.listVersions('invoice');

Schema Asset Structure

A schema asset includes metadata alongside the schema definition:
type SchemaAsset = {
  // Identity
  id: string;              // "invoice"
  version: string;         // "2.1.0" (semver)

  // Content
  schema: JSONSchemaObject;  // The JSON Schema

  // Metadata
  description?: string;    // Human-readable description
  tags?: string[];         // Categorization tags
  changelog?: string;      // Version change notes

  // Timestamps
  createdAt: string;       // ISO date
  updatedAt: string;       // ISO date
};

Built-in Schemas

The SDK includes pre-built schemas for common document types:
SchemaDescription
bdn@1.0.0Bunker Delivery Note (maritime fuel delivery)
Access via the schema registry:
import { SCHEMA_REGISTRY, bdnSchema } from '@doclo/schemas';

// Get the schema asset (with metadata)
const bdnAsset = SCHEMA_REGISTRY.get('bdn', '1.0.0');

// Or use the raw schema directly
console.log(bdnSchema); // The JSON Schema object

Schema Best Practices

Write Detailed Descriptions

The description is the most important property for extraction accuracy:
// Less effective
{ description: 'The total' }

// More effective
{
  description: 'Total amount due including tax. Look for labels like "Total", "Amount Due", "Grand Total", or "Balance Due". Usually appears at the bottom of the document near the subtotal. May include currency symbols.'
}

Specify Where to Find Data

Help the AI locate fields in the document:
{
  refNumber: {
    type: 'string',
    description: 'The primary unique identifier for the document. Look for labels such as "Reference", "Ref No", "Document No.", or "ID". This is often located in the header or top section. Examples: "REF-2023-10543", "DOC/23/554".'
  }
}

Handle Optional and Missing Data

Use nullable types for fields that may not exist:
{
  taxAmount: {
    type: 'number',
    nullable: true,
    description: 'Tax amount if shown. Set to null if not applicable or not present on the document.'
  }
}
Or use union types:
{
  taxAmount: {
    type: ['number', 'null'],
    description: 'Tax amount if shown, null if not applicable'
  }
}

Structure Nested Data Logically

Group related fields into objects:
{
  vendor: {
    type: 'object',
    description: 'Details of the supplier or vendor',
    properties: {
      name: {
        type: 'string',
        description: 'Company or trading name'
      },
      legalName: {
        type: 'string',
        nullable: true,
        description: 'Full registered legal name with corporate designator (Ltd, Inc, etc.)'
      },
      address: {
        type: 'string',
        description: 'Street address, city, and postal code'
      },
      phone: {
        type: 'string',
        nullable: true,
        description: 'Primary phone number, digits only without country code'
      }
    },
    required: ['name']
  }
}

Define Array Item Schemas

Always specify the structure of array items:
{
  lineItems: {
    type: 'array',
    description: 'Individual line items or products on the invoice',
    items: {
      type: 'object',
      properties: {
        description: {
          type: 'string',
          description: 'Product or service description'
        },
        quantity: {
          type: 'number',
          description: 'Quantity ordered or delivered'
        },
        unitPrice: {
          type: 'number',
          description: 'Price per unit'
        },
        amount: {
          type: 'number',
          description: 'Line total (quantity x unitPrice)'
        }
      },
      required: ['description', 'amount']
    }
  }
}

Handle Data Format Variations

Provide guidance for common format variations:
{
  date: {
    type: 'string',
    format: 'date',
    description: 'Document date. Convert from formats like DD.MM.YY, MM/DD/YYYY, or "January 15, 2024" to the standard YYYY-MM-DD format.'
  },
  phone: {
    type: 'string',
    description: 'Phone number. Extract only numeric digits, excluding country code and formatting characters like dashes or parentheses. For "(123) 456-7890", extract "1234567890".'
  }
}

Use Enums for Known Value Sets

Constrain values to valid options:
{
  documentType: {
    type: 'string',
    enum: ['invoice', 'receipt', 'credit_note', 'quote'],
    description: 'Type of financial document'
  },
  paymentStatus: {
    type: 'string',
    enum: ['paid', 'pending', 'overdue', 'cancelled'],
    description: 'Current payment status'
  }
}

Add Validation Constraints

Use validation keywords for data quality:
{
  quantity: {
    type: 'number',
    minimum: 0,
    description: 'Quantity must be positive'
  },
  email: {
    type: 'string',
    format: 'email',
    description: 'Contact email address'
  },
  postalCode: {
    type: 'string',
    pattern: '^[0-9]{5}(-[0-9]{4})?$',
    description: 'US postal code (ZIP or ZIP+4)'
  }
}

TypeScript Integration

Infer Types from Schemas

Use TypeScript’s as const for type inference:
const invoiceSchema = {
  type: 'object',
  required: ['invoiceNumber', 'totalAmount'],
  properties: {
    invoiceNumber: { type: 'string' },
    totalAmount: { type: 'number' },
    vendor: {
      type: 'object',
      properties: {
        name: { type: 'string' }
      }
    }
  }
} as const;

// Define a matching type
type Invoice = {
  invoiceNumber: string;
  totalAmount: number;
  vendor?: {
    name?: string;
  };
};

// Use with extract node
const flow = createFlow()
  .step('extract', extract<Invoice>({
    provider: vlmProvider,
    schema: invoiceSchema
  }))
  .build();

const result = await flow.run({ base64: pdf });
// result.output is typed as Invoice

Use Zod Schemas

The SDK automatically converts Zod schemas to JSON Schema:
import { z } from 'zod';

const invoiceSchema = z.object({
  invoiceNumber: z.string().describe('Invoice number or ID'),
  totalAmount: z.number().describe('Total invoice amount'),
  date: z.string().optional().describe('Invoice date in YYYY-MM-DD format'),
  lineItems: z.array(z.object({
    description: z.string(),
    amount: z.number()
  })).optional()
});

type Invoice = z.infer<typeof invoiceSchema>;

const flow = createFlow()
  .step('extract', extract<Invoice>({
    provider: vlmProvider,
    schema: invoiceSchema  // Zod schema converted automatically
  }))
  .build();

Provider Considerations

Different providers have varying schema support. The SDK handles translation automatically:
ProviderNative FormatNullable SupportNotes
OpenAIJSON Schema (strict mode)anyOf with nullSDK auto-adds additionalProperties: false
AnthropicTool input schemanullable: trueDirect nullable support
Google GeminiOpenAPI 3.0 subsetnullable: trueSDK adds propertyOrdering
OpenRouterVaries by modelanyOf with nullUses OpenAI-style for Claude
The SchemaTranslator class handles these conversions:
import { SchemaTranslator } from '@doclo/providers-llm';

const translator = new SchemaTranslator();

// OpenAI format (converts nullable to anyOf)
const openaiSchema = translator.toOpenAISchema(schema);

// Claude tool format (wraps in tool definition)
const claudeSchema = translator.toClaudeToolSchema(schema);

// Gemini format (adds propertyOrdering)
const geminiSchema = translator.toGeminiSchema(schema);

Validation

The SDK includes a lightweight JSON Schema validator that works in all environments including Edge Runtime:
import { validateJson } from '@doclo/core';

const data = { invoiceNumber: 'INV-001', totalAmount: 500 };

try {
  const validated = validateJson<Invoice>(data, invoiceSchema);
  // validated is typed as Invoice
} catch (error) {
  console.error('Validation failed:', error.message);
}

Next Steps