Schemas - Doclo

Schemas define the structure of data you want to extract from documents. They use JSON Schema format and tell the AI exactly what fields to look for and how to format them.

What are Schemas?

Schemas serve multiple purposes:

Define structure: Specify fields, types, and nesting
Guide extraction: Help the AI understand what to look for
Validate output: Ensure extracted data matches expected format
Enable type safety: Generate TypeScript types from schemas

Schema Structure

Schemas follow the JSON Schema specification (draft-07):

const invoiceSchema = {
  type: 'object',
  required: ['invoiceNumber', 'totalAmount'],
  properties: {
    invoiceNumber: {
      type: 'string',
      description: 'Invoice number or ID'
    },
    vendor: {
      type: 'object',
      properties: {
        name: { type: 'string', description: 'Company name' },
        address: { type: 'string', description: 'Full address' }
      }
    },
    date: {
      type: 'string',
      format: 'date',
      description: 'Invoice date in ISO format (YYYY-MM-DD)'
    },
    totalAmount: {
      type: 'number',
      description: 'Total invoice amount'
    },
    lineItems: {
      type: 'array',
      items: {
        type: 'object',
        properties: {
          description: { type: 'string' },
          quantity: { type: 'number' },
          unitPrice: { type: 'number' },
          amount: { type: 'number' }
        }
      }
    }
  }
};

Key Schema Properties

`description`

The most important property for extraction quality. Descriptions tell the AI what to look for and where to find it:

{
  invoiceNumber: {
    type: 'string',
    description: 'Invoice number, invoice ID, or reference number. Usually at the top of the document. Look for labels like "Invoice #", "Inv No", or "Reference".'
  }
}

Good descriptions significantly improve extraction accuracy. Be specific about where to find the data, what labels to look for, and what format to expect.

`required`

Fields that must be extracted. If a required field cannot be found, the extraction may fail or return null:

{
  type: 'object',
  required: ['invoiceNumber', 'totalAmount'],
  properties: { ... }
}

`type`

Data type for the field:

Type	Description	Example
`string`	Text values	`"INV-001"`
`number`	Numeric values (integers or decimals)	`1234.56`
`integer`	Whole numbers only	`42`
`boolean`	True/false values	`true`
`array`	Lists of items	`[1, 2, 3]`
`object`	Nested structures	`{ name: "..." }`
`null`	Explicit null value	`null`

`format`

String format hints that help the AI understand expected patterns:

{
  date: {
    type: 'string',
    format: 'date',
    description: 'Invoice date'
  },
  email: {
    type: 'string',
    format: 'email',
    description: 'Contact email address'
  }
}

Common formats:

date: ISO date (YYYY-MM-DD)
time: Time (HH:MM or HH:MM:SS)
date-time: ISO 8601 datetime
email: Email address
uri: URL/URI

`enum`

Restrict values to a specific set:

{
  status: {
    type: 'string',
    enum: ['paid', 'pending', 'overdue'],
    description: 'Payment status'
  },
  temperatureUnit: {
    type: 'string',
    enum: ['Celsius', 'Fahrenheit'],
    description: 'Unit of temperature measurement'
  }
}

`nullable`

Allow a field to be null when data is not present:

{
  taxAmount: {
    type: 'number',
    nullable: true,
    description: 'Tax amount if shown, null if not applicable'
  }
}

Different providers handle nullable differently. The SDK automatically translates nullable: true to the appropriate format for each provider (e.g., anyOf with null type for OpenAI).

`pattern`

Regular expression pattern for string validation:

{
  bunkeringTime: {
    type: 'string',
    pattern: '^([0-1]?[0-9]|2[0-3]):[0-5][0-9]$',
    description: 'Time in HH:MM format. Hours 00-23, minutes 00-59.'
  }
}

Using Schemas

Inline Schema

Pass the schema object directly to the extract node:

import { createFlow, extract } from '@doclo/flows';

const flow = createFlow()
  .step('extract', extract({
    provider: vlmProvider,
    schema: invoiceSchema
  }))
  .build();

Schema Reference

Reference schemas stored in the registry by ID and version:

const flow = createFlow()
  .step('extract', extract({
    provider: vlmProvider,
    schema: { ref: 'invoice@2.1.0' }
  }))
  .build();

The reference format is id@version (e.g., bdn@1.0.0, invoice@2.1.0).

Enhanced Schema

Add examples and extraction guidance alongside the schema:

const flow = createFlow()
  .step('extract', extract({
    provider: vlmProvider,
    schema: {
      schema: invoiceSchema,
      examples: [
        {
          description: 'Standard invoice',
          input: 'Invoice #: INV-001\nTotal: $500.00',
          output: { invoiceNumber: 'INV-001', totalAmount: 500.00 }
        }
      ],
      extractionRules: 'Focus on the header section for invoice metadata',
      contextPrompt: 'This is a commercial invoice document',
      hints: ['Amounts may include currency symbols', 'Dates vary in format']
    }
  }))
  .build();

Schema Registry

The SDK includes a schema registry for storing and retrieving versioned schemas.

Local Registry

import {
  SCHEMA_REGISTRY,
  registerSchema,
  getSchema,
  getLatestSchema
} from '@doclo/schemas';

// Register a schema
registerSchema({
  id: 'invoice',
  version: '2.1.0',
  schema: invoiceSchema,
  description: 'Invoice extraction schema',
  tags: ['finance', 'accounting'],
  createdAt: new Date().toISOString(),
  updatedAt: new Date().toISOString()
});

// Get a specific version
const schema = getSchema('invoice', '2.1.0');

// Get the latest version
const latest = getLatestSchema('invoice');

// List all versions
const versions = SCHEMA_REGISTRY.listVersions('invoice');
// ['2.1.0', '2.0.0', '1.0.0'] (sorted descending)

// Check if a schema exists
const exists = SCHEMA_REGISTRY.has('invoice', '2.1.0');

// Get by reference string
const schemaByRef = SCHEMA_REGISTRY.getByRef('invoice@2.1.0');

Remote Registry (Cloud)

Fetch schemas from Doclo Cloud with automatic caching:

import { DocloClient, RemoteSchemaRegistry } from '@doclo/client';

const client = new DocloClient({ apiKey: process.env.DOCLO_API_KEY });
const schemas = new RemoteSchemaRegistry(client, {
  ttlMs: 600000,        // Cache for 10 minutes
  autoRegisterLocal: true  // Also register in local SCHEMA_REGISTRY
});

// Fetch a specific version
const schema = await schemas.get('invoice', '2.1.0');

// Fetch the latest version
const latest = await schemas.getLatest('invoice');

// Preload multiple schemas
await schemas.preload([
  'invoice@2.1.0',
  'receipt@1.0.0',
  'bdn@1.0.0'
]);

You can also use the client directly:

// Get a specific version
const schemaAsset = await client.schemas.get('invoice', '2.1.0');

// Get the latest version
const latest = await client.schemas.getLatest('invoice');

// List all versions
const { versions } = await client.schemas.listVersions('invoice');

Schema Asset Structure

A schema asset includes metadata alongside the schema definition:

type SchemaAsset = {
  // Identity
  id: string;              // "invoice"
  version: string;         // "2.1.0" (semver)

  // Content
  schema: JSONSchemaObject;  // The JSON Schema

  // Metadata
  description?: string;    // Human-readable description
  tags?: string[];         // Categorization tags
  changelog?: string;      // Version change notes

  // Timestamps
  createdAt: string;       // ISO date
  updatedAt: string;       // ISO date
};

Built-in Schemas

The SDK includes pre-built schemas for common document types:

Schema	Description
`bdn@1.0.0`	Bunker Delivery Note (maritime fuel delivery)

Access via the schema registry:

import { SCHEMA_REGISTRY, bdnSchema } from '@doclo/schemas';

// Get the schema asset (with metadata)
const bdnAsset = SCHEMA_REGISTRY.get('bdn', '1.0.0');

// Or use the raw schema directly
console.log(bdnSchema); // The JSON Schema object

Schema Best Practices

Write Detailed Descriptions

The description is the most important property for extraction accuracy:

// Less effective
{ description: 'The total' }

// More effective
{
  description: 'Total amount due including tax. Look for labels like "Total", "Amount Due", "Grand Total", or "Balance Due". Usually appears at the bottom of the document near the subtotal. May include currency symbols.'
}

Specify Where to Find Data

Help the AI locate fields in the document:

{
  refNumber: {
    type: 'string',
    description: 'The primary unique identifier for the document. Look for labels such as "Reference", "Ref No", "Document No.", or "ID". This is often located in the header or top section. Examples: "REF-2023-10543", "DOC/23/554".'
  }
}

Handle Optional and Missing Data

Use nullable types for fields that may not exist:

{
  taxAmount: {
    type: 'number',
    nullable: true,
    description: 'Tax amount if shown. Set to null if not applicable or not present on the document.'
  }
}

Or use union types:

{
  taxAmount: {
    type: ['number', 'null'],
    description: 'Tax amount if shown, null if not applicable'
  }
}

Structure Nested Data Logically

Group related fields into objects:

{
  vendor: {
    type: 'object',
    description: 'Details of the supplier or vendor',
    properties: {
      name: {
        type: 'string',
        description: 'Company or trading name'
      },
      legalName: {
        type: 'string',
        nullable: true,
        description: 'Full registered legal name with corporate designator (Ltd, Inc, etc.)'
      },
      address: {
        type: 'string',
        description: 'Street address, city, and postal code'
      },
      phone: {
        type: 'string',
        nullable: true,
        description: 'Primary phone number, digits only without country code'
      }
    },
    required: ['name']
  }
}

Define Array Item Schemas

Always specify the structure of array items:

{
  lineItems: {
    type: 'array',
    description: 'Individual line items or products on the invoice',
    items: {
      type: 'object',
      properties: {
        description: {
          type: 'string',
          description: 'Product or service description'
        },
        quantity: {
          type: 'number',
          description: 'Quantity ordered or delivered'
        },
        unitPrice: {
          type: 'number',
          description: 'Price per unit'
        },
        amount: {
          type: 'number',
          description: 'Line total (quantity x unitPrice)'
        }
      },
      required: ['description', 'amount']
    }
  }
}

Handle Data Format Variations

Provide guidance for common format variations:

{
  date: {
    type: 'string',
    format: 'date',
    description: 'Document date. Convert from formats like DD.MM.YY, MM/DD/YYYY, or "January 15, 2024" to the standard YYYY-MM-DD format.'
  },
  phone: {
    type: 'string',
    description: 'Phone number. Extract only numeric digits, excluding country code and formatting characters like dashes or parentheses. For "(123) 456-7890", extract "1234567890".'
  }
}

Use Enums for Known Value Sets

Constrain values to valid options:

{
  documentType: {
    type: 'string',
    enum: ['invoice', 'receipt', 'credit_note', 'quote'],
    description: 'Type of financial document'
  },
  paymentStatus: {
    type: 'string',
    enum: ['paid', 'pending', 'overdue', 'cancelled'],
    description: 'Current payment status'
  }
}

Add Validation Constraints

Use validation keywords for data quality:

{
  quantity: {
    type: 'number',
    minimum: 0,
    description: 'Quantity must be positive'
  },
  email: {
    type: 'string',
    format: 'email',
    description: 'Contact email address'
  },
  postalCode: {
    type: 'string',
    pattern: '^[0-9]{5}(-[0-9]{4})?$',
    description: 'US postal code (ZIP or ZIP+4)'
  }
}

TypeScript Integration

Infer Types from Schemas

Use TypeScript’s as const for type inference:

const invoiceSchema = {
  type: 'object',
  required: ['invoiceNumber', 'totalAmount'],
  properties: {
    invoiceNumber: { type: 'string' },
    totalAmount: { type: 'number' },
    vendor: {
      type: 'object',
      properties: {
        name: { type: 'string' }
      }
    }
  }
} as const;

// Define a matching type
type Invoice = {
  invoiceNumber: string;
  totalAmount: number;
  vendor?: {
    name?: string;
  };
};

// Use with extract node
const flow = createFlow()
  .step('extract', extract<Invoice>({
    provider: vlmProvider,
    schema: invoiceSchema
  }))
  .build();

const result = await flow.run({ base64: pdf });
// result.output is typed as Invoice

Use Zod Schemas

The SDK automatically converts Zod schemas to JSON Schema:

import { z } from 'zod';

const invoiceSchema = z.object({
  invoiceNumber: z.string().describe('Invoice number or ID'),
  totalAmount: z.number().describe('Total invoice amount'),
  date: z.string().optional().describe('Invoice date in YYYY-MM-DD format'),
  lineItems: z.array(z.object({
    description: z.string(),
    amount: z.number()
  })).optional()
});

type Invoice = z.infer<typeof invoiceSchema>;

const flow = createFlow()
  .step('extract', extract<Invoice>({
    provider: vlmProvider,
    schema: invoiceSchema  // Zod schema converted automatically
  }))
  .build();

Provider Considerations

Different providers have varying schema support. The SDK handles translation automatically:

Provider	Native Format	Nullable Support	Notes
OpenAI	JSON Schema (strict mode)	`anyOf` with `null`	SDK auto-adds `additionalProperties: false`
Anthropic	Tool input schema	`nullable: true`	Direct nullable support
Google Gemini	OpenAPI 3.0 subset	`nullable: true`	SDK adds `propertyOrdering`
OpenRouter	Varies by model	`anyOf` with `null`	Uses OpenAI-style for Claude

The SchemaTranslator class handles these conversions:

import { SchemaTranslator } from '@doclo/providers-llm';

const translator = new SchemaTranslator();

// OpenAI format (converts nullable to anyOf)
const openaiSchema = translator.toOpenAISchema(schema);

// Claude tool format (wraps in tool definition)
const claudeSchema = translator.toClaudeToolSchema(schema);

// Gemini format (adds propertyOrdering)
const geminiSchema = translator.toGeminiSchema(schema);

Validation

The SDK includes a lightweight JSON Schema validator that works in all environments including Edge Runtime:

import { validateJson } from '@doclo/core';

const data = { invoiceNumber: 'INV-001', totalAmount: 500 };

try {
  const validated = validateJson<Invoice>(data, invoiceSchema);
  // validated is typed as Invoice
} catch (error) {
  console.error('Validation failed:', error.message);
}

Getting Started

Concepts

SDK

Doclo Cloud

Guides

Resources

​What are Schemas?

​Schema Structure

​Key Schema Properties

​description

​required

​type

​format

​enum

​nullable

​pattern

​Using Schemas

​Inline Schema

​Schema Reference

​Enhanced Schema

​Schema Registry

​Local Registry

​Remote Registry (Cloud)

​Schema Asset Structure

​Built-in Schemas

​Schema Best Practices

​Write Detailed Descriptions

​Specify Where to Find Data

​Handle Optional and Missing Data

​Structure Nested Data Logically

​Define Array Item Schemas

​Handle Data Format Variations

​Use Enums for Known Value Sets

​Add Validation Constraints

​TypeScript Integration

​Infer Types from Schemas

​Use Zod Schemas

​Provider Considerations

​Validation

​Next Steps

Extract Node

Flows

What are Schemas?

Schema Structure

Key Schema Properties

`description`

`required`

`type`

`format`

`enum`

`nullable`

`pattern`

Using Schemas

Inline Schema

Schema Reference

Enhanced Schema

Schema Registry

Local Registry

Remote Registry (Cloud)

Schema Asset Structure

Built-in Schemas

Schema Best Practices

Write Detailed Descriptions

Specify Where to Find Data

Handle Optional and Missing Data

Structure Nested Data Logically

Define Array Item Schemas

Handle Data Format Variations

Use Enums for Known Value Sets

Add Validation Constraints

TypeScript Integration

Infer Types from Schemas

Use Zod Schemas

Provider Considerations

Validation

Next Steps