A blog about technology and the future

Information extraction from documents isn't primarily an OCR problem anymore.

OCR systems have made progress but they are still affected severely by image quality, contrast, overlapping stamps and more. Moreover, suppliers change layouts, fields move between pages, and assumptions that worked for one vendor often fail for the next.

Large multimodal models offer another option: treating extraction as a document understanding task rather than a template matching problem.

Recently, I built an event-driven invoice processing pipeline using Gemini on Vertex AI. Rather than relying on document-specific templates, the pipeline treats invoice extraction as a document-understanding problem and combines AI-based extraction with deterministic validation.

This article walks through the architecture, the tradeoffs behind its design, and some of the limitations that became apparent during implementation.

The Constraints That Shaped This Design

Before discussing the architecture, it's worth explaining the constraints that influenced it:

PDF uploads can be large and unpredictable.
AI inference latency is variable.
Invoice formats change frequently.
Duplicate processing is preferable to dropped documents.
The application should remain simple to operate.
Security is paramount

Those constraints pushed the design toward an asynchronous, event-driven architecture.

Architecture Overview

This design uses an asynchronous ingestion pattern, to avoid the resource constraints and timeout risks associated with handling heavy file uploads over synchronous connections.

Invoices are a good fit for asynchronous processing because they don’t require immediate responses. That opens the door to separating ingestion from processing.

The application layer remains stateless, delegating ingestion, buffering, and persistence to managed cloud infrastructure.

In this design, file uploads are decoupled from extraction. Once a document arrives, it enters a pipeline that processes it independently.

 [ Cloud Storage Bucket ] 
            |
    (Object Created)
            ↓
[ Pub/Sub Ingestion Topic ]
            ↓
  [ Cloud Run Worker ]
            ↓
        [ Gemini ]
            ↓
    [ Processed Topic ]
            ↓
       [ BigQuery ]

Each component has a single responsibility:

Cloud Storage handles file ingestion.
Pub/Sub handles buffering and retries.
Cloud Run performs document processing.
Gemini extracts structured invoice data.
BigQuery stores the results for analysis.

The nice side effect is that every component can scale independently.

If invoices arrive faster than Gemini can process them, messages accumulate in Pub/Sub rather than being dropped. Once capacity becomes available again, workers simply continue consuming messages.

Letting Gemini Handle Document Understanding

One thing worth mentioning is that this pipeline doesn't use a traditional OCR stage.

Instead, the PDF converted to Markdown and then passed to Gemini through Vertex AI. LLMs are trained on large corpora of Markdown documents, and they are very good at catching patterns in those documents.

The model receives both the document and a predefined output schema describing the invoice fields I want returned.

This shifts the problem away from template-based extraction and toward output consistency. The main challenge is no longer parsing text, but ensuring the model produces reliable structured data across different invoice formats.

Why Store JSON Instead of a Strict Schema?

Instead of forcing extracted data into a strict relational schema, the pipeline stores raw structured output in a JSON column.

The reason is flexibility. Invoice formats vary, and different vendors expose different fields. Enforcing a rigid schema at ingestion time would tightly couple the extraction layer to downstream analytics.

Trying to enforce a rigid schema at the ingestion boundary creates unnecessary coupling between extraction and analytics.

By storing JSON directly, the system allows downstream consumers to decide how to interpret and normalize the data.

Downstream consumers can decide how much structure they want to impose through SQL views, transformation jobs, or tools like dbt.

The tradeoff is that consumers take on more responsibility for shaping data.

Idempotency and Duplicate Events

Event-driven systems naturally introduce a subtle issue: duplicate messages.

Pub/Sub guarantees at-least-once delivery, which means the same invoice may occasionally be processed more than once due to retries or worker restarts.

A common solution is to introduce deduplication using external state, such as Redis or a tracking database.

In this system, I chose not to do that.

The Cloud Run service remains stateless, and duplicates are handled downstream in BigQuery where they can be analyzed or filtered during transformation.

Instead, the Cloud Run service remains completely stateless. Every event is processed independently, and duplicate detection is handled downstream in BigQuery where the full extraction result is available for analysis.

This approach has a few advantages:

The application layer remains simple to operate.
No additional cache or database is required.
Deduplication logic can evolve without redeploying the processing service.
Analysts can inspect duplicate records rather than having them silently discarded.

The tradeoff is that duplicate records may temporarily exist in the raw dataset until downstream processes identify and consolidate them.

Validation Still Matters

A fixed-format output is not equal to a correct output. The output may be a valid JSON, but the information it contains may be incorrect.

Gemini can be instructed to return a response that conforms to a predefined schema, but a valid JSON response doesn't necessarily mean the extracted values are correct.

For that reason, I treat extraction and validation as separate concerns. While a thorough validation is performed in BigQuery, basic consistency checks are performed in the application itself, flagging invoices that do not pass.

The pipeline uses a simple two-stage validation approach:

        Raw PDF
            ↓
        Gemini Extraction
            ↓
        Structured JSON
            ↓
        Validation
            ↓
        Valid / Invalid

Raw PDF
    ↓
Gemini Extraction
    ↓
Structured JSON
    ↓
Validation
    ↓
Valid / Invalid

Tier 1: Structural Validation

A Pydantic schema is passed to Vertex AI as part of the response configuration, performing type validation and constraining the model to return a specific JSON structure. This reduces malformed outputs.

Tier 2: Business Validation

After extraction, deterministic checks verify internal consistency:

line items match totals
taxes are calculated correctly
required fields exist

The result is not perfect correctness, but a flag attached to the output that helps identify obvious inconsistencies before data moves downstream.

Lessons Learned

The biggest takeaway from building this pipeline is that AI works best when it’s constrained to a well-defined role inside a system, not when it’s expected to handle everything end-to-end.

A few practical observations:

AI should perform a specific, bounded task
Deterministic logic should remain outside of the model
Complexity doesn’t disappear with AI, it moves into validation, monitoring, and edge-case handling
Clear boundaries make LLMs easier to integrate without destabilizing the architecture

In this setup, Gemini is just one component performing a single clearly define action. Keeping that boundary clear is what makes the architecture manageable.

Closing Thoughts

This system didn’t make invoice extraction “solved”, it aims at reaching the middle-ground between complexity and functionality.

Instead of focusing on templates and parsing rules, most of the effort moves into system design: how data flows, where validation happens, and how to handle uncertainty without breaking everything downstream.

LLMs don't magically solve problems but they can be effective if used strategically. Everything around it, queues, retries, validation, and storage, is still classic engineering.

That separation is what keeps the architecture manageable. The model can change without forcing a redesign of the pipeline.

The main lesson is simple: AI works best when its role is clearly bounded. The more focussed and well defined its role is, the easier it is to build systems around it that stay reliable.