2026 · Solo developer

Report.AI — Fund-Document Intelligent Extraction Pipeline

An LLM-powered extractor for fund annual reports and prospectuses. Reads the PDFs, processes the data, fills the database.

Report.AI — Fund-Document Intelligent Extraction Pipeline cover

Overview

Report.AI ingests two types of fund documents — annual reports and prospectuses — and produces a structured Excel workbook tailored to the client’s need: fund metadata, fees, NAV figures, share-class details. It churns through the tedious and difficult part of fund analysis, while you do something else.

Architecture

Multistep conveyor-belt architecture, where every station is validated before proceeding.

  • Text extraction. Three backends — Docling, PyMuPDF, AWS Textract — selected per document. PyMuPDF takes care of the gist, while AWS Textract is used as a fallback on handwritten pages and bad scans.
  • Document mapping. A first fast LLM pass identifies sections in both the annual report and the prospectus and pairs them. This map is a highly detailed skeleton of where information is found.
  • Field extraction. 17+ specialized prompt templates with explicit rules: value conversions, multilingual preservation and area specific legislation. Each template knows what it’s looking for, and where to find it.
  • Export. Templated Excel: computed fields, formulas, and data validation. The Excel exists for manual verification before any data is persisted in the database.

Highlights

  • Cost-aware model routing. Gemini Flash / Lite / Pro and Claude Haiku / Sonnet / Opus, are hand picked per task with dynamic cost tracking. This enables a remarkably low cost base per document.
  • Multi-document RAG. Cross-validating annual report ↔ prospectus allows a second source of truth. A ChromaDB RAG system allows efficiently combining information from both documents without increased API usage.
  • Crash-resilient. Pipeline state is persisted across runs, making sure no API tokens are wasted.
  • Speed through Parallelism. Easily orchestrated document parallelism, allowing a return rate of only a few minutes per document.
  • Scales to large documents. Skeleton mapping is fast, efficient and low cost, allowing up to 1500 page documents to be completed within minutes.

Status

In active use by the client. Deployed across Q1 2026.