Building a Searchable Archive: How AI Processes 8,000+ Legal Documents
A technical deep-dive into OCR technology, entity extraction (12,243 people, 5,709 organizations, 3,211 locations), AI deduplication systems, and the open-source Python pipeline that made it possible.

📌 Disclaimer & Context:
This article examines the technical implementation of an open-source document processing system as a case study in AI-powered archival technology. The focus is on the engineering challenges of OCR, entity extraction, and deduplication at scale. This is a conceptual analysis of publicly available technology that has implications for legal transparency, investigative journalism, and public access to court records. The techniques discussed are applicable to any large-scale document digitization project.
Executive Summary
When thousands of legal documents are released to the public, they often arrive as scanned images unstructured, unsearchable, and locked away from meaningful analysis. This article examines an open-source project that transformed 8,175 document images into a fully searchable, entity-indexed, AI-analyzed archive. Using a Python-based OCR pipeline, computer vision models, and advanced entity deduplication algorithms, the system extracted text from both printed and handwritten sources, identified 12,243 people, 5,709 organizations, and 3,211 locations, and generated AI-powered summaries for every document. This is not just a technical achievement it's a blueprint for democratizing access to legal information in the age of AI.
8,175
Documents Processed
12,243
People Identified
5,709
Organizations
3,211
Locations Mapped
1. The Problem: Locked Knowledge in Scanned Images
Court documents, depositions, financial records, and legal filings are the raw material of transparency in democratic societies. When high-profile legal cases conclude or when freedom of information requests are fulfilled, thousands of pages are often released to the public. But there's a problem: these documents are almost always scanned images, not searchable text.
A scanned PDF is, to a computer, just a collection of pixels a photograph of a page. You cannot search for a person's name. You cannot extract dates. You cannot cross-reference mentions of a company across hundreds of documents. For journalists, researchers, and concerned citizens, this creates an insurmountable barrier. Reading 8,000 pages manually is impossible for a single person and impractical even for a team.
The Scale of the Challenge
- • 8,175 document images requiring OCR processing
- • Mixed content: printed text, handwritten notes, stamps, signatures
- • Varying quality: some pages clear, others degraded or redacted
- • No standardized naming: document numbers formatted inconsistently
- • Multi-page documents split into individual scans needing reconstruction
The answer lies in Optical Character Recognition (OCR) powered by modern AI vision models. But OCR alone is not enough. To make a document archive truly useful, you need entity extraction, deduplication, document reconstruction, and semantic search. This is the story of how one open-source project solved all of these problems.
2. The OCR Pipeline: From Pixels to Structured Data
Traditional OCR tools like Tesseract work well for clean, printed text. But legal documents are rarely clean. They contain:
- • Handwritten annotations in margins and on forms
- • Stamps and signatures overlaid on text
- • Multi-column layouts that confuse reading order
- • Redaction boxes covering sensitive information
- • Low-resolution scans from older fax or photocopy systems
The solution: AI-powered computer vision models from OpenAI, Google, or similar providers. These models, originally designed for image understanding, excel at extracting text from complex documents. The project used a Python script (process_images.py) that:
2.1 How the OCR Script Works
OCR Processing Workflow
- 1Image Submission: Each scanned page image is sent to an AI vision API endpoint (OpenAI-compatible)
- 2Structured Extraction: The AI model returns a JSON object containing:
- • Full text in reading order
- • Document metadata (page/document numbers, dates)
- • Named entities (people, organizations, locations)
- • Text type annotations (printed, handwritten, stamps)
- 3Auto-Fix Broken JSON: If the AI returns invalid JSON, the script automatically sends the error message back to the AI along with the original image, asking for a corrected response
- 4Save Results: Each processed page is saved as a JSON file in
./results/{folder}/{imagename}.json - 5Progress Tracking: The script maintains
processing_index.jsonto track which files have been processed (resume-friendly)
The script supports parallel processing with configurable workers (default: 5 concurrent requests) and can be limited to process only the first N images for testing. Failed files are logged for later retry using a cleanup script.
Key Innovation: Self-Healing JSON
One of the most elegant features is the auto-fix broken JSON mechanism. Large language models occasionally return malformed JSON (missing commas, unclosed brackets). Rather than failing the job, the script catches the error, sends it back to the AI with instructions to fix it, and retries. This dramatically reduces manual intervention.
3. Entity Extraction: Finding 12,243 People, 5,709 Organizations, 3,211 Locations
OCR gives you the text. But to make it searchable by entity, you need Named Entity Recognition (NER). The AI vision model extracts entities directly from the document images during OCR processing:
People
Names of individuals mentioned in any context: witnesses, defendants, attorneys, flight passengers, etc.
12,243
Organizations
Companies, law firms, government agencies, foundations, and other entities mentioned.
5,709
Locations
Cities, addresses, properties, airports, and geographic references extracted from documents.
3,211
These entities are stored alongside the OCR text in each document's JSON file. The vision model also extracts dates (11,020 temporal references) and document types (41 categories like "Deposition", "Flight Log", "Email", etc.).
But there's a critical problem that emerges at scale: inconsistent naming.
4. The AI Deduplication System: Merging "Epstein" and "Jeffrey Epstein"
When processing thousands of pages, the AI vision model does not maintain consistency across documents. One page might extract "Epstein", another "Jeffrey Epstein", and a third "J. Epstein". Without deduplication, these would appear as three separate people in the index.
The solution is a second AI pass dedicated to entity resolution. The script deduplicate.py collects all extracted entities, groups them in batches (default: 50 entities per batch), and sends them to the AI with instructions to identify duplicates and provide a canonical name for each cluster.
4.1 How Deduplication Works
- 1Collect Entities: Scan all JSON files in
./results/and extract all people, organizations, and locations - 2Batch Processing: Group entities into batches of 50 (configurable) to avoid overwhelming the AI context window
- 3AI Resolution: Send each batch to the AI with a prompt: "Identify which of these names refer to the same person/organization/location and provide the canonical form"
- 4Generate Mapping: Create
dedupe.jsonwith mappings like:{ "people": { "Epstein": "Jeffrey Epstein", "J. Epstein": "Jeffrey Epstein", "Jeffrey Epstein": "Jeffrey Epstein" } } - 5Apply at Build Time: The static site generator automatically applies this mapping when building entity index pages
The same process runs for document types (deduplicate_types.py), merging variants like "deposition", "Deposition", and "DEPOSITION TRANSCRIPT" into a single canonical form. The result: 41 document type categories instead of potentially hundreds of variations.
Why This Matters
Without deduplication, the entity index would be fragmented and nearly useless. A researcher searching for "Jeffrey Epstein" would miss documents where he's referenced as just "Epstein". This AI-powered resolution step is what makes the archive actually searchable.
5. Document Reconstruction: Grouping 2,000 Pages into Coherent Documents
Legal document releases typically arrive as a folder of individual page scans: page_001.jpg, page_002.jpg, and so on. But each document spans multiple pages. A deposition might be 50 pages. A financial record might be 200.
During OCR processing, the AI extracts document numbers and page numbers from each page (if present). The build system uses these metadata fields to:
- • Group pages by document number
- • Sort pages within each document by page number
- • Reconstruct complete multi-page documents for display
This is non-trivial because document numbers are not consistently formatted. Some pages show "Doc 123", others show "Document #123", and still others show "EPSTEIN-000123". The AI model is instructed to normalize these into a consistent format during extraction.
Result: 8,175 Pages → ~400 Multi-Page Documents
The system successfully reconstructed approximately 400 complete documents from over 8,000 individual page scans, handling inconsistencies in numbering and correctly ordering pages even when metadata was incomplete.
6. The AI Analysis Layer: Summaries, Key Topics, and Significance
Having searchable text and entity indexes is powerful. But for a researcher trying to understand which documents are important, reading through hundreds of depositions is still overwhelming. The final layer of the pipeline adds AI-generated document analysis.
The script analyze_documents.py groups pages into complete documents (using the reconstruction logic) and sends the full text of each document to an AI model with instructions to generate:
Executive Summary
2-3 paragraph overview of the document's contents, written in plain language
Key Topics
Array of main subjects discussed (e.g., "Flight logs", "Financial transactions", "Witness testimony")
Key People
List of the most significant individuals mentioned, with brief role descriptions
Significance
Assessment of the document's importance and what makes it noteworthy
All analyses are saved to analyses.json and automatically incorporated into the searchable website. The script is resume-friendly it skips documents already analyzed unless forced to re-run.
Example Analysis Output
{
"document_type": "Deposition",
"key_topics": [
"Flight logs",
"Private aircraft",
"Passenger manifests"
],
"key_people": [
{
"name": "Jeffrey Epstein",
"role": "Aircraft owner"
}
],
"significance": "Documents flight records showing...",
"summary": "This deposition contains testimony regarding..."
}7. The Open-Source Technology Stack
The entire system is built on open-source tools and is itself open-source. Here's the technology breakdown:
Python Processing Layer
- ▸Python 3.x for all OCR and processing scripts
- ▸OpenAI-compatible API (OpenAI, OpenRouter, local models via LM Studio)
- ▸Concurrent processing with configurable worker pools
- ▸JSON-based data storage for portability and simplicity
Static Website Layer
- ▸Eleventy (11ty) static site generator
- ▸Nunjucks templates for dynamic page generation
- ▸GitHub Pages hosting (free, fast CDN)
- ▸GitHub Actions for automated builds on every commit
Fully Open Source
The entire codebase is on GitHub under the MIT license. Anyone can fork it, modify the prompts, use a different AI provider, or apply it to completely different document sets (corporate records, medical research, historical archives).
Repository: github.com/epstein-docs/epstein-docs.github.io
8. Performance at Scale
Processing 8,000+ documents with AI is not instant. Here's what the performance profile looks like:
OCR Processing
~2-5 seconds per page with 5 concurrent workers. Total: several hours for full corpus.
Deduplication
Batches of 50 entities. ~10-30 seconds per batch. Runs once after OCR completes.
AI Analysis
~10-20 seconds per multi-page document. Total: a few hours for ~400 documents.
The system is resume-friendly. If processing is interrupted (network failure, rate limit, manual stop), the scripts automatically skip already-processed files on the next run. This is critical for large corpora where processing might take days.
Cost Considerations
Using cloud AI APIs (OpenAI GPT-4 Vision, Google Gemini) incurs per-token costs. For 8,000 images, expect costs in the range of $50-$200 depending on the model and provider. Using open-source local models (via LM Studio or Ollama) eliminates API costs but requires significant GPU resources.
9. Real-World Impact: Democratizing Access to Legal Information
This project is a case study in how AI can democratize access to information that was previously locked away. Before this archive existed, these documents were technically "public" but practically inaccessible. A journalist could spend weeks reading through scans with no ability to search across documents. A researcher trying to map networks of individuals would have to manually cross-reference hundreds of pages.
Now, anyone with a web browser can:
- • Search for any person, organization, or location across all documents
- • Read AI-generated summaries to identify relevant documents quickly
- • Browse by date, document type, or entity
- • Copy and quote exact text passages for analysis or reporting
Applications Beyond This Case Study
The techniques demonstrated here are applicable to any large-scale document digitization project:
- Historical archives: Digitizing museum collections, old newspapers, handwritten letters
- Corporate records: Making decades of internal documentation searchable
- Medical research: Extracting data from patient records or clinical trial documents
- Government transparency: FOIA releases, legislative records, meeting minutes
- Investigative journalism: Analyzing leaked documents, financial records, correspondence
10. Future Enhancement: Relationship Graphs
The project maintainers have outlined an ambitious next phase: relationship graphs. Instead of just listing entities, the system could visualize connections between people, organizations, and locations based on co-occurrence in documents.
Proposed Graph Types
- Co-occurrence network: People who appear together in the same documents. Edge weight = number of shared documents.
- Timeline view: Documents plotted by date with entities connected to each point in time.
- Organization membership: People connected to organizations they're associated with.
- Location network: People and organizations connected by geographic references.
Implementation would use client-side JavaScript graph libraries like D3.js or Cytoscape.js to render interactive visualizations. The data would be pre-generated during the build process and served as static JSON files, maintaining the project's philosophy of simple, fast, host-anywhere static sites.
The deduplication step becomes even more critical here without it, "Epstein" and "Jeffrey Epstein" would appear as separate, disconnected nodes in the graph.
11. Conceptual Theory: AI as a Tool for Transparency
This project raises important questions about the role of AI in public interest technology. On one hand, AI-powered OCR and entity extraction dramatically reduce the barrier to accessing public records. On the other hand, there are legitimate concerns:
⚠️ OCR Accuracy
AI vision models are not perfect. They can misread handwriting, misinterpret symbols, or fail to extract text from heavily redacted pages. Users must verify critical information against original images.
⚠️ Entity Extraction Bias
NER models are trained on datasets that may have cultural or linguistic biases. Less common names, non-Western names, or unconventional spellings may be missed or misclassified.
⚠️ AI-Generated Summaries
Summaries are interpretations, not facts. The AI may emphasize certain aspects while omitting others. They are aids to understanding, not substitutes for reading source documents.
That said, the transparency of the process is a key strength. The code is open source. The prompts are visible. The raw OCR output (full JSON files) can be examined. And most importantly, the original document images are always linked users can verify any claim against the source.
The Broader Principle
Public documents should be publicly accessible not just legally available, but practically usable. AI is a tool to bridge the gap between "technically public" and "actually transparent". This project demonstrates that modern AI makes it possible for a small team or even a single developer to process document corpora that would have required institutional resources a decade ago.
12. Conclusion: A Blueprint for Open Legal Archives
This project is more than a technical demonstration. It's a proof of concept for a new model of legal transparency. In an era where courts are increasingly releasing documents as scanned PDFs, the ability to automatically process, index, and analyze these documents at scale is transformative.
The tools are accessible: Python, OpenAI APIs, static site generators. The cost is manageable: hundreds of dollars, not tens of thousands. The results are profound: thousands of documents that would have remained practically inaccessible are now searchable by anyone with an internet connection.
For developers, this is a template. The code is open source. You can fork it, adapt it, and apply it to any document corpus you care about. For journalists and researchers, it's a reminder that the tools to hold power accountable are getting better, faster, and more accessible.
From 8,000 Unreadable Images to a Living Archive
The transformation from a folder of scanned JPEGs to a fully searchable, entity-indexed, AI-analyzed archive represents what's possible when open-source tools, AI capabilities, and a commitment to transparency converge. This is the future of public records and the code to build it is already available on GitHub.
References & Resources
- Epstein Files Archive – Live searchable archive at epstein-docs.github.io
- GitHub Repository – Full source code, processing scripts, and documentation at github.com/epstein-docs/epstein-docs.github.io
- OpenAI Vision API – Documentation for GPT-4 Vision model used for OCR: platform.openai.com/docs/guides/vision
- Eleventy (11ty) Static Site Generator – Official documentation: 11ty.dev
- D3.js Network Visualization – Force-directed graph examples: d3js.org
- Cytoscape.js – Graph theory library for network analysis: js.cytoscape.org
- GitHub Actions – CI/CD automation for static site deployment: docs.github.com/en/actions
- Named Entity Recognition (NER) – Overview of NER techniques in NLP: Jurafsky & Martin, Speech and Language Processing, Chapter 8
- Legal Technology and AI – Stanford CodeX report on AI applications in legal transparency (2024)
- Open Source Intelligence (OSINT) – Techniques for investigative research using public records and documents