Should I use PyPDF2 or pypdf for grant PDF text extraction?

Use pypdf. PyPDF2 is deprecated and unmaintained; its functionality was merged back into pypdf, which is the actively developed package. Install it with pip install pypdf and import PdfReader via 'from pypdf import PdfReader'. In this pipeline pypdf is used only to verify the embedded text layer before Camelot performs coordinate-based table extraction.

Is Camelot accuracy a parameter I pass to read_pdf?

No. Accuracy is an output, not an input. Camelot reports it as a property of each extracted table via table.parsing_report['accuracy'], a 0 to 100 confidence score available only after extraction. You read that value back to decide whether to trust the lattice result or fall back to the stream flavor; you never pass an accuracy argument to camelot.read_pdf.

When should I use Camelot lattice versus stream flavor for grant budgets?

Use lattice for budget tables with visible ruled gridlines, which is most structured federal and foundation templates, because it keys off those lines for precise cell boundaries. Fall back to stream when lattice returns no tables or a low accuracy score, since stream infers structure from whitespace alignment and tolerances like edge_tol and row_tol. Stream output should be flagged for manual review because it carries a higher column-misalignment risk.

Extracting Tables from Grant PDFs Using pypdf and Camelot

Deterministically extract budget tables from federal and foundation grant PDFs in Python: pypdf text-layer verification, Camelot lattice/stream routing, accuracy thresholds, SHA-256 audit manifests, schema-drift quarantine, and 2 CFR §200.302 alignment.

This guide is part of the PDF Grant Application Parsing section within the broader Data Ingestion & Grant Parsing Workflows framework, and it solves one narrow problem: how do you pull a budget table out of a federal or foundation grant PDF deterministically — same input, same DataFrame, every run — with an audit manifest an auditor can replay?

The naive answer — point camelot.read_pdf at the file and take tables[0].df — passes on the demo PDF and fails in production. It silently returns an empty list when the page has no vector gridlines, it cannot tell a digitally-born PDF from a scanned image, and it leaves no record of which extraction flavor produced which numbers. This guide builds an extractor that verifies the text layer first, routes between lattice and stream on a measured accuracy threshold, hashes the source, and quarantines anything that drifts from the registered budget schema.

When to Use This Approach

Reach for this extractor when all three conditions hold:

The PDF is digitally born, not scanned. Camelot reads the embedded text layer and its coordinate map; it cannot read pixels. A scanned application has to be rasterized and OCR’d first by the parent PDF Grant Application Parsing stage before it reaches this code. The text-layer check in Step 2 is the gate that enforces that precondition.
The output feeds a regulated artifact. Extracted budget figures eventually reconcile against 2 CFR §200.302 financial-management records, so a misaligned column or a silently dropped row is a compliance event, not a cosmetic glitch. Every extraction here emits an immutable manifest carrying the source hash, flavor, and accuracy score.
Table structure varies by funder. A NIH modular budget, an NSF cumulative budget, and a private-foundation line-item sheet have nothing in common structurally. Hardcoded column indices break on the second funder; the schema-drift gate in Step 5 validates by canonical name and quarantines the rest.

Network retrieval, OCR rasterization, currency normalization, and indirect-cost allocation are explicitly out of scope. Source retrieval cadence belongs to API Polling & Rate Limiting; concurrent dispatch belongs to Async Batch Processing Pipelines; canonical field translation and dtype casting belong to Field Mapping & Normalization; funder-specific export reconciliation belongs to Excel Budget Template Sync. This stage runs synchronously, one document at a time, so coordinate mapping and audit ordering stay deterministic.

Step-by-Step Implementation

The reference implementation targets Python 3.10+. Use pypdf for the text-layer probe — not the deprecated PyPDF2, which is unmaintained — and install Camelot with the cv extra so lattice mode has its OpenCV dependency. Camelot’s lattice flavor also shells out to Ghostscript for raster conversion, so install that at the system level:

bash

pip install "pypdf==4.2.0" "camelot-py[cv]==0.11.0" "pandas==2.2.2"
# lattice mode requires Ghostscript:
#   Debian/Ubuntu:  apt-get install ghostscript
#   macOS:          brew install ghostscript

One contract to internalize before writing any code: accuracy is an output, not an input. Camelot reports it as a property of each extracted table via table.parsing_report["accuracy"]; it is never a parameter you pass to camelot.read_pdf. The routing logic below reads that property to decide whether a result is trustworthy.

Step 1: Configure the audit logger and the extractor contract

Every extraction must be replayable from its logs, so configure a standard-library logger (never print) and a small class that fixes the input/output contract: in goes a validated PDF path, out comes a pandas.DataFrame plus an audit manifest dictionary.

python

import logging
import hashlib
from pathlib import Path
from typing import Dict, Optional, Tuple
import pandas as pd
import camelot
from pypdf import PdfReader
from camelot.core import TableList

# Configure structured audit logger
AUDIT_LOGGER = logging.getLogger("grant_extraction.audit")
AUDIT_LOGGER.setLevel(logging.INFO)
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter("%(asctime)s | %(levelname)s | %(message)s"))
AUDIT_LOGGER.addHandler(handler)


class GrantBudgetExtractor:
    """Deterministic PDF budget table extractor with compliance audit logging."""

    def __init__(self, pdf_path: Path, accuracy_threshold: float = 65.0) -> None:
        self.pdf_path = pdf_path.resolve()
        self.accuracy_threshold = accuracy_threshold
        self.audit_manifest: Dict[str, object] = {}

accuracy_threshold defaults to 65.0 — Camelot’s reported accuracy is a 0–100 confidence score, and a lattice result below ~65 almost always means missing or low-DPI gridlines that warrant a stream retry. Tune it per funder corpus; raise it for clean federal templates, lower it for noisy scanned-then-OCR’d inputs.

Step 2: Hash the source and verify the text layer

2 CFR §200.302 requires financial records to be traceable, so compute a SHA-256 digest of the raw file before touching its contents — that digest is the immutable document identifier every downstream stage references. Then probe the first few pages with pypdf: if there is essentially no embedded text, Camelot has nothing to read and the document must be routed to OCR rather than silently returning an empty table.

python

    def _compute_pdf_hash(self) -> str:
        """Generate SHA-256 hash for audit trail reproducibility."""
        with open(self.pdf_path, "rb") as f:
            return hashlib.sha256(f.read()).hexdigest()

    def _verify_text_layer(self) -> bool:
        """Validate embedded text objects before coordinate mapping."""
        try:
            reader = PdfReader(str(self.pdf_path))
            sample_text = "".join([page.extract_text() or "" for page in reader.pages[:3]])
            return len(sample_text.strip()) > 50
        except Exception as exc:
            AUDIT_LOGGER.error(f"Text layer verification failed: {exc}")
            return False

The > 50 character floor is deliberately conservative: a born-digital budget page yields hundreds of characters in its first three pages, while a pure scan yields zero. Anything in between is suspicious and should be treated as OCR-bound rather than parsed for coordinates.

Step 3: Extract with lattice and capture the parsing report

Run the primary extraction in lattice flavor, which keys off the ruled gridlines that structured budget tables almost always carry. Critically, read accuracy back from tables[0].parsing_report and fold it — along with the flavor, row count, and detected columns — into the audit manifest. An empty TableList is an explicit signal to fall back, not an exception to swallow.

python

    def extract_budget_table(self, page_range: str = "all") -> Tuple[pd.DataFrame, Dict[str, object]]:
        """Execute extraction with explicit error handling and audit logging."""
        self.audit_manifest["pdf_hash"] = self._compute_pdf_hash()
        self.audit_manifest["pdf_path"] = str(self.pdf_path)

        if not self._verify_text_layer():
            raise RuntimeError("PDF lacks embedded text layer. Route to OCR preprocessing stage.")

        try:
            tables: TableList = camelot.read_pdf(
                filepath=str(self.pdf_path),
                flavor="lattice",
                pages=page_range,
                process_background=True,
            )
            if len(tables) == 0:
                raise ValueError("No tables detected. Triggering fallback routing.")

            df: pd.DataFrame = tables[0].df
            # accuracy is a property of the Table object, not an input parameter
            parsing_report: Dict = tables[0].parsing_report
            accuracy: float = parsing_report.get("accuracy", 0.0)

            self.audit_manifest.update({
                "extraction_flavor": "lattice",
                "accuracy_score": accuracy,
                "rows_extracted": len(df),
                "columns_detected": list(df.columns)
            })

            AUDIT_LOGGER.info(
                f"Extraction complete | Hash: {str(self.audit_manifest['pdf_hash'])[:10]}... | "
                f"Accuracy: {accuracy:.2f}% | Rows: {len(df)}"
            )
            return df, self.audit_manifest

        except Exception as exc:
            AUDIT_LOGGER.error(f"Extraction failed: {exc}")
            raise

process_background=True tells Camelot to consider shaded cells — common in funder budget templates where subtotal rows are tinted — rather than treating the fill as a table boundary.

Step 4: Route to stream fallback below the accuracy threshold

Lattice fails when gridlines are absent or rasterized at low DPI. The stream flavor recovers structure from whitespace alignment instead, at the cost of column-misalignment risk — so a stream result is downgraded to “manual review” rather than trusted outright. Note that stream does not report an accuracy score, so the manifest records None and sets an explicit fallback_triggered flag for the audit trail required by 2 CFR §200.334 record retention.

python

def route_extraction_fallback(
    extractor: GrantBudgetExtractor,
    initial_df: Optional[pd.DataFrame] = None,
    initial_accuracy: float = 0.0
) -> Tuple[pd.DataFrame, Dict[str, object]]:
    """Route to stream extraction if lattice accuracy falls below compliance threshold."""
    if initial_df is not None and initial_accuracy >= extractor.accuracy_threshold:
        return initial_df, extractor.audit_manifest

    AUDIT_LOGGER.warning("Accuracy threshold breached. Initiating stream fallback.")
    try:
        tables: TableList = camelot.read_pdf(
            filepath=str(extractor.pdf_path),
            flavor="stream",
            pages="all",
            edge_tol=50,
            row_tol=10
        )
        if len(tables) == 0:
            raise ValueError("Stream extraction yielded zero tables. Quarantine document.")

        df: pd.DataFrame = tables[0].df
        extractor.audit_manifest.update({
            "extraction_flavor": "stream",
            "accuracy_score": None,  # Stream flavor does not report accuracy
            "fallback_triggered": True,
            "rows_extracted": len(df)
        })
        AUDIT_LOGGER.info("Stream fallback successful. Document routed for manual review.")
        return df, extractor.audit_manifest

    except Exception as exc:
        extractor.audit_manifest["extraction_status"] = "FAILED"
        extractor.audit_manifest["error_trace"] = str(exc)
        AUDIT_LOGGER.critical(f"Fallback failed: {exc}")
        raise

edge_tol=50 widens the whitespace tolerance Camelot uses to infer column edges, and row_tol=10 controls how aggressively adjacent text is merged into one row — both are the levers to tune when stream output splits or merges budget lines incorrectly.

Step 5: Validate against the canonical budget schema

Grant budget tables exhibit high structural variance, so never trust positional indices. Normalize the detected column names, compare them against the required canonical set, and quarantine the table on any missing column rather than letting downstream stages propagate NaN. A drift quarantine returns an empty DataFrame plus a manifest flag, which is the handoff contract into Error Categorization & Logging.

python

from typing import Set

REQUIRED_BUDGET_COLUMNS: Set[str] = {"line item", "fy2025", "fy2026", "total"}


def validate_schema_drift(
    df: pd.DataFrame,
    manifest: Dict[str, object],
    required_cols: Set[str] = REQUIRED_BUDGET_COLUMNS
) -> Tuple[pd.DataFrame, Dict[str, object]]:
    """Validate column presence and enforce compliance quarantine on drift."""
    detected_cols = set(df.columns.str.strip().str.lower())

    missing = required_cols - detected_cols
    if missing:
        manifest["validation_status"] = "QUARANTINED"
        manifest["missing_columns"] = list(missing)
        AUDIT_LOGGER.warning(
            f"Schema drift detected | Missing: {missing} | "
            f"Routing to manual reconciliation queue."
        )
        return pd.DataFrame(), manifest

    manifest["validation_status"] = "PASSED"
    AUDIT_LOGGER.info("Schema validation passed. Ready for Field Mapping & Normalization.")
    return df, manifest

A PASSED manifest is the only thing Field Mapping & Normalization will accept; everything else is explicit quarantine, never a silent partial table.

Step 6: Chunk large applications to hold memory flat

A 50-plus-page consolidated application will OOM a constrained CI runner because Camelot holds the full coordinate map per call. Iterate the PDF in fixed page windows, pre-filter each window by budget keyword so non-budget pages are skipped, and force garbage collection between chunks. This is what lets Async Batch Processing Pipelines scale workers horizontally without memory fragmentation.

python

import gc
from typing import Iterator, List


def chunked_budget_extraction(
    pdf_path: Path,
    budget_keywords: List[str],
    chunk_size: int = 5
) -> Iterator[Tuple[pd.DataFrame, Dict[str, object]]]:
    """Process large grant PDFs via page-level chunking with explicit memory isolation."""
    reader = PdfReader(str(pdf_path))
    total_pages = len(reader.pages)

    for start_idx in range(0, total_pages, chunk_size):
        end_idx = min(start_idx + chunk_size, total_pages)
        page_range = f"{start_idx + 1}-{end_idx}"

        # Keyword pre-filter to skip non-budget pages
        page_text = "".join(
            [reader.pages[i].extract_text() or "" for i in range(start_idx, end_idx)]
        )
        if not any(kw.lower() in page_text.lower() for kw in budget_keywords):
            continue

        extractor = GrantBudgetExtractor(pdf_path, accuracy_threshold=65.0)
        try:
            df, manifest = extractor.extract_budget_table(page_range=page_range)
            yield df, manifest
        except Exception:
            yield pd.DataFrame(), {"status": "SKIPPED", "error": "Extraction failed"}
        finally:
            del extractor
            gc.collect()

Each chunk yields an independent DataFrame, which prevents cross-page coordinate bleed and keeps pagination deterministic regardless of document length.

Verification

Confirm the extractor behaves deterministically with four checks:

The text-layer gate routes scans away. Feed a scanned-image PDF (no embedded text) and assert extract_budget_table raises RuntimeError referencing OCR, and that no DataFrame is produced. This proves the pypdf probe runs before any coordinate mapping.
The accuracy threshold triggers fallback. Run a borderline lattice document and assert that when initial_accuracy is below accuracy_threshold, the manifest ends with extraction_flavor == "stream", fallback_triggered is True, and accuracy_score is None.
Schema drift quarantines instead of dropping. Pass a DataFrame missing the total column to validate_schema_drift and assert it returns an empty DataFrame, validation_status == "QUARANTINED", and a populated missing_columns list.
The hash ties the manifest to the source. Assert audit_manifest["pdf_hash"] is present and that re-hashing the same file reproduces the digest byte-for-byte — the proof an auditor uses to bind a DataFrame to exactly one input.

A compliant run emits one structured INFO audit line carrying the truncated hash, accuracy, and row count; a fallback emits a WARNING; a quarantine emits a WARNING with the missing columns. Ship those logs to a write-once tier so the trail satisfies the three-year retention period under 2 CFR §200.334.

python

extractor = GrantBudgetExtractor(Path("NIH_R01_application.pdf"), accuracy_threshold=65.0)
df, manifest = extractor.extract_budget_table(page_range="all")
df, manifest = route_extraction_fallback(extractor, df, manifest.get("accuracy_score") or 0.0)
df, manifest = validate_schema_drift(df, manifest)
assert manifest["validation_status"] in {"PASSED", "QUARANTINED"}
assert len(manifest["pdf_hash"]) == 64

Common Errors & Fixes

Error	Cause	Fix
`camelot.read_pdf` returns an empty `TableList`	Page has no ruled gridlines, so `lattice` finds no table boundaries	Catch the zero-length list and route to `stream` via `route_extraction_fallback`; tune `edge_tol`/`row_tol` for the funder.
`RuntimeError: PDF lacks embedded text layer`	The application is a scan, not born-digital	Send it to OCR rasterization in PDF Grant Application Parsing before this stage; never pass pixels to Camelot.
`TypeError`/`KeyError` passing `accuracy=` to `read_pdf`	Treating accuracy as an input parameter	Accuracy is an output — read it from `tables[0].parsing_report["accuracy"]` after extraction.
`GhostscriptNotFound` / lattice import error	Ghostscript missing or Camelot installed without the `cv` extra	Install `camelot-py[cv]` and the system `ghostscript` package; verify with `gs --version`.
OOM kill on a 50+ page application	Whole-document coordinate map held in memory	Use `chunked_budget_extraction` to window pages, pre-filter by keyword, and `gc.collect()` per chunk; dispatch via Async Batch Processing Pipelines.
Columns silently misaligned downstream	Positional indices assumed across funders with different layouts	Validate by canonical name with `validate_schema_drift` and quarantine on missing columns instead of trusting position.
`ModuleNotFoundError: PyPDF2`	Depending on the deprecated, unmaintained package	Install and import `pypdf`; the API is `from pypdf import PdfReader`.

Parent section: PDF Grant Application Parsing
Where extracted columns get canonical names and dtypes: Field Mapping & Normalization
Where quarantine manifests are triaged: Error Categorization & Logging
The spreadsheet counterpart to this PDF path: Automating Excel to CSV Conversion for Budget Tracking
When volume bursts past one document at a time: Building Async Batch Processors for Grant Submissions

When to Use This Approach #

Step-by-Step Implementation #

Step 1: Configure the audit logger and the extractor contract #

Step 2: Hash the source and verify the text layer #

Step 3: Extract with lattice and capture the parsing report #

Step 4: Route to stream fallback below the accuracy threshold #

Step 5: Validate against the canonical budget schema #

Step 6: Chunk large applications to hold memory flat #

Verification #

Common Errors & Fixes #

Related #

When to Use This Approach

Step-by-Step Implementation

Step 1: Configure the audit logger and the extractor contract

Step 2: Hash the source and verify the text layer

Step 3: Extract with lattice and capture the parsing report

Step 4: Route to stream fallback below the accuracy threshold

Step 5: Validate against the canonical budget schema

Step 6: Chunk large applications to hold memory flat

Verification

Common Errors & Fixes

Related