PDF Grant Application Parsing

Deterministically extract narrative text, metadata, and budget tables from funder PDF grant submissions in Python: python-magic MIME gating, SHA-256 anchoring, pdfplumber/Camelot extraction, pydantic v2 schema contracts, pytest verification, and 2 CFR 200 audit alignment.

This guide is part of the Data Ingestion & Grant Parsing Workflows reference. PDF Grant Application Parsing is the discrete ingestion gate that accepts staged binary PDF grant submissions, proves their integrity, extracts narrative text, metadata, and budget tables deterministically, and emits either a validated canonical payload or a deterministic quarantine directive.

The scope is strictly confined to the intake, extraction, and structural validation of PDF artifacts. This stage does not perform network acquisition, currency conversion, fund reconciliation, general-ledger posting, or jurisdictional rule adjudication. Portal-sourced files arrive only after API Polling & Rate Limiting has materialized them locally; spreadsheet budgets belong to Excel Budget Template Sync; canonical field translation and aliasing belong to Field Mapping & Normalization; rule evaluation against funder and federal obligations belongs to the Core Architecture & Compliance Mapping reference, which consumes the payloads this gate produces. Every code path here terminates at exactly one of three outcomes: a validated schema emission, a deterministic quarantine routing that preserves the original artifact, or a flag for manual review.

Prerequisites

This gate targets Python 3.11+ (it relies on tomllib-era typing ergonomics and the modern pydantic v2 core). Pin dependencies so extraction is reproducible across grantor templates and across audit re-runs — an unpinned pdfplumber bump can silently change table geometry and break determinism.

text

# requirements.txt — pinned for reproducible extraction
pdfplumber==0.11.4
camelot-py[cv]==0.11.0
pypdf==4.3.1
python-magic==0.4.27
pydantic==2.9.2

System-level dependencies: libmagic1 (for python-magic MIME sniffing) and ghostscript (Camelot’s lattice flavor renders pages through it). Environment variables consumed by this stage:

Variable	Purpose	Example
`GRANT_PDF_STAGING_PATH`	Read-only directory of staged binaries awaiting extraction	`/var/grant/staging`
`GRANT_PDF_QUARANTINE_PATH`	Write target for rejected artifacts + rejection manifests	`/var/grant/quarantine`
`GRANT_PDF_MAX_BYTES`	Hard ceiling rejecting oversized payloads before render	`52428800`
`GRANT_CONFIDENCE_FLOOR`	Confidence score below which a page routes to manual review	`0.65`

Upstream stage dependency: a file is only eligible for parsing once it has been securely staged in GRANT_PDF_STAGING_PATH with an attached acquisition context (request ID, timestamp, source channel) produced by API Polling & Rate Limiting. This gate never reaches out to the network; coupling extraction latency to acquisition latency is an explicit anti-goal.

Document Intake & Pre-Flight Validation

The parsing pipeline begins only after a PDF has been staged. Before any text or table extraction, explicit pre-flight validation enforces integrity and security boundaries so that no malformed or hostile artifact reaches the extraction engines.

MIME & extension verification. Validate the application/pdf magic bytes using python-magic rather than trusting the file extension. Reject mismatches with a structured error.
Cryptographic hashing. Generate a SHA-256 digest of the raw binary stream per NIST SP 800-107 Rev. 1. Store the digest as the immutable document_id for every downstream audit record.
Specification compliance. Parse the %PDF-1.x / %PDF-2.x header and reject encrypted, password-protected, or non-compliant artifacts that violate the gate’s security posture.
Upstream synchronization. Attach the original acquisition request ID and timestamp to the parsing context to preserve traceability without coupling ingestion latency to extraction logic.

python

import hashlib
import logging
from pathlib import Path
from typing import Any, Dict, Optional
from pydantic import BaseModel

logger = logging.getLogger("grant_parser.preflight")


class PreFlightResult(BaseModel):
    document_id: str
    mime_type: str
    pdf_version: Optional[str]
    is_encrypted: bool
    upstream_context: Dict[str, Any]
    validation_passed: bool


class PreFlightRejection(Exception):
    """Structured pre-flight failure carrying a machine-readable reason code."""

    def __init__(self, code: str, detail: str) -> None:
        self.code = code
        self.detail = detail
        super().__init__(f"{code}: {detail}")


def audit_log(event: str, doc_id: str, metadata: Dict[str, Any]) -> None:
    """Append-only structured audit hook for compliance traceability."""
    logger.info(
        "%s | doc_id=%s | framework=NIST_SP_800-53_AU-2 | metadata=%s",
        event, doc_id, metadata,
    )


def run_preflight_validation(
    file_path: Path, upstream_context: Dict[str, Any]
) -> PreFlightResult:
    import magic  # python-magic; libmagic1 must be present

    raw_bytes = file_path.read_bytes()
    doc_id = hashlib.sha256(raw_bytes).hexdigest()
    audit_log("preflight_initiated", doc_id, upstream_context)

    detected_mime = magic.from_buffer(raw_bytes, mime=True)
    if detected_mime != "application/pdf":
        audit_log("preflight_failed", doc_id, {"reason": "ERR_MIME_MISMATCH", "detected": detected_mime})
        raise PreFlightRejection("ERR_MIME_MISMATCH", f"expected application/pdf, got {detected_mime}")

    header_line = raw_bytes[:10].decode("ascii", errors="ignore")
    if not (header_line.startswith("%PDF-1.") or header_line.startswith("%PDF-2.")):
        audit_log("preflight_failed", doc_id, {"reason": "ERR_INVALID_HEADER"})
        raise PreFlightRejection("ERR_INVALID_HEADER", "non-compliant PDF header")

    # Lightweight encryption check via keyword scan of the first 4 KB.
    if b"/Encrypt" in raw_bytes[:4096]:
        audit_log("preflight_failed", doc_id, {"reason": "ERR_ENCRYPTED"})
        raise PreFlightRejection("ERR_ENCRYPTED", "encrypted PDFs violate pipeline security posture")

    pdf_version = header_line.split("-")[-1].split()[0] if "-" in header_line else None
    audit_log("preflight_passed", doc_id, {"mime": detected_mime, "header": header_line})
    return PreFlightResult(
        document_id=doc_id,
        mime_type=detected_mime,
        pdf_version=pdf_version,
        is_encrypted=False,
        upstream_context=upstream_context,
        validation_passed=True,
    )

Deterministic fallback: when pre-flight raises a PreFlightRejection, route the file to GRANT_PDF_QUARANTINE_PATH with a machine-readable rejection manifest carrying the reason code. Never attempt partial parsing of a rejected artifact — the classification and routing of that directive is owned by Error Categorization & Logging.

Core Implementation: Canonical Extraction

This stage isolates the mechanical extraction of narrative text, embedded metadata, and tabular budget data. Extraction follows a strict priority chain so output is reproducible across funder templates: pdfplumber for the text layer and coordinate-aware table detection, with Camelot reserved for merged-cell and multi-column budget matrices.

Text layer extraction. Extract raw text streams with pdfplumber, normalize whitespace, strip zero-width characters, and segment by page boundary.
Table boundary detection. Apply coordinate-based parsing to isolate budget matrices, personnel allocations, and indirect-cost schedules. For complex merged cells and multi-column layouts, defer to the methodology in Extracting tables from grant PDFs using PyPDF2 and Camelot.
Metadata isolation. Extract embedded XMP metadata and document info — author, creation timestamp — and cross-reference against narrative headers to identify grant title, deadline, and funding agency.
Confidence scoring. Assign per-page and per-table confidence metrics. Any segment scoring below GRANT_CONFIDENCE_FLOOR routes to manual review rather than silently emitting low-trust data.

python

import logging
import re
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, Dict, List

import pdfplumber

logger = logging.getLogger("grant_parser.extract")

# Zero-width space, ZWNJ, ZWJ, BOM.
_ZERO_WIDTH = re.compile(r"[‌‍]")
_WHITESPACE = re.compile(r"\s+")


@dataclass
class ExtractionPayload:
    document_id: str
    full_text: str
    tables: List[Dict[str, Any]]
    metadata: Dict[str, Any]
    confidence_scores: Dict[str, float]
    audit_trail: List[Dict[str, Any]] = field(default_factory=list)


def normalize_text(raw: str) -> str:
    """Strip zero-width characters and collapse whitespace to UTF-8 NFC-clean text."""
    cleaned = _ZERO_WIDTH.sub("", raw)
    return _WHITESPACE.sub(" ", cleaned).strip()


def extract_structured_content(file_path: Path, doc_id: str) -> ExtractionPayload:
    audit_trail: List[Dict[str, Any]] = []
    tables: List[Dict[str, Any]] = []
    confidence_scores: Dict[str, float] = {}
    text_segments: List[str] = []
    page_count = 0

    with pdfplumber.open(file_path) as pdf:
        page_count = len(pdf.pages)
        for page_num, page in enumerate(pdf.pages, start=1):
            normalized = normalize_text(page.extract_text() or "")
            text_segments.append(normalized)
            confidence_scores[f"page_{page_num}_text"] = 1.0 if len(normalized) > 50 else 0.3

            for t_idx, table in enumerate(page.extract_tables()):
                if not table or len(table) < 2:
                    continue
                is_budget = any(
                    row and ("budget" in str(row[0]).lower() or "cost" in str(row[0]).lower())
                    for row in table
                )
                tables.append({
                    "page": page_num,
                    "table_index": t_idx,
                    "headers": table[0],
                    "rows": table[1:],
                    "is_budget_matrix": is_budget,
                })
                confidence_scores[f"page_{page_num}_table_{t_idx}"] = 0.9 if is_budget else 0.6

            audit_trail.append({
                "event": "page_parsed",
                "page": page_num,
                "text_length": len(normalized),
            })

    logger.info("extraction_complete | doc_id=%s | pages=%d | tables=%d", doc_id, page_count, len(tables))
    return ExtractionPayload(
        document_id=doc_id,
        full_text="\n".join(text_segments),
        tables=tables,
        metadata={"page_count": page_count, "extraction_engine": "pdfplumber==0.11.4"},
        confidence_scores=confidence_scores,
        audit_trail=audit_trail,
    )

All extraction operations are stateless and read-only against the staged artifact. No downstream module may mutate the raw ExtractionPayload; transformations occur exclusively in the isolated normalization layer.

Field Mapping & Schema Contract

Extracted strings are raw funder vocabulary — they must be reconciled against a canonical schema before they can be trusted. This gate emits raw values keyed to canonical names; organization-specific synonym dictionaries and geocoding are explicitly out of scope and belong to Field Mapping & Normalization. The alias table below resolves the most common header variants seen across SF-424 and foundation templates.

Canonical field	Common funder aliases	Type coercion rule
`grant_title`	“Project Title”, “Program Name”, “Title of Proposal”	trimmed `str`, collapse internal whitespace
`applicant_name`	“Organization”, “Applicant/Recipient”, “Legal Name”	trimmed `str`
`total_budget`	“Total Request”, “Amount Requested”, “Total Project Cost”	strip `$`/`,`, parse to `Decimal`, round 2dp
`indirect_cost_rate`	“Indirect Rate”, “F&A Rate”, “Overhead %”	percent → `float` in `[0.0, 1.0]`
`period_of_performance`	“Project Period”, “Grant Period”, “Duration”	ISO-8601 date range

The contract is enforced with a pydantic v2 model so that coercion and validation happen at one boundary and emit structured errors rather than silent coercion:

python

import logging
from typing import List, Optional

from pydantic import BaseModel, Field, ValidationError, field_validator

logger = logging.getLogger("grant_parser.schema")


class GrantSchema(BaseModel):
    document_id: str
    grant_title: str
    applicant_name: str
    total_budget: float
    indirect_cost_rate: Optional[float] = None
    compliance_flags: List[str] = Field(default_factory=list)

    @field_validator("total_budget")
    @classmethod
    def validate_budget(cls, v: float) -> float:
        if v <= 0:
            raise ValueError("total_budget must be positive")
        return round(v, 2)

    @field_validator("indirect_cost_rate")
    @classmethod
    def validate_indirect_rate(cls, v: Optional[float]) -> Optional[float]:
        if v is not None and not (0.0 <= v <= 1.0):
            raise ValueError("indirect_cost_rate must be within [0.0, 1.0]")
        return v


def enforce_compliance_schema(mapped_data: dict) -> dict:
    """Validate mapped fields against the canonical contract; emit a deterministic manifest."""
    try:
        validated = GrantSchema(**mapped_data)
    except ValidationError as exc:
        failed = [str(err["loc"][0]) for err in exc.errors()]
        logger.warning("schema_violation | doc_id=%s | failed=%s", mapped_data.get("document_id"), failed)
        return {
            "status": "non_compliant",
            "errors": exc.errors(),
            "audit_manifest": {"framework": "2_CFR_200_OMB_Uniform_Guidance", "checks_failed": failed},
        }
    return {
        "status": "compliant",
        "payload": validated.model_dump(),
        "audit_manifest": {
            "framework": "2_CFR_200_OMB_Uniform_Guidance",
            "checks_passed": ["budget_positive", "indirect_rate_capped", "required_fields_present"],
        },
    }

Validation & Testing

Determinism is only credible if it is asserted. The following pytest fixtures pin the two terminal outcomes — a clean compliant emission and a schema violation — and verify that the audit trail records what happened. A golden-file fixture per funder template guards against silent extraction drift when a dependency is bumped.

python

import pytest

from grant_parser.schema import enforce_compliance_schema


@pytest.fixture
def clean_payload() -> dict:
    return {
        "document_id": "a" * 64,
        "grant_title": "Community Resilience Initiative",
        "applicant_name": "Riverside Community Trust",
        "total_budget": 150000.00,
        "indirect_cost_rate": 0.15,
    }


def test_compliant_payload_passes(clean_payload: dict) -> None:
    result = enforce_compliance_schema(clean_payload)
    assert result["status"] == "compliant"
    assert "budget_positive" in result["audit_manifest"]["checks_passed"]


def test_negative_budget_quarantines(clean_payload: dict) -> None:
    clean_payload["total_budget"] = -500.0
    result = enforce_compliance_schema(clean_payload)
    assert result["status"] == "non_compliant"
    assert "total_budget" in result["audit_manifest"]["checks_failed"]


def test_indirect_rate_cap_enforced(clean_payload: dict) -> None:
    clean_payload["indirect_cost_rate"] = 1.4  # 140% — impossible
    result = enforce_compliance_schema(clean_payload)
    assert result["status"] == "non_compliant"

For text normalization, property-based tests with hypothesis confirm that normalize_text is idempotent and never emits a zero-width character regardless of input — a stronger guarantee than any handful of example strings:

python

from hypothesis import given, strategies as st

from grant_parser.extract import normalize_text


@given(st.text())
def test_normalize_is_idempotent(raw: str) -> None:
    once = normalize_text(raw)
    assert normalize_text(once) == once
    assert "" not in once

Performance & Scale Considerations

Nonprofit-scale workloads are bursty — grant deadlines cluster submissions into spikes — but rarely high-throughput. The dominant cost is PDF rendering memory, not CPU. Tune for predictability under a deadline spike rather than raw throughput.

Memory ceiling. pdfplumber holds page geometry in memory; a 200-page scanned PDF can exceed 1 GB. Enforce GRANT_PDF_MAX_BYTES at pre-flight and process one page at a time (the generator above never materializes all pages at once) to keep the resident set bounded.
Concurrency limits. Render-heavy extraction must not block synchronous API responses. Dispatch jobs onto bounded worker pools governed by Async Batch Processing Pipelines; a concurrency of min(4, cpu_count) is a safe default for the memory profile above.
Batch sizing. Group staged files into batches of 25–50 per worker cycle so the audit log flushes in coherent units and a single corrupt artifact never poisons an entire run.
Camelot cost. Camelot’s lattice flavor is markedly slower than pdfplumber because it shells out to Ghostscript. Reserve it for tables that pdfplumber scores below 0.6, not as the default path.

Failure Modes & Troubleshooting

Every failure resolves to a machine-readable reason code and a quarantine artifact — never a silent drop. A sustained shift in one reason code is the primary operational signal; a surge of ERR_SCHEMA_VIOLATION almost always means a funder published a new template version, which should be fixed in the alias contract upstream rather than by loosening validation here.

Error code	Root cause	Remediation
`ERR_MIME_MISMATCH`	File renamed `.pdf` but is actually `.docx`/HTML/image	Reject at pre-flight; re-request from source via API Polling & Rate Limiting
`ERR_ENCRYPTED`	Password-protected or DRM-restricted PDF	Quarantine; request an unprotected copy from the funder; never store decryption keys in the pipeline
`ERR_INVALID_HEADER`	Truncated download or non-PDF binary	Verify upstream transfer integrity against the staged SHA-256
`ERR_EMPTY_TEXT_LAYER`	Scanned image PDF with no embedded text	Route to manual review; OCR is out of scope for this gate
`ERR_TABLE_GEOMETRY`	Merged cells / rotated pages defeat `pdfplumber`	Fall back to Camelot per the tables extraction guide
`ERR_SCHEMA_VIOLATION`	Required field missing or budget arithmetic inconsistent	Fix the alias contract upstream; quarantine the artifact for audit

Transient storage faults (a flaky mount, a timed-out read) are not this stage’s concern — bounded retries and backoff belong to Pipeline Fallback & Retry Logic.

Compliance Alignment

This gate satisfies specific federal cost-principle, internal-control, and audit-event obligations rather than generic “compliance.” Every validated payload and every quarantine manifest is auditable evidence, cryptographically anchored to the original document digest.

Validation gate	Regulatory mapping	Audit artifact
Pre-flight MIME & header check	NIST SP 800-53 AU-2 (audit events)	Structured JSON log with framework tag and reason code
SHA-256 digest anchoring	NIST SP 800-107 Rev. 1 (hashing); 2 CFR §200.334 (record retention)	Immutable `document_id` linking every downstream record
Budget arithmetic verification	2 CFR §200.302 (financial management); 2 CFR §200.403 (consistency of cost treatment)	`Decimal` summation log, declared-total reconciliation
Indirect-cost ceiling	2 CFR §200.414 (indirect F&A cost rate); funder negotiated agreement	Indirect-rate cap check on the `indirect_cost_rate` field
Schema enforcement	2 CFR §200.303 (internal controls); SF-424 field requirements	`pydantic` validation with explicit `compliance_flags`
Handoff serialization	ISO/IEC 27001 A.12.4 (logging & monitoring)	Append-only payload emission; no cross-module state mutation

Compliance officers can reconstruct the exact extraction state at ingestion time by querying the append-only audit logs by document_id. Quarantined files retain their original binary state alongside a JSON rejection manifest, ensuring non-repudiation during federal audits. Under the 2 CFR §200.334 three-year minimum, the quarantine artifacts at GRANT_PDF_QUARANTINE_PATH are retention records and must be preserved (longer where a funder or state statute extends it). Audit artifacts emitted here conform to the conventions defined in Compliance Metadata Standards, and indirect-cost ceilings reconcile against the structures in IRS 990 Data Schema Mapping.

Parent: Data Ingestion & Grant Parsing Workflows
Extracting tables from grant PDFs using PyPDF2 and Camelot — the budget-matrix extraction detail this gate defers to
Field Mapping & Normalization — owns the canonical schema and alias resolution this stage hands off to
Excel Budget Template Sync — the parallel ingestion gate for spreadsheet budgets
Async Batch Processing Pipelines — concurrent dispatch that governs PDF render workers
Compliance Metadata Standards — cross-domain audit-artifact field conventions

Prerequisites #

Document Intake & Pre-Flight Validation #

Core Implementation: Canonical Extraction #

Field Mapping & Schema Contract #

Validation & Testing #

Performance & Scale Considerations #

Failure Modes & Troubleshooting #

Compliance Alignment #

Related #

Prerequisites

Document Intake & Pre-Flight Validation

Core Implementation: Canonical Extraction

Field Mapping & Schema Contract

Validation & Testing

Performance & Scale Considerations

Failure Modes & Troubleshooting

Compliance Alignment

Related