Field Mapping & Normalization for Grant Data Pipelines

A deterministic, stateless transformation stage that maps heterogeneous portal-specific grant payloads onto a canonical, type-safe schema with compliance gating and audit-trail emission for Python ingestion pipelines.

This stage is part of the Data Ingestion & Grant Parsing Workflows reference, and it owns one job: the deterministic translation of heterogeneous, portal-specific grant payloads into a single canonical, compliance-ready schema. Everything below assumes a payload has already been extracted and staged; nothing below performs extraction, fund reconciliation, general-ledger posting, or regulatory reporting.

Field Mapping & Normalization is a stateless transformation layer. Its boundaries are enforced deliberately so that mapping logic stays auditable, version-controlled, and decoupled from transport and extraction concerns:

Ingress boundary. Processing begins the moment a payload is delivered from upstream extraction. Raw outputs from PDF Grant Application Parsing and Excel Budget Template Sync terminate at the payload staging queue and are picked up here.
Egress boundary. Normalization concludes on successful schema validation, compliance flagging, and structured audit-trail emission. Only validated canonical records advance to the Core Architecture & Compliance Mapping reference that consumes them.
Execution context. This layer runs independently of network constraints. Connection pooling and the back-off behaviour governed by API Polling & Rate Limiting are irrelevant here, because payloads are already materialized in local or distributed staging storage.

Out of scope, explicitly: raw data extraction, financial reconciliation, obligation tracking, downstream rule adjudication, and any regulatory submission. Those responsibilities belong to later stages and must not bleed into mapping code.

Prerequisites

Pin the toolchain so that normalization is byte-for-byte reproducible across environments — non-deterministic dependency versions are themselves an audit finding under 2 CFR §200.302.

Requirement	Version / value	Notes
Python	`3.11+`	Required for `datetime.UTC` and faster `decimal` paths.
`pydantic`	`==2.7.1`	Strict runtime contract enforcement and JSON serialization.
`polars`	`==0.20.31`	Vectorized batch mapping; avoids row-by-row iteration.
`pyyaml`	`==6.0.1`	Loads the version-controlled mapping registry.

bash

python -m pip install "pydantic==2.7.1" "polars==0.20.31" "pyyaml==6.0.1"

Environment variables consumed by this stage:

Variable	Purpose
`GRANT_MAPPING_REGISTRY_PATH`	Absolute path to the Git-tracked mapping YAML.
`GRANT_SCHEMA_VERSION`	Canonical schema tag stamped onto every emitted record (e.g. `v2.4.1`).
`GRANT_AUDIT_SINK_URI`	Immutable storage tier for compliance audit logs.

Upstream dependency: a staged payload with a recorded SHA-256 digest. The digest is produced during extraction and carried forward so that the audit trail can prove which raw bytes a canonical record derived from. If the digest is missing, reject the payload before mapping — there is no way to reconstruct lineage afterwards.

Core Implementation

Before any transformation runs, instantiate a centralized field registry. This registry is the authoritative data contract for all grant records: it enforces strict types, allowed enumerations, and compliance metadata. The canonical schema is defined with pydantic v2 for native strict typing, custom validators, and explicit error serialization.

python

from pydantic import BaseModel, Field, field_validator, ConfigDict
from decimal import Decimal
from datetime import datetime, timezone
from typing import Literal
import uuid

class GrantCanonicalRecord(BaseModel):
    model_config = ConfigDict(strict=True, extra="forbid")

    grant_id: uuid.UUID
    funding_stream_code: str = Field(pattern=r"^[A-Z]{3}-\d{4}$")
    award_period_start: datetime
    award_period_end: datetime
    budget_category_id: Literal["DIRECT", "INDIRECT", "MATCHING", "ADMIN"]
    award_amount_usd: Decimal = Field(ge=0, decimal_places=2)
    compliance_framework: str = Field(default="2_CFR_200")
    normalization_timestamp: datetime = Field(
        default_factory=lambda: datetime.now(timezone.utc)
    )

    @field_validator("award_period_end")
    @classmethod
    def validate_date_sequence(cls, v: datetime, info) -> datetime:
        start = info.data.get("award_period_start")
        if start is not None and v <= start:
            raise ValueError(
                "award_period_end must be strictly greater than award_period_start"
            )
        return v

Mapping operations must be stateless, idempotent, and fully traceable. Row-by-row iteration is prohibited for production workloads; vectorized execution via polars holds sub-second throughput for batches exceeding 100k records. The function below fails fast on missing required columns — silent field drops destroy audit integrity — and logs through the standard library logger rather than print.

python

import polars as pl
from decimal import Decimal, ROUND_HALF_UP
from dataclasses import dataclass
from typing import Dict
import logging

logger = logging.getLogger("grant_normalization")

@dataclass
class MappingResult:
    frame: pl.DataFrame
    rows_in: int
    rows_mapped: int
    status: str  # "OK" | "ERROR"
    error: str | None = None

def execute_vectorized_mapping(
    raw_frame: pl.DataFrame,
    mapping_config: Dict[str, str],
) -> MappingResult:
    """Apply deterministic field mapping and type coercion using Polars.

    Returns a structured result instead of raising, so the caller can route
    failed batches to the exception queue without swallowing the cause.
    """
    missing_keys = [src for src in mapping_config if src not in raw_frame.columns]
    if missing_keys:
        logger.error("mapping_missing_source_fields", extra={"missing": missing_keys})
        return MappingResult(
            frame=raw_frame, rows_in=raw_frame.height, rows_mapped=0,
            status="ERROR", error=f"missing source fields: {missing_keys}",
        )

    mapped = raw_frame.rename(mapping_config)

    # Monetary normalization uses Python's decimal module for exact precision.
    mapped = mapped.with_columns([
        pl.col("award_amount_usd").map_elements(
            lambda x: str(Decimal(str(x)).quantize(Decimal("0.01"), rounding=ROUND_HALF_UP)),
            return_dtype=pl.Utf8,
        ),
        pl.col("budget_category_id").str.strip_chars().str.to_uppercase(),
        pl.col("funding_stream_code").str.strip_chars().str.to_uppercase(),
    ])

    for col in ("award_period_start", "award_period_end"):
        if col in mapped.columns:
            mapped = mapped.with_columns(
                pl.col(col)
                .str.to_datetime(format="%Y-%m-%dT%H:%M:%S", strict=False)
                .dt.replace_time_zone("UTC")
            )

    logger.info(
        "mapping_complete",
        extra={"rows_in": raw_frame.height, "rows_out": mapped.height},
    )
    return MappingResult(
        frame=mapped, rows_in=raw_frame.height,
        rows_mapped=mapped.height, status="OK",
    )

The execution protocol is four ordered passes:

Key resolution. Apply the mapping dictionary via explicit column renaming under a strict error policy.
Type coercion. Convert monetary values to decimal.Decimal with ROUND_HALF_UP at two places; normalize temporal fields to ISO 8601 UTC per the ISO 8601 date-and-time standard.
String sanitization. Strip whitespace, normalize enumeration casing, and replace non-UTF-8 sequences with deterministic fallbacks.
Cross-field alignment. Validate logical dependencies — budget-period boundaries and funding-stream prefixes — inside the transformation pass.

Field Mapping & Schema Contract

A bidirectional mapping registry is loaded from version control (Git-tracked YAML or a configuration service). Each entry binds a source key to a canonical target, a transformation rule, and a compliance lineage tag. Resolve portal aliasing drift before the engine engages by running the Standardizing Grant Field Names Across Multiple Portals workflow first; that step guarantees deterministic key resolution so grant_amt, award_total_usd, and funding_amount collapse to one canonical field.

Alias Resolution Table

Canonical field	Known portal aliases	Type	Coercion rule
`grant_id`	`id`, `record_uuid`, `application_id`	`uuid.UUID`	Parse to UUID; reject non-parseable.
`funding_stream_code`	`stream`, `program_code`, `cfda`	`str`	Strip, uppercase, match `^[A-Z]{3}-\d{4}$`.
`award_amount_usd`	`grant_amt`, `award_total_usd`, `funding_amount`	`Decimal`	`Decimal(str(x))`, `ROUND_HALF_UP`, 2 places.
`award_period_start`	`start`, `period_from`, `budget_start`	`datetime`	Parse ISO 8601, set UTC.
`award_period_end`	`end`, `period_to`, `budget_end`	`datetime`	Parse ISO 8601, set UTC; must exceed start.
`budget_category_id`	`cost_class`, `category`, `line_type`	`Literal`	Uppercase; constrain to enum.

The mapping registry itself is plain, reviewable configuration:

yaml

# mapping_registry.yaml — version-controlled, reviewed via pull request
grant_amt:        award_amount_usd
award_total_usd:  award_amount_usd
funding_amount:   award_amount_usd
program_code:     funding_stream_code
period_from:      award_period_start
period_to:        award_period_end
cost_class:       budget_category_id

Validation & Testing

Validation is decoupled from transformation: it runs as a post-mapping gate that tags records with compliance metadata rather than halting the batch. The gate emits aggregate counts alongside per-record verdicts so reviewers can triage at the batch level.

python

from dataclasses import dataclass
from typing import List, Literal
import logging

@dataclass
class ComplianceAudit:
    record_id: str
    status: Literal["PASS", "FLAGGED", "REJECTED"]
    compliance_tags: List[str]
    error_details: str | None = None

def run_compliance_gates(frame: pl.DataFrame) -> List[ComplianceAudit]:
    log = logging.getLogger("compliance.gates")
    date_violations = frame.filter(
        pl.col("award_period_end") <= pl.col("award_period_start")
    ).height
    log.info("gate_summary", extra={
        "date_violations": date_violations, "total_records": frame.height,
    })

    audits: List[ComplianceAudit] = []
    for row in frame.iter_rows(named=True):
        tags: List[str] = []
        status: Literal["PASS", "FLAGGED", "REJECTED"] = "PASS"
        details: str | None = None

        if row["award_period_end"] <= row["award_period_start"]:
            tags.append("DATE_SEQUENCE_VIOLATION")
            status, details = "FLAGGED", "Period end precedes start."

        if not row["funding_stream_code"].startswith(("NSF", "NIH", "DOE")):
            tags.append("STREAM_FORMAT_MISMATCH")
            status, details = "REJECTED", "Unrecognized funding-stream prefix."

        audits.append(ComplianceAudit(str(row["grant_id"]), status, tags, details))
    return audits

A pytest suite pins the contract with one passing payload and one deliberately invalid one, then asserts on the structured audit output:

python

import polars as pl
import pytest

@pytest.fixture
def valid_frame() -> pl.DataFrame:
    return pl.DataFrame({
        "grant_id": ["1c4e...uuid"],
        "funding_stream_code": ["NSF-2024"],
        "award_period_start": ["2024-01-01T00:00:00"],
        "award_period_end": ["2024-12-31T00:00:00"],
        "budget_category_id": ["DIRECT"],
        "award_amount_usd": ["100000.005"],
    })

def test_money_rounds_half_up(valid_frame):
    result = execute_vectorized_mapping(valid_frame, {})
    assert result.status == "OK"
    assert result.frame["award_amount_usd"][0] == "100000.01"

def test_missing_field_returns_structured_error():
    frame = pl.DataFrame({"grant_id": ["x"]})
    result = execute_vectorized_mapping(frame, {"absent_src": "award_amount_usd"})
    assert result.status == "ERROR"
    assert "absent_src" in result.error

def test_reversed_dates_are_flagged():
    frame = pl.DataFrame({
        "grant_id": ["abc"], "funding_stream_code": ["NSF-2024"],
        "award_period_start": ["2024-12-31T00:00:00"],
        "award_period_end": ["2024-01-01T00:00:00"],
        "budget_category_id": ["DIRECT"], "award_amount_usd": ["1.00"],
    }).with_columns([
        pl.col(c).str.to_datetime().dt.replace_time_zone("UTC")
        for c in ("award_period_start", "award_period_end")
    ])
    audit = run_compliance_gates(frame)[0]
    assert audit.status == "FLAGGED"
    assert "DATE_SEQUENCE_VIOLATION" in audit.compliance_tags

Because mapping is idempotent, a property-based check with hypothesis is worth adding: feeding any already-normalized frame back through execute_vectorized_mapping with an empty config must leave the canonical money and date columns unchanged.

Performance & Scale Considerations

Nonprofit-scale batches are typically thousands to low hundreds of thousands of records — small enough to fit in memory, large enough that row iteration hurts. Guidance:

Stay vectorized. Reserve iter_rows() for the compliance gate’s per-record verdict; do all coercion in columnar with_columns expressions. The map_elements Decimal pass is the one unavoidable Python-level loop; if it dominates a profile, pre-cast to a fixed-scale integer of cents and divide on emission.
Bound batch size. Cap a single mapping call at ~50k records (roughly 150–250 MB resident for a wide grant schema). Larger payloads should be chunked upstream by Async Batch Processing Pipelines, which owns concurrency and back-pressure.
Keep the stage stateless. No shared mutable state means batches parallelize cleanly across workers; the only shared input is the immutable mapping registry, loaded once per process.
Memory ceiling. Set a hard per-worker ceiling (e.g. 512 MB) and fail a batch over the limit into the exception queue rather than risk an OOM kill that leaves no audit record.

Failure Modes & Troubleshooting

Error category	Root cause	Remediation
`KeyError` on rename	Source field absent from payload; portal renamed a column	Return structured `MappingResult(status="ERROR")`; route batch to exception queue; update the mapping registry via pull request.
`decimal.InvalidOperation`	Money field carries currency symbols or thousands separators	Pre-sanitize with a strip step before `Decimal(str(x))`; flag `PRECISION_OVERFLOW`.
Null datetime after coercion	Timestamp format deviates from `%Y-%m-%dT%H:%M:%S`	Use `strict=False` to null rather than crash, then flag the null and quarantine the row.
`STREAM_FORMAT_MISMATCH`	Unrecognized funding-stream prefix	Reject record, emit portal-sync alert; verify the agency NOFO code list is current.
Silent field drop	Mapping config maps two aliases to one canonical without precedence	Enforce one canonical target per batch; assert no duplicate target columns after rename.

Rejected and flagged records do not stop here. Route PASS records onward; route FLAGGED and REJECTED records to the exception queue, where they are consumed by Error Categorization & Logging for automated triage and manual review.

Compliance Alignment

This stage exists to make the numbers downstream consumers depend on present, typed, and reconciled. The gates map to specific authorities:

2 CFR §200.302 (Financial Management). Strict typing at the boundary and a recorded source-payload hash give the reproducible, traceable records the regulation requires; untyped dictionaries must never propagate past this layer.
2 CFR §200.308 (Revision of Budget and Program Plans). The DATE_SEQUENCE_VIOLATION gate enforces award-period integrity so budget-period revisions remain internally consistent.
2 CFR §200.400 (Policy Guide / cost principles). ROUND_HALF_UP at two decimal places guarantees deterministic monetary precision, preventing rounding drift across reporting cycles.
OMB Uniform Guidance cost principles. The budget_category_id enumeration constrains records to recognized cost classes (DIRECT, INDIRECT, MATCHING, ADMIN) before any downstream classification.

Audit artifacts emitted on handoff conform to the Compliance Metadata Standards contract: each log entry carries the original payload hash, transformation timestamp, validation results, and compliance tags, written to an immutable tier and stamped with GRANT_SCHEMA_VERSION. From there the canonical record feeds the IRS 990 Data Schema Mapping and rule-evaluation layers, which assume — and never re-derive — the guarantees made here.

Data Ingestion & Grant Parsing Workflows — parent reference and the ingestion-boundary contract this stage sits inside.
Standardizing Grant Field Names Across Multiple Portals — the alias-resolution prerequisite that feeds this registry.
Excel Budget Template Sync — upstream extraction stage whose payloads land in the staging queue.
Async Batch Processing Pipelines — owns batch chunking and concurrency for the records this stage maps.
Error Categorization & Logging — consumes the flagged and rejected records this stage routes to the exception queue.
Compliance Metadata Standards — the cross-pillar audit-metadata contract the emitted artifacts conform to.

Prerequisites #

Core Implementation #

Field Mapping & Schema Contract #

Alias Resolution Table #

Validation & Testing #

Performance & Scale Considerations #

Failure Modes & Troubleshooting #

Compliance Alignment #

Related #

Prerequisites

Core Implementation

Field Mapping & Schema Contract

Alias Resolution Table

Validation & Testing

Performance & Scale Considerations

Failure Modes & Troubleshooting

Compliance Alignment

Related