IRS 990 Data Schema Mapping

This module functions as a discrete transformation layer within the broader Core Architecture & Compliance Mapping framework. Its operational mandate is…

Architectural Position & Scope

This module functions as a discrete transformation layer within the broader Core Architecture & Compliance Mapping framework. Its operational mandate is strictly bounded: ingest raw IRS Form 990 filings, normalize heterogeneous field representations into a canonical JSON structure, enforce deterministic validation, and emit audit-ready payloads. It does not execute grant disbursement workflows, synchronize donor CRM records, or render regulatory reports. By enforcing strict separation of concerns, this subsystem guarantees structural integrity, type fidelity, and immutable lineage for every 990 payload entering the grant management ecosystem.

The implementation targets four primary stakeholders:

  • Nonprofit Operations & Grant Managers: Require predictable, standardized financial disclosures for portfolio tracking.
  • Python Automation Developers: Require deterministic parsing contracts, explicit type coercion, and structured error payloads.
  • Compliance Officers: Require auditable validation trails, mathematical reconciliation proofs, and explicit compliance tagging.

All downstream logic—including jurisdictional registration checks, funder-specific eligibility rules, and access control enforcement—resides outside this boundary. This module terminates at the validation gate.


Stage 1: Ingestion & Canonical Schema Normalization

The ingestion boundary accepts IRS 990 submissions across three primary transport formats: native XML (IRS eFiling), structured CSV (third-party aggregators), and OCR-extracted text (legacy archives). Each format routes through a dedicated parser adapter that outputs a flat or lightly nested dictionary. These dictionaries converge into a single normalization function responsible for alias resolution, type coercion, and structural alignment.

Deterministic Alias Resolution & Type Coercion

Field aliases (EIN, TaxID, EmployerIdentificationNumber, 990_EIN) collapse into a single canonical ein key via a deterministic lookup table. Numeric strings containing currency symbols, thousands separators, or trailing whitespace are sanitized and cast to decimal.Decimal to eliminate floating-point drift. Boolean indicators (X, Yes, 1, true, checked) normalize to strict Python bool types.

For complex compensation and officer disclosures, array ordering and nested object validation must align precisely with the structural conventions documented in How to map IRS 990 Part VII to JSON schema. Deviations from this contract trigger immediate normalization rejection.

python
import logging
import decimal
from typing import Any, Dict, List
from datetime import datetime, timezone

logger = logging.getLogger("irs990.normalizer")

# Canonical alias mapping (subset for demonstration)
FIELD_ALIASES = {
    "ein": {"ein", "tax_id", "employer_identification_number", "990_ein"},
    "organization_name": {"name", "org_name", "legal_name"},
    "total_revenue": {"part_i_line_12", "total_revenue", "gross_receipts"},
    "is_501c3": {"is_501c3", "exempt_status", "section_501c3", "is_charitable"}
}

BOOLEAN_TRUTHY = {"x", "yes", "1", "true", "checked", "y"}

def _sanitize_numeric(value: Any) -> decimal.Decimal:
    """Strip non-numeric characters and return Decimal. Raises ValueError on failure."""
    if isinstance(value, (int, float, decimal.Decimal)):
        return decimal.Decimal(str(value))
    cleaned = str(value).replace(",", "").replace("$", "").strip()
    if not cleaned or cleaned.lower() in ("n/a", "none", "null"):
        return decimal.Decimal("0")
    return decimal.Decimal(cleaned)

def normalize_payload(raw: Dict[str, Any], trace_id: str) -> Dict[str, Any]:
    """Normalize raw parsed 990 dictionary into canonical structure."""
    canonical: Dict[str, Any] = {}
    audit_log: List[str] = []

    for canonical_key, aliases in FIELD_ALIASES.items():
        matched = False
        for alias in aliases:
            if alias in raw:
                val = raw[alias]
                if canonical_key == "ein":
                    canonical["ein"] = str(val).replace("-", "").strip()
                elif canonical_key == "total_revenue":
                    canonical["total_revenue"] = _sanitize_numeric(val)
                elif canonical_key == "is_501c3":
                    canonical["is_501c3"] = str(val).strip().lower() in BOOLEAN_TRUTHY
                else:
                    canonical[canonical_key] = str(val).strip()
                matched = True
                break
        if not matched:
            audit_log.append(f"MISSING_ALIAS:{canonical_key}")

    # Attach immutable audit metadata
    canonical["_meta"] = {
        "trace_id": trace_id,
        "normalized_at": datetime.now(timezone.utc).isoformat(),
        "normalization_warnings": audit_log,
        "schema_version": "irs990_v2.1"
    }

    logger.info(
        "Normalization complete",
        extra={"trace_id": trace_id, "warnings": len(audit_log)}
    )
    return canonical

Normalization terminates at the schema validation gate. No reconciliation, rule evaluation, or compliance tagging occurs within this boundary.


Stage 2: Field-Level Validation & Cross-Record Reconciliation

Validated payloads enter the validation layer, which executes explicit structural and semantic checks. The implementation relies on pydantic v2 models backed by strict jsonschema validators. Required fields trigger immediate rejection if absent or null. Type mismatches generate structured error payloads containing the JSON path, expected type, and actual value.

Mathematical Reconciliation & Tolerance Thresholds

Cross-record reconciliation enforces mathematical consistency across IRS 990 sections. For example, Part I Line 12 (Total Revenue) must equal the sum of constituent lines (1h, 2–11). Validation logic computes the delta and flags discrepancies exceeding a configurable tolerance threshold (default: $0.00). Reconciliation failures do not halt the pipeline; instead, they emit structured exception payloads for downstream triage.

python
from pydantic import BaseModel, Field, ValidationError, field_validator
from typing import List, Dict
import decimal

class IRS990PartI(BaseModel):
    line_1h: decimal.Decimal = Field(default=decimal.Decimal("0"))
    line_2: decimal.Decimal = Field(default=decimal.Decimal("0"))
    line_3: decimal.Decimal = Field(default=decimal.Decimal("0"))
    line_4: decimal.Decimal = Field(default=decimal.Decimal("0"))
    line_5: decimal.Decimal = Field(default=decimal.Decimal("0"))
    line_6: decimal.Decimal = Field(default=decimal.Decimal("0"))
    line_7: decimal.Decimal = Field(default=decimal.Decimal("0"))
    line_8: decimal.Decimal = Field(default=decimal.Decimal("0"))
    line_9: decimal.Decimal = Field(default=decimal.Decimal("0"))
    line_10: decimal.Decimal = Field(default=decimal.Decimal("0"))
    line_11: decimal.Decimal = Field(default=decimal.Decimal("0"))
    line_12_total_revenue: decimal.Decimal

    @field_validator("line_12_total_revenue")
    @classmethod
    def validate_revenue_sum(cls, v: decimal.Decimal, info) -> decimal.Decimal:
        constituent_sum = sum(getattr(info.data, f"line_{i}") for i in ["1h"] + [str(n) for n in range(2, 12)])
        delta = abs(v - constituent_sum)
        if delta > decimal.Decimal("0.00"):
            raise ValueError(
                f"Revenue reconciliation failed: Line 12 ({v}) != Sum of Lines ({constituent_sum}). Delta: {delta}"
            )
        return v

def validate_and_reconcile(normalized: Dict[str, Any], trace_id: str) -> Dict[str, Any]:
    """Execute Pydantic validation and reconciliation. Returns validated payload or structured error."""
    try:
        part_i = IRS990PartI(**normalized.get("part_i", {}))
        normalized["part_i"] = part_i.model_dump(mode="json")
        normalized["_meta"]["validation_status"] = "PASSED"
        logger.info("Validation passed", extra={"trace_id": trace_id})
        return normalized
    except ValidationError as e:
        error_payload = {
            "trace_id": trace_id,
            "validation_status": "FAILED",
            "errors": [
                {
                    "field": err["loc"][0],
                    "expected_type": "decimal.Decimal",
                    "actual_value": str(err["input"]),
                    "message": err["msg"]
                }
                for err in e.errors()
            ],
            "timestamp": datetime.now(timezone.utc).isoformat()
        }
        logger.warning("Validation failed", extra={"trace_id": trace_id, "error_count": len(error_payload["errors"])})
        return error_payload

Reconciliation logic remains strictly mathematical. Business rule evaluation, jurisdictional compliance checks, and funder-specific constraints are explicitly deferred to downstream consumers.


Stage 3: Compliance Handoff & Lineage Emission

Once payloads pass normalization and validation, the module emits a finalized canonical JSON object with attached compliance metadata. This emission point serves as the sole integration boundary for downstream systems.

The validated payload routes to the State Charity Registration Compliance subsystem for jurisdictional filing verification, multi-state nexus detection, and annual reporting deadline tracking. Simultaneously, the payload enters the Grantor-Specific Rule Taxonomies engine, where funder-defined eligibility matrices, programmatic restrictions, and matching requirements are evaluated.

This module does not participate in those evaluations. It guarantees only that:

  1. All required fields are present and type-safe.
  2. Mathematical cross-sections reconcile within tolerance.
  3. Audit lineage (trace ID, normalization timestamp, validation status, schema version) is immutable and queryable.

Access boundaries, encryption at rest, and pipeline fallback/retry logic are enforced by the orchestration layer. This subsystem emits payloads synchronously or via message queue, depending on deployment topology, but never retains state beyond the request lifecycle.


Compliance Mapping Matrix

The following matrix explicitly maps IRS 990 sections to canonical JSON keys, validation rules, and compliance artifacts. This mapping ensures deterministic translation across all ingestion formats.

IRS 990 Section Canonical JSON Key Validation Rule Compliance Artifact
Header / EIN ein 9-digit numeric, stripped of hyphens Tax-exempt status verification anchor
Part I, Line 12 part_i.line_12_total_revenue Must equal sum of Lines 1h–11 (Δ ≤ $0.00) Financial threshold eligibility proof
Part VII, Section A officers_compensation Array of objects, sorted by compensation desc Governance transparency audit trail
Part VIII, Line 1a contributions_grants Non-negative Decimal, currency-stripped Donor restriction classification base
Schedule O, Narrative supplemental_narratives UTF-8 string, max 100KB Regulatory disclosure completeness flag
Signature Block authorized_officer Required string, non-null Legal attestation compliance marker

All fields marked as required trigger immediate rejection if absent. Optional fields default to null or 0 per canonical schema defaults. The mapping aligns with the official Instructions for Form 990 published by the Internal Revenue Service and adheres to Python’s decimal arithmetic standards for financial precision.


Audit & Observability Standards

Every transformation step emits structured logs containing:

  • trace_id: UUID correlating ingestion, normalization, validation, and handoff.
  • schema_version: Immutable reference to the active canonical schema.
  • validation_status: PASSED, FAILED, or RECONCILIATION_WARNING.
  • compliance_tags: Array of jurisdictional and funder-agnostic compliance markers.

Audit logs must be written to an immutable, write-once storage layer. No payload mutation occurs post-validation. Pipeline retry logic handles transient ingestion failures; this module does not implement backoff strategies or circuit breakers. Observability metrics (normalization latency, validation error rates, reconciliation deltas) feed into the central compliance dashboard, enabling real-time anomaly detection and regulatory reporting.

By maintaining strict separation of concerns, deterministic type coercion, and explicit compliance mapping, this subsystem ensures that every 990 payload entering the grant automation pipeline is structurally sound, mathematically consistent, and fully auditable.