Architectural Position & Scope
This module functions as a discrete transformation layer within the broader Core Architecture & Compliance Mapping framework. Its operational mandate is strictly bounded: ingest raw IRS Form 990 filings, normalize heterogeneous field representations into a canonical JSON structure, enforce deterministic validation, and emit audit-ready payloads. It does not execute grant disbursement workflows, synchronize donor CRM records, or render regulatory reports. By enforcing strict separation of concerns, this subsystem guarantees structural integrity, type fidelity, and immutable lineage for every 990 payload entering the grant management ecosystem.
The implementation targets four primary stakeholders:
- Nonprofit Operations & Grant Managers: Require predictable, standardized financial disclosures for portfolio tracking.
- Python Automation Developers: Require deterministic parsing contracts, explicit type coercion, and structured error payloads.
- Compliance Officers: Require auditable validation trails, mathematical reconciliation proofs, and explicit compliance tagging.
All downstream logic—including jurisdictional registration checks, funder-specific eligibility rules, and access control enforcement—resides outside this boundary. This module terminates at the validation gate.
Stage 1: Ingestion & Canonical Schema Normalization
The ingestion boundary accepts IRS 990 submissions across three primary transport formats: native XML (IRS eFiling), structured CSV (third-party aggregators), and OCR-extracted text (legacy archives). Each format routes through a dedicated parser adapter that outputs a flat or lightly nested dictionary. These dictionaries converge into a single normalization function responsible for alias resolution, type coercion, and structural alignment.
Deterministic Alias Resolution & Type Coercion
Field aliases (EIN, TaxID, EmployerIdentificationNumber, 990_EIN) collapse into a single canonical ein key via a deterministic lookup table. Numeric strings containing currency symbols, thousands separators, or trailing whitespace are sanitized and cast to decimal.Decimal to eliminate floating-point drift. Boolean indicators (X, Yes, 1, true, checked) normalize to strict Python bool types.
For complex compensation and officer disclosures, array ordering and nested object validation must align precisely with the structural conventions documented in How to map IRS 990 Part VII to JSON schema. Deviations from this contract trigger immediate normalization rejection.
import logging
import decimal
from typing import Any, Dict, List
from datetime import datetime, timezone
logger = logging.getLogger("irs990.normalizer")
# Canonical alias mapping (subset for demonstration)
FIELD_ALIASES = {
"ein": {"ein", "tax_id", "employer_identification_number", "990_ein"},
"organization_name": {"name", "org_name", "legal_name"},
"total_revenue": {"part_i_line_12", "total_revenue", "gross_receipts"},
"is_501c3": {"is_501c3", "exempt_status", "section_501c3", "is_charitable"}
}
BOOLEAN_TRUTHY = {"x", "yes", "1", "true", "checked", "y"}
def _sanitize_numeric(value: Any) -> decimal.Decimal:
"""Strip non-numeric characters and return Decimal. Raises ValueError on failure."""
if isinstance(value, (int, float, decimal.Decimal)):
return decimal.Decimal(str(value))
cleaned = str(value).replace(",", "").replace("$", "").strip()
if not cleaned or cleaned.lower() in ("n/a", "none", "null"):
return decimal.Decimal("0")
return decimal.Decimal(cleaned)
def normalize_payload(raw: Dict[str, Any], trace_id: str) -> Dict[str, Any]:
"""Normalize raw parsed 990 dictionary into canonical structure."""
canonical: Dict[str, Any] = {}
audit_log: List[str] = []
for canonical_key, aliases in FIELD_ALIASES.items():
matched = False
for alias in aliases:
if alias in raw:
val = raw[alias]
if canonical_key == "ein":
canonical["ein"] = str(val).replace("-", "").strip()
elif canonical_key == "total_revenue":
canonical["total_revenue"] = _sanitize_numeric(val)
elif canonical_key == "is_501c3":
canonical["is_501c3"] = str(val).strip().lower() in BOOLEAN_TRUTHY
else:
canonical[canonical_key] = str(val).strip()
matched = True
break
if not matched:
audit_log.append(f"MISSING_ALIAS:{canonical_key}")
# Attach immutable audit metadata
canonical["_meta"] = {
"trace_id": trace_id,
"normalized_at": datetime.now(timezone.utc).isoformat(),
"normalization_warnings": audit_log,
"schema_version": "irs990_v2.1"
}
logger.info(
"Normalization complete",
extra={"trace_id": trace_id, "warnings": len(audit_log)}
)
return canonical
Normalization terminates at the schema validation gate. No reconciliation, rule evaluation, or compliance tagging occurs within this boundary.
Stage 2: Field-Level Validation & Cross-Record Reconciliation
Validated payloads enter the validation layer, which executes explicit structural and semantic checks. The implementation relies on pydantic v2 models backed by strict jsonschema validators. Required fields trigger immediate rejection if absent or null. Type mismatches generate structured error payloads containing the JSON path, expected type, and actual value.
Mathematical Reconciliation & Tolerance Thresholds
Cross-record reconciliation enforces mathematical consistency across IRS 990 sections. For example, Part I Line 12 (Total Revenue) must equal the sum of constituent lines (1h, 2–11). Validation logic computes the delta and flags discrepancies exceeding a configurable tolerance threshold (default: $0.00). Reconciliation failures do not halt the pipeline; instead, they emit structured exception payloads for downstream triage.
from pydantic import BaseModel, Field, ValidationError, field_validator
from typing import List, Dict
import decimal
class IRS990PartI(BaseModel):
line_1h: decimal.Decimal = Field(default=decimal.Decimal("0"))
line_2: decimal.Decimal = Field(default=decimal.Decimal("0"))
line_3: decimal.Decimal = Field(default=decimal.Decimal("0"))
line_4: decimal.Decimal = Field(default=decimal.Decimal("0"))
line_5: decimal.Decimal = Field(default=decimal.Decimal("0"))
line_6: decimal.Decimal = Field(default=decimal.Decimal("0"))
line_7: decimal.Decimal = Field(default=decimal.Decimal("0"))
line_8: decimal.Decimal = Field(default=decimal.Decimal("0"))
line_9: decimal.Decimal = Field(default=decimal.Decimal("0"))
line_10: decimal.Decimal = Field(default=decimal.Decimal("0"))
line_11: decimal.Decimal = Field(default=decimal.Decimal("0"))
line_12_total_revenue: decimal.Decimal
@field_validator("line_12_total_revenue")
@classmethod
def validate_revenue_sum(cls, v: decimal.Decimal, info) -> decimal.Decimal:
constituent_sum = sum(getattr(info.data, f"line_{i}") for i in ["1h"] + [str(n) for n in range(2, 12)])
delta = abs(v - constituent_sum)
if delta > decimal.Decimal("0.00"):
raise ValueError(
f"Revenue reconciliation failed: Line 12 ({v}) != Sum of Lines ({constituent_sum}). Delta: {delta}"
)
return v
def validate_and_reconcile(normalized: Dict[str, Any], trace_id: str) -> Dict[str, Any]:
"""Execute Pydantic validation and reconciliation. Returns validated payload or structured error."""
try:
part_i = IRS990PartI(**normalized.get("part_i", {}))
normalized["part_i"] = part_i.model_dump(mode="json")
normalized["_meta"]["validation_status"] = "PASSED"
logger.info("Validation passed", extra={"trace_id": trace_id})
return normalized
except ValidationError as e:
error_payload = {
"trace_id": trace_id,
"validation_status": "FAILED",
"errors": [
{
"field": err["loc"][0],
"expected_type": "decimal.Decimal",
"actual_value": str(err["input"]),
"message": err["msg"]
}
for err in e.errors()
],
"timestamp": datetime.now(timezone.utc).isoformat()
}
logger.warning("Validation failed", extra={"trace_id": trace_id, "error_count": len(error_payload["errors"])})
return error_payload
Reconciliation logic remains strictly mathematical. Business rule evaluation, jurisdictional compliance checks, and funder-specific constraints are explicitly deferred to downstream consumers.
Stage 3: Compliance Handoff & Lineage Emission
Once payloads pass normalization and validation, the module emits a finalized canonical JSON object with attached compliance metadata. This emission point serves as the sole integration boundary for downstream systems.
The validated payload routes to the State Charity Registration Compliance subsystem for jurisdictional filing verification, multi-state nexus detection, and annual reporting deadline tracking. Simultaneously, the payload enters the Grantor-Specific Rule Taxonomies engine, where funder-defined eligibility matrices, programmatic restrictions, and matching requirements are evaluated.
This module does not participate in those evaluations. It guarantees only that:
- All required fields are present and type-safe.
- Mathematical cross-sections reconcile within tolerance.
- Audit lineage (trace ID, normalization timestamp, validation status, schema version) is immutable and queryable.
Access boundaries, encryption at rest, and pipeline fallback/retry logic are enforced by the orchestration layer. This subsystem emits payloads synchronously or via message queue, depending on deployment topology, but never retains state beyond the request lifecycle.
Compliance Mapping Matrix
The following matrix explicitly maps IRS 990 sections to canonical JSON keys, validation rules, and compliance artifacts. This mapping ensures deterministic translation across all ingestion formats.
| IRS 990 Section | Canonical JSON Key | Validation Rule | Compliance Artifact |
|---|---|---|---|
| Header / EIN | ein |
9-digit numeric, stripped of hyphens | Tax-exempt status verification anchor |
| Part I, Line 12 | part_i.line_12_total_revenue |
Must equal sum of Lines 1h–11 (Δ ≤ $0.00) | Financial threshold eligibility proof |
| Part VII, Section A | officers_compensation |
Array of objects, sorted by compensation desc | Governance transparency audit trail |
| Part VIII, Line 1a | contributions_grants |
Non-negative Decimal, currency-stripped | Donor restriction classification base |
| Schedule O, Narrative | supplemental_narratives |
UTF-8 string, max 100KB | Regulatory disclosure completeness flag |
| Signature Block | authorized_officer |
Required string, non-null | Legal attestation compliance marker |
All fields marked as required trigger immediate rejection if absent. Optional fields default to null or 0 per canonical schema defaults. The mapping aligns with the official Instructions for Form 990 published by the Internal Revenue Service and adheres to Python’s decimal arithmetic standards for financial precision.
Audit & Observability Standards
Every transformation step emits structured logs containing:
trace_id: UUID correlating ingestion, normalization, validation, and handoff.schema_version: Immutable reference to the active canonical schema.validation_status:PASSED,FAILED, orRECONCILIATION_WARNING.compliance_tags: Array of jurisdictional and funder-agnostic compliance markers.
Audit logs must be written to an immutable, write-once storage layer. No payload mutation occurs post-validation. Pipeline retry logic handles transient ingestion failures; this module does not implement backoff strategies or circuit breakers. Observability metrics (normalization latency, validation error rates, reconciliation deltas) feed into the central compliance dashboard, enabling real-time anomaly detection and regulatory reporting.
By maintaining strict separation of concerns, deterministic type coercion, and explicit compliance mapping, this subsystem ensures that every 990 payload entering the grant automation pipeline is structurally sound, mathematically consistent, and fully auditable.