Compliance Metadata Standards

This guide is part of the Core Architecture & Compliance Mapping framework and defines the discrete control layer that governs how compliance metadata is…

This guide is part of the Core Architecture & Compliance Mapping framework and defines the discrete control layer that governs how compliance metadata is structured, versioned, validated, and sealed as it moves through a grant automation pipeline. The scope is deliberately narrow: it specifies the canonical metadata envelope, the cryptographic provenance fields every payload must carry, and the immutable audit hooks attached at each stage boundary. It is written for nonprofit operations leads, grant program managers, Python automation developers, and compliance officers who need a deterministic contract for audit-ready data.

This standard does not perform document extraction, run grantor adjudication, or disburse funds. Those responsibilities belong to adjacent modules: structural normalization is handled upstream by Field Mapping & Normalization, funder constraint logic lives in Grantor-Specific Rule Taxonomies, and access enforcement is owned by Data Security & Access Boundaries. This module terminates at the point where a metadata envelope is validated, enriched, and serialized for submission. Everything in scope is the metadata about the data, never the financial payload itself.

Prerequisites

The reference implementation targets a reproducible, pinned environment. Floating dependency ranges are prohibited because validation semantics (especially Pydantic strict mode) change across minor versions and would silently alter rejection behaviour.

Python: 3.11 or later (required for datetime.fromisoformat ISO-8601 offset parsing and tomllib).
Pinned packages:

bash

pip install "pydantic==2.6.4" "orjson==3.10.0" "structlog==24.1.0"

Environment variables:

Variable	Purpose	Example
`METADATA_SCHEMA_VERSION`	Pins the active envelope contract version	`1.0`
`COMPLIANCE_REGISTRY_URL`	Endpoint for jurisdictional reconciliation lookups	`https://registry.internal/charity`
`AUDIT_LOG_SINK`	Append-only destination for signed audit records	`s3://audit-ledger/grants/`
`RETENTION_FLOOR_YEARS`	Minimum record retention per 2 CFR §200.334	`3`

Upstream stage dependencies: payloads must already be type-normalized by the ingestion layer. Where a grant filing is sourced from IRS data, field semantics must first resolve against the IRS 990 Data Schema Mapping contract so that monetary fields and line-item identifiers are canonical before metadata sealing begins.

The envelope is enriched, never mutated, as it crosses each boundary; every handoff appends exactly one signed audit record, so the chain length equals the number of stages cleared.

Core Implementation: The Metadata Envelope

Every payload entering the pipeline is wrapped in a versioned metadata envelope. The envelope is a strict contract: implicit type coercion is disabled, mandatory provenance fields are enforced, and a SHA-256 digest binds the metadata to the exact bytes it describes. Validation failures are routed as structured errors, never swallowed.

python

import hashlib
import json
import logging
from datetime import datetime, timezone
from typing import Any

from pydantic import (
    BaseModel,
    StrictInt,
    StrictStr,
    ValidationError,
    field_validator,
)

# Structured audit logger for compliance traceability — never use print()
audit_logger = logging.getLogger("compliance.audit")
audit_logger.setLevel(logging.INFO)


class IngestionMetadata(BaseModel):
    grant_id: StrictStr
    fiscal_year: StrictInt
    compliance_jurisdiction: StrictStr
    data_origin_hash: StrictStr
    ingestion_timestamp: StrictStr

    @field_validator("data_origin_hash")
    @classmethod
    def validate_sha256(cls, v: str) -> str:
        if len(v) != 64 or not all(c in "0123456789abcdef" for c in v):
            raise ValueError("Invalid SHA-256 hex digest format")
        return v

    @field_validator("ingestion_timestamp")
    @classmethod
    def validate_iso8601(cls, v: str) -> str:
        try:
            datetime.fromisoformat(v.replace("Z", "+00:00"))
        except ValueError as exc:
            raise ValueError("Timestamp must be valid ISO-8601 UTC") from exc
        return v


def compute_payload_hash(raw_bytes: bytes) -> str:
    """Generate a deterministic SHA-256 digest for the raw payload at the network edge."""
    return hashlib.sha256(raw_bytes).hexdigest()


def validate_ingestion_envelope(raw_bytes: bytes, metadata_json: str) -> dict[str, Any]:
    """Validate metadata, verify integrity, attach an audit record, and return the envelope."""
    try:
        metadata = IngestionMetadata.model_validate_json(metadata_json)
    except ValidationError as exc:
        # Explicit structured error routing — field-level detail preserved for the caller.
        audit_logger.error("Ingestion validation failed", extra={"errors": exc.errors()})
        return {"status": "REJECTED", "stage": "INGESTION", "errors": exc.errors()}

    expected_hash = compute_payload_hash(raw_bytes)
    if metadata.data_origin_hash != expected_hash:
        audit_logger.critical("Hash mismatch detected", extra={"grant_id": metadata.grant_id})
        return {
            "status": "REJECTED",
            "stage": "INGESTION",
            "errors": [{"type": "integrity", "msg": "Payload integrity verification failed"}],
        }

    audit_record = {
        "stage": "INGESTION",
        "grant_id": metadata.grant_id,
        "timestamp_utc": datetime.now(timezone.utc).isoformat(),
        "hash_verified": True,
        "schema_version": "1.0",
    }
    audit_logger.info("Ingestion envelope validated", extra=audit_record)
    return {"status": "ACCEPTED", "metadata": metadata.model_dump(), "audit_trail": [audit_record]}

Two design choices are load-bearing. First, StrictStr and StrictInt disable Pydantic’s implicit coercion, so a JSON "2024" will be rejected rather than silently accepted as the integer 2024 — this prevents a class of fiscal-year drift that otherwise corrupts downstream rule evaluation. Second, the function returns a structured rejection object instead of raising, so the caller can route the failure into Error Categorization & Logging without unwinding the stack and losing field-level context.

Field Mapping & Schema Contract

Source systems describe the same compliance concept under inconsistent keys. The metadata standard pins one canonical name per concept and resolves all known aliases during envelope construction. Alias resolution is deterministic and case-folded; an unmapped key is preserved untouched so that unexpected fields surface in validation rather than being dropped.

Canonical field	Accepted aliases	Type	Coercion rule
`grant_id`	`grantId`, `award_no`, `grant_number`	`str`	None — must arrive as a string
`fiscal_year`	`fy`, `fiscalYear`, `budget_year`	`int`	Reject string-numerics; require true integer
`compliance_jurisdiction`	`jurisdiction`, `state_code`, `reg_state`	`str`	Upper-cased ISO 3166-2 subdivision code
`data_origin_hash`	`payload_hash`, `sha256`	`str`	64-char lowercase hex; reject otherwise
`ingestion_timestamp`	`received_at`, `ts`	`str`	Normalize trailing `Z` to `+00:00` UTC offset

python

_ALIAS_MAP: dict[str, str] = {
    "grantid": "grant_id", "award_no": "grant_id", "grant_number": "grant_id",
    "fy": "fiscal_year", "fiscalyear": "fiscal_year", "budget_year": "fiscal_year",
    "jurisdiction": "compliance_jurisdiction", "state_code": "compliance_jurisdiction",
    "reg_state": "compliance_jurisdiction",
    "payload_hash": "data_origin_hash", "sha256": "data_origin_hash",
    "received_at": "ingestion_timestamp", "ts": "ingestion_timestamp",
}


def resolve_aliases(raw: dict[str, Any]) -> dict[str, Any]:
    """Fold known aliases to canonical field names; preserve unknown keys for strict validation."""
    resolved: dict[str, Any] = {}
    for key, value in raw.items():
        canonical = _ALIAS_MAP.get(key.lower(), key)
        if canonical in resolved:
            raise ValueError(f"Conflicting aliases resolved to '{canonical}'")
        resolved[canonical] = value
    return resolved

Collisions — two different source keys mapping to the same canonical field — raise immediately, because silently choosing one value would make the resulting envelope non-reproducible and break the integrity guarantee.

Metadata Lifecycle Across Stage Boundaries

The envelope is enriched, never mutated in place, as it crosses stage boundaries. Each handoff appends exactly one signed audit record so the chain length equals the number of stages the payload has cleared. This append-only discipline is what makes the trail reconstructable during an audit.

When a validated envelope leaves ingestion it enters reconciliation, where jurisdictional tagging is checked against State Charity Registration Compliance before PII is masked at the boundary.

python

def reconcile_jurisdictional_status(metadata: dict[str, Any], registry_lookup_fn) -> dict[str, Any]:
    """Enrich metadata with jurisdictional compliance status (idempotent)."""
    grant_id = metadata["grant_id"]
    jurisdiction = metadata["compliance_jurisdiction"]

    if metadata.get("reconciliation_status") == "COMPLETED":
        return metadata  # idempotent: re-processing yields the same state, no duplicate audit entry

    registry_data = registry_lookup_fn(grant_id, jurisdiction)
    if not registry_data:
        audit_logger.error("Registry lookup empty", extra={"grant_id": grant_id})
        return {**metadata, "reconciliation_status": "FAILED"}

    enriched = metadata.copy()
    enriched.update({
        "registration_status": registry_data.get("status", "UNKNOWN"),
        "registry_match_score": registry_data.get("match_score", 0.0),
        "last_verified_date": datetime.now(timezone.utc).isoformat(),
        "reconciliation_status": "COMPLETED",
    })
    enriched.setdefault("audit_trail", list(metadata.get("audit_trail", []))).append({
        "stage": "RECONCILIATION",
        "grant_id": grant_id,
        "action": "JURISDICTIONAL_ENRICHMENT",
        "timestamp_utc": datetime.now(timezone.utc).isoformat(),
    })
    return enriched

Reconciled metadata is then evaluated against funder constraints defined in Grantor-Specific Rule Taxonomies. The rule engine is a stateless reader: it emits a verdict and a violation list, and appends a verdict record, but never edits the upstream fields.

python

from enum import Enum


class ComplianceVerdict(str, Enum):
    PASS = "PASS"
    CONDITIONAL = "CONDITIONAL"
    FAIL = "FAIL"


def evaluate_grantor_rules(
    metadata: dict[str, Any], rule_matrix: dict[str, Any]
) -> tuple[ComplianceVerdict, list[str]]:
    """Evaluate metadata against an immutable rule taxonomy; return verdict + violations."""
    violations: list[str] = []
    if metadata.get("fiscal_year") not in rule_matrix.get("allowed_fiscal_years", []):
        violations.append("FISCAL_YEAR_OUTSIDE_GRANT_WINDOW")
    if metadata.get("compliance_jurisdiction") not in rule_matrix.get("permitted_jurisdictions", []):
        violations.append("JURISDICTION_NOT_ELIGIBLE")
    if metadata.get("registration_status") != "ACTIVE":
        violations.append("REGISTRATION_STATUS_INVALID")

    verdict = ComplianceVerdict.FAIL if violations else ComplianceVerdict.PASS
    metadata.setdefault("audit_trail", []).append({
        "stage": "RULE_ENGINE",
        "grant_id": metadata["grant_id"],
        "verdict": verdict.value,
        "rules_evaluated": len(rule_matrix),
        "timestamp_utc": datetime.now(timezone.utc).isoformat(),
    })
    return verdict, violations

A FAIL verdict quarantines the payload; a CONDITIONAL verdict triggers manual review. Only a PASS reaches the reporting stage, where the envelope is mapped to the regulatory submission schema and frozen. Serialization uses sorted keys and stable separators so the same envelope always produces byte-identical output — a precondition for the final checksum that seals the audit chain.

python

import orjson


def serialize_compliance_report(metadata: dict[str, Any]) -> bytes:
    """Finalize the envelope and emit a deterministic, submission-ready artifact."""
    metadata.setdefault("audit_trail", []).append({
        "stage": "REPORTING",
        "grant_id": metadata["grant_id"],
        "action": "FINAL_SERIALIZATION",
        "audit_chain_length": len(metadata.get("audit_trail", [])) + 1,
        "timestamp_utc": datetime.now(timezone.utc).isoformat(),
    })
    report_payload = {
        "schema_version": "1.0",
        "metadata": {k: v for k, v in metadata.items() if k != "audit_trail"},
        "audit_trail": metadata["audit_trail"],
    }
    # orjson with OPT_SORT_KEYS gives deterministic, byte-stable output for checksum sealing.
    return orjson.dumps(report_payload, option=orjson.OPT_SORT_KEYS)

Validation & Testing

Metadata contracts are only trustworthy if their rejection behaviour is pinned by tests. The suite asserts both the happy path and the precise structured error returned on failure, including the audit record that must accompany every accepted envelope.

python

import hashlib
import json

import pytest

from envelope import compute_payload_hash, validate_ingestion_envelope


def _envelope(raw: bytes, **overrides) -> str:
    base = {
        "grant_id": "G-2024-0001",
        "fiscal_year": 2024,
        "compliance_jurisdiction": "US-CA",
        "data_origin_hash": compute_payload_hash(raw),
        "ingestion_timestamp": "2024-06-27T12:00:00Z",
    }
    base.update(overrides)
    return json.dumps(base)


def test_accepts_valid_envelope() -> None:
    raw = b'{"amount": 50000}'
    result = validate_ingestion_envelope(raw, _envelope(raw))
    assert result["status"] == "ACCEPTED"
    assert result["audit_trail"][0]["hash_verified"] is True


def test_rejects_string_fiscal_year() -> None:
    raw = b'{"amount": 50000}'
    bad = _envelope(raw, fiscal_year="2024")  # strict mode must reject the string
    result = validate_ingestion_envelope(raw, bad)
    assert result["status"] == "REJECTED"
    assert any(e["loc"] == ("fiscal_year",) for e in result["errors"])


def test_rejects_hash_mismatch() -> None:
    raw = b'{"amount": 50000}'
    tampered = _envelope(b'{"amount": 99999}')  # hash computed over different bytes
    result = validate_ingestion_envelope(raw, tampered)
    assert result["status"] == "REJECTED"
    assert result["errors"][0]["type"] == "integrity"

For broader coverage, drive the alias resolver with hypothesis to confirm that arbitrary key casings fold to the same canonical field and that no input silently drops a mandatory field.

Performance & Scale Considerations

Nonprofit-scale workloads are bursty (grant-cycle deadlines) rather than continuously high-volume, so the standard optimizes for predictable memory and bounded concurrency rather than raw throughput.

Batch sizing: validate in batches of 250–500 envelopes. Envelope objects are small (well under 4 KB each), so a 500-envelope batch holds comfortably under a 50 MB working-set ceiling even with audit trails attached.
Concurrency limits: cap registry-bound reconciliation at 8–16 concurrent lookups. The bottleneck is the external charity registry, not local CPU; oversaturating it produces the timeout failures categorized below.
Serialization: prefer orjson over the stdlib json for the reporting stage — it is roughly an order of magnitude faster on sorted-key output and matters when sealing thousands of artifacts at cycle close.
Audit trail growth: the trail grows by one record per stage, so a fully processed envelope carries four records. Do not inline large diffs into audit records; store a hash reference and keep the record fixed-size.

Where envelopes are produced by concurrent extraction workers, governance of that fan-out belongs to Async Batch Processing Pipelines, which feeds normalized payloads into this validation boundary.

Failure Modes & Troubleshooting

Error category	Root cause	Remediation
`STRICT_TYPE_REJECTION`	Source sent a string where an int is required (e.g. `"2024"`)	Coerce types at the normalization layer before sealing; never relax strict mode
`INTEGRITY_HASH_MISMATCH`	Payload bytes mutated after the hash was computed	Recompute `data_origin_hash` at the true edge; reject any in-flight rewrite
`ALIAS_COLLISION`	Two source keys map to one canonical field	Inspect the source schema; pin a single accepted alias and drop the duplicate upstream
`REGISTRY_LOOKUP_TIMEOUT`	Registry saturated by excess concurrency	Lower the concurrency cap; route exhausted lookups to the retry queue
`VERDICT_QUARANTINE`	Rule engine returned `FAIL`	Resolve the listed violation codes; re-run reconciliation, do not edit the sealed envelope

Transient failures (timeouts, empty registry responses) must not be retried in place. Hand them to Pipeline Fallback & Retry Logic, which applies bounded exponential backoff and preserves the partial metadata state in a dead-letter queue so no audit context is lost.

Compliance Alignment

These standards exist to satisfy concrete regulatory obligations, not abstract best practice:

2 CFR §200.302 (financial management): the immutable, hash-verified envelope provides the records that “identify the source and application” of federal award funds and support reconciliation.
2 CFR §200.303(e) (internal controls): field-level masking at the reconciliation boundary safeguards personally identifiable information, with enforcement delegated to Data Security & Access Boundaries.
2 CFR §200.334 (record retention): audit trails and serialized artifacts are retained for a minimum of three years from final report submission; the RETENTION_FLOOR_YEARS floor encodes this.
IRS Form 990: canonical mapping of monetary metadata aligns with Part VIII (Statement of Revenue) and Part IX (Statement of Functional Expenses) line items so derived disclosures are reconstructable from sealed envelopes.
State charity thresholds: jurisdictional tagging supports state filings such as California’s RRF-1 and New York’s CHAR500, where solicitation registration status gates eligibility.

Core Architecture & Compliance Mapping — parent framework and system topology
IRS 990 Data Schema Mapping — canonical field semantics that feed the envelope
Grantor-Specific Rule Taxonomies — constraint evaluation downstream of reconciliation
Data Security & Access Boundaries — PII masking and least-privilege enforcement at stage handoffs
Pipeline Fallback & Retry Logic — deterministic recovery for transient validation and registry failures
Async Batch Processing Pipelines — concurrent producers that feed normalized payloads into this validation boundary

Prerequisites #

Core Implementation: The Metadata Envelope #

Field Mapping & Schema Contract #

Metadata Lifecycle Across Stage Boundaries #

Validation & Testing #

Performance & Scale Considerations #

Failure Modes & Troubleshooting #

Compliance Alignment #

Related #

Prerequisites

Core Implementation: The Metadata Envelope

Field Mapping & Schema Contract

Metadata Lifecycle Across Stage Boundaries

Validation & Testing

Performance & Scale Considerations

Failure Modes & Troubleshooting

Compliance Alignment

Related