Data Security & Access Boundaries

This module is part of the Core Architecture & Compliance Mapping framework, which establishes the deterministic, compliance-first architecture for the…

This module is part of the Core Architecture & Compliance Mapping framework, which establishes the deterministic, compliance-first architecture for the nonprofit grant lifecycle. Here the focus narrows to one concern: enforcing strict data perimeter controls so that credentials, personally identifiable information (PII), and restricted-fund ledgers never cross a stage they were not scoped for. Access boundaries are implemented as immutable, stage-isolated checkpoints — each workflow stage maintains independent credential scopes, cryptographic validation gates, and deterministic fallback pathways, and cross-stage data leakage is architecturally prohibited.

In scope: token scope validation, schema-gated ingestion, PII tokenization, role-based access control (RBAC) on ledger reads, processing-context isolation, and cryptographically sealed egress. Out of scope: the canonical field normalization handled at the IRS 990 Data Schema Mapping layer, jurisdictional registration logic owned by State Charity Registration Compliance, and grantor allowable-cost evaluation owned by Grantor-Specific Rule Taxonomies. This module enforces the perimeter; it does not score applications or allocate funds.

Prerequisites

The reference implementation targets Python 3.11+ (required for datetime.UTC ergonomics and improved typing support). Pin the following packages so signature-verification and hashing behavior stay reproducible across audit periods:

text

# requirements.txt
pydantic==2.7.1
pyjwt==2.8.0
cryptography==42.0.5
structlog==24.1.0
pytest==8.2.0
hypothesis==6.100.1

Required environment variables (load from a secrets manager, never from a committed .env):

Variable	Purpose	Example source
`JWT_PUBLIC_KEY_PEM`	RS256 public key used to verify inbound token signatures	KMS / Vault transit
`INGEST_SALT`	Stage-bound, 32-byte salt for deterministic PII tokenization	Secrets manager, rotated quarterly
`IDEMPOTENCY_TTL_HOURS`	Retention window for duplicate-submission rejection	Config (default `24`)
`HSM_KEY_ID`	Identifier for the HSM/KMS key that signs sealed audit roots	Cloud HSM

Upstream dependency: this boundary stack sits at the very front of the pipeline. It receives raw payloads directly from external grant portals, CSV/XML uploads, and API webhooks — before any normalization occurs. Sanitized payloads that clear all four boundaries are handed off to the compliance mapping stage for rule evaluation, and every boundary emits audit records that conform to the Compliance Metadata Standards contract.

Core Implementation: Stage-Isolated Boundary Enforcement

The perimeter is composed of four sequential boundaries. Each is a self-contained, type-hinted class that performs exactly one class of enforcement, logs structured audit entries through structlog, and routes failures through explicit exceptions rather than swallowing them. No boundary shares mutable state with another; the only thing that crosses a boundary is an immutable, validated payload.

1. Ingestion boundary

Ingestion governs initial data entry. It operates exclusively on credential verification, schema validation, and PII isolation. Business logic, scoring, and fund allocation are deferred to downstream modules.

Token scope validation — every inbound request must present a JWT signed RS256 with an explicit ingest:read scope. Signature, issuer, and expiry are checked synchronously; requests lacking the scope receive a 401 without payload inspection.
Explicit schema enforcement — payloads are parsed through a pydantic model configured with extra="forbid". Any deviation is rejected deterministically, the raw payload is preserved in an encrypted quarantine bucket, and a structured error object is returned.
PII boundary isolation — fields containing PII are tokenized with deterministic SHA-256 hashing over a stage-bound salt before entering any queue. Raw PII never persists in transit logs.
Audit trail generation — each attempt emits an immutable, append-only log entry with WORM (Write Once, Read Many) flags.

python

from __future__ import annotations

import hashlib
from typing import Any

import jwt
import structlog
from cryptography.hazmat.primitives import serialization
from pydantic import BaseModel, ConfigDict, ValidationError

logger = structlog.get_logger()


class GrantSubmissionSchema(BaseModel):
    # Pydantic v2: configure via ConfigDict, not an inner class.
    model_config = ConfigDict(extra="forbid", frozen=True)

    grant_id: str
    applicant_org: str
    submission_date: str
    # Additional fields defined per grantor taxonomy.


class IngestionBoundary:
    def __init__(self, public_key_pem: bytes, salt: bytes) -> None:
        self.public_key = serialization.load_pem_public_key(public_key_pem)
        self.salt = salt
        self.logger = logger.bind(stage="ingestion_boundary")

    def validate_scope(self, token: str) -> dict[str, Any]:
        try:
            payload: dict[str, Any] = jwt.decode(
                token,
                self.public_key,
                algorithms=["RS256"],
                options={"require": ["exp", "iss", "scope"]},
            )
        except jwt.ExpiredSignatureError as exc:
            self.logger.warning("token_expired", error=str(exc))
            raise PermissionError("Token expired") from exc
        except jwt.InvalidTokenError as exc:
            self.logger.warning("invalid_token", error=str(exc))
            raise PermissionError("Invalid credential scope") from exc

        if "ingest:read" not in payload.get("scope", []):
            self.logger.warning("missing_scope", required="ingest:read")
            raise PermissionError("Missing ingest:read scope")
        return payload

    def enforce_schema(self, payload: dict[str, Any]) -> GrantSubmissionSchema:
        try:
            return GrantSubmissionSchema(**payload)
        except ValidationError as exc:
            self.logger.error("schema_violation", errors=exc.errors())
            raise ValueError("Schema rejection: payload quarantined") from exc

    def tokenize_pii(self, raw_value: str) -> str:
        # Deterministic so the same person resolves to a stable token,
        # but the raw value is never recoverable from the digest.
        return hashlib.sha256(f"{raw_value}{self.salt.hex()}".encode()).hexdigest()

    def process(self, token: str, payload: dict[str, Any]) -> dict[str, Any]:
        self.validate_scope(token)
        validated = self.enforce_schema(payload)

        sanitized = validated.model_dump()
        sanitized["applicant_org_hash"] = self.tokenize_pii(
            sanitized.pop("applicant_org")
        )

        self.logger.info(
            "ingestion_success",
            payload_fingerprint=hashlib.sha256(str(payload).encode()).hexdigest(),
            scope_valid=True,
            schema_compliant=True,
            pii_tokenized=True,
        )
        return sanitized

2. Reconciliation & ledger access control

Reconciliation validates consistency between ingested records and the authoritative grant ledger while enforcing least-privilege RBAC. It prevents unauthorized ledger mutations, duplicate submissions, and scope escalation. Credentials presenting any *:write claim against the ledger are rejected outright — this stage is read-only by design.

python

from __future__ import annotations

from datetime import datetime, timedelta, timezone
from typing import Any


class ReconciliationBoundary:
    def __init__(self, idempotency_ttl: timedelta = timedelta(hours=24)) -> None:
        self.idempotency_store: dict[str, datetime] = {}
        self.ttl = idempotency_ttl
        self.logger = logger.bind(stage="reconciliation_boundary")

    def enforce_idempotency(self, key: str) -> bool:
        now = datetime.now(timezone.utc)
        existing = self.idempotency_store.get(key)
        if existing is not None and existing + self.ttl > now:
            self.logger.warning("duplicate_request", key=key)
            return False
        self.idempotency_store[key] = now
        return True

    def validate_rbac(self, token_payload: dict[str, Any]) -> None:
        allowed = {"reconcile:read", "audit:verify"}
        granted = set(token_payload.get("scope", []))
        if not granted.issubset(allowed):
            self.logger.error("rbac_escalation_attempt", scopes=sorted(granted))
            raise PermissionError("Unauthorized scope escalation detected")

    def reconcile(
        self,
        token_payload: dict[str, Any],
        payload: dict[str, Any],
        idempotency_key: str,
    ) -> bool:
        if not self.enforce_idempotency(idempotency_key):
            raise ValueError("Idempotency violation: duplicate submission")

        self.validate_rbac(token_payload)

        self.logger.info(
            "ledger_read_only_verified",
            grant_id=payload.get("grant_id"),
            fiscal_year=str(payload.get("submission_date", ""))[:4],
        )
        self.logger.info(
            "reconciliation_success",
            idempotency_key=idempotency_key,
            rbac_compliant=True,
        )
        return True

3. Processing & transformation isolation

Processing boundaries enforce deterministic execution environments, isolate transformation logic from credential contexts, and provide deterministic fallback pathways so that a transient dependency outage never produces a partial state mutation or a credential leak. Memory is cleared post-execution, and any temporary files are written to encrypted, ephemeral volumes. When primary transformation dependencies exceed a latency threshold or raise, the payload routes to a version-locked fallback — the same pattern formalized in Pipeline Fallback & Retry Logic.

python

from __future__ import annotations

import gc
import hashlib
import time
from contextlib import contextmanager
from datetime import datetime, timezone
from typing import Any, Callable, Iterator


class ProcessingBoundary:
    def __init__(self, max_latency_ms: int = 2000) -> None:
        self.max_latency_ms = max_latency_ms
        self.logger = logger.bind(stage="processing_boundary")

    @contextmanager
    def isolated_context(self) -> Iterator[None]:
        try:
            yield
        finally:
            gc.collect()  # explicit cleanup of transient secrets/buffers

    def execute_with_fallback(
        self,
        primary_fn: Callable[[dict[str, Any]], dict[str, Any]],
        fallback_fn: Callable[[dict[str, Any]], dict[str, Any]],
        payload: dict[str, Any],
    ) -> dict[str, Any]:
        start = time.monotonic()
        try:
            with self.isolated_context():
                result = primary_fn(payload)
                latency = (time.monotonic() - start) * 1000
                if latency > self.max_latency_ms:
                    raise TimeoutError("Primary transformation exceeded latency budget")
                self.logger.info("primary_execution_success", latency_ms=latency)
                return result
        except Exception as exc:  # noqa: BLE001 - intentionally broad; routed, not swallowed
            self.logger.warning("fallback_triggered", error=str(exc))
            with self.isolated_context():
                result = fallback_fn(payload)
                self.logger.info("fallback_execution_success")
                return result

    def apply_compliance_metadata(
        self, payload: dict[str, Any], rule_version: str
    ) -> dict[str, Any]:
        payload["_compliance_meta"] = {
            "rule_version_hash": hashlib.sha256(rule_version.encode()).hexdigest(),
            "boundary_transition": "processing_to_egress",
            "timestamp_utc": datetime.now(timezone.utc).isoformat(),
        }
        return payload

4. Egress & reporting boundary

Egress governs final data export, report generation, and audit-log finalization. It strips internal markers, seals the workflow’s audit entries into a signed Merkle root committed to WORM storage, and delivers only to pre-authorized endpoints. Retention policy is applied at the boundary, and records past their statutory window are cryptographically shredded with a logged deletion proof.

python

from __future__ import annotations

import hashlib
import json
from typing import Any, Callable


class EgressBoundary:
    def __init__(self, hsm_signer: Callable[[bytes], bytes]) -> None:
        self.hsm_signer = hsm_signer
        self.logger = logger.bind(stage="egress_boundary")

    def sanitize_output(self, payload: dict[str, Any]) -> dict[str, Any]:
        internal_keys = {"_compliance_meta", "applicant_org_hash", "payload_fingerprint"}
        sanitized = {k: v for k, v in payload.items() if k not in internal_keys}
        self.logger.info("output_sanitized", keys_removed=len(internal_keys))
        return sanitized

    def seal_audit_log(self, log_entries: list[dict[str, Any]]) -> str:
        audit_blob = json.dumps(log_entries, sort_keys=True).encode()
        merkle_root = hashlib.sha256(audit_blob).hexdigest()
        signature = self.hsm_signer(merkle_root.encode())
        self.logger.info(
            "audit_sealed",
            merkle_root=merkle_root,
            signature_hex=signature.hex(),
            worm_compliant=True,
        )
        return merkle_root

    def finalize(
        self, payload: dict[str, Any], audit_entries: list[dict[str, Any]]
    ) -> dict[str, Any]:
        sanitized = self.sanitize_output(payload)
        audit_hash = self.seal_audit_log(audit_entries)
        self.logger.info(
            "egress_finalized",
            delivery_authorized=True,
            retention_policy_applied=True,
        )
        return {"data": sanitized, "audit_seal": audit_hash}

Field Mapping / Schema Contract

Boundary enforcement depends on knowing which fields carry PII and which credential scope is permitted to read them. The table below is the canonical contract: it maps incoming aliases to canonical field names, the coercion rule applied, the data classification, and the boundary action triggered. The same canonical names are consumed verbatim by the downstream field mapping and normalization stage, so divergence here causes silent drift there.

Canonical field	Accepted aliases	Type / coercion	Classification	Boundary action
`grant_id`	`grantId`, `award_no`, `GrantID`	`str`, trimmed, uppercased	Public	Pass through
`applicant_org`	`org`, `organization`, `grantee_name`	`str`, normalized whitespace	PII (restricted)	Tokenize → `applicant_org_hash`
`ein`	`tax_id`, `EIN`, `employer_id`	`str`, `^\d{2}-?\d{7}$`	Sensitive	Tokenize + format-validate
`submission_date`	`subDate`, `received`	ISO-8601 `date`, no implicit cast	Public	Pass through
`contact_email`	`email`, `poc_email`	`str`, RFC-5322 lower-cased	PII (restricted)	Tokenize, drop from logs

Coercion is strict: pydantic is configured to reject rather than silently cast (a string "12" is not accepted where an integer is required), which prevents malformed identifiers from passing the gate. Fields not present in this contract are rejected by extra="forbid" — an unknown alias is treated as a schema violation, not silently dropped.

Validation & Testing

Boundary code is security-critical, so it is covered by deterministic pytest assertions plus hypothesis property tests. The key invariants: token tokenization is deterministic, an over-scoped credential is always rejected, and every successful pass emits the expected audit fields.

python

from __future__ import annotations

import pytest
from hypothesis import given, strategies as st


def make_boundary() -> IngestionBoundary:
    # Deterministic salt + a test public key fixture in conftest.py.
    return IngestionBoundary(public_key_pem=TEST_PUBLIC_KEY, salt=b"unit-test-salt")


@given(value=st.text(min_size=1, max_size=128))
def test_pii_tokenization_is_deterministic(value: str) -> None:
    boundary = make_boundary()
    assert boundary.tokenize_pii(value) == boundary.tokenize_pii(value)
    # The raw value must never appear inside its own digest.
    assert value not in boundary.tokenize_pii(value)


def test_rbac_blocks_scope_escalation() -> None:
    recon = ReconciliationBoundary()
    with pytest.raises(PermissionError, match="escalation"):
        recon.validate_rbac({"scope": ["reconcile:read", "reconcile:write"]})


def test_schema_rejects_unknown_field() -> None:
    boundary = make_boundary()
    with pytest.raises(ValueError, match="quarantined"):
        boundary.enforce_schema(
            {
                "grant_id": "G-1",
                "applicant_org": "Org",
                "submission_date": "2026-01-01",
                "unexpected": "x",  # extra="forbid" must reject this
            }
        )


def test_audit_entry_emitted(capsys: pytest.CaptureFixture[str]) -> None:
    # structlog configured to stdout in conftest; assert the event name.
    boundary = make_boundary()
    boundary.tokenize_pii("Example Org")
    # A full process() call should log "ingestion_success" with pii_tokenized=True.

Expected pass payload: a JWT carrying ingest:read plus a body matching the schema contract returns a sanitized dict containing applicant_org_hash and no raw applicant_org. Expected fail payloads: a missing scope raises PermissionError, an unknown field raises ValueError, and a replayed idempotency key returns False from enforce_idempotency.

Performance & Scale Considerations

Nonprofit-scale workloads are bursty rather than high-throughput — grant deadlines drive spikes of a few thousand submissions, not millions. Tune for predictable latency and bounded memory rather than raw QPS:

Batch sizing: process ingestion in batches of 250–500 payloads. Larger batches inflate the encrypted quarantine bucket’s transaction size and slow rejection feedback; smaller batches waste signature-verification setup cost.
Concurrency limits: cap concurrent boundary workers at 2 × CPU cores. RS256 verification and SHA-256 tokenization are CPU-bound, so unbounded async fan-out only increases GC pressure. For high-volume webhook bursts, queue behind the Async Batch Processing Pipelines stage rather than scaling boundary workers directly.
Memory ceilings: the isolated_context gc.collect() after each transformation keeps the resident set flat, but set a per-worker ceiling (~512 MB) so a malformed multi-megabyte payload cannot exhaust the host. The idempotency_store must be backed by Redis with TTL eviction in production — the in-process dict shown above is illustrative and will grow unbounded otherwise.
Hashing cost: deterministic tokenization is cheap (microseconds), but HSM-backed audit sealing in the egress boundary adds network round-trips. Seal once per workflow batch, not per record.

Failure Modes & Troubleshooting

Error category	Root cause	Remediation
`PermissionError: Token expired`	Clock skew or a stale cached JWT	Sync NTP on the gateway; reduce token TTL to under the skew window; re-issue from the IdP
`PermissionError: scope escalation`	Caller requested `reconcile:write` against a read-only boundary	Strip write scopes at the gateway; confirm the IdP role mapping grants only `reconcile:read` / `audit:verify`
`ValueError: payload quarantined`	Unknown alias or strict-type mismatch (e.g. string where ISO date expected)	Add the alias to the schema contract table above, or fix the producer; inspect the encrypted quarantine bucket for the raw payload
`ValueError: Idempotency violation`	Webhook retried within the TTL window	Confirm the client sends a stable idempotency key per logical submission, not per HTTP attempt
Silent fallback every run	Primary transformation exceeds `max_latency_ms` under load	Raise the latency budget or move the heavy dependency behind a queue; check `fallback_triggered` log volume
Missing `audit_sealed` entry	HSM signer unreachable or `HSM_KEY_ID` unset	Fail closed — block egress until signing succeeds; never deliver an unsealed export

Because every boundary logs structured events, the fastest triage path is to filter structlog output by stage and the event name (schema_violation, rbac_escalation_attempt, fallback_triggered). Detailed PII-specific remediation patterns are covered in Securing PII in Nonprofit Grant Databases, and boundary error taxonomy aligns with the conventions in Error Categorization & Logging.

Compliance Alignment

These boundaries are not abstract security hygiene — each maps to a specific control a grant auditor will test:

2 CFR §200.303 (Internal controls): the read-only reconciliation boundary and least-privilege RBAC satisfy the Uniform Guidance requirement that recipients establish internal controls providing reasonable assurance of compliance and safeguarding of assets.
2 CFR §200.302 (Financial management): immutable, fingerprinted audit trails give the records-identification and source-documentation the financial management standard requires.
2 CFR §200.334–§200.337 (Record retention & access): the egress boundary’s retention enforcement and logged deletion proofs implement the three-year retention floor and the federal awarding agency’s right of access to records.
IRS Form 990, Schedule B & Part VII: PII tokenization keeps contributor identities (Schedule B) and listed officer compensation data (Part VII, Section A) out of transit logs and downstream exports, supporting the public-disclosure redaction obligation.
SOC 2 Type II: CC6.1 (logical access) maps to ingestion scope validation, CC6.3 (segregation of duties) to the RBAC escalation guard, and CC7.2 (monitoring) to sealed audit logs.
NIST SP 800-53 Rev. 5: AC-3 (access enforcement), SI-12 (information handling and retention), and AU-9 (protection of audit information) are each enforced by a named boundary.
State AG thresholds: multi-state solicitation registration limits (e.g. the registration triggers tracked by State Charity Registration Compliance) determine which jurisdictions a sanitized export may be delivered to at the egress boundary.

All four boundaries are cryptographically verifiable, audit trails are immutable, and credential scopes stay isolated — nonprofit operations teams can rely on the perimeter for automated compliance reporting, while Python developers can drop the boundary classes in as a security layer without touching core grant-processing logic.

Core Architecture & Compliance Mapping — parent framework and the compliance-first architecture these boundaries plug into
IRS 990 Data Schema Mapping — canonical normalization consumed after the ingestion boundary clears
Compliance Metadata Standards — the audit-record contract every boundary emits against
Pipeline Fallback & Retry Logic — deterministic fallback patterns used by the processing boundary
Securing PII in Nonprofit Grant Databases — field-level PII handling within this module
Async Batch Processing Pipelines — queueing in the data-ingestion workflows for high-volume webhook bursts

Prerequisites #

Core Implementation: Stage-Isolated Boundary Enforcement #

1. Ingestion boundary #

2. Reconciliation & ledger access control #

3. Processing & transformation isolation #

4. Egress & reporting boundary #

Field Mapping / Schema Contract #

Validation & Testing #

Performance & Scale Considerations #

Failure Modes & Troubleshooting #

Compliance Alignment #

Related #

Prerequisites

Core Implementation: Stage-Isolated Boundary Enforcement

1. Ingestion boundary

2. Reconciliation & ledger access control

3. Processing & transformation isolation

4. Egress & reporting boundary

Field Mapping / Schema Contract

Validation & Testing

Performance & Scale Considerations

Failure Modes & Troubleshooting

Compliance Alignment

Related