# PIPELINE VALIDATION GUIDE

## Purpose

This file defines a repeatable validation process for the NORMALIZATION PIPELINE.

## Core Reproducibility Contract

1. Source-of-truth inputs are the Excel source files discovered by the pipeline.
2. Transformations are defined by the pipeline scripts and normalization rules.
3. Any implementation (Python or another language) is considered equivalent if it produces the same required outputs from the same source inputs.
4. Strict reproducibility is guaranteed when using the documented runtime and dependency setup.
5. Recursive discovery eliminates the need to hard-code source file counts.

## Preconditions

1. Python virtual environment exists at ..venv
2. Dependencies are installed from requirements.txt
3. Source Excel files are discoverable within the workspace.
4. Pipeline scripts are present and executable.
5. Run commands from workspace root.

## Input Discovery

The pipeline recursively discovers all files matching:

```
*_Ornamentation*.xlsx
```

Current supported recitations include:

* Hafs
* Warsh
* Qaloon
* Soosi
* Doori
* Shouba
* Bazzi
* Qumball
* Hishem
* Ibn_Dhakwan
* Khallad
* Khalaf_Hamza
* Kisai_Duri


Current supported verse systems include:

* VERSE 0
* Basra
* Damascus
* Himsi
* Kufa
* Mecca
* Medina I
* Medina II

No source file count is hard-coded.
Any newly added matching Excel file is automatically included in processing.

## Validation Steps

1. Full reproducibility verification:

   PowerShell -ExecutionPolicy Bypass -File .\verify_repro.ps1

2. Generate hash registry:

   PowerShell -ExecutionPolicy Bypass -File .\hash_registry.ps1 -Mode generate

3. Validate against registry:

   PowerShell -ExecutionPolicy Bypass -File .\hash_registry.ps1 -Mode validate

4. Optional artifact-level validation:

   PowerShell -ExecutionPolicy Bypass -File .\hash_registry.ps1 -Mode generate -IncludeGeneratedOutputs

   PowerShell -ExecutionPolicy Bypass -File .\hash_registry.ps1 -Mode validate -IncludeGeneratedOutputs

## Expected Results

1. verify_repro.ps1 reports:

   [OK] Reproducibility verification passed.
   [OK] Fresh rebuild succeeded, naming contract holds, and text outputs are deterministic.

2. hash_registry.ps1 -Mode generate reports:

   [OK] Hash registry generated.
   [OK] Files tracked.

3. hash_registry.ps1 -Mode validate reports:

   [OK] Registry validation passed.

## Scope Notes

1. verify_repro.ps1 validates deterministic text outputs across reruns.
2. hash_registry.ps1 validates file integrity through cryptographic hashes.
3. Generated outputs are excluded from registry scope unless explicitly requested with -IncludeGeneratedOutputs.
4. Runtime or dependency changes should be followed by a complete validation run.

## Pass Criteria (Acceptance Checklist)

A validation run is considered PASS only if all conditions below are true:

1. Input readiness

   * All discoverable *_Ornamentation*.xlsx files are accessible.

2. Reproducibility script

   * verify_repro.ps1 completes without error.
   * Reproducibility verification passes.

3. Naming contract

   * No generated filename contains:
     __normalized
     __unicode_report

4. Determinism contract

   * Repeated execution produces identical tracked text outputs.

5. Registry integrity

   * hash_registry.ps1 -Mode generate succeeds.
   * hash_registry.ps1 -Mode validate succeeds.

6. Optional strict artifact coverage

   * If artifact-level tracking is enabled, validation succeeds with:
     -IncludeGeneratedOutputs

If any condition fails, the overall validation result is FAIL.

## Failure Interpretation

1. Missing inputs

   * Verify all intended *_Ornamentation*.xlsx files are present.

2. Naming violations

   * Check generated filenames for prohibited naming patterns.

3. Hash mismatch

   * One or more tracked files changed since registry generation.
   * If intentional, regenerate the registry.

4. Runtime mismatch

   * Align Python version and dependencies.
   * Re-run validation.

## Current Validation Status

Latest validation result:

[OK] Reproducibility verification passed.
[OK] Fresh rebuild succeeded.
[OK] Naming contract holds.
[OK] Text outputs are deterministic.

System Status: PASS

## Artifacts

1. Hash registry:

   workspace_hash_registry.json

2. Generated outputs:

   unicode_reports
   normalized_outputs_all_with_json\

3. Validation records:

   PIPELINE_VALIDATION.txt

## Deterministic Reproducibility Statement

A successful validation run demonstrates that the normalization pipeline can be rebuilt from source inputs and reproduce identical tracked outputs under the documented environment.

Determinism is established when repeated executions produce no changes in tracked output artifacts.
