Appendix A
Judgement, Coherence, and Deterministic Evaluation
A Formal Note on Epistemic Structure in the RI Safety Layer (v1.0)
A.1 Purpose of this appendix
This appendix provides a formal clarification of how the RI Safety Layer defines and implements:
- judgement
- coherence
- behavioural evaluation
It addresses a central question in AI evaluation:
How are evaluation criteria defined, and why should they be considered meaningful?
The aim is not to assert a universal definition of “correct” behaviour, but to describe:
- how criteria are constructed
- how they are applied
- how their influence on outcomes can be inspected, reproduced, and challenged
A.2 The epistemic problem in AI evaluation
All AI evaluation systems depend on prior assumptions.
These assumptions determine:
- what behaviours are considered desirable
- what constitutes failure
- how trade-offs are resolved
In most existing evaluation frameworks, these assumptions are:
- embedded in benchmark design
- implicit in scoring systems
- difficult to isolate from results
This creates three core limitations:
- Opacity — the basis of judgement is not fully visible
- Non-reproducibility of interpretation — results cannot be cleanly separated from evaluative assumptions
- Limited contestability — disagreement is difficult to localise (data vs method vs judgement)
A.3 System position
The RI Safety Layer treats judgement as an explicit and separable component of evaluation.
It does not attempt to:
- remove human-defined criteria
- claim objective or universal correctness
Instead, it enforces:
explicit, structured, and inspectable judgement
The system does not eliminate subjectivity.
It constrains and exposes it.
A.3.1 System Architecture Overview
The system can be understood structurally as a layered architecture, shown below.
System Stack Diagram
A.4 Architectural separation
The system is built on a strict separation between three layers:
A.4.1 Behaviour (Observed Interaction)
- raw model outputs
- full conversational traces
- input–output sequences
This layer is:
- recorded in full
- immutable once captured
A.4.2 Measurement (Deterministic Evaluation)
- application of defined evaluation procedures
- generation of structured metrics
- production of reproducible summaries
This layer is:
- deterministic
- repeatable under identical conditions
- independent of downstream judgement
A.4.3 Judgement (Interpretive Layer)
- definition of evaluation criteria
- thresholds, classifications, and containment rules
- decisions regarding inclusion in aggregate metrics
This layer is:
- explicitly defined
- version-controlled
- auditable
A.4.4 Key property
Measurement does not depend on judgement.
Judgement operates on the outputs of measurement.
This prevents:
- retrospective reinterpretation of behaviour
- hidden modification of evaluation outcomes
Behaviour / Measurement / Judgement Separation
A.5 Determinism in behavioural measurement
Within the RI Safety Layer, determinism refers to:
the property that identical inputs, evaluation procedures, and system configurations produce identical outputs
This applies to:
- metric computation
- summary generation
- evidence bundle construction
Determinism ensures:
- reproducibility across runs
- stability of evaluation outputs
- independence from runtime variability
This is distinct from model determinism.
The system does not require the underlying model to be deterministic.
Instead, it ensures that:
given a recorded interaction, its evaluation is deterministic
Deterministic Evaluation Path
A.6 Operational definition of coherence
Coherence is defined operationally as:
the degree of alignment between observed behaviour and a declared set of evaluation criteria
Formally, for a given interaction I and criteria set C:
Where:
- Measurement(I) produces structured behavioural descriptors
- C defines evaluative expectations
- ƒ applies criteria to measured behaviour
Coherence evaluated within a defined band
A.6.1 Key properties
- coherence is criteria-relative
- coherence is context-dependent
- coherence is computable and reproducible
- coherence is open to inspection and challenge
- coherence is not claimed as universal truth
A.6.2 Upper-bound interpretation
There is no absolute “highest coherence” independent of context.
Within this system, the upper bound of coherence is defined as:
maximal alignment with the currently declared criteria set C
Changing the criteria set may change the location or meaning of that upper bound.
A.6.3 Practical meaning
The system does not claim to discover coherence as a universal property.
It provides a transparent mechanism for defining, applying, and testing coherence within a specified evaluative frame.
This allows disagreement to focus on:
- the measured behaviour
- the criteria chosen
- the procedure ƒ used to map behaviour to alignment outcomes
rather than on hidden or implicit judgement.
A.7 Construction of evaluation criteria
Evaluation criteria within the system are constructed through a combination of:
A.7.1 Empirical evaluation design
- repeated testing across varied prompts
- identification of stable behavioural patterns
- measurement of response consistency
A.7.2 Behavioural analysis
- examination of model outputs under perturbation
- detection of instability, contradiction, or drift
- analysis of alignment between stated intent and response
A.7.3 Theoretical modelling
Internally developed models of coherence and system alignment inform:
- the structure of criteria
- the selection of behavioural dimensions
- the interpretation of observed patterns
These models aim to describe:
- consistency under variation
- structural integrity of responses
- relationship between prompt conditions and outputs
A.7.4 Iterative refinement
Criteria are not fixed.
They are:
- updated as new behaviours are observed
- stress-tested against edge cases
- evaluated for unintended consequences
All changes are:
- versioned
- documented
- applied prospectively (not retroactively)
A.7.5 Role of the Coherence Model
The construction of evaluation criteria is informed by an underlying model of behavioural coherence.
This model does not function as an absolute authority.
It provides a structured, explicitly defined interpretive lens through which behavioural patterns are understood.
Specifically, it:
- identifies which behavioural dimensions are meaningful for evaluation
- informs the construction of criteria applied to those dimensions
- defines the contextual band within which behaviour is assessed
Evaluation therefore does not occur against an absolute standard.
It occurs relative to:
- explicitly defined criteria
- informed by a model
- applied consistently within a given context
The coherence model constrains interpretation — it does not replace it.
Coherence Model as Interpretive Lens
The validity of this approach is not asserted.
It is tested through:
- reproducibility of results
- stability of measurement
- and the ability of independent observers to inspect and challenge outcomes
A.8 Governance and containment
Judgement is operationalised through governance rules that determine:
- whether a session contributes to aggregate metrics
- whether results are surfaced in dashboards
- whether outputs are flagged, held, or excluded
These decisions are:
- derived from explicit criteria
- recorded as signed governance outcomes
- traceable to both measurement and criteria
A.9 Contestability and inspection
A central design objective is to enable structured disagreement.
The system allows independent observers to:
- inspect the raw interaction
- reproduce the measurement outputs
- examine the applied criteria
- evaluate the resulting judgement
Disagreement can therefore be localised to:
- the behaviour itself
- the measurement process
- the criteria definition
This improves:
- clarity of critique
- speed of refinement
- overall system robustness
A.10 Limitations
The system does not:
- establish a universal definition of safe or correct behaviour
- eliminate the need for human judgement
- guarantee that chosen criteria are optimal
Instead, it provides:
- a framework in which judgement is explicit
- a process in which evaluation is reproducible
- a structure in which assumptions can be tested and revised
A.11 Implications for AI safety evaluation
By separating behaviour, measurement, and judgement, the system enables:
- reproducible evaluation independent of interpretation
- governance of metrics prior to publication
- transparent linkage between assumptions and outcomes
This supports:
- regulatory review
- third-party audit
- iterative improvement of evaluation frameworks
A.12 Summary
The RI Safety Layer introduces a structured approach to AI evaluation in which:
- behaviour is fully recorded
- measurement is deterministic
- judgement is explicit and versioned
Coherence is not assumed.
It is defined, applied, and open to challenge.
The system therefore replaces implicit trust with:
inspectable process, reproducible results, and explicit assumptions