Appendix A

Judgement, Coherence, and Deterministic Evaluation

A Formal Note on Epistemic Structure in the RI Safety Layer (v1.0)

A.1 Purpose of this appendix

This appendix provides a formal clarification of how the RI Safety Layer defines and implements:

  • judgement
  • coherence
  • behavioural evaluation

It addresses a central question in AI evaluation:

How are evaluation criteria defined, and why should they be considered meaningful?

The aim is not to assert a universal definition of “correct” behaviour, but to describe:

  • how criteria are constructed
  • how they are applied
  • how their influence on outcomes can be inspected, reproduced, and challenged

A.2 The epistemic problem in AI evaluation

All AI evaluation systems depend on prior assumptions.

These assumptions determine:

  • what behaviours are considered desirable
  • what constitutes failure
  • how trade-offs are resolved

In most existing evaluation frameworks, these assumptions are:

  • embedded in benchmark design
  • implicit in scoring systems
  • difficult to isolate from results

This creates three core limitations:

  • Opacity — the basis of judgement is not fully visible
  • Non-reproducibility of interpretation — results cannot be cleanly separated from evaluative assumptions
  • Limited contestability — disagreement is difficult to localise (data vs method vs judgement)

A.3 System position

The RI Safety Layer treats judgement as an explicit and separable component of evaluation.

It does not attempt to:

  • remove human-defined criteria
  • claim objective or universal correctness

Instead, it enforces:

explicit, structured, and inspectable judgement

The system does not eliminate subjectivity.

It constrains and exposes it.

A.3.1 System Architecture Overview

The system can be understood structurally as a layered architecture, shown below.

System Stack Diagram

Layered stack from foundational theory through coherence model, behavioural layer, measurement layer, governance layer, and system layer.
Figure A0. System architecture of the RI Safety Layer. The system is structured as a layered stack progressing from foundational theory through coherence modelling, behavioural definition, deterministic measurement, and governance, to external system interfaces. Each layer operates with defined responsibilities and constraints, preventing the conflation of observation, evaluation, and judgement.

A.4 Architectural separation

The system is built on a strict separation between three layers:

A.4.1 Behaviour (Observed Interaction)

  • raw model outputs
  • full conversational traces
  • input–output sequences

This layer is:

  • recorded in full
  • immutable once captured

A.4.2 Measurement (Deterministic Evaluation)

  • application of defined evaluation procedures
  • generation of structured metrics
  • production of reproducible summaries

This layer is:

  • deterministic
  • repeatable under identical conditions
  • independent of downstream judgement

A.4.3 Judgement (Interpretive Layer)

  • definition of evaluation criteria
  • thresholds, classifications, and containment rules
  • decisions regarding inclusion in aggregate metrics

This layer is:

  • explicitly defined
  • version-controlled
  • auditable

A.4.4 Key property

Measurement does not depend on judgement.

Judgement operates on the outputs of measurement.

This prevents:

  • retrospective reinterpretation of behaviour
  • hidden modification of evaluation outcomes

Behaviour / Measurement / Judgement Separation

Behaviour flows to measurement, and measurement outputs flow to judgement.
Figure A1. Separation of behaviour, measurement, and judgement. Behaviour is recorded as immutable evidence. Measurement produces deterministic outputs from that evidence. Judgement operates only on measurement outputs and does not alter them.

A.5 Determinism in behavioural measurement

Within the RI Safety Layer, determinism refers to:

the property that identical inputs, evaluation procedures, and system configurations produce identical outputs

This applies to:

  • metric computation
  • summary generation
  • evidence bundle construction

Determinism ensures:

  • reproducibility across runs
  • stability of evaluation outputs
  • independence from runtime variability

This is distinct from model determinism.

The system does not require the underlying model to be deterministic.

Instead, it ensures that:

given a recorded interaction, its evaluation is deterministic

Deterministic Evaluation Path

Recorded interaction flows to deterministic measurement, producing stable outputs.
Figure A2. Deterministic evaluation. Given a fixed recorded interaction and evaluation configuration, the system produces identical outputs across runs. Determinism applies to the evaluation layer, not to the underlying model.

A.6 Operational definition of coherence

Coherence is defined operationally as:

the degree of alignment between observed behaviour and a declared set of evaluation criteria

Formally, for a given interaction I and criteria set C:

Coherence(I, C) = ƒ(Measurement(I), C)

Where:

  • Measurement(I) produces structured behavioural descriptors
  • C defines evaluative expectations
  • ƒ applies criteria to measured behaviour

Coherence evaluated within a defined band

Coherence is evaluated along a bounded region from incoherent to coherent, with an acceptable band defined by criteria set C.
Figure A4. For a given criteria set C, behaviour is evaluated within a bounded region of acceptable alignment. The position, width, and threshold of this band are determined by explicit criteria and may vary across contexts.

A.6.1 Key properties

  • coherence is criteria-relative
  • coherence is context-dependent
  • coherence is computable and reproducible
  • coherence is open to inspection and challenge
  • coherence is not claimed as universal truth

A.6.2 Upper-bound interpretation

There is no absolute “highest coherence” independent of context.

Within this system, the upper bound of coherence is defined as:

maximal alignment with the currently declared criteria set C

Changing the criteria set may change the location or meaning of that upper bound.

A.6.3 Practical meaning

The system does not claim to discover coherence as a universal property.

It provides a transparent mechanism for defining, applying, and testing coherence within a specified evaluative frame.

This allows disagreement to focus on:

  • the measured behaviour
  • the criteria chosen
  • the procedure ƒ used to map behaviour to alignment outcomes

rather than on hidden or implicit judgement.

A.7 Construction of evaluation criteria

Evaluation criteria within the system are constructed through a combination of:

A.7.1 Empirical evaluation design

  • repeated testing across varied prompts
  • identification of stable behavioural patterns
  • measurement of response consistency

A.7.2 Behavioural analysis

  • examination of model outputs under perturbation
  • detection of instability, contradiction, or drift
  • analysis of alignment between stated intent and response

A.7.3 Theoretical modelling

Internally developed models of coherence and system alignment inform:

  • the structure of criteria
  • the selection of behavioural dimensions
  • the interpretation of observed patterns

These models aim to describe:

  • consistency under variation
  • structural integrity of responses
  • relationship between prompt conditions and outputs

A.7.4 Iterative refinement

Criteria are not fixed.

They are:

  • updated as new behaviours are observed
  • stress-tested against edge cases
  • evaluated for unintended consequences

All changes are:

  • versioned
  • documented
  • applied prospectively (not retroactively)

A.7.5 Role of the Coherence Model

The construction of evaluation criteria is informed by an underlying model of behavioural coherence.

This model does not function as an absolute authority.

It provides a structured, explicitly defined interpretive lens through which behavioural patterns are understood.

Specifically, it:

  • identifies which behavioural dimensions are meaningful for evaluation
  • informs the construction of criteria applied to those dimensions
  • defines the contextual band within which behaviour is assessed

Evaluation therefore does not occur against an absolute standard.

It occurs relative to:

  • explicitly defined criteria
  • informed by a model
  • applied consistently within a given context

The coherence model constrains interpretation — it does not replace it.

Coherence Model as Interpretive Lens

Observed behaviour passes through the coherence model, interpretation, criteria application, and measured outcome, with a test and challenge loop.
Figure A3. The coherence model as an interpretive lens. Observed behaviour is not evaluated directly against an absolute standard. It is interpreted through a structured model, which informs criteria selection and evaluation. Outputs are then tested through reproducibility and open inspection.

The validity of this approach is not asserted.

It is tested through:

  • reproducibility of results
  • stability of measurement
  • and the ability of independent observers to inspect and challenge outcomes

A.8 Governance and containment

Judgement is operationalised through governance rules that determine:

  • whether a session contributes to aggregate metrics
  • whether results are surfaced in dashboards
  • whether outputs are flagged, held, or excluded

These decisions are:

  • derived from explicit criteria
  • recorded as signed governance outcomes
  • traceable to both measurement and criteria

A.9 Contestability and inspection

A central design objective is to enable structured disagreement.

The system allows independent observers to:

  • inspect the raw interaction
  • reproduce the measurement outputs
  • examine the applied criteria
  • evaluate the resulting judgement

Disagreement can therefore be localised to:

  • the behaviour itself
  • the measurement process
  • the criteria definition

This improves:

  • clarity of critique
  • speed of refinement
  • overall system robustness

A.10 Limitations

The system does not:

  • establish a universal definition of safe or correct behaviour
  • eliminate the need for human judgement
  • guarantee that chosen criteria are optimal

Instead, it provides:

  • a framework in which judgement is explicit
  • a process in which evaluation is reproducible
  • a structure in which assumptions can be tested and revised

A.11 Implications for AI safety evaluation

By separating behaviour, measurement, and judgement, the system enables:

  • reproducible evaluation independent of interpretation
  • governance of metrics prior to publication
  • transparent linkage between assumptions and outcomes

This supports:

  • regulatory review
  • third-party audit
  • iterative improvement of evaluation frameworks

A.12 Summary

The RI Safety Layer introduces a structured approach to AI evaluation in which:

  • behaviour is fully recorded
  • measurement is deterministic
  • judgement is explicit and versioned

Coherence is not assumed.

It is defined, applied, and open to challenge.

The system therefore replaces implicit trust with:

inspectable process, reproducible results, and explicit assumptions