Appendix A

A Formal Note on Epistemic Structure in the RI Safety Layer (v1.0)

A.1 Purpose of this appendix

This appendix provides a formal clarification of how the RI Safety Layer defines and implements:

judgement
coherence
behavioural evaluation

It addresses a central question in AI evaluation:

How are evaluation criteria defined, and why should they be considered meaningful?

The aim is not to assert a universal definition of “correct” behaviour, but to describe:

how criteria are constructed
how they are applied
how their influence on outcomes can be inspected, reproduced, and challenged

A.2 The epistemic problem in AI evaluation

All AI evaluation systems depend on prior assumptions.

These assumptions determine:

what behaviours are considered desirable
what constitutes failure
how trade-offs are resolved

In most existing evaluation frameworks, these assumptions are:

embedded in benchmark design
implicit in scoring systems
difficult to isolate from results

This creates three core limitations:

Opacity — the basis of judgement is not fully visible
Non-reproducibility of interpretation — results cannot be cleanly separated from evaluative assumptions
Limited contestability — disagreement is difficult to localise (data vs method vs judgement)

A.3 System position

The RI Safety Layer treats judgement as an explicit and separable component of evaluation.

It does not attempt to:

remove human-defined criteria
claim objective or universal correctness

Instead, it enforces:

explicit, structured, and inspectable judgement

The system does not eliminate subjectivity.

It constrains and exposes it.

A.3.1 System Architecture Overview

The system can be understood structurally as a layered architecture, shown below.

System Stack Diagram

Layered stack from foundational theory through coherence model, behavioural layer, measurement layer, governance layer, and system layer. — Figure A0. System architecture of the RI Safety Layer. The system is structured as a layered stack progressing from foundational theory through coherence modelling, behavioural definition, deterministic measurement, and governance, to external system interfaces. Each layer operates with defined responsibilities and constraints, preventing the conflation of observation, evaluation, and judgement.

A.4 Architectural separation

The system is built on a strict separation between three layers:

A.4.1 Behaviour (Observed Interaction)

raw model outputs
full conversational traces
input–output sequences

This layer is:

recorded in full
immutable once captured

A.4.2 Measurement (Deterministic Evaluation)

application of defined evaluation procedures
generation of structured metrics
production of reproducible summaries

This layer is:

deterministic
repeatable under identical conditions
independent of downstream judgement

A.4.3 Judgement (Interpretive Layer)

definition of evaluation criteria
thresholds, classifications, and containment rules
decisions regarding inclusion in aggregate metrics

This layer is:

explicitly defined
version-controlled
auditable

A.4.4 Key property

Measurement does not depend on judgement.

Judgement operates on the outputs of measurement.

This prevents:

retrospective reinterpretation of behaviour
hidden modification of evaluation outcomes

Behaviour / Measurement / Judgement Separation

Behaviour flows to measurement, and measurement outputs flow to judgement. — Figure A1. Separation of behaviour, measurement, and judgement. Behaviour is recorded as immutable evidence. Measurement produces deterministic outputs from that evidence. Judgement operates only on measurement outputs and does not alter them.

A.5 Determinism in behavioural measurement

Within the RI Safety Layer, determinism refers to:

the property that identical inputs, evaluation procedures, and system configurations produce identical outputs

This applies to:

metric computation
summary generation
evidence bundle construction

Determinism ensures:

reproducibility across runs
stability of evaluation outputs
independence from runtime variability

This is distinct from model determinism.

The system does not require the underlying model to be deterministic.

Instead, it ensures that:

given a recorded interaction, its evaluation is deterministic

Deterministic Evaluation Path

Recorded interaction flows to deterministic measurement, producing stable outputs. — Figure A2. Deterministic evaluation. Given a fixed recorded interaction and evaluation configuration, the system produces identical outputs across runs. Determinism applies to the evaluation layer, not to the underlying model.

A.6 Operational definition of coherence

Coherence is defined operationally as:

the degree of alignment between observed behaviour and a declared set of evaluation criteria

Formally, for a given interaction I and criteria set C:

Coherence(I, C) = ƒ(Measurement(I), C)

Where:

Measurement(I) produces structured behavioural descriptors
C defines evaluative expectations
ƒ applies criteria to measured behaviour

Coherence evaluated within a defined band

Coherence is evaluated along a bounded region from incoherent to coherent, with an acceptable band defined by criteria set C. — Figure A4. For a given criteria set C, behaviour is evaluated within a bounded region of acceptable alignment. The position, width, and threshold of this band are determined by explicit criteria and may vary across contexts.

A.6.1 Key properties

coherence is criteria-relative
coherence is context-dependent
coherence is computable and reproducible
coherence is open to inspection and challenge
coherence is not claimed as universal truth

A.6.2 Upper-bound interpretation

There is no absolute “highest coherence” independent of context.

Within this system, the upper bound of coherence is defined as:

maximal alignment with the currently declared criteria set C

Changing the criteria set may change the location or meaning of that upper bound.

A.6.3 Practical meaning

The system does not claim to discover coherence as a universal property.

It provides a transparent mechanism for defining, applying, and testing coherence within a specified evaluative frame.

This allows disagreement to focus on:

the measured behaviour
the criteria chosen
the procedure ƒ used to map behaviour to alignment outcomes

rather than on hidden or implicit judgement.

A.7 Construction of evaluation criteria

Evaluation criteria within the system are constructed through a combination of:

A.7.1 Empirical evaluation design

repeated testing across varied prompts
identification of stable behavioural patterns
measurement of response consistency

A.7.2 Behavioural analysis

examination of model outputs under perturbation
detection of instability, contradiction, or drift
analysis of alignment between stated intent and response

A.7.3 Theoretical modelling

Internally developed models of coherence and system alignment inform:

the structure of criteria
the selection of behavioural dimensions
the interpretation of observed patterns

These models aim to describe:

consistency under variation
structural integrity of responses
relationship between prompt conditions and outputs

A.7.4 Iterative refinement

Criteria are not fixed.

They are:

updated as new behaviours are observed
stress-tested against edge cases
evaluated for unintended consequences

All changes are:

versioned
documented
applied prospectively (not retroactively)

A.7.5 Role of the Coherence Model

The construction of evaluation criteria is informed by an underlying model of behavioural coherence.

This model does not function as an absolute authority.

It provides a structured, explicitly defined interpretive lens through which behavioural patterns are understood.

Specifically, it:

identifies which behavioural dimensions are meaningful for evaluation
informs the construction of criteria applied to those dimensions
defines the contextual band within which behaviour is assessed

Evaluation therefore does not occur against an absolute standard.

It occurs relative to:

explicitly defined criteria
informed by a model
applied consistently within a given context

The coherence model constrains interpretation — it does not replace it.

Coherence Model as Interpretive Lens

Observed behaviour passes through the coherence model, interpretation, criteria application, and measured outcome, with a test and challenge loop. — Figure A3. The coherence model as an interpretive lens. Observed behaviour is not evaluated directly against an absolute standard. It is interpreted through a structured model, which informs criteria selection and evaluation. Outputs are then tested through reproducibility and open inspection.

The validity of this approach is not asserted.

It is tested through:

reproducibility of results
stability of measurement
and the ability of independent observers to inspect and challenge outcomes

A.8 Governance and containment

Judgement is operationalised through governance rules that determine:

whether a session contributes to aggregate metrics
whether results are surfaced in dashboards
whether outputs are flagged, held, or excluded

These decisions are:

derived from explicit criteria
recorded as signed governance outcomes
traceable to both measurement and criteria

A.9 Contestability and inspection

A central design objective is to enable structured disagreement.

The system allows independent observers to:

inspect the raw interaction
reproduce the measurement outputs
examine the applied criteria
evaluate the resulting judgement

Disagreement can therefore be localised to:

the behaviour itself
the measurement process
the criteria definition

This improves:

clarity of critique
speed of refinement
overall system robustness

A.10 Limitations

The system does not:

establish a universal definition of safe or correct behaviour
eliminate the need for human judgement
guarantee that chosen criteria are optimal

Instead, it provides:

a framework in which judgement is explicit
a process in which evaluation is reproducible
a structure in which assumptions can be tested and revised

A.11 Implications for AI safety evaluation

By separating behaviour, measurement, and judgement, the system enables:

reproducible evaluation independent of interpretation
governance of metrics prior to publication
transparent linkage between assumptions and outcomes

This supports:

regulatory review
third-party audit
iterative improvement of evaluation frameworks

A.12 Summary

The RI Safety Layer introduces a structured approach to AI evaluation in which:

behaviour is fully recorded
measurement is deterministic
judgement is explicit and versioned

Coherence is not assumed.

It is defined, applied, and open to challenge.

The system therefore replaces implicit trust with:

inspectable process, reproducible results, and explicit assumptions