LLM Psychological Evaluation Benchmark

George Deac

Apr 21, 2025 — 3 min read

1. Dataset and Notation

Rows are individual “shots” of a question under a given model/language/configuration/temperature.
Columns:
- model ∈ {anthropic/claude-3.7-sonnet, google/gemini-2.0-flash-001, google/gemma-3-27b-it, meta-llama/llama-3.3-70b-instruct, openai/gpt-4o-2024-11-20, x‑ai/grok-2-1212}
- language ∈ {english, romanian}
- configuration ∈ {closed_form, closed_form_with_explanation, open_form}
- temperature ∈ {0.65, 1.0, 1.35}
- question_id ∈ {1…100}
- shot ∈ {1…5}
- final_extracted_closed_answer ∈ {A…H}
- correct_answer ∈ {A…H}

We define for each row a binary outcome on final_extracted_closed_answer==correct_answer

2. Exploratory & Baseline Metrics

Raw Agreement Fraction
For any grouping (g),
/[ \mathrm{agreement_frac}g=\max{c\in{\mathrm{A},\ldots,\mathrm{H}}}\frac{#{\text{shots in }g\text{ with answer }c}}{#{\text{shots in }g}}. /]
Raw Correctness Ratio
/[ \text{correctness_frac}g = \frac{\sum{i\in g} \text{correct}_i}{|g|}. /]
One‑Point “ROC”
- Thresholding raw agreement or raw correctness at a single (t) yields one (TPR, FPR) point—no curve.
- We need a continuous score to sweep thresholds and build a full ROC.

3. Hierarchical Logistic GLMM (“Smoothed” Correctness)

We fit, at the shot level, a logistic mixed‐effects model:
\[ \mathrm{logit}\bigl(P(\mathrm{correct}_i)\bigr) = \beta_0 \]

\( u_{\mathrm{model}[i]} \)
\( u_{\mathrm{language}[i]} \)
\( u_{\mathrm{configuration}[i]} \)
\( u_{\mathrm{temperature}[i]} \)
\( u_{\mathrm{question_id}[i]} \),
where each random intercept \( u_{\cdot}\sim N(0,\sigma^2_{\cdot}) \) .
Smoothing via partial pooling:
Raw group ratios, especially in small cells, are extreme (0 or 1). The GLMM shrinks these extremes toward the global mean, borrowing strength across all groups. This stabilizes low‐data cells while preserving large‐cell signal.

4. Extracting Continuous Scores

Shot‑level fitted probability
\[ \hat p_i = P(\mathrm{correct}_i \mid \text{GLMM}) \in [0,1]. \]
Group‑level summaries (for any grouping (G)):
- \( \mathrm{pred_prob_mean}G = \frac1{|G|}\sum{i\in G}\hat p_i \)
- \( \mathrm{agreement_frac}_G \) as above
- Consensus correctness
  \( \mathrm{consensus_correct}_G = 1 \) if the majority answer in (G) matches gold, else 0.

5. Multi‑Level ROC Curves

Loop over all non‑empty combinations of \[{!\text{model},\text{language},\text{configuration},\text{temperature},\text{question_id}} \]. For each combination (C):

Group by (C).
Compute per group (G):
- Score 1: \( s^{(1)}_G = \mathrm{agreement_frac}_G\ )
- Score 2: \( s^{(2)}_G = \mathrm{pred_prob_mean}_G \)
- Label: \( y_G = \mathrm{consensus_correct}_G \)
ROC & AUC
- \[ \mathrm{ROC}_1) from ({(s^{(1)}G,y_G)}), AUC = (AUC{\mathrm{agree}}(C) \].
- \[ \mathrm{ROC}_2) from ({(s^{(2)}G,y_G)}), AUC = (AUC{\mathrm{GLMM}}(C) \].
ΔAUC
\[ \Delta AUC(C) = AUC_{\mathrm{GLMM}}(C) - AUC_{\mathrm{agree}}(C). \]
This difference quantifies how much extra discrimination the GLMM’s smoothing provides over raw vote strength.

6. Statistical Testing: DeLong’s Test

Objective: Test if \( \Delta AUC(C) \) > 0 beyond chance.
DeLong’s method for paired ROC curves:
1. Treat each group (G) as a paired observation ((s^{(1)}_G,s{(2)}_G)).
2. Estimate
  \( \mathrm{Var}(\widehat{AUC}_1) \),
  \( \mathrm{Var}(\widehat{AUC}_2) \),
  \( \mathrm{Cov}(\widehat{AUC}_1,\widehat{AUC}_2) \)
  via U‑statistics.
3. Compute
  \[ Z = \frac{\Delta AUC}{\sqrt{\mathrm{Var}(AUC_1)+\mathrm{Var}(AUC_2)-2,\mathrm{Cov}(AUC_1,AUC_2)}} \]
  and derive a p‑value.
Interpretation: A low p‑value indicates that GLMM smoothing yields a statistically significant increase in discriminative power over raw agreement.

Meta‑meta‑analysis: Each factor’s AUC comparison is akin to an individual study comparing two classifiers; DeLong’s synthesis across paired curves provides a higher‐order inference on whether GLMM consistently outperforms voting across dimensions.

7. Ranking & Visualization

Combination heatmap
Bar plots or heatmaps of (\Delta AUC(C)) across all group combinations to identify interactions.

Single‐factor summary table

Factor	AUC(_\text{agree})	AUC(_\text{GLMM})	ΔAUC	p‑value
model	0.78	0.83	+0.05	0.04
language	0.64	0.81	+0.17	<0.001
configuration	…	…	…	…
temperature	…	…	…	…
question_id	…	…	…	…

8. Edge‐Case: `configuration == "open_form"`

For open_form, a semantic majority‐vote across shots collapses each group to one answer—so:
- \(\mathrm{agreement_frac} = 1 \).
- Raw correctness is 0 or 1 only—no curve.
Treatment:
1. Exclude open_form from any ROC needing per‐shot variability.
2. Include it in the GLMM: each singleton shot still yields (\hat p_i).
3. In combinations including open_form, [\Delta AUC = AUC_{\mathrm{GLMM}} - 0.5), since (AUC_{\mathrm{agree}}=0.5 \].
4. Document this degeneracy explicitly.

9. Putting It All Together

Prepare data → label shots → compute raw metrics.
Fit GLMM → extract \( \hat p_i \).
Auto‐group over factor combinations → compute scores & labels.
Compute ROCs → derive \[ AUC_{\mathrm{agree}}), (AUC_{\mathrm{GLMM}}), (\Delta AUC) \].
Run DeLong → test significance.
Visualize & rank → spot high‐impact factors.
Handle open_form edge case.

This pipeline yields descriptive (where performance dips) and inferential (which gains are significant) insights, mapping exactly which dimensions drive noise and how hierarchical pooling mitigates it.

10. Detailed Interpretation & Statistical Connections

The DeLong test offers a nonparametric, U‑statistic‐based inference for paired ROC curves. ΔAUC can be viewed as a probability envelope isolating variance from grouping heterogeneity—distinct from model‐intrinsic uncertainty. Its rank‐based nature avoids prevalence and threshold biases and eschews Simpson’s paradox issues seen in IDI, making ΔAUC the most robust global discrimination metric in our pipeline.

11. Comparative Insights from Metric Studies

Drawing on Pencina & Demler (2012) and the PMC3341978 / PubMed29160558 comparisons:

Under LDA, ΔAUC, IDI, and NRI all scale with squared Mahalanobis distance—ΔAUC ∈ [0,0.5], IDI ∈ [–1,1], NRI ∈ [–1,1].
ΔAUC & IDI depend on baseline discrimination & prevalence; NRI does not.
Simpson’s paradox can plague IDI (necessitating weighted variants); ΔAUC’s rank‐based global summary avoids it.
ΔAUC remains the most stable, threshold‐agnostic measure of discrimination improvement for our multi‐level analysis.

LLM Psychological Evaluation Benchmark

George Deac

1. Dataset and Notation

2. Exploratory & Baseline Metrics

3. Hierarchical Logistic GLMM (“Smoothed” Correctness)

4. Extracting Continuous Scores

5. Multi‑Level ROC Curves

6. Statistical Testing: DeLong’s Test

7. Ranking & Visualization

8. Edge‐Case: `configuration == "open_form"`

9. Putting It All Together

10. Detailed Interpretation & Statistical Connections

11. Comparative Insights from Metric Studies

Read more

Bringing Structure and Faithfulness to LLM Outputs

Beyond Basic RFM: Introducing RFM-T-U-E(BDP) for Deeper SaaS Tenant Insights

Investigating Grokking as a Phase Transition under the SETOL framework and Latent Dunning-Kruger Dynamics

Converting 2D Architectural Drawings into Hierarchical Scene Graph Representations

1. Dataset and Notation

2. Exploratory & Baseline Metrics

3. Hierarchical Logistic GLMM (“Smoothed” Correctness)

4. Extracting Continuous Scores

5. Multi‑Level ROC Curves

6. Statistical Testing: DeLong’s Test

7. Ranking & Visualization

8. Edge‐Case: configuration == "open_form"

9. Putting It All Together

10. Detailed Interpretation & Statistical Connections

11. Comparative Insights from Metric Studies

Read more

Bringing Structure and Faithfulness to LLM Outputs

Beyond Basic RFM: Introducing RFM-T-U-E(BDP) for Deeper SaaS Tenant Insights

Investigating Grokking as a Phase Transition under the SETOL framework and Latent Dunning-Kruger Dynamics

Converting 2D Architectural Drawings into Hierarchical Scene Graph Representations

8. Edge‐Case: `configuration == "open_form"`