LLM Psychological Evaluation Benchmark

1. Dataset and Notation

  • Rows are individual “shots” of a question under a given model/language/configuration/temperature.
  • Columns:
    • model ∈ {anthropic/claude-3.7-sonnet, google/gemini-2.0-flash-001, google/gemma-3-27b-it, meta-llama/llama-3.3-70b-instruct, openai/gpt-4o-2024-11-20, x‑ai/grok-2-1212}
    • language ∈ {english, romanian}
    • configuration ∈ {closed_form, closed_form_with_explanation, open_form}
    • temperature ∈ {0.65, 1.0, 1.35}
    • question_id ∈ {1…100}
    • shot ∈ {1…5}
    • final_extracted_closed_answer ∈ {A…H}
    • correct_answer ∈ {A…H}

We define for each row a binary outcome on final_extracted_closed_answer==correct_answer


2. Exploratory & Baseline Metrics

  1. Raw Agreement Fraction
    For any grouping (g),
    /[ \mathrm{agreement_frac}g=\max{c\in{\mathrm{A},\ldots,\mathrm{H}}}\frac{#{\text{shots in }g\text{ with answer }c}}{#{\text{shots in }g}}. /]
  2. Raw Correctness Ratio
    /[ \text{correctness_frac}g = \frac{\sum{i\in g} \text{correct}_i}{|g|}. /]
  3. One‑Point “ROC”
    • Thresholding raw agreement or raw correctness at a single (t) yields one (TPR, FPR) point—no curve.
    • We need a continuous score to sweep thresholds and build a full ROC.

3. Hierarchical Logistic GLMM (“Smoothed” Correctness)

We fit, at the shot level, a logistic mixed‐effects model:
\[ \mathrm{logit}\bigl(P(\mathrm{correct}_i)\bigr) = \beta_0 \]

  • \( u_{\mathrm{model}[i]} \)
  • \( u_{\mathrm{language}[i]} \)
  • \( u_{\mathrm{configuration}[i]} \)
  • \( u_{\mathrm{temperature}[i]} \)
  • \( u_{\mathrm{question_id}[i]} \),
    where each random intercept \( u_{\cdot}\sim N(0,\sigma^2_{\cdot}) \) .
  • Smoothing via partial pooling:
    Raw group ratios, especially in small cells, are extreme (0 or 1). The GLMM shrinks these extremes toward the global mean, borrowing strength across all groups. This stabilizes low‐data cells while preserving large‐cell signal.

4. Extracting Continuous Scores

  1. Shot‑level fitted probability
    \[ \hat p_i = P(\mathrm{correct}_i \mid \text{GLMM}) \in [0,1]. \]
  2. Group‑level summaries (for any grouping (G)):
    • \( \mathrm{pred_prob_mean}G = \frac1{|G|}\sum{i\in G}\hat p_i \)
    • \( \mathrm{agreement_frac}_G \) as above
    • Consensus correctness
      \( \mathrm{consensus_correct}_G = 1 \) if the majority answer in (G) matches gold, else 0.

5. Multi‑Level ROC Curves

Loop over all non‑empty combinations of \[{!\text{model},\text{language},\text{configuration},\text{temperature},\text{question_id}} \]. For each combination (C):

  1. Group by (C).
  2. Compute per group (G):
    • Score 1: \( s^{(1)}_G = \mathrm{agreement_frac}_G\ )
    • Score 2: \( s^{(2)}_G = \mathrm{pred_prob_mean}_G \)
    • Label: \( y_G = \mathrm{consensus_correct}_G \)
  3. ROC & AUC
    • \[ \mathrm{ROC}_1) from ({(s^{(1)}G,y_G)}), AUC = (AUC{\mathrm{agree}}(C) \].
    • \[ \mathrm{ROC}_2) from ({(s^{(2)}G,y_G)}), AUC = (AUC{\mathrm{GLMM}}(C) \].
  4. ΔAUC
    \[ \Delta AUC(C) = AUC_{\mathrm{GLMM}}(C) - AUC_{\mathrm{agree}}(C). \]
    This difference quantifies how much extra discrimination the GLMM’s smoothing provides over raw vote strength.

6. Statistical Testing: DeLong’s Test

  • Objective: Test if \( \Delta AUC(C) \) > 0 beyond chance.
  • DeLong’s method for paired ROC curves:
    1. Treat each group (G) as a paired observation ((s{(1)}_G,s{(2)}_G)).
    2. Estimate
      \( \mathrm{Var}(\widehat{AUC}_1) \),
      \( \mathrm{Var}(\widehat{AUC}_2) \),
      \( \mathrm{Cov}(\widehat{AUC}_1,\widehat{AUC}_2) \)
      via U‑statistics.
    3. Compute
      \[ Z = \frac{\Delta AUC}{\sqrt{\mathrm{Var}(AUC_1)+\mathrm{Var}(AUC_2)-2,\mathrm{Cov}(AUC_1,AUC_2)}} \]
      and derive a p‑value.
  • Interpretation: A low p‑value indicates that GLMM smoothing yields a statistically significant increase in discriminative power over raw agreement.
Meta‑meta‑analysis: Each factor’s AUC comparison is akin to an individual study comparing two classifiers; DeLong’s synthesis across paired curves provides a higher‐order inference on whether GLMM consistently outperforms voting across dimensions.

7. Ranking & Visualization

  1. Combination heatmap
    Bar plots or heatmaps of (\Delta AUC(C)) across all group combinations to identify interactions.

Single‐factor summary table

Factor AUC(_\text{agree}) AUC(_\text{GLMM}) ΔAUC p‑value
model 0.78 0.83 +0.05 0.04
language 0.64 0.81 +0.17 <0.001
configuration
temperature
question_id

8. Edge‐Case: configuration == "open_form"

  • For open_form, a semantic majority‐vote across shots collapses each group to one answer—so:
    • \(\mathrm{agreement_frac} = 1 \).
    • Raw correctness is 0 or 1 only—no curve.
  • Treatment:
    1. Exclude open_form from any ROC needing per‐shot variability.
    2. Include it in the GLMM: each singleton shot still yields (\hat p_i).
    3. In combinations including open_form, [\Delta AUC = AUC_{\mathrm{GLMM}} - 0.5), since (AUC_{\mathrm{agree}}=0.5 \].
    4. Document this degeneracy explicitly.

9. Putting It All Together

  1. Prepare data → label shots → compute raw metrics.
  2. Fit GLMM → extract \( \hat p_i \).
  3. Auto‐group over factor combinations → compute scores & labels.
  4. Compute ROCs → derive \[ AUC_{\mathrm{agree}}), (AUC_{\mathrm{GLMM}}), (\Delta AUC) \].
  5. Run DeLong → test significance.
  6. Visualize & rank → spot high‐impact factors.
  7. Handle open_form edge case.

This pipeline yields descriptive (where performance dips) and inferential (which gains are significant) insights, mapping exactly which dimensions drive noise and how hierarchical pooling mitigates it.


10. Detailed Interpretation & Statistical Connections

The DeLong test offers a nonparametric, U‑statistic‐based inference for paired ROC curves. ΔAUC can be viewed as a probability envelope isolating variance from grouping heterogeneity—distinct from model‐intrinsic uncertainty. Its rank‐based nature avoids prevalence and threshold biases and eschews Simpson’s paradox issues seen in IDI, making ΔAUC the most robust global discrimination metric in our pipeline.


11. Comparative Insights from Metric Studies

Drawing on Pencina & Demler (2012) and the PMC3341978 / PubMed29160558 comparisons:

  • Under LDA, ΔAUC, IDI, and NRI all scale with squared Mahalanobis distance—ΔAUC ∈ [0,0.5], IDI ∈ [–1,1], NRI ∈ [–1,1].
  • ΔAUC & IDI depend on baseline discrimination & prevalence; NRI does not.
  • Simpson’s paradox can plague IDI (necessitating weighted variants); ΔAUC’s rank‐based global summary avoids it.
  • ΔAUC remains the most stable, threshold‐agnostic measure of discrimination improvement for our multi‐level analysis.

Read more

Converting 2D Architectural Drawings into Hierarchical Scene Graph Representations

Converting 2D Architectural Drawings into Hierarchical Scene Graph Representations

In this ongoing research, we outline a theoretical framework for transforming architectural 2D drawings into hierarchical scene graph representations. Traditional floorplans carry far more than rectangular outlines: they contain rich spatial relationships, complex room shapes, and functional constraints. Our approach bridges graph-theoretic floorplan analysis, hierarchical scene modeling, and constraint reasoning