LLM Psychological Evaluation Benchmark
1. Dataset and Notation * Rows are individual “shots” of a question under a given model/language/configuration/temperature. * Columns: * model ∈ {anthropic/claude-3.7-sonnet, google/gemini-2.0-flash-001, google/gemma-3-27b-it, meta-llama/llama-3.3-70b-instruct, openai/gpt-4o-2024-11-20, x‑ai/grok-2-1212} * language ∈ {english, romanian} * configuration ∈ {closed_form, closed_form_with_explanation, open_form} * temperature ∈ {0.