RL for Psychometrics: Methodology¶
1 Tasks and Goals¶
Problem. Given logs of many participants’ interaction trajectories (state–action–next state), we seek to:
- Obtain an interpretable individual score \(\beta_j>0\) (participant \(j\)’s “decisiveness/ability”).
- Learn a generalizable task value function \(Q_\theta(s,a)\) and, optionally, a reward function \(r_w(s,a,s')\) (learning \(r\) yields an inverse RL setting).
- Ensure the model both explains observed behavior (predicts next-action distributions) and respects task dynamics (values do not collapse into arbitrary, overfit functions).
Core idea. Use an MDP to represent the task (“objective value”) and a maximum-entropy / softmax policy to capture decision making (“how people act given value”). Individual differences are captured by a single, interpretable temperature \(\beta_j\). All parameters are estimated jointly via penalized maximum likelihood / generalized EM (GEM).
2 Notation¶
- MDP: \((\mathcal S,\mathcal A,T,R,\gamma)\).
- Soft Bellman temperature \(\tau>0\) (backup only; fixes scale; typically \(\tau=1\)): $$ V_\theta(s)=\tau\log\sum_{a}\exp\Big(\tfrac{1}{\tau}Q_\theta(s,a)\Big). $$
- Soft Bellman residual: $$ \delta(s,a,s')=Q_\theta(s,a)-\big(r_w(s,a,s')+\gamma\,V_\theta(s')\big). $$
- Policy: $$ \pi_\theta(a\mid s,\beta_j)=\frac{\exp{\beta_j\,Q_\theta(s,a)}}{\sum_{a'}\exp{\beta_j\,Q_\theta(s,a')}}. $$
- Ability prior: \(\log\beta_j\sim\mathcal N(\mu,\sigma^2)\).
- Reward head bounded: \(r_w(\cdot)\in[-1,1]\) (e.g., via \(\tanh\) or an L2 magnitude constraint).
3 Objective: Penalized Observed Log-Likelihood¶
Intuition:
- \(\mathcal L_{\text{beh}}\) pulls \(Q\) into shapes that explain choices.
- \(\mathcal L_{\text{bell}}\) (+ \(\mathcal L_{\text{cql}}\)) pulls \(Q\) back toward dynamics-consistent solutions.
- The prior shrinks \(\beta\) for stability and empirical Bayes regularization.
\(Q\) neither drifts from dynamics nor overfits noise; \(\beta\) remains dedicated to individual decisiveness/randomness.
4 Identification Issues¶
Problem¶
Under a softmax policy, $$ \pi_\theta(a\mid s,\beta_j)=\frac{\exp(\beta_j Q_\theta(s,a))}{\sum_{a'}\exp(\beta_j Q_\theta(s,a'))}, $$ the behavior likelihood is invariant to two transformations:
- β–Q co-scaling: \((Q,\beta)\mapsto(cQ,\beta/c)\) for any \(c>0\).
- Per-state translation: \(Q(s,a)\mapsto Q(s,a)+b(s)\).
These symmetries imply that from behavioral data alone, the absolute scale and offset of \(Q\) are not uniquely identified.
Without additional constraints, \(\beta_j\) can arbitrarily compensate for re-scalings of \(Q\), leading to spurious interpretations of individual “temperature” parameters.
Consequence¶
A model may achieve low held-out negative log-likelihood while its \(Q_\theta\) values exist on an arbitrary scale.
This breaks the interpretability of \(\beta_j\) and undermines out-of-distribution generalization—for example, when extending to larger state spaces or tasks with different reward magnitudes.
Solution¶
We explicitly remove the two sources of indeterminacy by construction:
-
Remove per-state translation (advantage centering).
Define an unnormalized advantage function: $$ \tilde A_\theta(s,a)=Q_\theta(s,a)-\frac{1}{|\mathcal A|}\sum_{a'}Q_\theta(s,a'), \quad\text{so that}\quad \sum_a \tilde A_\theta(s,a)=0. $$ -
Fix global scale (unit RMS normalization).
Normalize \(\tilde A_\theta\) to have unit root-mean-square (RMS) per state: $$ A_\theta(s,a)=\frac{\tilde A_\theta(s,a)} {\sqrt{\frac{1}{|\mathcal A|}\sum_a \tilde A_\theta(s,a)^2+\varepsilon}}, \quad\text{so that}\quad \frac{1}{|\mathcal A|}\sum_a A_\theta(s,a)^2=1. $$
The final policy is then defined over \(A_\theta\): $$ \pi_\theta(a\mid s,\beta_j)=\mathrm{softmax}\big(\beta_j A_\theta(s,a)\big). $$
Result¶
- The translation freedom is eliminated by zero-centering.
- The scaling freedom is eliminated by RMS normalization.
- \(\beta_j\) becomes the sole global temperature controlling policy sharpness.
Consequently, the \((cQ,\beta/c)\) equivalence class collapses to a single representation, yielding a method-level identification of the softmax policy parameterization.
5 Generalized EM: Alternating Optimization¶
Alternate individual (\(\beta\)) and shared (\(Q_\theta, r_w\)) updates.
5.1 E-step: Update \(\beta_j\) (per participant)¶
Let \(z_j=\log\beta_j\). Given current \(Q,r\), maximize $$ \ell_j(z)=\sum_{(s,a)\in\mathcal D_j}\Big[e^z Q(s,a)-\log\sum_{a'}e^{e^z Q(s,a')}\Big]-\frac{(z-\mu)^2}{2\sigma^2}. $$
Gradient/Hessian (concave in \(z\); 1D Newton, ~5–10 steps; participants are independent/parallelizable):
Newton step \(z\leftarrow z-g_j/H_j\), then set \(\hat\beta_j=e^z\).
5.2 M-step: Update \(Q_\theta, r_w\) (shared)¶
Fix \(\hat\beta\); run a few small-step SGD iterations to minimize $$ \mathcal L_{\text{beh}}\ \text{(or }\mathcal L_{\text{beh-mix}}\text{)} +\lambda_{\text{bell}}\mathcal L_{\text{bell}} +\lambda_{\text{cql}}\mathcal L_{\text{cql}} +\mathcal R_r. $$
5.3 Mini-batch / Online EM¶
Per outer iteration:
- Sample a small set of participants \(U\); run parallel E-steps to update/cache \(\hat\beta_j\).
- Use their transitions for a few M-step SGD updates.
6 Evaluation and Uncertainty¶
6.1 External Validity¶
Treat \(\hat\beta_j\) as decisiveness/ability. For an external criterion \(Y_j\) (post-test, performance, ratings).
- Pearson: $$ r=\frac{\sum_j (\hat\beta_j-\bar\beta)(Y_j-\bar Y)}{\sqrt{\sum_j (\hat\beta_j-\bar\beta)^2}\sqrt{\sum_j (Y_j-\bar Y)^2}}. $$
- Spearman: rank-transform \(\hat\beta,Y\) then compute Pearson.
6.2 Confidence Intervals¶
With \(Q\) fixed, define the participant-level log posterior $$ \ell(\beta)=\sum_t\Big[\beta Q(s_t,a_t)-\log\sum_a e^{\beta Q(s_t,a)}\Big]+\log p(\beta). $$ At \(\hat\beta\), use a second-order Taylor expansion. Observed information \(J(\hat\beta)=-\ell''(\hat\beta)\) gives $$ \text{SE}(\hat\beta)\approx \frac{1}{\sqrt{J(\hat\beta)}}\quad\text{and}\quad 95\%\ \text{CI}:\ \hat\beta\pm1.96\,\text{SE}. $$
Prefer the log scale \(z=\log\beta\) for stronger concavity: 1) Newton to get \(\hat z\) and Hessian \(H\). 2) \(\text{SE}_z\approx\sqrt{(-H)^{-1}}\). 3) Log-scale CI \([\hat z\pm1.96\,\text{SE}_z]\), then exponentiate: \([e^{\hat z-1.96\,\text{SE}_z},\,e^{\hat z+1.96\,\text{SE}_z}]\).