Confidence ellipsoids in multinomial logistic regression

This is a vignette of the paper (Tan and Bellec 2023) which develops confidence ellipsoids in high-dimensional multinomial logistic regression.

Setup and notation

Consider a multinomial model with \(K+1\) classes, where label \(k=1,...,K+1\) is thought of to satisfy a logistic model with one-hot encoded \(\boldsymbol y_i\in \{0,1\}^{K+1}\) following the model \[\begin{equation} P(y_{ik} = 1 | \boldsymbol x_i) = \frac{\exp(\boldsymbol x_i^T \boldsymbol{B}^* \boldsymbol e_k)}{\sum_{k'=1}^{K+1} \exp(\boldsymbol x_i^T \boldsymbol{B}^* \boldsymbol e_{k'})}, \quad k\in\{1,2, \ldots, K+1\} \end{equation}\] where \(\boldsymbol{B}^*\in \mathbb R^{p\times (K+1)}\) is an unknown logistic model parameter and \(\boldsymbol e_k\in \mathbb R^{K+1},\boldsymbol e_{k'}\in\mathbb R^{K+1}\) are the \(k\)-th and \(k'\)-th canonical basis vectors. The Maximum Likelihood Estimate (MLE) for \(\boldsymbol{B}^*\) is any solution that minimizes the cross-entropy loss, \[\begin{equation}\label{barB} \textstyle \boldsymbol{\hat{B}} \in \operatorname{argmin}_{ \boldsymbol B\in \mathbb R^{p\times (K+1)}} \sum_{i=1}^n L_i ( \boldsymbol B^T \boldsymbol x_i), \end{equation}\] where the cross-entropy loss \(L_i: \mathbb R^{K+1} \to \mathbb R\) is defined as \[L_i ( \boldsymbol u) = - \sum_{k=1}^{K+1} y_{ik} u_k + \log \sum_{k'=1}^{K+1} \exp( u_{k'}), \qquad \boldsymbol u\in \mathbb R^{K+1}.\] Define for each observation \(i \in [n]\) the vector of predicted probabilities \(\boldsymbol{\hat p} _i = (\hat p_{i1}, ..., \hat p_{i(K+1)})^T\) with \[\begin{equation} \label{p_i} \hat p_{ik} = P(\hat y_{ik} = 1) = \frac{\exp(\boldsymbol x_i^T \boldsymbol{\hat{B}} \boldsymbol e_k)}{\sum_{k'=1}^{K+1}\exp(\boldsymbol x_i^T \boldsymbol{\hat{B}} \boldsymbol e_{k'})} \qquad \qquad \text{ for each }k\in\{1,...,K+1\}. \end{equation}\] Our results will utilize the gradient and Hessian of \(L_i\) evaluated at \(\boldsymbol{\hat{B}}^T \boldsymbol x_i\), denoted by \[\begin{align} \boldsymbol g_i &= \nabla L_i(\boldsymbol{\hat{B}}^T \boldsymbol x_i) = - \boldsymbol y_i + \boldsymbol{\hat{p_i}} && \in \mathbb R^{K+1}, \\ \boldsymbol H_i &= \nabla^2 L_i(\boldsymbol{\hat{B}}^T \boldsymbol x_i) = \operatorname{diag}(\boldsymbol{\hat{p_i}}) - \boldsymbol{\hat{p_i}} \boldsymbol{\hat{p_i}}^T && \in \mathbb R^{(K+1)\times (K+1)}. \end{align}\] The quantities \((\boldsymbol{\hat{B}}, \boldsymbol{\hat{p_i}}, \boldsymbol{g}_i, \boldsymbol{H}_i)\) can be readily computed from the data \(\{(\boldsymbol x_i, y_i)\}_{i=1}^n\).

Confidence ellipsoid

Our goal is to develop a methodology to test the significance of the \(j\)-th feature, in the model where the features are normally distributed \[ \boldsymbol x_i\sim N(\boldsymbol 0_p, \boldsymbol \Sigma). \] Specifically, for a desired confidence level \((1-\alpha)\in(0,1)\) (say, \(1-\alpha=0.95\)) and a given feature \(j\in [p]\) of interest, our goal is to test \[\begin{equation} \label{H0} H_0: \boldsymbol y_i \text{ is conditionally independent of } x_{ij} \text{ given } (x_{ij'})_{j'\in[p]\setminus \{j\}} . \end{equation}\] Namely, we want to test whether the \(j\)-th variable is independent from the response given all other explanatory variables \((x_{ij'}, j'\in[p]\setminus\{j\})\). Given the multinomial logistic estimator \(\boldsymbol{\hat{B}}\), the conjecture that rejecting \(H_0\) when \(\boldsymbol e_j^T \boldsymbol{\hat{B}}\) is far from \(\boldsymbol 0_{K+1}\) is a reasonable starting point; here \(\boldsymbol e_j\in \mathbb R^p\) is the \(j\)-th canonical basis vector. The important question, then, is to determine a quantitative statement for the informal “far from \(\boldsymbol 0_{K+1}\)”. The paper (Tan and Bellec 2023) proves that if the dimension/sample size ratio \(p/n\) is held fixed such that the MLE exists, the convergence in distribution \[\begin{equation} \frac{n}{\boldsymbol e_j^T\boldsymbol \Sigma^{-1}\boldsymbol e_j} \Bigl\| \Bigl( \Bigl( \frac1n\sum_{i=1}^n \boldsymbol g_i \boldsymbol g_i^T \Bigr)^{1/2} \Bigr)^\dagger \Bigl(\frac1n\sum_{i=1}^n \boldsymbol V_i\Bigr) \boldsymbol{\hat{B}}^T \boldsymbol e_j \Bigr\|^2 ~\to^d~ \chi_K^2 \end{equation}\] holds, where \(\boldsymbol g_i = -y_i + \hat p_i\in \mathbb R^{K+1}\) is the \(i\)-th gradient and \[\boldsymbol V_i= \boldsymbol H_i - (\boldsymbol H_i \otimes \boldsymbol x_i^T)\Bigl[\sum_{l=1}^n \boldsymbol H_l\otimes (\boldsymbol x_l\boldsymbol x_l^T)\Bigr]^{\dagger}(\boldsymbol H_i\otimes \boldsymbol x_i)\] for the \(i\)-th Hessian \(\boldsymbol H_i\in\mathbb R^{K+1}\). Above, \(^\dagger\) denotes the pseudo-inverse of a matrix, and \(^{1/2}\) dentes the unique square root of a psd matrix.

Illustrations for \(K+1=3\) clases

For a given dataset \((X,y)\), a given realization of the multinomial logistic MLE \(\boldsymbol{\hat{B}}\in \mathbb R^{p\times (K+1)}\) and a given feature of interested \(j\in [p]\), we may construct p-values and \((1-\alpha)\)-confidence ellipsoids from the above convergence in distribution to the chi-square distribution with \(K\) degreres-of-freedom. The particular shape of the confidence ellipsoid is random as it depends on the gradients \(\boldsymbol g_i\) and Hessians \(\boldsymbol H_i\) for this particular realization of the dataset. Repeating the generation of the dataset 1000 times, we see in the illustrations below that the desired coverage \(1-\alpha=0.95\) holds approximately for the red ellipsoid (“modern” in the figures) based on the above convergence in distribution. These red ellipsoids vary slightly over the 1000 repetitions. The blue ellipsoids, based on the classical theory of the Fisher information, suffers from under-coverage.

Scatter plot of pairs \((\sqrt{n}\boldsymbol{\hat{B}}_{j1}, \sqrt{n}\boldsymbol{\hat{B}}_{j2})\) with \(K=2\) over 1000 repetitions. The blue ellipsoid is the boundary of the \(95\%\) confidence set for \(\sqrt{n} \boldsymbol{\hat{B}}^T \boldsymbol e_j\) under \(H_0\) from the classical theory based on the Fisher information, the dashed red ellipsoids are the boundaries of the \(95\%\) confidence set for \(\sqrt{n} \boldsymbol{\hat{B}}^T \boldsymbol e_j\) under \(H_0\) from this paper. Each of the 1000 repetition gives a slightly different dashed ellipsoid. The solid red ellipsoid is the average of these 1000 dashed ellipsoids. Each row \(\boldsymbol x_i\) of \(\boldsymbol X\) is iid sampled from \(N(\boldsymbol{0}_p, \boldsymbol\Sigma)\) with \(\Sigma=(0.5^{|i-j|})_{p\times p}\). The first \(\lceil p/4\rceil\) rows of \(\boldsymbol{B}^*\) are iid sampled from \(N(\boldsymbol{0},\boldsymbol I_K)\) while other rows are set to zeros. We further normalize the ground truth \(\boldsymbol{B}^*\) such that \(\boldsymbol{B}^*{}^T\Sigma \boldsymbol{B}^* = [[\boldsymbol I_K, \boldsymbol 0_K];[\boldsymbol 0_K^T, 0]].\) The last coordinate \(j=p\) is used as the null coordinate.

Histogram for p-values of the classical test based on the Fisher information (in orange) and of the proposed high-dimensional test (in blue) for a null covariate (i.e., under \(H_0\)) with simulated data with different \((n,K)\) and \(p=1000\).

Code

The code to reproduce the figures can be found in the supplementary material of the paper https://proceedings.neurips.cc/paper_files/paper/2023/file/e0ac27bf3327c9cb99cc5f548db4f73a-Supplemental-Conference.zip.

References

Tan, Kai, and Pierre C Bellec. 2023. “Multinomial Logistic Regression: Asymptotic Normality on Null Covariates in High-Dimensions.” NeurIPS 2023. https://proceedings.neurips.cc/paper_files/paper/2023/hash/e0ac27bf3327c9cb99cc5f548db4f73a-Abstract-Conference.html.