Estimating ground-truth correlation in L1 regularized logistic regression

Consider an L1 regularized estimator of the form \[\begin{equation} \boldsymbol{\hat{\beta}} = \operatorname{argmin}_{\boldsymbol{b}\in\mathbb R^p} \frac1n \sum_{i=1}^n \mathcal L_{y_i} L(\boldsymbol x_i^T \boldsymbol b) + \frac{\lambda}{\sqrt n}\|\boldsymbol b\|_1. \end{equation}\] for some data-fitting loss function \(\mathcal L_{y_i}(\cdot)\). We are interested in estimating the correaltion of \(\boldsymbol{\hat{\beta}}\) with a ground-truth \(\boldsymbol\beta^*\).

Let us look at a dataset \((\boldsymbol x_i, y_i)\) where the response \(y_i\) is valued in \(\{0,1,...,q\}\) for some integer \(q\) and satisfying a binomial model with isotropic Gaussian feature vectors \[ \boldsymbol x_i \sim N(\boldsymbol 0_p, \boldsymbol I_p), \qquad y_i \mid \boldsymbol x_i \sim \operatorname{Binomial}(q, \rho(\boldsymbol x_i^T\beta^*)) \] where \(\boldsymbol \beta^*\) is the unknown ground-truth and \(\rho(t)=1/(1+e^{-t})\) is the sigmoid function. This can be seen as a logistic regression model with \(q\) repeated measurements for each observed feature vector \(\boldsymbol x_i\). Using the negative log-likelihood for the loss, the above estimator becomes \[\begin{equation} \boldsymbol{\hat\beta} = \operatorname{argmin}_{\boldsymbol b\in\mathbb R^p} \frac1n \sum_{i=1}^n \Bigl[ q \log(1+e^{\boldsymbol x_i^T\boldsymbol b}) - y_i \boldsymbol x_i^T\boldsymbol b \Bigr] + \frac{\lambda}{\sqrt n}\|\boldsymbol b\|_1 \end{equation}\] Denote by \(\hat S\) the subset of \([p]\) containing the nonzero coefficients of \(\boldsymbol{\hat\beta}\), i.e., \[ \hat S = \{j\in[p]: \hat\beta_j \ne 0\}. \]

Estimating the correlation with the ground truth

In this case of L1 regularization, the estimator \(\hat a\) proposed in (Bellec 2022) of the correlation of \(\boldsymbol{\hat\beta}\) with the ground truth \(\boldsymbol\beta^*\) is defined as \[\begin{equation} \hat a^2 = \frac{ \big( \tfrac{\hat v}{n}\|\boldsymbol{X} \boldsymbol{\hat\beta} - \hat\gamma \boldsymbol{\hat\psi}\|^2 + \tfrac{1}{n}\boldsymbol{\hat\psi}^\top\boldsymbol X\boldsymbol{\hat\beta} - \tfrac{\hat\gamma}{n} \|\boldsymbol{\hat\psi}\|^2 \big)^2 }{ \tfrac{1}{n^2}\|\boldsymbol{X}^T\boldsymbol{\hat\psi}\|^2 + \tfrac{2 \hat v}{n} \boldsymbol{\hat\psi}^T\boldsymbol X\boldsymbol {\hat\beta} + \tfrac{\hat v^2}{n} \|\boldsymbol{X} \boldsymbol{\hat\beta} - \hat\gamma \boldsymbol{\hat\psi}\|^2 - \tfrac{p}{n^2}\|\boldsymbol{\hat\psi}\|^2 } \label{asquare} \end{equation}\] where \(\boldsymbol{\hat\psi} = (\hat \psi_i)_{i\in[n]}\) is the vector defined by \(\hat\psi_i= - \mathcal L_{y_i}'(\boldsymbol x_i^T\boldsymbol{\hat\beta})\), the scalars \(\hat v\) and \(\hat \gamma\) are defined by \(\hat \gamma = |\hat S|/(n\hat v)\) and \[ \hat v = \frac 1 n \operatorname{trace}\Bigl[ \boldsymbol D - \boldsymbol D \boldsymbol X_{\hat S} (\boldsymbol X_{\hat S}^T \boldsymbol D \boldsymbol X_{\hat S})^{-1} \boldsymbol X_{\hat S}^T \boldsymbol D \Bigr], \] and \(\boldsymbol D = \operatorname{diag}((\mathcal L_{y_i}''(\boldsymbol x_i^T\boldsymbol{\hat\beta}))_{i\in[n]})\). Above \(\boldsymbol X_{\hat S}\) is the submatrix of \(\boldsymbol X\) made of the columns indexed in \(\hat S\). The paper (Bellec 2022) proves that \(\hat a^2 \approx (\boldsymbol{\hat\beta}^T\boldsymbol \beta^*)^2\) under suitable conditions. We could not think of any intuition to suggest the validity this expression, except from the algebra and the probabilistic inequalities used in the proof.

Simulations: Data-generating process

We verify with simulations the validity of this estimate under the following simulation setting:

\(n=1000\).
\(p\in\{500, 1000, 2000, 2500\}\).
For each \(q\in\{1,2,4,6,8\}\), the response \(y_i\) is generated from the model \[y_i|\boldsymbol x_i \sim \text{Binomial}(q, \rho'(\boldsymbol x_i^T\boldsymbol\beta^*))\] where \(\rho'(t)=1/(1+e^{-t})\) is the sigmoid.
The unknown vector \(\boldsymbol\beta^*\) has \(\frac{p}{20}\) nonzero coefficients and is deterministically generated as follows: first generate \(\boldsymbol\beta^0\) with the first \(\frac{p}{20}\) coefficients being equispaced in \([0.5,4]\) and with remaining coefficients equal to 0, second set \(\boldsymbol\beta^* = 1.1 \boldsymbol\beta^0 / \|\boldsymbol\beta^0\|\).

import numpy as np
from scipy.special import expit
from numpy.linalg import norm

def generate_dataset(p, s, n, q, c, seed=None):
    beta = np.hstack([np.linspace(0.5, 4, num=s), np.zeros(p-s)])
    beta = c*beta/norm(beta)
    rng = np.random.default_rng(seed)
    X = rng.normal(size=(n, p))
    y = rng.binomial(n=q, p=expit(X@beta))
    return X, y

Computation of the regularized M-estimator

For each dataset, we compute the regularized M-estimator \[\begin{equation} \boldsymbol{\hat\beta} = \operatorname{argmin}_{\boldsymbol b\in\mathbb R^p} \frac1n \sum_{i=1}^n \Bigl[ q \log(1+e^{\boldsymbol x_i^T\boldsymbol b}) - y_i \boldsymbol x_i^T\boldsymbol b \Bigr] + \frac{\lambda}{\sqrt n}\|\boldsymbol b\|_1 \end{equation}\] for each tuning parameter \(\lambda\) in a logarithmic grid of cardinality 20 from \(0.3\) to \(3.0\). The function below returns both the estimator as well as the predicted probabilities for each \(i\in[n]\). It is implemented by repeating the feature vector \(\boldsymbol x_i\) exactly \(q\) times, with corresponding \(q\) binary responses equal to \(1\) (\(y_i\) times) and to \(0\) (\(q-y_i\) times).

from sklearn.linear_model import LogisticRegression


def b_hat(X, y, lam):
    n, p = X.shape
    q = np.max(y)
    yy = np.hstack(np.stack([np.hstack([np.ones(yi), np.zeros(q-yi)]) for yi in y]))
    XX = np.repeat(X, q, axis=0)
    assert np.allclose(XX[0, :], XX[q-1, :])
    regr = LogisticRegression(penalty='l1',
                              C=1/(lam*np.sqrt(n)),
                              fit_intercept=False,
                              solver='saga')
    regr.fit(XX, yy)
    regr.coef_.shape
    hbeta = regr.coef_.squeeze()
    assert np.allclose(regr.predict_proba(XX)[:, 1],
               expit(XX @ hbeta), rtol=0.01)
    probas = expit(XX @ hbeta).reshape((n, q)).mean(axis=1)
    assert np.allclose(np.repeat(probas, q) , expit(XX @ hbeta))
    return probas, hbeta

b_hat(*generate_dataset(100, 10, 80, 3, 1.2), lam=1.3)[0][:10]

array([0.56274953, 0.35626422, 0.58911923, 0.42489169, 0.56935492,
       0.18512148, 0.25751163, 0.53012393, 0.36587244, 0.68674939])

Computing the estimator of the correlation with the ground truth

The computation of the estimate \[\begin{equation} \hat a^2 = \frac{ \big( \tfrac{\hat v}{n}\|\boldsymbol{X} \boldsymbol{\hat\beta} - \hat\gamma \boldsymbol{\hat\psi}\|^2 + \tfrac{1}{n}\boldsymbol{\hat\psi}^\top\boldsymbol X\boldsymbol{\hat\beta} - \tfrac{\hat\gamma}{n} \|\boldsymbol{\hat\psi}\|^2 \big)^2 }{ \tfrac{1}{n^2}\|\boldsymbol{X}^T\boldsymbol{\hat\psi}\|^2 + \tfrac{2 \hat v}{n} \boldsymbol{\hat\psi}^T\boldsymbol X\boldsymbol {\hat\beta} + \tfrac{\hat v^2}{n} \|\boldsymbol{X} \boldsymbol{\hat\beta} - \hat\gamma \boldsymbol{\hat\psi}\|^2 - \tfrac{p}{n^2}\|\boldsymbol{\hat\psi}\|^2 } \end{equation}\] can be performed as follows. In our case, \(\hat\psi_i = - q \rho(\boldsymbol x_i^T\boldsymbol{\hat\beta}) + y_i\) and the second derivative of the loss is \(D_i = q \rho'(\boldsymbol x_i^T\boldsymbol{\hat\beta})\) where \(\rho(t)=1/(1+\exp(-t))\) is the sigmoid and \(\rho'(t)=\rho(t)(1-\rho(t))\) its derivative.

def compute(p, s, n, q, c, lam=1.0, seed=None):
    X, y = generate_dataset(p, s, n, q, c, seed=seed)
    probas, hbeta = b_hat(X, y, lam=lam)

    psi = - q * probas + y
    D = q * probas * (1-probas)
    S_hat = np.nonzero(hbeta)[0]
    X_S = X[:, S_hat]
    DX_S = D[:, np.newaxis] * X_S
    v = (D.sum() - np.trace(np.linalg.solve(X_S.T @ DX_S, DX_S.T @ DX_S)))/n
    gamma_hat = len(S_hat) / (n * v)

    a_square_denominator  = \
             n**-2 * norm(X.T @ psi)**2  \
           + (2*v/n)*psi.dot(X@hbeta) \
           + (v**2/n) * norm(X@hbeta -gamma_hat * psi)**2 \
           - (p/n)*norm(psi)**2/n

    a_numerator = \
             (v/n) * norm(X@hbeta -gamma_hat * psi)**2 \
           + (1/n) * psi.dot(X@ hbeta) \
           - gamma_hat * norm(psi)**2/n

    hat_a = np.sqrt( a_numerator**2 / a_square_denominator )

    # ground truth
    beta = np.hstack([np.linspace(0.5, 4, num=s), np.zeros(p-s)])
    beta = c*beta/norm(beta)
    true_correlation = hbeta.dot(beta/norm(beta))
    return {"p": p, "n": n, "s": s,  "q": q, "c": c, "lam": lam,
            "estimate $\hat a$": hat_a,
            "true correlation": true_correlation,
            "true normalized correlation": true_correlation / (0.01+norm(hbeta)**2)**0.5,
            "estimate normalized correlation": hat_a / (0.01+norm(hbeta)**2)**0.5 }

Simulations: Accuracy of the estimator of the correlation

lams = 0.3 * np.logspace(0, 1, num=20)
n = 1000
params = [dict(n=n, s=p//20, p=p, q=q, lam=lam, c=c, seed=seed)
            for lam in lams
            for p in [n//2, n, 2*n, (5*n)//2]
            for q in [1, 2, 4, 6, 8]
            for seed in range(50)
            for c in [1.1]
         ]

# Heavy computation
from joblib import Parallel, delayed

f"Will now launch {len(params)} parallel tasks"
#data = Parallel(n_jobs=100, verbose=12)(delayed(compute)(**d) for d in params)

'Will now launch 20000 parallel tasks'

Illustration: Figure 1

Using the above code, the figure below illustrate averages (as well as error bands depicting standard errors times 1.96), over 50 independent copy of the dataset \((\boldsymbol X, \boldsymbol y)\), of \(\sqrt{\hat a^2}\) for \(\hat a^2\) as in (\(\ref{asquare}\)), versus its target \(a_* = \|\boldsymbol\beta^*\|^{-1}\boldsymbol\beta^*{}^T\boldsymbol{\hat\beta}\) in the first plot.

The second plot shows the normalized correlation \(\sqrt{\hat a^2}/(0.01+\|\boldsymbol{\hat\beta}\|^2)^{1/2}\) versus \(a_*/(0.01+\|\boldsymbol{\hat\beta}\|^2)^{1/2}\). We add 0.01 to the denominator of the normalized correlation to avoid numerical instability for large tuning parameters for which \(\boldsymbol{\hat\beta}=\boldsymbol 0_p\) holds in some repetitions.

The figures suggest that the estimate \(\hat a\) is accurate across different dimensions, different tuning parameter \(\lambda\) and different binomial parameter \(q\). Estimation inaccuracies start to appear at small tuning parameter \(\lambda\) and for the largest values of the dimension \(p\).

For more details, see Section 6.4 of Bellec (2022).

References

Bellec, Pierre C. 2022. “Observable Adjustments in Single-Index Models for Regularized m-Estimators.” arXiv Preprint arXiv:2204.06990. https://arxiv.org/pdf/2204.06990.