perfect recovery with least absolute deviation in robust regression
Least absolute deviation
Least Absolute Deviation refers to the M-estimator \[
\hat\beta =
\operatorname{argmin}_{b\in \mathbb R^p} \sum_{i=1}^n |y_i - x_i^Tb|,
\] that is, minimizing the absolute values of the residuals. The minimization problem can be reduced to a linear program. In python this estimator can be computed as follows.
import numpy as npfrom sklearn.linear_model import QuantileRegressorQuantileRegressor(quantile=0.5, alpha=0.0, fit_intercept=False, solver='highs')
QuantileRegressor(alpha=0.0, fit_intercept=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
QuantileRegressor(alpha=0.0, fit_intercept=False)
Experiment
In a simple regression model with \(y_i = x_i^T\beta^* + \varepsilon_i\) for some ground truth \(\beta^*\in\mathbb R^p\), normal feature vectors \(x_i\sim N(0_p, I_p)\) and noise \(\varepsilon_i\) independent of \(x_i\), we look at the probablity of perfect recovery, \[
P(\hat b = \beta^*),
\] as a function of \(P(\varepsilon_i=0)\) (number of noise variables being zero/non-zero) and of the dimension/sample size ratio \(p/n\).
def compute(n, p, o, seed=None): regr = QuantileRegressor(quantile=0.5, alpha=0.0, fit_intercept=False, solver='highs') rng = np.random.default_rng(seed=seed) X = rng.standard_normal(size=(n, p)) beta_star = rng.standard_normal(size=(p, )) y = X @ beta_star + rng.standard_cauchy(size=(n, )) * np.hstack([np.ones(o), np.zeros(n-o)]) regr.fit(X, y)return {'n': n, 'p': p, 'o': o, 'seed': seed,'perfect recovery': np.allclose(regr.coef_, beta_star)}compute(100, 10, 30, seed=6)
This experiment is repeated across values of \(p\) (the dimension) and \(o\) (number of nonzero values among \(\varepsilon_1,...,\varepsilon_n)\). For each value of \(o\in[n]\), \(p\in[n]\), the experiment is repeated 10 times.
n =100parameters = [dict(n=n, p=p, o=o, seed=seed)for o inrange(0, n+1)for p inrange(1, n+1)for seed inrange(10) ]f"Will now launch {len(parameters)} parallel tasks"
'Will now launch 101000 parallel tasks'
from joblib import Parallel, delayed# Heavy computationdata = Parallel(n_jobs=18, verbose=2)(delayed(compute)(**d)for d in parameters)
Heatmap: empirical probabilities for perfect recovery
#%config InlineBackend.figure_formats = ['svg']import seaborn as snsimport pandas as pdimport matplotlib.pyplot as pltimport matplotlib as mplmpl.rcParams['figure.dpi'] =150piv = pd.pivot_table(pd.DataFrame(data), values='perfect recovery', index='p', columns='o')ax = sns.heatmap(piv)plt.ylabel('sample size')plt.xlabel('nonzero noise values')ax.invert_yaxis()
Theoretical phase transition
As discussed in (Bellec and Koriyama 2023, sec. 2), the phase transition can be predicted using simple computation based on the results of Amelunxen et al. (2014). As in the Donoho-Tanner phase transition, the predicted theoretical curve that separates perfect recovery from imperfect recovery involves the standard normal density and cumulative distribution function.
from scipy.stats import normf = norm.pdfPhi = norm.cdf
Assume that \(p,n\to+\infty\) simultaneously with \(\lim \frac n p = \delta\). The simple computation in (Bellec and Koriyama 2023, sec. 2) explains that perfect recovery occurs if \[
\frac{1}{\delta}
> 1 - \min_{t\in \mathbb R}
\mathbb E[\text{dist}(G, t \partial |\varepsilon_i|)^2]
\] where \(G\sim N(0,1)\) is independent of \(\varepsilon_i\), and imperfect recovery occurs if the strict inequality is reversed. Above, \(\partial |\epsilon_i| = \{\text{sign}(\varepsilon_i)\}\) if \(\epsilon_i\ne 0\) and \(\partial |\epsilon_i = [-1,1]\) is the subdifferential of the absolute value at \(\varepsilon_i\).
Let \(q= P(\varepsilon_i\ne 0)\) be the probability that a noise random variable is nonzero. A parametric curve \(t\mapsto (\delta(t),q(t))\) that separates the \((\delta,q)\) plane in two regions (a perfect recovery region and an imperfect recovery region), can be computed with \[
\delta(t) = \frac{1}{1- \mathbb E[\text{dist}(G, t \partial |\varepsilon_i|)^2]}
\] and \(q(t)\) solving the optimality condition for the minimization over \(t\) of the function \(t\mapsto
\mathbb E[\text{dist}(G, t \partial |\varepsilon_i|)^2]
=
P(\varepsilon_i\ne 0)(1+t^2) +P(\varepsilon_i = 0)\mathbb E[(|G|-t)_+^2]\). This gives the equation characterizing \(q(t)\) on the phase transition, namely, \[
0 =
q(t) 2 t
-(1-q(t)) 2 \mathbb E[(|G|-t)_+].
\] and \(\mathbb E[(|G|-t)_+]= 2 f(t) - 2 \Phi(-t)\) where \(f\) is the standard normal pdf and \(\Phi(-t)=\int_t^\infty f(u)du\).
Amelunxen, Dennis, Martin Lotz, Michael B McCoy, and Joel A Tropp. 2014. “Living on the Edge: Phase Transitions in Convex Programs with Random Data.”Information and Inference.
Bellec, Pierre C., and Takuya Koriyama. 2023. “Existence of Solutions to the Nonlinear Equations Characterizing the Precise Error of M-Estimators.”arXiv Preprint.