Debiased inference for heterogeneous subpopulations in a high-dimensional logistic regression model

Kim, Hyunjin; Lee, Eun Ryung; Park, Seyoung

doi:10.1038/s41598-023-48903-x

Download PDF

Article
Open access
Published: 11 December 2023

Debiased inference for heterogeneous subpopulations in a high-dimensional logistic regression model

Hyunjin Kim¹,
Eun Ryung Lee¹ &
Seyoung Park¹

Scientific Reports volume 13, Article number: 21979 (2023) Cite this article

536 Accesses
Metrics details

Subjects

Abstract

Due to the prevalence of complex data, data heterogeneity is often observed in contemporary scientific studies and various applications. Motivated by studies on cancer cell lines, we consider the analysis of heterogeneous subpopulations with binary responses and high-dimensional covariates. In many practical scenarios, it is common to use a single regression model for the entire data set. To do this effectively, it is critical to quantify the heterogeneity of the effect of covariates across subpopulations through appropriate statistical inference. However, the high dimensionality and discrete nature of the data can lead to challenges in inference. Therefore, we propose a novel statistical inference method for a high-dimensional logistic regression model that accounts for heterogeneous subpopulations. Our primary goal is to investigate heterogeneity across subpopulations by testing the equivalence of the effect of a covariate and the significance of the overall effects of a covariate. To achieve overall sparsity of the coefficients and their fusions across subpopulations, we employ a fused group Lasso penalization method. In addition, we develop a statistical inference method that incorporates bias correction of the proposed penalized method. To address computational issues due to the nonlinear log-likelihood and the fused Lasso penalty, we propose a computationally efficient and fast algorithm by adapting the ideas of the proximal gradient method and the alternating direction method of multipliers (ADMM) to our settings. Furthermore, we develop non-asymptotic analyses for the proposed fused group Lasso and prove that the debiased test statistics admit chi-squared approximations even in the presence of high-dimensional variables. In simulations, the proposed test outperforms existing methods. The practical effectiveness of the proposed method is demonstrated by analyzing data from the Cancer Cell Line Encyclopedia (CCLE).

LogSum + L2 penalized logistic regression model for biomarker selection and cancer classification

Article Open access 17 December 2020

Bayesian network-driven clustering analysis with feature selection for high-dimensional multi-modal molecular data

Article Open access 04 March 2021

Meta-Analysis Based on Nonconvex Regularization

Article Open access 01 April 2020

Introduction

Significant efforts have been made in recent research to perform screening genetic profiling and drug testing in human cancer cell lines to explore how genomic backgrounds influence response to therapy¹. These efforts have resulted in valuable cancer cell line (CCL) data resources, such as the Cancer Cell Line Encyclopedia (CCLE)². CCLE data provide responses to 24 anticancer drugs in hundreds of cell lines across multiple tumor types, along with genomic information about these cell lines, such as the expression of nearly 20,000 genes. These data are often used to build computational models to predict drug response³. For example, to predict drug response using gene expression, researchers have used several methods, including Ridge regression⁴, mixture regression⁵, support vector machine⁶, random forest⁷, and neural networks⁸.

Given that recent large public CCL datasets, such as CCLE, include multiple cancer types, there is a need to develop statistical inference to assess the heterogeneity of the effects of gene expression on a drug across multiple tumor types. Quantifying the heterogeneity of the effects of genes across different tumors can provide valuable information in drug response modeling, as cancer tissue heterogeneity must be considered when modeling drug response across different cancer types⁹. Different tumor types may be characterized by their own tumor-specific genes, and thus tumor type interactions may need to be considered. In addition, it is often observed that patients with different types of cancer have similar expression patterns of certain genes, but the efficacy of treatments varies. For example, HER2 overexpression has been observed in subsets of patients with cancers such as breast and gastric cancer, but a HER2 inhibitor, pertuzumab, is more effective in treating HER2-positive breast cancer compared to HER2-positive gastric cancer¹⁰. Therefore, interactions with tumor type may need to be considered for more accurate statistical analysis. Several studies have attempted to develop statistical modeling methods for drug response that account for heterogeneity between different cancer types^11,12. However, a common approach is still to analyze several tumor types simultaneously^2,13. This approach assumes that there are no interaction effects with tumor-specific factors. Such inconsistencies in the current literature raises a critical research question regarding the uniformity of the effects of gene expression on drug response across different cancer types. To date, this question remains unanswered. This paper aims to fill this gap by developing a statistical test for homogeneity in the effects of gene expression on drug response across multiple cancer types. This task is particularly challenging due to the high-dimensional nature of CCLE data. In a different context, heterogeneity testing between studies in meta-analyses has been investigated^14,15.

In this study, we focus on statistical inference for the heterogeneity of the effects of gene expression on drug sensitivity across different cancer types representing distinct subpopulations. We consider a binary response setting, where the response for a cell line represents whether that cell line is sensitive or resistant to a drug, because determining whether a patient is sensitive or resistant to anticancer drugs is critical to treatment. More specifically, we focus on statistical inference in a high-dimensional logistic regression model where the population is stratified into heterogeneous subpopulations, i.e., cancer types. We consider the problem of testing whether a given covariate has the same effect on the binary response in different subpopulations in high-dimensional logistic regression settings as follows: for some $j \in \{1,\ldots ,p\}$ and $c \in {\mathbb {R}}$,

$$\begin{aligned} {} H_0: \beta ^{(1)}_{j} = \cdots = \beta ^{(G)}_{j} = c \quad \text{ vs } \quad H_1: \text{ not } H_0, \end{aligned}$$

(1)

where $\beta ^{(g)}_{j}$ is the underlying regression coefficient of the jth covariate for the gth subpopulation, and c can be specified, e.g. $c=0$, or c can be unspecified. The null hypothesis with unspecified c indicates homogeneity, i.e., equal effects of the covariate. Testing (1) with unspecified c can provide valuable information in pharmacogenomics research, which studies how genes influence response to a drug. In addition, it may be helpful in understanding variations in drug response between cancer types in terms of gene expression. On the other hand, specifying c, e.g. $c=0$, suggests that the covariate has zero effect, i.e., it is not significant. Testing (1) with $c=0$ can provide candidates for gene expression markers of drug sensitivity that can be applied to multiple cancer types. Such versatile gene expressions are very useful for research and clinical settings¹⁶. While testing (1) can provide valuable insights, standard maximum likelihood estimation cannot be used for CCLE data analysis due to the high dimensionality of genes compared to the number of cell lines for tumor types. In recent years, much effort has been devoted to statistical inference for high-dimensional generalized linear models. Several studies have considered inference for either low-dimensional coefficients or a single coefficient in the presence of a large number of nuisance parameters. For example, Van de Geer et al.¹⁷ studied the theoretical properties of a bias-corrected Lasso¹⁸ estimator called desparsified Lasso or debiased Lasso. Theoretical properties of different types of bias-corrected estimators have also been studied under high-dimensional linear regression settings^19,20. In addition, Ning and Liu²¹ proposed a decorrelated score test that can be applied to generic penalized M estimators. Other researchers have considered interval estimation for a single coefficient^22,23. More recently, Ma et al.²⁴ considered the multiple testing problem for high-dimensional logistic regression in two-sample settings.

Despite the nice theoretical properties of these methods in high-dimensional regimes, these methods were developed in classical single-population settings and thus may not be optimal for analyzing data consisting of heterogeneous subpopulations. In addition, the combination of limited sample sizes for cancer types and the high dimensionality of gene expression data poses challenges in obtaining accurate results when testing the homogeneity or significance of a gene’s effect on a drug across different cancer types. Instead of considering inference based on Lasso penalized estimation, a common approach in existing high-dimensional inference, we propose statistical inference based on the fused group Lasso. Our proposed testing procedure consists of two steps. In the first step, we compute a suitable estimator for the underlying coefficients using variable selection and homogeneity detection. Our proposed objective function includes a negative log-likelihood, a group Lasso-type penalty²⁵, and a fusion-type penalty²⁶. The group Lasso-type penalty controls overall sparsity in the model, i.e., removes irrelevant covariates in all subpopulations, while the fusion-type penalty promotes similarities between coefficients across subpopulations, which can improve estimation efficiency by clustering regression coefficients²⁷. Thus, the combination of group Lasso-type and fusion-type penalties allows for sparsity control and integration of samples from different cancer types, effectively addressing the challenges posed by small sample sizes for cancer types. The effectiveness of our tests over Lasso-based approaches is demonstrated by a simulation model in section “Simultation study for an imbalanced design”, where sample sizes for subpopulations are relatively limited. In the second step, we develop a bias correction procedure to correct the bias of our fused group Lasso estimator obtained in the first step. In essence, we extend the de-sparsified Lasso concept of Javanmard and Montanari¹⁹ to the fused group Lasso method, thereby facilitating statistical inference in high-dimensional regression scenarios. This advancement requires rigorous theoretical investigations of the fused group Lasso technique. However, to the best of our current knowledge, there are limited theoretical investigations of fused group Lasso problems.

Note that Zhou et al.²⁸ considered fused sparse group Lasso in a multiple response linear regression model without rigorous convergence and inference analyses. The theoretical analysis of the combined penalty of group Lasso-type and fusion-type is highly non-trivial. In particular, the penalty function is not decomposable with respect to the support of the parameter, which means that the existing theory of decomposable regularizers²⁹ cannot be applied³⁰. In addition, we provide a computationally efficient algorithm to address the computational difficulties arising from the logistic log-likelihood and the fused Lasso penalty. By integrating the principles of the proximal gradient method and the ADMM into our framework, we effectively handle the computational complexity associated with our settings. Furthermore, our theory and methods can be easily extended to general fused group lasso settings, such as those with discrete responses.

To test the hypothesis (1), existing high-dimensional inference methods based on $\ell _1$ norm penalization, such as the Lasso, could be considered. A naive approach would be to test (1) by applying a separate penalized logistic regression for each subpopulation, using an existing $\ell _1$ norm-based penalization method, such as the debiased Lasso¹⁷. However, when the sparsity patterns of the underlying coefficients are similar across different subpopulations—as is the case in the CCLE data example—the proposed fused group Lasso approach can combine subpopulations and increase the sample size, resulting in better performance. This was also observed in our simulation analysis and in different contexts³¹.

The rest of the paper is organized as follows. Section “Method and theory” describes the proposed penalization method, its debiased version, and the test statistics with their theoretical properties. Section “Implementation” presents an algorithm that solves the proposed penalization method. Section “Simulation study” examines the finite sample performance of the proposed test along with other competing methods. Section “Application to the CCLE data” illustrates the application of our approach to the Cancer Cell Line Encyclopedia (CCLE) data². Finally, Section “Conclusion” concludes the paper. Additional simulation results are presented in the Supplementary material.

Method and theory

In this section, we introduce the proposed penalization and test statistics. First, we introduce the notations that will be used throughout the paper. For any positive integer d, let $\varvec{I}_d$ be the $d \times d$ identity matrix, and $\varvec{1}_d$ and $\varvec{0}_d$ be the $d \times 1$ vectors of all 1’s and 0’s, respectively. Let ${\varvec{0}_{d_1 \times d_2}}$ and ${\varvec{1}_{d_1 \times d_2}}$ be $d_1\times d_2$ matrices whose entries are all 0’s and 1’s, respectively. When the size of the matrix is obvious, the subscript is sometimes omitted. For an index $1 \le l \le d$, let ${ \varvec{e}_l}$ denote the $p \times 1$ vector with one at the lth location and zero everywhere else. For any $d \times 1$ vector ${\varvec{a}}=(a_1,\ldots ,a_d)^\top$, let $\Vert {\varvec{a}}\Vert _q := \left( \sum _{\ell =1}^{d} a^q_\ell \right) ^{1/q}$ for $1 \le q < \infty$ and $\Vert {\varvec{a}}\Vert _{\max } := \max _{1\le \ell \le d}|a_\ell |$. For a set S, let |S| denote the cardinality of S. For a vector ${\varvec{a}}$ and an index set of elements, say S, let $\varvec{a}_{S}$ be the $|S|\times 1$ sub-vector of $\varvec{a}$ with elements in S. For a matrix ${\varvec{A}}=(A_{ij})_{d_1 \times d_2}$, we let

$$\Vert {\varvec{A}}\Vert _{\max } := \max _{i,j} |A_{ij} |, \quad \Vert {\varvec{A}}\Vert _1 := \sum _{i,j}|A_{ij}| , \quad \Vert {\varvec{A}}\Vert _F := \sqrt{\sum _{i,j}A^2_{ij}}, \quad |\!|\!|\!{\varvec{A}}|\!|\!|\!_1 := \max _j \sum _i |A_{ij}|,$$

and use $\text{ vec }({\varvec{A}})$ to represent vectorization by staking the columns of a matrix ${\varvec{A}}$. For a symmetric matrix ${\varvec{A}}$, let $\lambda _{\min }({\varvec{A}})$ be the minimum eigenvalue of ${\varvec{A}}$. Given a matrix ${\varvec{A}}$ and an index set of rows, say R, let ${{\varvec{A}}_{R,\cdot }}$ denote sub-matrix of ${\varvec{A}}$ with rows in R. We also use ${\varvec{A}}_{i,\cdot }$ to represent the ith row of ${\varvec{A}}$. For any vectors ${\varvec{a}}$ and $\varvec{b}$ of equal length, let $\langle {\varvec{a}}, \varvec{b} \rangle := \sum _i a_i b_i$. For any matrices ${\varvec{A}}$ and ${\varvec{B}}$ with equal dimensions, let $\langle {\varvec{A}},{\varvec{B}}\rangle :=\sum _{i,j}A_{ij}B_{ij}$. For non-negative sequences $\{a_n\}_{n=1}^{\infty }$ and $\{b_n\}_{n=1}^{\infty }$, we write $a_n \ll b_n$ or $a_n = o(b_n)$ if $b_n>0$ and $a_n/b_n \rightarrow 0$. We also write $a_n = O(b_n)$ or $a_n \lesssim b_n$ if $a_n \le C b_n$ for some positive constant C. Let $a \vee b$ and $a \wedge b$ denote $\max (a,b)$ and $\min (a,b)$, respectively.

Setting and problem

Here we present a logistic regression model for heterogeneous subpopulations. Suppose the data come from G independent subpopulations, such as tumor sample groups generated independently in CCLE data. For each subpopulation $1 \le g \le G$, there exist $n_g$ pairs $\left\{ \textbf{x}^{(g)}_{i},y^{(g)}_{i}\right\} _{i=1}^{n_g}$, where $y^{(g)}_{i} \in \{0,1\}$ represents a binary response (e.g., binary drug response) of the ith subject in the gth group (e.g., tumor group) and $\textbf{x}^{(g)}_{i} =\left( x^{(g)}_{i1},\ldots , x^{(g)}_{ip}\right) ^\top \in {\mathbb {R}}^p$ represents a p-dimensional vector of covariates, e.g. gene expression variables. We consider the following logistic regression model

$$\begin{aligned} y^{(g)}_{i} \mid \textbf{x}^{(g)}_{i} {\sim } \; \text{ Bernoulli }\left( \frac{\exp \left( \left[ \textbf{x}^{(g)}_{i} \right] ^\top {\varvec{\beta }}^{(g)}\right) }{1+\exp \left( \left[ \textbf{x}^{(g)}_{i} \right] ^\top {\varvec{\beta }}^{(g)}\right) }\right) \end{aligned}$$

(2)

for each group $g=1,\ldots ,G$ and samples $i=1,\dots ,n_g$, where ${\varvec{\beta }^{(g)}}:=(\beta ^{(g)}_1,\ldots , \beta ^{(g)}_p)^\top$ represents the underlying group-specific coefficient vector for the gth group. Let ${\varvec{\beta }}_{(j)}:= \left( \beta ^{(1)}_j,\ldots , \beta ^{(G)}_j \right) ^\top$ denote the underlying coefficient vector for the jth covariate. We consider the high-dimensional setting where the number of covariates p increases with samples sizes $n_1, \ldots ,n_G$. Let $n = \sum _{g=1}^G n_g$ be the total sample size. We assume that the groups are heterogeneous but share similar characteristics in the sense that most regression covariates have similar effects on the response across different groups. In this study, we consider the case where only a few covariates, e.g., a small number of gene expressions in the CCLE data example, are relevant to the response across different groups, i.e., the index set $S := \{j: \Vert {\varvec{\beta }}_{(j)} \Vert _2 \ne 0 \}$ is sparse in that $s := |S| \ll p$. We also assume that only a few pairs of groups have different covariate effects, i.e., $\Omega :=\left\{ (j,g,g'): \beta ^{(g)}_{j} \ne \beta ^{(g')}_{j}\right\}$ is sparse in that ${\tilde{s}}:=|\Omega | \ll sG^2$.

For a matrix notation, let

$$\begin{aligned} \textbf{y}^{(g)}&= { \left[ y^{(g)}_1,\ldots , y^{(g)}_{n_g} \right] ^\top } \in {\mathbb {R}}^{n_g} \\ \textbf{X}^{(g)}&= \left[ \textbf{x}^{(g)}_1, \ldots , \textbf{x}^{(g)}_{n_g} \right] ^\top \in {\mathbb {R}}^{n_g \times p} \end{aligned}$$

for $1 \le g \le G$, where each column of $\textbf{X}^{(g)}$ has a zero mean and $\ell _2$ norm $\sqrt{n_g}$. Let $\textbf{y}$ be the binary response vector and $\textbf{X}$ be the covariate matrix defined by

$$\begin{aligned} \textbf{y}&= \left[ (\textbf{y}^{(1)})^\top , \ldots , (\textbf{y}^{(G)})^\top \right] \in {\mathbb {R}}^{n} \\ \textbf{X}&= \text{ diag } \left( \textbf{X}^{(1)},\ldots , \textbf{X}^{(G)} \right) \in {\mathbb {R}}^{n \times pG}, \end{aligned}$$

where $\textbf{X}$ is a block-diagonal matrix, consisting of $\textbf{X}^{(g)}$’s. Define the coefficient matrix ${\varvec{B}}$ as ${\varvec{B}}= \left[ {\varvec{\beta }}^{(1)},\ldots , {\varvec{\beta }}^{(G)} \right] \in {\mathbb {R}}^{p \times G}.$ Our main goal is to test the homogeneity of the effects of the lth covariate across the G groups, i.e., for $1 \le l \le p$,

$$\begin{aligned} {} H_0: \beta ^{(1)}_l = \cdots = \beta ^{(G)}_l\quad \text{ vs } \quad H_1: \text{ not } H_0. \end{aligned}$$

(3)

Another goal is to test whether the lth covariate is significant to at least one of the groups:

$$\begin{aligned} {} H_0: \beta ^{(1)}_l = \cdots = \beta ^{(G)}_l =0 \quad \text{ vs } \quad H_1: \text{ not } H_0. \end{aligned}$$

(4)

The penalization method

We propose to test (3) or (4) based on the penalized method using the fused group Lasso. For any $y \in \{0,1\}$ and $\nu \in {\mathbb {R}}$, define $\ell (y,\nu )= -y\nu + \log (1+\exp (\nu ))$. For the loss function $\ell (y,\nu )$, let ${\dot{\ell }}(y,\nu )$ and $\ddot{\ell }(y,\nu )$ denote its first and second derivatives with respect to $\nu$, respectively. For a matrix ${\varvec{\Delta }}= ({\Delta }_{jg})_{1 \le j \le p, 1 \le g \le G} = \left( {\varvec{\Delta }}^{(1)},\ldots , {\varvec{\Delta }}^{(G)}\right) = \left( {\varvec{\Delta }}^\top _{(1)}, \ldots , {\varvec{\Delta }}^\top _{(p)}\right) ^\top$, let

$$\begin{aligned} L_n({\varvec{\Delta }})= \frac{1}{n} \sum _{g=1}^{G} \sum _{i=1}^{n_g}\ell \left( y^{(g)}_i, \left( \textbf{x}^{(g)}_i \right) ^\top {\varvec{\Delta }}^{(g)} \right) . \end{aligned}$$

We propose ${\hat{{\varvec{B}}}}$, which solves the following optimization problem:

$$\begin{aligned} {\hat{{\varvec{B}}}}&:= \mathop {\mathrm {arg\,min}}\limits _{{\varvec{\Delta }}\in {\mathbb {R}}^{p \times G}} L_n({\varvec{\Delta }}) + { \lambda _1\sum _{j=1}^{p} w_{j} \Vert {\varvec{\Delta }}_{(j)}\Vert _2 + \lambda _2\sum _{j=1}^{p} \sum _{g < g'} v_{j,gg'} |{\Delta }_{jg} - {\Delta }_{jg'}|}, \end{aligned}$$

(5)

where $\lambda _1$ and $\lambda _2$ are non-negative penalty parameters and $w_j$’s and $v_{j,gg'}$’s are non-negative weights.

Let ${{\hat{{\varvec{B}}}}}= \left( {\hat{\beta }}_{jg} \right) _{1 \le j \le p, 1 \le g \le G} = \left( {\hat{{\varvec{\beta }}}}^{(1)},\ldots , {\hat{{\varvec{\beta }}}}^{(G)} \right) = \left( {\hat{{\varvec{\beta }}^\top }}_{(1)}, \cdots , {\hat{{\varvec{\beta }}^\top }}_{(p)} \right) ^\top$. In (5), the two penalty terms are based on the sparsity assumption, as explained in section “Setting and problem”: The group Lasso-type penalty promotes overall sparsity in the total coefficients, i.e., many ${\hat{{\varvec{\beta }}}}_{(j)}$’s are the zero vector; while the fusion-type penalty promotes similarities among ${{\hat{\beta }}}_{jg}$’s for $1 \le g\le G$. For the weights $w_j$ and $v_{j,gg'}$, we consider the following optimization problem:

$$\begin{aligned} {\tilde{{\varvec{B}}}}&:=\mathop {\mathrm {arg\,min}}\limits _{{\varvec{\Delta }}\in {\mathbb {R}}^{p \times G}} L_n({\varvec{\Delta }}) + {{\tilde{\lambda }}}_1 \sum _{j=1}^{p}\Vert {\varvec{\Delta }}_{(j)}\Vert _2 + {{\tilde{\lambda }}}_2 \sum _{j=1}^{p} \sum _{g < g'} |{\Delta }_{jg} - {\Delta }_{jg'}|. \end{aligned}$$

(6)

Let ${{\tilde{{\varvec{B}}}}}= \left( {{\tilde{\beta }}}_{jg}\right) _{1 \le j \le p, 1 \le g \le G} = \left( {{\tilde{{\varvec{\beta }}}}}^{(1)},\ldots , {{\tilde{{\varvec{\beta }}}}}^{(G)}\right) = \left( {{\tilde{{\varvec{\beta }}}}}^\top _{(1)}, \ldots , {{\tilde{{\varvec{\beta }}}}}^\top _{(p)}\right) ^\top$. Following Zou³², we set

$$\begin{aligned} w_{j} = 1/\Vert {{\tilde{{\varvec{\beta }}}}}_{(j)}\Vert _2, \quad v_{j,gg'} = 1/|{\tilde{\beta }}_{jg}-{\tilde{\beta }}_{jg'}|. \end{aligned}$$

The following theorem presents an estimation error bound for the initial estimator ${\tilde{{\varvec{B}}}}$. For the details of the Conditions 1–2 assumed to derive Theorem 1, please refer to section S3.1 of the Supplementary materials.

Theorem 1

Assume that Conditions 1–2 in the Supplementary materials hold, and $\max _{1\le g\le G} \Vert {\varvec{\beta }}^{(g)}\Vert _1 \le \tilde{C}$ for some absolute constant ${{\tilde{C}}}>0$. Let the penalty parameters ${{\tilde{\lambda }}}_1$ and ${{\tilde{\lambda }}}_2$ be chosen so that

$$\begin{aligned} {{\tilde{\lambda }}}_1 \ge \sqrt{\frac{16G (\log p + \log G)}{n}}, \quad {{\tilde{\lambda }}}_2 = G^{-3/2} {{\tilde{\lambda }}}_1. \end{aligned}$$

Then, it holds that with probability at least $1- 1/(pG)$,

$$\begin{aligned} \Vert {\tilde{{\varvec{B}}}}-{\varvec{B}}\Vert _F^2 \lesssim {{\tilde{\lambda }}}_1^2 s + {{\tilde{\lambda }}}_2^2 {\tilde{s}}. \end{aligned}$$

Theorem 1 implies that if we take

$$\begin{aligned} {{\tilde{\lambda }}}_1 \asymp \sqrt{\frac{G(\log p + \log G)}{n}}, \quad {{\tilde{\lambda }}}_2 \asymp \sqrt{\frac{\log p +\log G}{nG^2}}, \end{aligned}$$

then the initial estimator ${\tilde{{\varvec{B}}}}$ satisfies

$$\begin{aligned} \Vert {\tilde{{\varvec{B}}}}-{\varvec{B}}\Vert _F^2 = O_p\left( \Big (sG + \frac{{\tilde{s}}}{G^2}\Big ) \frac{\log (p \vee G)}{n}\right) . \end{aligned}$$

The following theorem presents the theoretical properties of the estimator ${\hat{{\varvec{B}}}}$. For the details of the additional Conditions 3–4 required to prove Theorem thm2, please see section S3.1 in the Supplementary materials.

Theorem 2

Assume that the conditions of Theorem 1 and Conditions 3–4 in the Supplementary materials hold. Consider a minimizer ${\hat{{\varvec{B}}}}$ with $\lambda _2 = G^{-3/2} \lambda _1$ and

$$\begin{aligned} 8{{\tilde{C}}} \sqrt{\frac{G(\log p + \log G)}{n}} \le \lambda _1 \asymp \sqrt{\frac{G(\log p + \log G)}{n}}. \end{aligned}$$

Define the estimators of S and $\Omega$ by

$$\begin{aligned} {\hat{S}}:= \{j:\ \Vert {{\hat{{\varvec{\beta }}}}}_{(j)} \Vert _2 \ne 0 \}, \ {{\hat{\Omega }}}:=\{(j,g,g'): {\hat{\beta }}_{jg} \ne {\hat{\beta }}_{jg'}\}. \end{aligned}$$

Then, $P({\hat{S}} = S) \rightarrow 1$ and $P({{\hat{\Omega }}} = \Omega ) \rightarrow 1$.

Theorem 2 shows that the proposed fused group Lasso consistently selects relevant covariates and finds the same covariate effects across groups under additional conditions on the correlations between relevant and irrelevant covariates and on the minimum signal strengths. However, if a covariate of interest j is not selected in ${\hat{{\varvec{B}}}}$, i.e., $j \notin {\hat{S}}$, statistical inference under the original model is impossible for ${\hat{{\varvec{B}}}}$, while the debiased version of ${\hat{{\varvec{B}}}}$ may give some useful information, e.g., asymptotic distributions and p-values.

Debiased test

In this subsection, we develop the debiased version of the fused group Lasso estimator ${\hat{{\varvec{B}}}}$ for statistical inference. For $1\le g \le G$, let

$$\begin{aligned} {\varvec{\Sigma }^{(g)}}&:= E\left[ n_g^{-1}\sum _{i=1}^{n_g}\ddot{\ell }\left( y^{(g)}_i, [\textbf{x}^{(g)}_{i}]^\top {\varvec{\beta }}^{(g)}\right) \textbf{x}^{(g)}_{i} [\textbf{x}^{(g)}_{i}]^\top \right] \\ {\varvec{M}^{(g)}}&:= ({\varvec{\Sigma }^{(g)}})^{-1}. \end{aligned}$$

Let $\hat{{\varvec{\Sigma }}}^{(g)}$ be an estimator of ${\varvec{\Sigma }^{(g)}}$ defined by

$$\begin{aligned} \hat{{\varvec{\Sigma }}}^{(g)}:=n_g^{-1}\sum _{i=1}^{n_g}\ddot{\ell }\left( y^{(g)}_i, [\textbf{x}^{(g)}_i]^\top {{\hat{{\varvec{\beta }}}}}^{(g)}\right) \textbf{x}^{(g)}_{i} [\textbf{x}^{(g)}_{i}]^\top . \end{aligned}$$

Using the main idea of Javanmard and Montanari¹⁹, let ${\hat{\varvec{M}}}^{(g)} := \left[ {\hat{\varvec{m}}}^{(g)}_{1},\ldots , {\hat{ \varvec{m}}}^{(g)}_{p} \right] ^\top \in {\mathbb {R}}^{p \times p}$ be the estimate of ${\varvec{M}^{(g)}}$ defined by solving the quadratic programming

$$\begin{aligned} {\hat{\varvec{m}}}^{(g)}_{j}&= \mathop {\mathrm {arg\,min}}\limits _{{\varvec{m}} \in {\mathbb {R}}^{p}} \frac{1}{2} {\varvec{m}}^\top { {\hat{{\varvec{\Sigma }}}}^{(g)} } {\varvec{m}} \quad \text{ subject } \text{ to } { \Vert {\hat{{\varvec{\Sigma }}}}^{(g)} \varvec{m} -\varvec{e}_j\Vert _{\max }} \le \mu _g \end{aligned}$$

(7)

for each $1 \le j \le p$ and $1 \le g\le G$, where $\mu _g =C_1 \sqrt{{\log p}/{n_g}}$ for some constant $C_1>0$. We propose a debiased estimator: for $g=1,\dots ,G$,

$$\begin{aligned} {\hat{\varvec{b}}^{(g)}}:={\hat{{\varvec{\beta }}}}^{(g)} - \frac{ {\hat{\varvec{ M}}}^{(g)} }{n_g}\sum _{i=1}^{n_g}{\dot{\ell }}\left( y^{(g)}_{i}, [\textbf{x}^{(g)}_i]^\top {\hat{{\varvec{\beta }}}}^{(g)}\right) \textbf{x}^{(g)}_{i}. \end{aligned}$$

In the debiased estimation approach, Javanmard and Montanari¹⁹ considered only linear regression, while Van de Geer et al.¹⁷ examined a generalized linear model. Van de Geer et al.¹⁷ used nodewise regression³³ to estimate standard errors for the debiased Lasso estimator. While we use the bias correction technique developed by Javanmard and Montanari¹⁹, our proposed method differs from existing methods. Specifically, we use the quadratic programming considered in Javanmard and Montanari¹⁹ to estimate ${\varvec{M}}^{{(g)}}$s instead of approximating the inverse of the sample covariance matrix for covariates. Compared to our method, most existing debiased inference methods for high-dimensional generalized linear models use nodewise regression (e.g. Ma et al.²⁴, Tian and Feng³⁴, Caner³⁵). In addition, most of these existing methods were developed for statistical inference about a parameter in the classical single-population setting, and thus are Lasso to estimate parameters for a single population. In contrast, we develop tests based on our fused group Lasso for inference about parameters for multiple subpopulations.

To derive an asymptotic normality of the debiased Lasso¹⁷, Van de Geer et al.¹⁷ assumed that each row of ${\varvec{M}^{(g)}}$ has a small number of non-zero entries. However, this assumption of exact sparsity may not hold in the generalized linear model, as pointed out by Xia et al.³⁶. As demonstrated in Javanmard and Montanari¹⁹, we can achieve asymptotic normality of the proposed debiased estimator without assuming exact sparsity for ${\varvec{M}^{(g)}}$. Theorem 3 presents the theoretical properties of the proposed debiased estimator ${\hat{\varvec{b}}^{(g)}}$. See section S3.2 of the Supplementary materials for the details of Condition 5, which is crucial for proving Theorem thm123.

Theorem 3

Suppose conditions of Theorem 2 and Condition 5 in the Supplementary materials hold. Then, the debiased estimator satisfies

$$\begin{aligned} \sqrt{n_g} \left( {\hat{\varvec{b}}^{(g)}}-{\varvec{\beta }}^{(g)}\right) = - \frac{ {\hat{\varvec{ M}}}^{(g)}}{\sqrt{n_g}}\sum _{i=1}^{n_g} {\dot{\ell }}\left( y^{(g)}_{i}, \left[ \textbf{x}^{(g)}_i \right] ^\top {\varvec{\beta }}^{(g)}\right) \textbf{x}^{(g)}_i + {\varvec{\Delta }_g}, \end{aligned}$$

where $\max _g \Vert {\varvec{\Delta }_g}\Vert _{\max } = o_p(1)$.

Let ${\hat{\varvec{V}}}^{(g)} :={{\hat{{\varvec{M}}}}^{{(g)}}{\hat{{\varvec{\Sigma }}}}^{{(g)}}} \left[ {\hat{\varvec{M}}}^{(g)}\right] ^\top$ and ${\hat{\varvec{V}}_{(j)}}$ be the $G \times G$ diagonal matrix with diagonal elements $\left\{ \frac{1}{n_g} \hat{V}^{(g)}_{jj}\right\} _{g=1}^{G}$. Let ${\hat{\varvec{b}}_{(j)}}:= \left[ {\hat{b}}^{(1)}_j, \ldots , {\hat{b}}^{(G)}_j \right] ^\top$. Define ${\varvec{S}_j} = \left[ S_{j1},\ldots , S_{jG}\right] ^\top := {\hat{\varvec{V}}_{(j)}}^{-1/2} \left( {\hat{\varvec{b}}_{(j)}}- {\varvec{\beta }}_{(j)}\right)$ and ${\varvec{S}^0_j}:= {\hat{\varvec{V}}^{-1/2}_{(j)}} {\hat{\varvec{b}}_{(j)}}$. For a given significance level $0<\alpha <1$, we reject the null hypothesis $H_0$ in (4) when $\Vert {\varvec{S}^0_j}\Vert _2^2 > \chi ^2_{\alpha }(G)$, where $\chi ^2_{\alpha }(G)$ represents the upper $\alpha$-quantile of a central $\chi ^2$ distribution with G degrees of freedom. The notion of weak convergence is not well defined in our setting, since we consider the case where G diverges with n. Theorem 4 shows that our test for the hypothesis (4) is still valid in the $\chi ^2$ approximation.

Theorem 4

Suppose conditions of Theorem 3 hold, and $G^{7/2} = o(\min _{1\le g\le G} n_g)$. Then, we have

$$\begin{aligned} \sup _x |P \left( \Vert {\varvec{S}_j}\Vert _2^2 \le x \right) - P({\chi ^2(G)} \le x)| \rightarrow 0. \end{aligned}$$

The following Corollary implies ${\hat{\varvec{b}}_{(j)}}^\top \hat{V}^{-1}_{(j)}{\hat{\varvec{b}}_{(j)}}\overset{d}{\rightarrow }\ \chi ^2_G$ under the null hypothesis $H_0: \beta ^{(1)}_j = \cdots = \beta ^{(G)}_j =0$ in (4).

Corollary 1

Assume Conditions of Theorem 4. Then, under the null hypothesis $H_0$ in (4), for any significance level $0<\alpha < 1$, we have

$$\begin{aligned} \lim _{n \rightarrow \infty } P\big (\Vert {\varvec{S}^0_j}\Vert _2^2 > \chi ^2_{\alpha }(G) \big ) = \alpha . \end{aligned}$$

Based on this result, the corresponding chi-square test statistic for the hypothesis (4) is $\Vert S^0_j\Vert _2^2$, i.e., reject $H_0$ in (4) if $\Vert S^0_j\Vert _2^2 > \chi ^2_{\alpha }(G)$.

Next, we consider testing the homogeneity hypothesis in (3), which can be rewritten as

$$\begin{aligned} H_0: {\varvec{D}}{\varvec{\beta }}_{(j)} = 0 \quad \text{ vs } \quad H_1: \text{ not } H_0, \end{aligned}$$

where ${\varvec{D}}$ represents the $(G-1) \times G$ matrix such that $D_{\ell \ell }=1$ and $D_{\ell , \ell +1}=-1$ for $\ell =1,\ldots ,G-1$. Define

$$\begin{aligned} {\varvec{K}_j}&:={\left( {\varvec{D}}{\hat{{\varvec{V}}}}_{(j)}{\varvec{D}}^\top \right) }^{-1/2} {\varvec{D}}({\hat{\varvec{b}}_{(j)}}-{{\varvec{\beta }}_{(j)}}) \\ {\varvec{K}_j^0}&:={\left( {\varvec{D}}{\hat{\varvec{{\varvec{V}}}}_{(j)}}{\varvec{D}}^\top \right) }^{-1/2} {\varvec{D}}{\hat{\varvec{b}}_{(j)}}. \end{aligned}$$

Theorem 5 shows that the test procedure for the hypothesis (3) based on $\Vert {\varvec{K}_j}\Vert _2^2$ admits a $\chi ^2$ approximation.

Theorem 5

Assume Conditions of Theorem 4. Then, we have

$$\begin{aligned} \sup _x \Big |P\big ( \Vert {\varvec{K}_j}\Vert ^2_2 \le x \big ) - P\big ({\chi ^2(G-1)}\le x \big ) \Big | \rightarrow 0. \end{aligned}$$

Corollary 2

Assume the conditions of Theorem 4. Then, under the null hypothesis $H_0$ in (3), for any significance level $0<\alpha < 1$, we have

$$\begin{aligned} \lim _{n \rightarrow \infty } P\left( \Vert {\varvec{K}_j^0}\Vert _2^2> \chi ^2_{\alpha }(G-1)\right) =\alpha . \end{aligned}$$

Corollary 2 implies that under $H_0: \beta ^{(1)}_{j} = \cdots =\beta ^{(G)}_{j}$ in (3), it holds that

$$\begin{aligned} {\hat{\varvec{b}}_{(j)}}^\top {\varvec{D}}^\top ({\varvec{D}}{\hat{\varvec{{\varvec{V}}}}_{(j)}}{\varvec{D}}^\top )^{-1}{\varvec{D}}{\hat{\varvec{b}}_{(j)}}\overset{d}{\rightarrow }\ \chi ^2_{G-1}. \end{aligned}$$

Based on this result, the corresponding chi-square test statistic for the hypothesis (3) is $\Vert {\varvec{K}_j^0}\Vert _2^2$, i.e., reject $H_0$ in (3) if $\Vert {\varvec{K}_j^0}\Vert _2^2 > \chi ^2_{\alpha }(G-1)$.

Implementation

In this section we present the computational algorithm for solving (6). The algorithm for (5) can be obtained in a similar way. We use the proximal gradient method to solve (6). Let ${\tilde{{\varvec{B}}}}^{(t)}$ be the tth update in the proximal gradient method. Then the $(t+1)$th update ${\tilde{{\varvec{B}}}}^{(t+1)}$ is given by

$$\begin{aligned}&\mathop {\mathrm {arg\,min}}\limits _{{\varvec{\Delta }}\in {\mathbb {R}}^{p \times G}}\frac{1}{n}\ L_n \left( \tilde{\varvec{B}}^{(t)}\right) + \left\langle \nabla L_n({{\tilde{{\varvec{B}}}}}^{(t)}), {\varvec{\Delta }}- \tilde{\varvec{B}}^{(t)} \right\rangle + \frac{\eta }{2} { \Vert {\varvec{\Delta }}-{\tilde{{\varvec{B}}}}^{(t)}\Vert ^2_F} + \tilde{\lambda }_1\sum _{j=1}^{p}\Vert {\varvec{\Delta }}_{j,\cdot }\Vert _2 + \tilde{\lambda }_2\sum _{j=1}^{p}\sum _{g < g'}|{\Delta }_{jg}-{\Delta }_{jg'}|. \end{aligned}$$

(8)

We set $\eta =\sum _{g=1}^{G}\Vert \textbf{X}^{(g)}\Vert ^2_F/4n$ based on the convergence properties of the proximal gradient method³⁷. Note that $\nabla L_n({\varvec{\Delta }})$ is Lipschitz continuous with Lipschitz parameter $\sum _{g=1}^{G}\Vert \textbf{X}^{(g)}\Vert ^2_F/(4n)$. Let $L_{n,{\varvec{A}}}({\varvec{\Delta }})$ be the first-order approximation of $L_n({\varvec{\Delta }})$ at a matrix ${\varvec{A}}$. Then, (8) can be rewritten as

$$\begin{aligned}&\mathop {\mathrm {arg\,min}}\limits _{{\varvec{\Delta }}} L_{n,{\tilde{{\varvec{B}}}}^{(t)}}({\varvec{\Delta }}) + {\frac{\eta }{2}\Vert {\varvec{\Delta }}-{\tilde{{\varvec{B}}}}^{(t)}\Vert ^2_F}+ \tilde{\lambda }_1\sum _{j=1}^{p}\Vert {\varvec{\Delta }}_{j,\cdot }\Vert _2 + {{\tilde{\lambda }}}_2\sum _{j=1}^{p}\sum _{g < g'}|{\Delta }_{jg}-{\Delta }_{jg'}|. \end{aligned}$$

(9)

To compute (9), we use the alternating direction method of multipliers (ADMM)³⁸. Let ${\varvec{H}}$ be the G by $G(G-1)/2$ matrix satisfying

$$\begin{aligned} \sum _{j=1}^{p}\sum _{g < g'}|{\Delta }_{jg}-{\Delta }_{jg'}| = \Vert {\varvec{\Delta }}{\varvec{H}}\Vert _1. \end{aligned}$$

By introducing surrogate variables ${\varvec{A}}$ and ${\varvec{F}}$, (9) can be converted to solving the following optimization problem:

$$\begin{aligned} \min _{{\varvec{\Delta }}, {\varvec{A}}, {\varvec{F}}}&\, \, L_{n,{\tilde{{\varvec{B}}}}^{(t)}}({\varvec{\Delta }}) + \frac{\eta }{2}\Vert {\varvec{\Delta }}- {\tilde{{\varvec{B}}}}^{(t)} \Vert ^2_F + {{\tilde{\lambda }}}_1 \sum _{j=1}^{p}\Vert {\varvec{A}}_{j,\cdot }\Vert _2+ {{\tilde{\lambda }}}_2 \Vert {\varvec{F}}\Vert _1 \nonumber \\ {}&\text{ subject } \text{ to } {\varvec{A}}={\varvec{\Delta }} \text{ and } {\varvec{\Delta }}{\varvec{H}}= {\varvec{F}}. \end{aligned}$$

(10)

The corresponding augmented Lagrangian is

$$\begin{aligned} K^{(t)}({\varvec{\Delta }}, {\varvec{A}},{\varvec{F}},{\varvec{W}},{\varvec{U}}) :=&L_{n,{\tilde{{\varvec{B}}}}^{(t)}}({\varvec{\Delta }}) + \frac{\eta }{2}\Vert {\varvec{\Delta }}-{\tilde{{\varvec{B}}}}^{(t)}\Vert ^2_F + {{\tilde{\lambda }}}_1 \sum _{j=1}^{p}\left\| {\varvec{A}}_{j,\cdot }\right\| _2 + {{\tilde{\lambda }}}_2 \Vert {\varvec{F}}\Vert _1 \\&+ \langle {\varvec{U}},{\varvec{\Delta }}-{\varvec{A}}\rangle + \langle {\varvec{W}}, {\varvec{\Delta }}{\varvec{H}}- {\varvec{F}}\rangle + \frac{\rho }{2}\Vert {\varvec{\Delta }}-{\varvec{A}}\Vert ^2_F + \frac{\rho }{2}\Vert {\varvec{\Delta }}{\varvec{H}}-{\varvec{F}}\Vert ^2_F, \end{aligned}$$

where ${\varvec{U}}$ and ${\varvec{W}}$ represent dual variables and $\rho >0$ is a fixed parameter. Let ${\tilde{{\varvec{B}}}}^{(t+1,s)}$ be the sth update in the ADMM to compute ${\tilde{{\varvec{B}}}}^{(t+1)}$. Then, ${\tilde{{\varvec{B}}}}^{(t+1)}$ for $t=0,1,2,\ldots$ is obtained by iterating the following updates: starting with ${\varvec{A}}^{(0)} = {\varvec{U}}^{(0)} ={\varvec{0}_{p \times G}}$ and ${\varvec{F}}^{(0)} = {\varvec{W}}^{(0)}={\varvec{0}_{p \times \frac{G(G-1)}{2}}}$, we repeat for $s=1,\ldots ,S$,

$$\begin{aligned} {\tilde{{\varvec{B}}}}^{(t+1, s)}&=\mathop {\mathrm {arg\,min}}\limits _{{\varvec{\Delta }}}K^{(t)}\left( {\varvec{\Delta }},{{\varvec{A}}}^{(s-1)},{{\varvec{F}}}^{(s-1)},{{\varvec{W}}}^{(s-1)},{{\varvec{U}}}^{(s-1)}\right) \nonumber \\ {{\varvec{A}}}^{(s)}&=\mathop {\mathrm {arg\,min}}\limits _{{\varvec{A}}}K^{(t)} \left( {\tilde{{\varvec{B}}}}^{(t+1,s)},{\varvec{A}},{{\varvec{F}}}^{(s-1)},{{\varvec{W}}}^{(s-1)},{{\varvec{U}}}^{(s-1)} \right) \nonumber \\ {{\varvec{F}}}^{(s)}&=\mathop {\mathrm {arg\,min}}\limits _{{\varvec{F}}} K^{(t)} \left( {\tilde{{\varvec{B}}}}^{(t+1,s)},{{\varvec{A}}}^{(s)},{\varvec{F}},{{\varvec{W}}}^{(s-1)},{{\varvec{U}}}^{(s-1)} \right) \nonumber \\ {{\varvec{W}}}^{(s)}&=\ {{\varvec{W}}}^{(s-1)} + { \rho \left( {\tilde{{\varvec{B}}}}^{(t+1,s)}{\varvec{H}}-{{\varvec{F}}}^{(s)} \right) }, \nonumber \\ {{\varvec{U}}}^{(s)}&=\ {{\varvec{U}}}^{(s-1)} + {\rho \left( {\tilde{{\varvec{B}}}}^{(t+1,s)}-{{\varvec{A}}}^{(s)} \right) }, \end{aligned}$$

(11)

and ${\tilde{{\varvec{B}}}}^{(t+1)}$ is updated as ${\tilde{{\varvec{B}}}}^{(t+1)} = {\tilde{{\varvec{B}}}}^{(t+1,S)}$, where the derivations of each update in (11) can be found in section “ADMM update”. We set $S=50$, $\rho =1$, and the maximum iteration number $T=200$ by analysis. The proposed ADMM algorithm is summarized in Algorithm 1. In our simulation and real data examples, it was observed that the algorithm achieves fast convergence. On average, it completes in less than one second, implemented on an Intel Xeon (2.20 GHz). See also section “Computational time comparison”.

ADMM update

In this section, we include details of the ADMM update presented at (11).

Update for ${\tilde{{\varvec{B}}}}^{(t+1,s)}$: For simplicity, let

$$\begin{aligned} K^{(t,s-1)}\left( {\varvec{\Delta }}\right) :=K^{(t)} \left( {\varvec{\Delta }},{\varvec{A}}^{(s-1)}, {\varvec{F}}^{(s-1)}, {\varvec{W}}^{(s-1)}, {\varvec{U}}^{(s-1)}\right) . \end{aligned}$$

By the convexity, ${\tilde{{\varvec{B}}}}^{(t+1,s)}$ must satisfy

$$\begin{aligned} \frac{\partial K^{(t,s-1)}\left( {\varvec{\Delta }}\right) }{\partial {\varvec{\Delta }}} \mid _{{\varvec{\Delta }}={\tilde{{\varvec{B}}}}^{(t+1,s)}} = {\varvec{0}_{p \times G}}. \end{aligned}$$

Thus, it holds that

$$\begin{aligned} {\tilde{{\varvec{B}}}}^{(t+1,s)}\left( \rho {\varvec{H}}{\varvec{H}}^\top + \left( \eta +\rho \right) {\varvec{I}_{G}}\right) = {\varvec{V}}, \end{aligned}$$

where ${\varvec{V}}:= -\nabla L_n \left( \tilde{{\varvec{B}}}^{(t)}\right) +\eta \tilde{{\varvec{B}}}^{(t)} + \rho {\varvec{A}}^{(s-1)} + \rho {\varvec{F}}^{(s-1)} {\varvec{H}}^\top - {\varvec{U}}^{(s-1)} - {\varvec{W}}^{(s-1)} {\varvec{H}}^\top .$

Update for ${\varvec{A}}^{(s)}$: ${\varvec{A}}^{(s)}$ is defined by

$$\begin{aligned} {\varvec{A}}^{(s)}&=\mathop {\mathrm {arg\,min}}\limits _{{\varvec{A}}} \ -\langle {\varvec{U}}^{(s-1)}, {\varvec{A}} \rangle + \frac{\rho }{2}\Vert {\tilde{{\varvec{B}}}}^{(t+1,s)}- {\varvec{A}}\Vert ^2_F +{{\tilde{\lambda }}_1}\sum _{j=1}^{p}\Vert {\varvec{A}}_{j,\cdot }\Vert _2, \end{aligned}$$

which is separable with respect to j’s. For each $1\le j \le p$, let ${\varvec{s}}_{j}$ be the subgradient of $\Vert {\varvec{x}}\Vert _2$ at ${\varvec{x}}= {\varvec{A}}^{(s)}_{j.}$, i.e.,

$$\begin{aligned} {\varvec{s}}_j={\left\{ \begin{array}{ll} \text{ some } {\varvec{x}} \in {\mathbb {R}}^{1 \times G} \text{ with } \Vert {\varvec{x}}\Vert _2 \le 1 &{} \text{ if } {\varvec{A}}^{(s)}_{j.} = {\varvec{0}^\top _G} \\ {\varvec{A}}^{(s)}_{j,\cdot } /\Vert {\varvec{A}}^{(s)}_{j,\cdot } \Vert _2 &{} \text{ otherwise. } \end{array}\right. } \end{aligned}$$

By the convexity, it holds that for $1 \le j \le p$,

$$\begin{aligned} -\rho \left( {\tilde{{\varvec{B}}}}^{(t+1,s)}_{j,\cdot } - {\varvec{A}}^{(s)}_{j,\cdot }\right) - {\varvec{U}}^{(s-1)}_{j,\cdot } + {{\tilde{\lambda }}_1} {\varvec{s}}_j ={\varvec{0}^\top _{G}}. \end{aligned}$$

By the definition of ${\varvec{s}}_j$, we obtain that for $j=1,\ldots ,p$,

$$\begin{aligned} {\varvec{A}}^{(s)}_{j,\cdot }&=\max \left( 0,1-\frac{{{\tilde{\lambda }}_1}/ \rho }{ \Vert {\tilde{{\varvec{B}}}}^{(t+1,s)}_{j,\cdot } + {\varvec{U}}^{(s-1)}_{j,\cdot }/\rho \Vert _2}\right) \cdot \Bigg (\tilde{ {\varvec{B}}}^{(t+1,s)}_{j,\cdot } + \frac{{\varvec{U}}^{(s-1)}_{j,\cdot }}{\rho } \Bigg ). \end{aligned}$$

Update for ${\varvec{F}}^{(s)}$: ${\varvec{F}}^{(s)}$ can be derived using the definition of subgradient of $\ell _1$ norm. For each $(i,j) \in \{1,\ldots , p\} \times \left\{ 1,\ldots ,\frac{G(G-1)}{2}\right\}$, it must hold that

$$\begin{aligned} -W^{(s-1)}_{ij} -\rho \left[ {{\tilde{{\varvec{B}}}}}^{(t+1,s)} {\varvec{H}}-{\varvec{F}}^{(s)}\right] _{ij} + {{{\tilde{\lambda }}}_2} \zeta _{ij} = 0, \end{aligned}$$

where $\zeta _{ij}$ is defined by

$$\begin{aligned} \zeta _{ij} = {\left\{ \begin{array}{ll} 1 &{} \text{ if }\quad F^{(s)}_{ij} > 0\\ -1 &{} \text{ if }\quad F^{(s)}_{ij} < 0\\ \text{ some } x \text{ with } |x| \le 1 &{} \text{ if }\quad F^{(s)}_{ij} =0. \end{array}\right. } \end{aligned}$$

By the definition, ${\varvec{F}}^{(s)}=(F^{(s)}_{ij})_{1\le i\le p,\ 1\le j\le \frac{G(G-1)}{2}}$ is given by

$$\begin{aligned} F^{(s)}_{ij} = {\left\{ \begin{array}{ll} \left[ {{\tilde{{\varvec{B}}}}}^{(t+1,s)} {\varvec{H}}\right] _{ij} + \frac{W^{(s-1)}_{ij}}{\rho } - \frac{ {{{\tilde{\lambda }}}_2} }{\rho } &{} \text{ if } {\breve{F}^{(s)}_{ij}} > \frac{ {{{\tilde{\lambda }}}_2} }{\rho } \\ {[}{{\tilde{{\varvec{B}}}}}^{(t+1,s)} {\varvec{H}}]_{ij} + \frac{W^{(s-1)}_{ij}}{\rho } + \frac{{{{\tilde{\lambda }}}_2}}{\rho } &{} \text{ if } {\breve{F}^{(s)}_{ij}} < \frac{-{{{\tilde{\lambda }}}_2}}{\rho }\\ 0 &{} \text{ otherwise }\\ \end{array}\right. } \end{aligned}$$

where ${\breve{F}^{(s)}_{ij}} = \left[ {{\tilde{{\varvec{B}}}}}^{(t+1,s)} {\varvec{H}}\right] _{ij} + \frac{W^{(s-1)}_{ij}}{\rho }$.

Simulation study

In this section, we present empirical results through simulation analysis. A main objective of the simulation analysis is to investigate the finite sample performance of the proposed method (Debiased Fused Group Lasso; DFGL) in testing the following hypotheses:

$$\begin{aligned} H_0: \beta ^{(1)}_j&= \cdots = \beta ^{(G)}_j = 0\ \text{ vs } \ H_{1}: \text{ not } H_{0}, \end{aligned}$$

(13)

$$\begin{aligned} H_0: \beta ^{(1)}_j&= \cdots = \beta ^{(G)}_j\ \text{ vs } \ H_{1}: \text{ not } H_{0}. \end{aligned}$$

(14)

To investigate the advantages of the proposed tests over existing $\ell _1$ penalized approaches that do not use a fusion penalty, we compared the proposed method with the following methods:

DL (Debiased Lasso) : Chi-squared test based on applying debiased Lasso¹⁷ separately for each subpopulation
DL-B: Bonferroni correction using p-values obtained by applying debiased Lasso¹⁷ separately for each subpopulation
DL-E (Debiased Lasso based on the Exact inverse of the information matrix): Chi-squared test based on applying debiased Lasso³⁶ separately for each subpopulation, where this bias-correction method is developed for the scenario where the sample size is greater than the number of regressors
DL-E-B: Bonferroni correction using p-values obtained by applying debiased Lasso³⁶ separately for each subpopulation

We also compared the proposed method with DR-B, Bonferroni correction based on debiased Ridge³⁹. R code (https://github.com/luxia-bios/DebiasedLassoGLMs) was used to implement DL-E, and R package hdi⁴⁰ was used to implement DL and DR-B. We also used R code (https://web.stanford.edu/~montanar/ssLasso/) to solve the quadratic programming (7). Following Javanmard and Montanari¹⁹, we set $\mu _g$ in (7) as $\mu _g =c\sqrt{{\log p}/{n_g}}$ for some positive constant c. Specifically, we set $c=0.7$ based on the results of the sensitivity analysis of c summarized in the Supplementary material. 5-fold cross-validation was used to determine the regularization parameters for the proposed penalization method.

Simulation setting

In our simulation study, we fix $G=7$ and simulate $\textbf{x}^{(g)}_{i}$ for all (g, i) from the p-dimensional multivariate normal distribution with mean ${{\varvec{0}}_{p}}$ and covariance matrix ${\varvec{\Sigma }_{x}}$. Specifically, we consider the following two different covariance matrices: (1) AR(1): $[{\varvec{\Sigma }_{x}}]_{ij}=0.5^{|i-j|}$; or (2) Block: ${\varvec{\Sigma }_{x}}$ is a $p \times p$ block diagonal matrix consisting of p/4 identical blocks, where the $4 \times 4$ sub-block matrix, denoted by ${\varvec{\Sigma }_b}$, is a Toeplitz matrix such that $[{\varvec{\Sigma }_b}]_{ij}=0.5^{|i-j|}$, $i=1,\ldots ,4;~ j=1,\ldots ,4$. Then, the response variables are generated independently as follows: for $1 \le g \le G, \, 1 \le i \le n_g$,

$$\begin{aligned} y^{(g)}_{i} {\sim } \text{ Bernoulli }\left( \frac{\exp ([\textbf{x}^{(g)}_{i}]^\top {\varvec{\beta }}^{(g)})}{ 1+ \exp ([\textbf{x}^{(g)}_{i}]^\top {\varvec{\beta }}^{(g)}) }\right) , \end{aligned}$$

where ${\varvec{\beta }}_{(j)}=\left( \beta ^{(1)}_{j},\ldots , \beta ^{(G)}_{j}\right) ^\top \in {\mathbb {R}}^{G}$ is set as

$$\begin{aligned} {\varvec{\beta }}_{(j)}:= {\left\{ \begin{array}{ll} (-\,0.6, -\,0.6, 0.6, 0.6, 0.6, 0.6, 0.6) &{} j=1 \\ (0.6, 0.6, -\,0.6, -\,0.6, 0.6, 0.6, 0.6) &{} j=2 \\ (0.6, 0.6, 0.6, 0.6, -\,0.4, -\,0.4, 0.6) &{} j=3 \\ (-\,0.4, 0.6, 0.6, 0.6, 0.6, 0.6, -\,0.4) &{} j=4 \\ (0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4) &{} j=5 \\ (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0) &{} j=6,7 \\ -\,(0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4) &{} j=8 \\ (1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5) &{} j=9,10 \\ (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0) &{} j=11 \\ (2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5) &{} j=12 \\ (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0) &{} otherwise\\ \end{array}\right. }. \end{aligned}$$

We set $n_1=\cdots =n_G=m$ and consider the following two specifications of (m, p): $(m,p)=(200,80)$ or $(m,p)=(300,120)$. For each case, we simulate $M=100$ Monte Carlo samples and summarize the results over 100 replications.

Simulation results

First, we present simulation results when testing for homogeneity (14). Next, we present simulation results when testing for overall significance (13).

Testing homogeneity

We consider $j=1,2,3,4$ and $j=5,8,9,10,12$ to measure powers and type I errors, respectively. Tables 1 and 2 record the powers and type I errors of different methods when the significance level $\alpha$ is set to $\alpha =0.05$. As shown in Table 1, DFGL outperforms the other approaches in terms of higher power. DL ranks second in terms of higher power in most cases, but fails to control Type I errors when the covariates of interest have strong signals. Competing approaches based on a debiased Lasso using the exact inverse of the information matrix also produce type I errors higher than the nominal level when the covariate with the strongest signal is considered. In contrast to the despecified Lasso-based chi-squared test procedures, our proposed DFGL and other Bonferroni-corrected test procedures, including the Ridge-based DR-B, yield type I errors less than or close to the nominal level $\alpha =0.05$, but these Bonferroni-corrected tests are conservative, as shown in Table 1. Note that the Ridge-based approaches are known to be conservative in various contexts^19,41.

Table 1 Power for testing $H_0:\beta ^{(1)}_{j}=\cdots =\beta ^{(G)}_{j}$ vs $H_1:$ not $H_0$ at $\alpha =0.05$, where $n_1,\ldots ,n_G$ are set as $n_1=\cdots =n_G=m$.

Full size table

Table 2 Type I error for testing $H_0:\beta ^{(1)}_{j}=\cdots =\beta ^{(G)}_{j}$ vs $H_1:$ not $H_0$ at $\alpha =0.05$, where $n_1,\ldots ,n_G$ are set as $n_1=\cdots =n_G=m$.

Full size table

Testing significance

In this subsection, we examine the performance of the proposed significance test. We consider $j=1,2,3,4,5,8,9,10,12$ to measure power, while we consider $j=6,7,11,15,20$ to measure type I error. Tables S5 and S6 in the Supplementary materials report the performance of each method in terms of power and Type I error, respectively, when $\alpha =0.05$. The proposed method generally has higher power compared to the competing approaches. In particular, the proposed method is superior to the competing approaches in terms of higher power when the covariates of interest have relatively weak signals. This result was also observed in a previous study⁴¹, which investigated debiased group Lasso for linear regression. When sample sizes are set as $n_1=\ldots =n_G=300$, DL and the proposed DFGL sometimes provide similar power. However, DL fails to control Type I errors when the covariate of interest is correlated with covariates with strong signals. The proposed DFGL and the other methods, except DL, have type I errors less than or close to the significance level in all cases considered, but the other methods are conservative.

Multiple testing

We evaluate the empirical performance of the proposed testing procedures in the context of multiple testing. We consider the following two multiple testing problems, (15) and (16), respectively:

$$\begin{aligned} H_{0,j}: \beta ^{(1)}_j&= \ldots = \beta ^{(G)}_j\ {\rm vs } \ H_{1}: {\rm not\, } H_{0,j}, \quad j=1,\ldots ,p \end{aligned}$$

(15)

$$\begin{aligned} H_{0,j}: \beta ^{(1)}_j&= \cdots = \beta ^{(G)}_j = 0\ {\rm vs } \ H_{1}: { \rm not\, } \,H_{0,j}, \quad j=1,\ldots ,p. \end{aligned}$$

(16)

To control the familywise error rate (FWER), we adjust p-values using the Bonferroni-Holm (BH) procedure⁴². We also apply the BH procedure to p-values from DL and those from DL-E. We don’t consider methods using Bonefrroni-correction, i.e., DL-B, DL-E-B, and DR-B, for multiple testing. This is because they are too conservative, as observed in sections “Testing homogeneity” and “Testing significance”.

When considering multiple testing for significance (16), we measure FWER and power as follows:

FWER: The percentage of the cases where $H_{0,j}$ is rejected for some $j \in S^c$,
Power: Average of the empirical power $\sum _{j \in {S}} I\left( H_{0,j} \text { is rejected }\right) / s$,

where $S=\{j: H_{0,j} \text{ is } \text{ false }\}$ with cardinality $s=|S|.$ Power and FWER are measured in the same way when considering homogeneity tests. Table 3 summarizes the results. The proposed DFGL has the highest power in all cases, while providing FWER below the nominal level $\alpha =0.05$. Among the competing methods, DL has incorrect control of FWER when considering the homogeneity test (15) and DL-E has poor power, especially when considering the homogeneity test (15).

Table 3 Performances of multiple testing at $\alpha =0.05$, where $n_1,\ldots ,n_G$ are set as $n_1=\cdots =n_G=m$.

Full size table

Simultation study for an imbalanced design

In this section, we perform a simulation analysis to assess the performance of DFGL when the sample sizes for the groups vary. Specifically, we consider the following model parameters: $G=3$, $p=300$, $n_1=90$, $n_2=70$, and $n_3=40$, reflecting the dimension of a subset of the CCLE data analyzed in section “Application to the CCLE data”. For each group, we simulate the covariates $\textbf{x}^{{(g)}}_i$ from p-dimensional multivariate normal distribution with mean ${{\varvec{0}}_{p}}$ and covariance matrix ${\varvec{\Sigma }_{x}}$ where $[{\varvec{\Sigma }_{x}}]_{ij}=0.5^{|i-j|}$. We set $s=|S|=6$ where $S=\{j:\Vert {\varvec{\beta }}_{(j)}\Vert _2 > 0 \}$ and randomly draw elements of S from $\{1,\ldots ,p\}$. As a result, we obtained $S=\{85,129,167,187,211,270\}$, and we set ${\varvec{\beta }}_{(j)}=(\beta ^{(1)}_{j},\cdots , \beta ^{(G)}_{j})^\top \in {\mathbb {R}}^{G}$ as follows:

$$\begin{aligned} {\varvec{\beta }}_{(j)}:= {\left\{ \begin{array}{ll} (1.5,1.5,1.5) &{} j=85 \\ (2.0,2.0,2.0) &{} j=129 \\ (1.0,1.0,-1.0) &{} j=167 \\ (-1.0,1.0,1.0) &{} j=187 \\ (0.8,0.8,-0.8) &{} j=211 \\ (0.8,-0.8,0.8) &{} j=270 \\ (0.0,0.0,0.0) &{} otherwise\\ \end{array}\right. }. \end{aligned}$$

Due to the relatively small sample sizes ($n_1=90$, $n_2=70$, and $n_3=40$), we consider DPL (Debiased Pooled-Lasso) as an additional competing approach. DPL refers to a chi-squared test based on applying bias correction¹⁷ to Pooled-Lasso, which adapts to Lasso to analyze samples from different groups together, i.e., $\{\textbf{y}, \textbf{X}\}$. Here, $\textbf{y}= \left[ (\textbf{y}^{(1)})^\top , \ldots , (\textbf{y}^{(G)})^\top \right] ^\top$, and $\textbf{X}=\text{ diag } \left( \textbf{X}^{(1)},\ldots , \textbf{X}^{(G)}\right)$ represents a block diagonal matrix consisting of $\textbf{X}^{(1)},\ldots , \textbf{X}^{(G)}$. Note that DL-E can’t be used in this simulation analysis because $p > n_g$ for $g=1,2,3$.

Testing homogeneity

We consider $j=167,187,211,270$ and $j=1,2,3,85,129$ to measure powers and type I errors, respectively. Tables 4 and 5 show the powers and type I errors of different methods when $\alpha =0.05$. DFGL outperforms the other methods in terms of higher power. In addition, DFGL successfully controls Type I errors; however, Lasso-based methods, including DL, DPL, and DL-B, produce Type I errors higher than the significance level when testing the null hypothesis $H_0: \beta ^{(1)}_{129}=\beta ^{(2)}_{129}=\beta ^{(3)}_{129}$ where $\beta ^{(1)}_{129}=\beta ^{(2)}_{129}=\beta ^{(3)}_{129}=2$.

Table 4 Power for testing $H_0:\beta ^{(1)}_{j}=\cdots =\beta ^{(G)}_{j}$ vs $H_1:$ not $H_0$ at $\alpha =0.05$.

Full size table

Table 5 Type I error for testing $H_0:\beta ^{(1)}_{j}=\cdots =\beta ^{(G)}_{j}$ vs $H_1:$ not $H_0$ at $\alpha =0.05$.

Full size table

Testing signficance

We consider $j=85,129,167,187,211,270$ and $j=1,2,3$ to measure powers and type I errors, respectively. Tables 6 and 7 show the powers and type I errors of different methods when $\alpha =0.05$. While all methods produce type I errors close to or below the significance level, DFGL has higher powers compared to all other methods.

The observed higher power of DFGL in testing for significance and homogeneity compared to Lasso-based approaches is attributed to the fused group Lasso regularization. This regularization method allows for increasing sample sizes in regression parameter estimation by combining subpopulations, leading to more accurate statistical inference. These results suggest the effectiveness of our DFGL in analyzing data characterized by limited sample sizes for groups, such as the CCLE data.

Table 6 Power for testing $H_0:\beta ^{(1)}_{j}=\cdots =\beta ^{(G)}_{j}$ vs $H_1:$ not $H_0$ at $\alpha =0.05$.

Full size table

Table 7 Type I error for testing $H_0:\beta ^{(1)}_{j}=\cdots =\beta ^{(G)}_{j}$ vs $H_1:$ not $H_0$ at $\alpha =0.05$.

Full size table

Multiple testing

In this section, we consider the two multiple testing problems (15) and (16), respectively. As in section Multiple testing, we use the BH procedure to control the familywise error rate. In this analysis, we don’t consider DR-B because it is conservative as observed in sections “Testing homogeneity” and “Testing signficance”. Table 8 summarizes the results at the $\alpha =0.05$ significance level. Despite the limited sample size, which results in relatively low power, especially when testing for homogeneity, DFGL outperforms other methods in terms of higher power. All methods provide FWER below the nominal level $\alpha =0.05$.

Table 8 Performances of multiple testing at $\alpha =0.05$.

Full size table

Computational time comparison

In this subsection, we discuss the computational cost of computing the DFGL estimate and compare it with the costs of computing three estimates used in DL, DL-E, and DR-B. To compute Lasso or Ridge penalized estimate, we use R package glmnet. For the penalty parameters, we consider 100 candidates for $(\lambda _1,\lambda _2)$ for DFGL, and 100 candidates for the penalty parameter for DL, DL-E, and DR-B. The penalty parameters are determined using 5-fold Cross-Validation in all methods. Parallel computing is employed to compute estimates for each subpopulation separately in DL and DL-E.

In order to assess computational efficiency, we examine a simulation scenario with the AR(1) covariance matrix for the covariates, $n_1=\cdots =n_G=200$, and $p=80$. Table 9 presents the computational times required to compute four estimates, determined from 100 Monte Carlo simulations and executed on an Intel Xeon (2.20 GHz) processor, are presented in Table 9. A noteworthy observation from Table 9 is that while the DFGL implementation requires slightly more time in comparison to the other methods, it remains significantly expeditious. This can be explained by the fact that DFGL involves addressing a more intricate large-scale problem, entailing a combined penalty of both group Lasso and fusion-type components, whereas the other methods involve simpler Lasso or Ridge techniques entailing a single penalty.

Table 9 Average computational times (in seconds) for implementing estimates.

Full size table

Application to the CCLE data

The dataset

In this section, we present real data analyses when our method is applied to the CCLE data. The CCLE data contains information on cancer treatment responses for 24 drugs on 504 cancer cell lines of 23 cancer types, where the transcription profile of each cell line is characterized by the measured expression levels of 19,177 probes. Cancer cell lines are widely used to understand cancer biology and test the efficacy of novel therapies⁴³, and are also used to identify predictive biomarkers for anticancer drug sensitivity^13,44. We consider three cancer types that include at least 30 cancer cell lines: Lymphoid, Lung, and Skin in our CCLE data analysis. Our main objective is to check whether a specific gene is significant to binary drug response in at least one of these cancer types, and whether such significant genes have heterogeneous effects on a drug across the cancer types. Analyses similar to ours could be useful in two ways. First, examining the significance of the effects of a gene across different cancer types may lead to the identification of potential gene expression markers of drug response that can be used for multiple cancer types. Such versatile gene expression markers are valuable in research¹⁶. Second, studying the heterogeneity of a gene’s effects across cancer types can provide insights into understanding differences in drug sensitivity across cancer types.

Following Park et al.⁴⁵, we classify cancer cell lines into two categories for each drug. If a drug response value (IC50) is less than 0.5, then the cancer cell line is assigned to the “sensitive” category; otherwise, it is assigned to the “resistant” category. Then, most cancer cell lines are either sensitive or resistant to most drugs in some of the three cancer types. For example, all lymphoid cancer cell lines are resistant to Erlotinib, and only one of the lung cancer cell lines is sensitive to Panobinostat. After removing these imbalanced drugs in our analysis, we consider the following five drugs: 17-AAG, AZD6244, Irinotecan, PD-0325901, and Topotecan. Table 10 presents the number of sensitive cell lines for each of these five drugs.

Table 10 The number of sensitive cell lines for each drug across cancer types.

Full size table

The analysis of ultra-high-dimensional (UHD) data, such as CCLE data analysis, is accompanied by several challenges, including high collinearity, spurious correlation, noise accumulation, and a significant computational burden. To alleviate these difficulties inherent in UHD data, it is desirable to reduce the dimensionality of the feature space^46,47. Similar to sure independence screening⁴⁶, we removed relatively irrelevant genes to each drug, respectively, before fitting models. The gene screening procedure is as follows:

1.
We selected the top 3,000 genes with the largest sample variances.
2.
For $j=1,\ldots , 3000$, we applied logistic regression using each gene and two dummy variables indicating cancer types.
3.
We selected the top 300 genes with the smallest p-values for the significance test.

These screening procedures have been used in the literature on high-dimensional regressions (e.g. Park et al.⁴⁸, Wang et al.⁴⁹, Li et al.⁴⁵).

While we analyze a set of $p=300$ genes obtained from the screening procedure described above, the number of genes still exceeds the sample sizes for the cancer types. As a result, standard maximum likelihood estimation couldn’t be used for statistical inference in a model containing the $p=300$ genes. Lasso-based approaches, including debiased Lasso¹⁷, which do not incorporate a fusion penalty, can be considered to test for heterogeneity or significance of a gene’s effects. However, it is worth noting that approaches based on lasso penalization may yield inaccurate results when applied to data characterized by limited sample sizes, as in the case of our CCLE dataset. This limitation was demonstrated in our simulation study in section “Simultation study for an imbalanced design”. To achieve more accurate statistical inference, we use our DFGL based on a fusion penalty that combines subpopulations.

Results

First, we identify important genes using the proposed penalized estimation (5). Figure 1 shows the estimated cancer-specific coefficients for selected genes. We can see that for each drug, the penalized estimates of the coefficients for most genes are similar for at least two of the three cancer types. Next, we apply the proposed simultaneous significance test to investigate whether the genes detected through the penalization have significant effects on the drug when the effects of the other genes are adjusted. Figure 2 shows the debiased estimates corresponding to genes identified by the proposed significance test at the significance level $\alpha =0.05$. Comparing Figs. 1 and 2, we found that most genes selected by the penalization are also significant at the significance level 0.05. However, some genes that were not detected by the penalization were found to be significant for those drugs. Such genes seem to have a significant effect in some of the cancer types or have weak signals across cancer types, as shown in Fig. 2. For example, the absolute value of the estimated effect of BASP1 on PD-0325901 is approximately 0.25 in the skin cancer and less than 0.04 in the other cancer types.

We also use DL, DL-B, and DR-B for comparison with DFGL. In the simulation in section “Simultation study for an imbalanced design”, DPL shows similar performance to DL, so we do not consider DPL in the analysis. Note that DL-E can not be used in this analysis because the number of genes is larger than the sample sizes for the cancer types. Figure 3 compares the results of these methods in terms of testing significance for specific genes. We observe that DR-B seems to be too conservative, as observed in our simulation analysis. However, both the Lasso-based approaches and our method identify some common genes. For example, as shown in Fig. 3, all methods except for DR-B indicate the significance of the effects of SLFN11 on Irinotecan and Topotecan across cancer types. SLFN11 was previously identified as relevant to Irinotecan and Topotecan when a penalized mixture regression was applied to the CCLE data⁵.

When performing a sensitivity analysis based on Bootstrap in section S4.4 in the Supplementary materials, we observe that the following drug-gene pairs are relatively frequently identified by our significance test: SLFN11-Topotecan, SLFN11-Irinotecan, NQO1-17AAG, SIRPA-AZD6244, and ETV4-PD-0325901. These results demonstrate the significance of SLFN11 for Topotecan and Irinotecan. Notably, the remaining three gene-drug pairs were not detected by Lasso-based approaches at $\alpha =0.05$, as shown in Fig. 3. However, NQO1 was identified as an important gene for 17-AAG in the previous analyses of CCLE data^5,52, and ETV4 was also identified as a related gene to PD-0325901 in the CCLE data analysis performed by Liang et al.⁵². In addition, expression of ETV4, detected as a significant gene expression to PD-0325901, might modulate sensitivity to a MEK inhibitor trametinib⁵³. Hayashi et al.⁵⁴ discovered that activation of MEK was induced by ligation of SIRP$\beta$, while SIRP$\alpha$ (SIRPA) is significant to AZD6244. Despite the empirical evidence supporting our significant test results, the gene-drug pairs NQO1-17AAG, ETV4-PD-0325901, and SIPR$\alpha$-AZD6244 were not identified by any of the other significance tests at $\alpha =0.05$. These results suggest that DFGL may provide more accurate results for the significance of the association between a gene and a drug by combining different cancer types, as opposed to approaches based on Lasso or Ridge. Furthermore, as shown in Fig. 2, we expect the associations SIRPA-AZD6244 and ETV4-PD-0325901 to be relatively weak in specific cancer types. The relatively weak effects of SIPRA and ETV4 may result in Lasso-based approaches failing to detect them at the 0.05 significance level. This observation is consistent with our simulation results presented in section “Simultation study for an imbalanced design”, which indicate that Lasso-based approaches have lower power compared to our approach, especially when testing the significance of effects for covariates with relatively weak signals in simulated data with small sample sizes.

Table 11 Genes identified by the proposed homogeneity test at $\alpha =0.05$. Genes are listed in order of p-value.

Full size table

Our equivalence test shows different results for different cancer types, as summarized in Table 11, i.e., some genes have heterogeneous coefficients depending on the cancer type at the significance level $\alpha =0.05$. The estimated effect of SLFN11 on Topotecan is relatively large in lung cancer cell lines compared to other cancer types. The p-value for the equivalence test corresponding to SLFN11 was less than 0.05 when Topoptecan was considered. However, we note that other methods suggest a lack of significance for the heterogeneity of the effects of SLFN11 on Topotecan at the $\alpha =0.05$ significance level. Further investigation is needed to determine whether SLFN11 has heterogeneous effects on the response to Topotecan in patients with different cancer types. However, in the sensitivity analysis based on Bootstrap in Section S4.4 in the Supplementary materials, the association between SLFN11 and Topotecan appears to be the most significant among all gene-drug pairs in terms of heterogeneous effects across cancer types. In addition, in the previous analysis of CCLE data⁵, a penalized mixture regression suggested that the effects of SLFN11 are different between some of the clusters. These results suggest that our method may uncover underlying heterogeneous effects of a gene across cancer types that are difficult to capture using Lasso or Ridge-based methods. We observed that some genes identified by the homogeneity test have positive estimated coefficients in specific cancer types. For Topotecan, the estimated effect of gene MAOA is only positive in skin cancer cell lines. Low expression of MAOA has been observed in melanoma skin cancer compared with normal samples⁵⁵, but high expression of MAOA has been observed in lung cancer tissues⁵⁶ and lymphoma⁵⁷.

Conclusion

In this paper, we propose two different tests: (1) testing the homogeneity of the effects of the covariate across different groups and (2) testing the significance of the covariate over groups. We develop non-asymptotic analyses for the proposed fused group Lasso and prove that the debiased test statistics admit chi-squared approximations even in the presence of high dimensional variables. The proposed tests generally outperform the existing bias-correction methods based on Lasso^17,36 or Ridge³⁹ in that it proves higher power, while it controls type I error quite well as shown in section “Simulation study”. Through CCLE data analysis, we can observe that the proposed method can make significant scientific discoveries.

From a methodological point of view, there are some extensions to our method. First, our tests can be applied to generalized linear models, including linear regression and the Poisson regression model, although we focus on logistic regression. In addition, our theoretical analyses can be extended to the generalized linear regression (GLM) setting. Second, we expect that the performance of the proposed tests can be improved by simultaneously estimating the inverse of the information matrices across subpopulations, as in the joint estimation of precision matrices^58,59.

From the perspective of CCLE data analysis, there are several interesting directions for future research. Although our primary focus in CCLE data analysis was on the use of gene expression, which is known to be predictive of drug response⁶⁰, other omics features such as DNA copy number are available in the analysis of CCL data. It is of great interest to investigate which omics data are most predictive of drug response in a specific cancer type, or have heterogeneous effects on drug response across cancer types. Given the different characteristics of different types of omics data, it is expected that our method may have some limitations in the analysis of multi-omics data. Therefore, a sophisticated extension of our method in estimation and construction of test statistics will be needed for the analysis of multi-omics data. In CCLE data, there are cell lines with missing responses to a drug. Therefore, an extension of our method to include cell lines with missing drug responses would be beneficial for the analysis of CCLE data.

Data availability

All data generated or analyzed during this study are included in supplementary information files.

References

Caroli, J., Dori, M. & Bicciato, S. Computational methods for the integrative analysis of genomics and pharmacological data. Front. Oncol. 10, 185 (2020).
Article PubMed PubMed Central Google Scholar
Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Azuaje, F. Computational models for predicting drug responses in cancer research. Brief. Bioinf. 18, 820–829 (2017).
CAS Google Scholar
Geeleher, P., Cox, N. J. & Huang, R. S. Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines. Genome Biol. 15, 782 (2014).
Article Google Scholar
Li, Q., Shi, R. & Liang, F. Drug sensitivity prediction with high-dimensional mixture regression. PloS one 14, e0212108 (2019).
Article CAS PubMed PubMed Central Google Scholar
Dong, Z. et al. Anticancer drug sensitivity prediction in cell lines from baseline gene expression through recursive feature selection. BMC Cancer 15, 14 (2015).
Article Google Scholar
Riddick, G. et al. Predicting in vitro drug sensitivity using random forests. Bioinformatics 27, 220–224 (2011).
Article CAS PubMed Google Scholar
Choi, J., Park, S. & Ahn, J. Refdnn: A reference drug based neural network for more accurate prediction of anticancer drug resistance. Sci. Rep. 10, 1861 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Huang, E. W., Bhope, A., Lim, J., Sinha, S. & Emad, A. Tissue-guided lasso for prediction of clinical drug response using preclinical samples. PLoS Comput. Biol. 16, e1007607 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Oh, D. Y. & Bang, Y. J. HER2-targeted therapies—a role beyond breast cancer. Nat. Rev. Clin. Oncol. 17, 33–48 (2020).
Article CAS PubMed Google Scholar
Zhao, Z., Wang, S., Zucknick, M. & Aittokallio, T. Tissue-specific identification of multi-omics features for pan-cancer drug response prediction. iScience 25, 104767 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Rahman, R., Matlock, K., Ghosh, S. & Pal, R. Heterogeneity aware random forest for drug sensitivity prediction. Sci. Rep. 7, 11347 (2017).
Article ADS PubMed PubMed Central Google Scholar
Garnett, M. J. et al. Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature 483, 570–575 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Zintzaras, E. & Ioannidis, J. P. A. Heterogeneity testing in meta-analysis of genome searches. Genet. Epidemiol. 28, 123–137 (2005).
Article PubMed Google Scholar
Lewis, C. M. & Levinson, D. F. Testing for genetic heterogeneity in the genome search meta-analysis method. Genet. Epidemiol. 30, 348–355 (2006).
Article PubMed Google Scholar
Martinez-Ledesma, E., Verhaak, R. G. & Treviño, V. Identification of a multi-cancer gene expression biomarker for cancer clinical outcomes using a network-based algorithm. Sci. Rep. 5, 11966 (2015).
Article ADS CAS PubMed PubMed Central Google Scholar
Van de Geer, S., Bühlmann, P., Ritov, Y. A. & Dezeure, R. On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Stat. 42, 1166–1202 (2014).
MathSciNet MATH Google Scholar
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B: Stat. Methodol. 58, 267–288 (1996).
MathSciNet MATH Google Scholar
Javanmard, A. & Montanari, A. Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res. 15, 2869–2909 (2014).
MathSciNet MATH Google Scholar
Zhang, C. H. & Zhang, S. S. Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B: Stat. Method. 76, 217–242 (2014).
Article MathSciNet MATH Google Scholar
Ning, Y. & Liu, H. A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Ann. Stat. 45, 158–195 (2017).
Article MathSciNet MATH Google Scholar
Cai, T. T., Guo, Z. & Ma, R. Statistical inference for high-dimensional generalized linear models with binary outcomes. J. Am. Stat. Assoc. 118, 1319–1332 (2023).
Article MathSciNet CAS PubMed MATH Google Scholar
Shi, C., Song, R., Lu, W. & Li, R. Statistical inference for high-dimensional models via recursive online-score estimation. J. Am. Stat. Assoc. 116, 1307–1318 (2021).
Article MathSciNet CAS PubMed MATH Google Scholar
Ma, R., Tony Cai, T. & Li, H. Global and simultaneous hypothesis testing for high-dimensional logistic regression models. J. Am. Stat. Assoc. 116, 984–998 (2021).
Article MathSciNet CAS PubMed MATH Google Scholar
Yuan, M. & Lin, Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B: Stat. Method. 68, 49–67 (2006).
Article MathSciNet MATH Google Scholar
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. & Knight, K. Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B: Stat. Method. 67(1), 91–108 (2005).
Article MathSciNet MATH Google Scholar
Tang, L. & Song, P. X. Fused lasso approach in regression coefficients clustering: Learning parameter heterogeneity in data integration. J. Mach. Learn. Res. 17, 1–23 (2016).
MathSciNet MATH Google Scholar
Zhou, J., Liu, J., Narayan, V.A., & Ye, J. Modeling disease progression via fused sparse group lasso. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1095–1103 (2012).
Negahban, S. N., Ravikumar, P., Wainwright, M. J. & Yu, B. A unified framework for high-dimensional analysis of m -estimators with decomposable regularizers. Stat. Sci. 27, 538–557 (2012).
Article MathSciNet MATH Google Scholar
Cai, T. T., Zhang, A. R. & Zhou, Y. Sparse group lasso: Optimal sample complexity, convergence rate, and statistical inference. IEEE Trans. Inf. Theory 68, 5975–6002 (2019).
Article MathSciNet MATH Google Scholar
Ollier, E. & Viallon, V. Regression modelling on stratified data with the lasso. Biometrika 104, 83–96 (2017).
MathSciNet MATH Google Scholar
Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101, 1418–1429 (2006).
Article MathSciNet CAS MATH Google Scholar
Meinshausen, N. & Bühlmann, P. High-dimensional graphs and variable selection with the lasso. Ann. Stat. 34, 1436–1462 (2006).
Article MathSciNet MATH Google Scholar
Tian, Y. & Feng, Y. Transfer learning under high-dimensional generalized linear models. J. Am. Stat. Assoc. 2022, 1–14 (2022).
Article Google Scholar
Caner, M. Generalized linear models with structured sparsity estimators. J. Econ. 236, 105478 (2023).
Article MathSciNet MATH Google Scholar
Xia, L., Nan, B. & Li, Y. Debiased lasso for generalized linear models with a diverging number of covariates. Biometrics 79, 344–357 (2023).
Article MathSciNet PubMed MATH Google Scholar
Beck, A. & Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2, 183–202 (2009).
Article MathSciNet MATH Google Scholar
Boyd, S., Parikh, N., Chu, E., Peleato, B. & Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3, 1–22 (2011).
Article MATH Google Scholar
Bülmann, P. Statistical significance in high-dimensional linear models. Bernoulli 819, 1212–1242 (2013).
MathSciNet Google Scholar
Dezeure, R., Bülmann, P., Meier, L. & Meinshausen, N. High-dimensional inference: Confidence intervals, p-values and R-software hdi. Stat. Sci. 30, 533–558 (2015).
Article MathSciNet MATH Google Scholar
Mitra, R. & Zhang, C. H. The benefit of group sparsity in group inference with de-biased scaled group lasso. Electron. J. Stat. 10, 1829–1873 (2016).
Article MathSciNet MATH Google Scholar
Holm, S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 65–70 (1979).
MathSciNet MATH Google Scholar
Sharma, S. V., Haber, D. A. & Settleman, J. Cell line-based platforms to evaluate the therapeutic efficacy of candidate anticancer agents. Nat. Rev. Cancer 10, 241–253 (2010).
Article CAS PubMed Google Scholar
Nakatsu, N. et al. Chemosensitivity profile of cancer cell lines and identification of genes determining chemosensitivity by an integrated bioinformatical approach using cDNA arrays. Mol. Cancer Therapeut. 4, 399–412 (2005).
Article CAS Google Scholar
Park, S., Lee, E. R. & Zhao, H. Low-rank regression models for multiple binary responses and their applications to cancer cell-line encyclopedia data. J. Am. Stat. Assoc. 2022, 1–15 (2022).
Google Scholar
Fan, J. & Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B: Stat. Method. 70, 849–911 (2008).
Article MathSciNet MATH Google Scholar
Fan, J. & Lv, J. Sure Independence Screening, Statistics Reference Online (Wiley, 2018).
Google Scholar
Wang, L., Wu, Y. & Li, R. Quantile regression for analyzing heterogeneity in ultra-high dimension. J. Am. Stat. Assoc. 107, 214–222 (2012).
Article MathSciNet PubMed PubMed Central MATH Google Scholar
Li, Y., Nan, B. & Zhu, J. Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure. Biometrics 71, 354–363 (2015).
Article MathSciNet CAS PubMed PubMed Central MATH Google Scholar
Wickham, H. ggplot2. Wiley Interdiscipl. Rev.: Comput. Stat. 3, 180–185 (2011).
Article Google Scholar
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing (2021). https://www.R-project.org/.
Liang, F., Li, Q. & Zhou, L. Bayesian neural networks for selection of drug sensitive genes. J. Am. Stat. Assoc. 113, 955–972 (2018).
Article MathSciNet CAS PubMed PubMed Central MATH Google Scholar
Wang, B. et al. ATXN1L, CIC, and ETS transcription factors modulate sensitivity to MAPK pathway inhibition. Br. J. Cancer 18, 1543–1557 (2017).
CAS Google Scholar
Hayashi, A. et al. Positive regulation of phagocytosis by SIRP$\beta$ and its signaling mechanism in macrophages. J. Biol. Chem. 279, 29450–29460 (2004).
Article CAS PubMed Google Scholar
Rybaczyk, L. A., Bashaw, M. J., Pathak, D. R. & Huang, K. An indicator of cancer: Downregulation of monoamine oxidase-A in multiple organs and species. BMC Genom. 9, 1–9 (2008).
Article Google Scholar
Liu, F. et al. Increased expression of monoamine oxidase A is associated with epithelial to mesenchymal transition and clinicopathological features in non-small cell lung cancer. Oncol. Lett. 15, 3245–3251 (2018).
PubMed Google Scholar
Li, P. C. et al. Monoamine oxidase A is highly expressed in classical Hodgkin lymphoma. J. Pathol. 243, 220–229 (2017).
Article CAS PubMed PubMed Central Google Scholar
Lee, W. & Liu, Y. Joint estimation of multiple precision matrices with common structures. J. Mach. Learn. Res. 16, 1035–1062 (2015).
MathSciNet PubMed PubMed Central MATH Google Scholar
Cai, T. T., Li, H., Liu, W. & Xie, J. Joint estimation of multiple high-dimensional precision matrices. Stat. Sin. 27, 445–464 (2016).
MathSciNet MATH Google Scholar
Costello, J. C. et al. A community effort to assess and improve drug sensitivity prediction algorithms. Nat. Biotechnol. 32, 1202–1212 (2014).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

Seyoung Park is supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government(MSIT) (No. NRF-2022R1A2C4002150). Eun Ryung Lee is supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government(MSIT) (No. NRF-2022R1A2C1012798).

Author information

Authors and Affiliations

Department of Statistics, Sungkyunkwan University, Seoul, 100190, South Korea
Hyunjin Kim, Eun Ryung Lee & Seyoung Park

Authors

Hyunjin Kim
View author publications
You can also search for this author in PubMed Google Scholar
Eun Ryung Lee
View author publications
You can also search for this author in PubMed Google Scholar
Seyoung Park
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

E.R.L. and S.P. are corresponding authors who wrote the manuscript and derive the theoretical and computational aspects. H.K. derived algorithms and conduct numerical analyses. All authors reviewed the manuscript

Corresponding authors

Correspondence to Eun Ryung Lee or Seyoung Park.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kim, H., Lee, E.R. & Park, S. Debiased inference for heterogeneous subpopulations in a high-dimensional logistic regression model. Sci Rep 13, 21979 (2023). https://doi.org/10.1038/s41598-023-48903-x

Download citation

Received: 07 August 2023
Accepted: 30 November 2023
Published: 11 December 2023
DOI: https://doi.org/10.1038/s41598-023-48903-x

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

LogSum + L2 penalized logistic regression model for biomarker selection and cancer classification

Bayesian network-driven clustering analysis with feature selection for high-dimensional multi-modal molecular data

Meta-Analysis Based on Nonconvex Regularization

Introduction

Method and theory

Setting and problem

The penalization method

Theorem 1

Theorem 2

Debiased test

Theorem 3

Theorem 4

Corollary 1

Theorem 5

Corollary 2

Implementation

ADMM update

Simulation study

Simulation setting

Simulation results

Testing homogeneity

Testing significance

Multiple testing

Simultation study for an imbalanced design

Testing homogeneity

Testing signficance

Multiple testing

Computational time comparison

Application to the CCLE data

The dataset

Results

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links