摘要
对于许多复杂的癌症疾病,单一的基因效应或单一的环境效应不能进行有效的预测判断,识别与复杂疾病相关的基因–环境交互作用成为了高维数据下病理学和生物信息学研究的一大挑战。对于生存数据高维度、异质性、删失性等问题,我们提出了一种基于AFT模型的识别基因–环境交互作用的方法。该方法创新地通过采用LAD损失函数和SCAD惩罚函数相结合的目标函数减除数据不平衡带来的影响并选出服从主效应与交互效应间的强层次结构的交互项,并利用CCCP算法对目标函数进行优化求解。利用R进行了仿真研究和实证研究,从这两方面验证了该方法能稳健地选择出合适的基因效应和基因–环境交互效应,具有较好的预测性和稳定性,且该方法能有效压缩备选的变量,选出的模型简洁、有较好的解释性。
For many complex cancer diseases, a single gene effect or a single environmental effect cannot account for the total variant of prediction results. Identifying the gene-environment interactions associated with complex diseases has become a major challenge for pathology and bioinformatics research under high-dimensional data. To solve the problems of high dimension, heterogeneity, and censored survival data, we proposed an AFT model-based method to identify gene-environment interactions. In this method, an objective function combining LAD loss function and SCAD penalty function is innovatively adopted to reduce the influence of unbalanced data and to select interaction terms that follow a strong hierarchical structure between main effects and interaction effects. The objective function is optimized and solved by CCCP algorithm. Simulation and empirical studies were carried out using R to verify that this method can select the appropriate gene effect and gene-environment interaction effect, and has good predictability and stability. Moreover, this method can effectively compress the alternative variables, and the selected model is simple and has good explanatory ability.
出处
《应用数学进展》
2021年第5期1765-1775,共11页
Advances in Applied Mathematics