Objective:To use the gene chip of pseudomonas aeruginosa as a research sample and to explore it at an omics level,aiming at elucidating the co-expression network characteristics of the virulence genes exoS and exoU of...Objective:To use the gene chip of pseudomonas aeruginosa as a research sample and to explore it at an omics level,aiming at elucidating the co-expression network characteristics of the virulence genes exoS and exoU of pseudomonas aeruginosa in the lower respiratory tract from the perspective of molecular biology and identifying its key regulatory genes.Methods:From March 2016 to May 2018,312 patients infected with pseudomonas aeruginosa in the lower respiratory tract who were admitted to Department of Respiratory Medicine of Baogang Hospital and given follow-up treatments in the hospital were selected as subjects by use of cluster sampling.Alveolar lavage fluid and sputum collected from those patients were used as biological specimens.The genes of pseudomonas aeruginosa were detected with the help of oligonucleotide probes to make a pre-processing of chip data.A total of 8 common antibiotics(ceftazidime,gentamicin,piperacillin,amikacin,ciprofloxacin,levofloxacin,doripenem and ticarcillin)against Gram-negative bacteria were selected to determine the drug resistance of biological specimens.MCODE algorithm was used to construct a co-expression network model of the drug-resistance genes focused on exoS/exoU.Results:The expression level of exoS/exoU in the drug-resistance group was significantly higher than that in the non-resistance group(p<0.05).The top 5 differentially expressed genes in the alveolar lavage fluid specimens from the drug-resistance group were RAC1,ITGB1,ITGB5,CRK and IGF1R in the order from high to low.In the sputum specimens,the top 5 differentially expressed genes were RAC1,CRK,IGF1R,ITGB1 and ITGB5.In the alveolar lavage fluid specimens,only RAC1 had a positive correlation with the expression of exoS and exoU(p<0.05).In the sputum specimens,RAC1,ITGB1,ITGB5,CRK and IGF1R were positively correlated with the expression of exoS and exoU(p<0.05).The genes included in the co-expression network contained exoS,exoU,RAC1,ITGB1,ITGB5,CRK,CAMK2D,RHOA,FLNA,IGF1R,TGFBR2 and FOS.Among them,RAC1 had a highest score in the aspect of regulatory ability(72.00)and the largest number of regulatory genes(6);followed by ITGB1,ITGB5 and CRK genes.Conclusions:The high expression of exoS and exoU in the sputum specimens suggests that pseudomonas aeruginosa has a higher probability to get resistant to antibiotics;RAC1,ITGB1,ITGB5 and CRK genes may be the key genes that can regulate the expression of exoS and exoU.展开更多
Lung cancer remains a significant global health challenge and identifying lung cancer at an early stage is essential for enhancing patient outcomes. The study focuses on developing and optimizing gene expression-based...Lung cancer remains a significant global health challenge and identifying lung cancer at an early stage is essential for enhancing patient outcomes. The study focuses on developing and optimizing gene expression-based models for classifying cancer types using machine learning techniques. By applying Log2 normalization to gene expression data and conducting Wilcoxon rank sum tests, the researchers employed various classifiers and Incremental Feature Selection (IFS) strategies. The study culminated in two optimized models using the XGBoost classifier, comprising 10 and 74 genes respectively. The 10-gene model, due to its simplicity, is proposed for easier clinical implementation, whereas the 74-gene model exhibited superior performance in terms of Specificity, AUC (Area Under the Curve), and Precision. These models were evaluated based on their sensitivity, AUC, and specificity, aiming to achieve high sensitivity and AUC while maintaining reasonable specificity.展开更多
CHDTEPDB(URL:http://chdtepdb.com/)is a manually integrated database for congenital heart disease(CHD)that stores the expression profiling data of CHD derived from published papers,aiming to provide rich resources for i...CHDTEPDB(URL:http://chdtepdb.com/)is a manually integrated database for congenital heart disease(CHD)that stores the expression profiling data of CHD derived from published papers,aiming to provide rich resources for investigating a deeper correlation between human CHD and aberrant transcriptome expression.The develop-ment of human diseases involves important regulatory roles of RNAs,and expression profiling data can reflect the underlying etiology of inherited diseases.Hence,collecting and compiling expression profiling data is of critical significance for a comprehensive understanding of the mechanisms and functions that underpin genetic diseases.CHDTEPDB stores the expression profiles of over 200 sets of 7 types of CHD and provides users with more convenient basic analytical functions.Due to the differences in clinical indicators such as disease type and unavoidable detection errors among various datasets,users are able to customize their selection of corresponding data for personalized analysis.Moreover,we provide a submission page for researchers to submit their own data so that increasing expression profiles as well as some other histological data could be supplemented to the database.CHDTEPDB is a user-friendly interface that allows users to quickly browse,retrieve,download,and analyze their target samples.CHDTEPDB will significantly improve the current knowledge of expression profiling data in CHD and has the potential to be exploited as an important tool for future research on the disease.展开更多
Gene expression data represents a condition matrix where each rowrepresents the gene and the column shows the condition. Micro array used todetect gene expression in lab for thousands of gene at a time. Genes encode p...Gene expression data represents a condition matrix where each rowrepresents the gene and the column shows the condition. Micro array used todetect gene expression in lab for thousands of gene at a time. Genes encode proteins which in turn will dictate the cell function. The production of messengerRNA along with processing the same are the two main stages involved in the process of gene expression. The biological networks complexity added with thevolume of data containing imprecision and outliers increases the challenges indealing with them. Clustering methods are hence essential to identify the patternspresent in massive gene data. Many techniques involve hierarchical, partitioning,grid based, density based, model based and soft clustering approaches for dealingwith the gene expression data. Understanding the gene regulation and other usefulinformation from this data can be possible only through effective clustering algorithms. Though many methods are discussed in the literature, we concentrate onproviding a soft clustering approach for analyzing the gene expression data. Thepopulation elements are grouped based on the fuzziness principle and a degree ofmembership is assigned to all the elements. An improved Fuzzy clustering byLocal Approximation of Memberships (FLAME) is proposed in this workwhich overcomes the limitations of the other approaches while dealing with thenon-linear relationships and provide better segregation of biological functions.展开更多
Acute leukemia is an aggressive disease that has high mortality rates worldwide.The error rate can be as high as 40%when classifying acute leukemia into its subtypes.So,there is an urgent need to support hematologists...Acute leukemia is an aggressive disease that has high mortality rates worldwide.The error rate can be as high as 40%when classifying acute leukemia into its subtypes.So,there is an urgent need to support hematologists during the classification process.More than two decades ago,researchers used microarray gene expression data to classify cancer and adopted acute leukemia as a test case.The high classification accuracy they achieved confirmed that it is possible to classify cancer subtypes using microarray gene expression data.Ensemble machine learning is an effective method that combines individual classifiers to classify new samples.Ensemble classifiers are recognized as powerful algorithms with numerous advantages over traditional classifiers.Over the past few decades,researchers have focused a great deal of attention on ensemble classifiers in a wide variety of fields,including but not limited to disease diagnosis,finance,bioinformatics,healthcare,manufacturing,and geography.This paper reviews the recent ensemble classifier approaches utilized for acute leukemia gene expression data classification.Moreover,a framework for classifying acute leukemia gene expression data is proposed.The pairwise correlation gene selection method and the Rotation Forest of Bayesian Networks are both used in this framework.Experimental outcomes show that the classification accuracy achieved by the acute leukemia ensemble classifiers constructed according to the suggested framework is good compared to the classification accuracy achieved in other studies.展开更多
In bioinformatics applications,examination of microarray data has received significant interest to diagnose diseases.Microarray gene expression data can be defined by a massive searching space that poses a primary cha...In bioinformatics applications,examination of microarray data has received significant interest to diagnose diseases.Microarray gene expression data can be defined by a massive searching space that poses a primary challenge in the appropriate selection of genes.Microarray data classification incorporates multiple disciplines such as bioinformatics,machine learning(ML),data science,and pattern classification.This paper designs an optimal deep neural network based microarray gene expression classification(ODNN-MGEC)model for bioinformatics applications.The proposed ODNN-MGEC technique performs data normalization process to normalize the data into a uniform scale.Besides,improved fruit fly optimization(IFFO)based feature selection technique is used to reduce the high dimensionality in the biomedical data.Moreover,deep neural network(DNN)model is applied for the classification of microarray gene expression data and the hyperparameter tuning of the DNN model is carried out using the Symbiotic Organisms Search(SOS)algorithm.The utilization of IFFO and SOS algorithms pave the way for accomplishing maximum gene expression classification outcomes.For examining the improved outcomes of the ODNN-MGEC technique,a wide ranging experimental analysis is made against benchmark datasets.The extensive comparison study with recent approaches demonstrates the enhanced outcomes of the ODNN-MGEC technique in terms of different measures.展开更多
This work evaluates a recently developed multivariate statistical method based on the creation of pseudo or latent variables using principal component analysis (PCA). The application is the data mining of gene expre...This work evaluates a recently developed multivariate statistical method based on the creation of pseudo or latent variables using principal component analysis (PCA). The application is the data mining of gene expression data to find a small subset of the most important genes in a set of thousand or tens of thousands of genes from a relatively small number of experimental runs. The method was previously developed and evaluated on artificially generated data and real data sets. Its evaluations consisted of its ability to rank the genes against known truth in simulated data studies and to identify known important genes in real data studies. The purpose of the work described here is to identify a ranked set of genes in an experimental study and then for a few of the most highly ranked unverified genes, experimentally verify their importance.This method was evaluated using the transcriptional response of Escherichia coli to treatment with four distinct inhibitory compounds: nitric oxide, S-nitrosoglutathione, serine hydroxamate and potassium cyanide. Our analysis identified genes previously recognized in the response to these compounds and also identified new genes.Three of these new genes, ycbR, yJhA and yahN, were found to significantly (p-values〈0.002) affect the sensitivityofE, coli to nitric oxide-mediated growth inhibition. Given that the three genes were not highly ranked in the selected ranked set (RS), these results support strong sensitivity in the ability of the method to successfully identify genes related to challenge by NO and GSNO. This ability to identify genes related to the response to an inhibitory compound is important for engineering tolerance to inhibitory metabolic products, such as biofuels, and utilization of cheap sugar streams, such as biomass-derived sugars or hydrolysate.展开更多
We propose a new method for tumor classification from gene expression data, which mainly contains three steps. Firstly, the original DNA microarray gene expression data are modeled by independent component analysis (...We propose a new method for tumor classification from gene expression data, which mainly contains three steps. Firstly, the original DNA microarray gene expression data are modeled by independent component analysis (ICA). Secondly, the most discriminant eigenassays extracted by ICA are selected by the sequential floating forward selection technique. Finally, support vector machine is used to classify the modeling data. To show the validity of the proposed method, we applied it to classify three DNA microarray datasets involving various human normal and tumor tissue samples. The experimental results show that the method is efficient and feasible.展开更多
The rapid developments of technologies that generate arrays of gene dataenable a global view of the transcription levels of hundreds of thousands of genes simultaneously.The outlier detection problem for gene data has...The rapid developments of technologies that generate arrays of gene dataenable a global view of the transcription levels of hundreds of thousands of genes simultaneously.The outlier detection problem for gene data has its importance but together with the difficulty ofhigh dimensionality. The sparsity of data in high-dimensional space makes each point a relativelygood outlier in the view of traditional distance-based definitions. Thus, finding outliers in highdimensional data is more complex. In this paper, some basic outlier analysis algorithms arediscussed and a new genetic algorithm is presented. This algorithm is to find best dimensionprojections based on a revised cell-based algorithm and to give explanations to solutions. It cansolve the outlier detection problem for gene expression data and for other high dimensional data aswell.展开更多
There have been many skewed cancer gene expression datasets in the post-genomic era. Extraction of differential expression genes or construction of decision rules using these skewed datasets by traditional algorithms ...There have been many skewed cancer gene expression datasets in the post-genomic era. Extraction of differential expression genes or construction of decision rules using these skewed datasets by traditional algorithms will seriously underestimate the performance of the minority class, leading to inaccurate diagnosis in clinical trails. This paper presents a skewed gene selection algorithm that introduces a weighted metric into the gene selection procedure. The extracted genes are paired as decision rules to distinguish both classes, with these decision rules then integrated into an ensemble learning framework by majority voting to recognize test examples; thus avoiding tedious data normalization and classifier construction. The mining and integrating of a few reliable decision rules gave higher or at least comparable classification performance than many traditional class imbalance learning algorithms on four benchmark imbalanced cancer gene expression datasets.展开更多
Order-preserving submatrix (OPSM) has become important in modelling biologically meaningful subspace cluster, capturing the general tendency of gene expressions across a subset of conditions. With the advance of mic...Order-preserving submatrix (OPSM) has become important in modelling biologically meaningful subspace cluster, capturing the general tendency of gene expressions across a subset of conditions. With the advance of microarray and analysis techniques, big volume of gene expression datasets and OPSM mining results are produced. OPSM query can efficiently retrieve relevant OPSMs from the huge amount of OPSM datasets. However, improving OPSM query relevancy remains a difficult task in real life exploratory data analysis processing. First, it is hard to capture subjective interestingness aspects, e.g., the analyst's expectation given her/his domain knowledge. Second, when these expectations can be declaratively specified, it is still challenging to use them during the computational process of OPSM queries. With the best of our knowledge, existing methods mainly fo- cus on batch OPSM mining, while few works involve OPSM query. To solve the above problems, the paper proposes two constrained OPSM query methods, which exploit userdefined constraints to search relevant results from two kinds of indices introduced. In this paper, extensive experiments are conducted on real datasets, and experiment results demonstrate that the multi-dimension index (cIndex) and enumerating sequence index (esIndex) based queries have better performance than brute force search.展开更多
The classification of cancer is a major research topic in bioinformatics. The nature of high dimensionality and small size associated with gene expression data,however,makes the classification quite challenging. Altho...The classification of cancer is a major research topic in bioinformatics. The nature of high dimensionality and small size associated with gene expression data,however,makes the classification quite challenging. Although principal component analysis (PCA) is of particular interest for the high-dimensional data,it may overemphasize some aspects and ignore some other important information contained in the richly complex data,because it displays only the difference in the first twoor three-dimensional PC subspaces. Based on PCA,a principal component accumulation (PCAcc) method was proposed. It employs the information contained in multiple PC subspaces and improves the class separability of cancers. The effectiveness of the present method was evaluated by four commonly used gene expression datasets,and the results show that the method performs well for cancer classification.展开更多
Constructing biological networks is one of the most important issues in systems biology. However, constructing a network from data manually takes a considerable large amount of time, therefore an automated procedure i...Constructing biological networks is one of the most important issues in systems biology. However, constructing a network from data manually takes a considerable large amount of time, therefore an automated procedure is advocated. To automate the procedure of network construction, in this work we use two intelligent computing techniques, genetic programming and neural computation, to infer two kinds of network models that use continuous variables. To verify the presented approaches, experiments have been conducted and the preliminary results show that both approaches can be used to infer networks successfully.展开更多
Programmed cell death protein 1(PD-1)/programmed cell death ligand 1(PD-L1)blockade is an important therapeutic strategy for melanoma,despite its low clinical response.It is important to identify genes and pathways th...Programmed cell death protein 1(PD-1)/programmed cell death ligand 1(PD-L1)blockade is an important therapeutic strategy for melanoma,despite its low clinical response.It is important to identify genes and pathways that may reflect the clinical outcomes of this therapy in patients.We analyzed clinical dataset GSE96619,which contains clinical information from five melanoma patients before and after anti-PD-1 therapy(five pairs of data).We identified 704 DEGs using these five pairs of data,and then the number of DEGs was narrowed down to 286 in patients who responded to treatment.Next,we performed KEGG pathway enrichment and constructed a DEG-associated protein-protein interaction network.Smooth muscle actin 2(ACTA2)and tyrosine kinase growth factor receptor(KDR)were identified as the hub genes,which were significantly downregulated in the tumor tissue of the two patients who re-sponded to treatment.To confirm our analysis,we demonstrated similar expression tendency to the clinical data for the two hub genes in a B16F10 subcutaneous xeno-graft model.This study demonstrates that ACTA2 and KDR are valuable responsive markers for PD-1/PD-L1 blockade therapy.展开更多
This paper states the basic principle of program data flow analysis in a formal way and gives the concept of data flow expression. On the basis of this concept, an algorithm of finding data flow exceptions is rendered...This paper states the basic principle of program data flow analysis in a formal way and gives the concept of data flow expression. On the basis of this concept, an algorithm of finding data flow exceptions is rendered. This algorithm has great generality, with which it is easy to develop a tool for program test. So it is practical in application.展开更多
In previous gene expression data analyses, supervised learning has mainly focused on the clas-sification of attribute data, such as the different experimental conditions, different known classes of the same tumor and ...In previous gene expression data analyses, supervised learning has mainly focused on the clas-sification of attribute data, such as the different experimental conditions, different known classes of the same tumor and sex. However, supervised learning classification is not suitable for interval-scaled attributes, such as age and survival outcome of cancer patients. For this problem, this paper proposed a new method by combining two well-known methods: principal component analysis (PCA) and Fisher analysis (FA). The method, PCA-FA, realizes supervised learning with two types of attributes (nominal attributes and interval-scaled attributes). The fuzzy FA was introduced to model the interval-scaled attributes. In this paper, an ap-proximate linear relationship between gene expression data of lung adenocarcinoma patients and survival outcome is successfully revealed by PCA-TA.展开更多
Clustering is perhaps one of the most widely used tools for microarray data analysis. Proposed roles for genes of unknown function are inferred from clusters of genes similarity expressed across many biological condit...Clustering is perhaps one of the most widely used tools for microarray data analysis. Proposed roles for genes of unknown function are inferred from clusters of genes similarity expressed across many biological conditions. However, whether function annotation by similarity metrics is reliable or not and to what extent the similarity in gene expression patterns is useful for annotation of gene functions, has not been evaluated. This paper made a comprehensive research on the correlation between the similarity of expression data and of gene functions using Gene Ontology. It has been found that although the similarity in expression patterns and the similarity in gene functions are significantly dependent on each other, this association is rather weak. In addition, among the three categories of Gene Ontology, the similarity of expression data is more useful for cellular component annotation than for biological process and molecular function. The results presented are interesting for the gene functions prediction research area.展开更多
BACKGROUND The objectives of this study were to identify hub genes and biological pathways involved in lung adenocarcinoma(LUAD)via bioinformatics analysis,and investigate potential therapeutic targets.AIM To determin...BACKGROUND The objectives of this study were to identify hub genes and biological pathways involved in lung adenocarcinoma(LUAD)via bioinformatics analysis,and investigate potential therapeutic targets.AIM To determine reliable prognostic biomarkers for early diagnosis and treatment of LUAD.METHODS To identify potential therapeutic targets for LUAD,two microarray datasets derived from the Gene Expression Omnibus(GEO)database were analyzed,GSE3116959 and GSE118370.Differentially expressed genes(DEGs)in LUAD and normal tissues were identified using the GEO2R tool.The Hiplot database was then used to generate a volcanic map of the DEGs.Weighted gene co-expression network analysis was conducted to cluster the genes in GSE116959 and GSE-118370 into different modules,and identify immune genes shared between them.A protein-protein interaction network was established using the Search Tool for the Retrieval of Interacting Genes database,then the CytoNCA and CytoHubba components of Cytoscape software were used to visualize the genes.Hub genes with high scores and co-expression were identified,and the Database for Annotation,Visualization and Integrated Discovery was used to perform enrichment analysis of these genes.The diagnostic and prognostic values of the hub genes were calculated using receiver operating characteristic curves and Kaplan-Meier survival analysis,and gene-set enrichment analysis was conducted.The University of Alabama at Birmingham Cancer data analysis portal was used to analyze relationships between the hub genes and normal specimens,as well as their expression during tumor progression.Lastly,validation of protein expression was conducted on the identified hub genes via the Human Protein Atlas database.RESULTS Three hub genes with high connectivity were identified;cellular retinoic acid binding protein 2(CRABP2),matrix metallopeptidase 12(MMP12),and DNA topoisomerase II alpha(TOP2A).High expression of these genes was associated with a poor LUAD prognosis,and the genes exhibited high diagnostic value.CONCLUSION Expression levels of CRABP2,MMP12,and TOP2A in LUAD were higher than those in normal lung tissue.This observation has diagnostic value,and is linked to poor LUAD prognosis.These genes may be biomarkers and therapeutic targets in LUAD,but further research is warranted to investigate their usefulness in these respects.展开更多
Behavior-based malware analysis is an important technique for automatically analyzing and detecting malware, and it has received considerable attention from both academic and industrial communities. By considering how...Behavior-based malware analysis is an important technique for automatically analyzing and detecting malware, and it has received considerable attention from both academic and industrial communities. By considering how malware behaves, we can tackle the malware obfuscation problem, which cannot be processed by traditional static analysis approaches, and we can also derive the as-built behavior specifications and cover the entire behavior space of the malware samples. Although there have been several works focusing on malware behavior analysis, such research is far from mature, and no overviews have been put forward to date to investigate current developments and challenges. In this paper, we conduct a survey on malware behavior description and analysis considering three aspects: malware behavior description, behavior analysis methods, and visualization techniques. First, existing behavior data types and emerging techniques for malware behavior description are explored, especially the goals, prin- ciples, characteristics, and classifications of behavior analysis techniques proposed in the existing approaches. Second, the in- adequacies and challenges in malware behavior analysis are summarized from different perspectives. Finally, several possible directions are discussed for future research.展开更多
文摘Objective:To use the gene chip of pseudomonas aeruginosa as a research sample and to explore it at an omics level,aiming at elucidating the co-expression network characteristics of the virulence genes exoS and exoU of pseudomonas aeruginosa in the lower respiratory tract from the perspective of molecular biology and identifying its key regulatory genes.Methods:From March 2016 to May 2018,312 patients infected with pseudomonas aeruginosa in the lower respiratory tract who were admitted to Department of Respiratory Medicine of Baogang Hospital and given follow-up treatments in the hospital were selected as subjects by use of cluster sampling.Alveolar lavage fluid and sputum collected from those patients were used as biological specimens.The genes of pseudomonas aeruginosa were detected with the help of oligonucleotide probes to make a pre-processing of chip data.A total of 8 common antibiotics(ceftazidime,gentamicin,piperacillin,amikacin,ciprofloxacin,levofloxacin,doripenem and ticarcillin)against Gram-negative bacteria were selected to determine the drug resistance of biological specimens.MCODE algorithm was used to construct a co-expression network model of the drug-resistance genes focused on exoS/exoU.Results:The expression level of exoS/exoU in the drug-resistance group was significantly higher than that in the non-resistance group(p<0.05).The top 5 differentially expressed genes in the alveolar lavage fluid specimens from the drug-resistance group were RAC1,ITGB1,ITGB5,CRK and IGF1R in the order from high to low.In the sputum specimens,the top 5 differentially expressed genes were RAC1,CRK,IGF1R,ITGB1 and ITGB5.In the alveolar lavage fluid specimens,only RAC1 had a positive correlation with the expression of exoS and exoU(p<0.05).In the sputum specimens,RAC1,ITGB1,ITGB5,CRK and IGF1R were positively correlated with the expression of exoS and exoU(p<0.05).The genes included in the co-expression network contained exoS,exoU,RAC1,ITGB1,ITGB5,CRK,CAMK2D,RHOA,FLNA,IGF1R,TGFBR2 and FOS.Among them,RAC1 had a highest score in the aspect of regulatory ability(72.00)and the largest number of regulatory genes(6);followed by ITGB1,ITGB5 and CRK genes.Conclusions:The high expression of exoS and exoU in the sputum specimens suggests that pseudomonas aeruginosa has a higher probability to get resistant to antibiotics;RAC1,ITGB1,ITGB5 and CRK genes may be the key genes that can regulate the expression of exoS and exoU.
文摘Lung cancer remains a significant global health challenge and identifying lung cancer at an early stage is essential for enhancing patient outcomes. The study focuses on developing and optimizing gene expression-based models for classifying cancer types using machine learning techniques. By applying Log2 normalization to gene expression data and conducting Wilcoxon rank sum tests, the researchers employed various classifiers and Incremental Feature Selection (IFS) strategies. The study culminated in two optimized models using the XGBoost classifier, comprising 10 and 74 genes respectively. The 10-gene model, due to its simplicity, is proposed for easier clinical implementation, whereas the 74-gene model exhibited superior performance in terms of Specificity, AUC (Area Under the Curve), and Precision. These models were evaluated based on their sensitivity, AUC, and specificity, aiming to achieve high sensitivity and AUC while maintaining reasonable specificity.
文摘CHDTEPDB(URL:http://chdtepdb.com/)is a manually integrated database for congenital heart disease(CHD)that stores the expression profiling data of CHD derived from published papers,aiming to provide rich resources for investigating a deeper correlation between human CHD and aberrant transcriptome expression.The develop-ment of human diseases involves important regulatory roles of RNAs,and expression profiling data can reflect the underlying etiology of inherited diseases.Hence,collecting and compiling expression profiling data is of critical significance for a comprehensive understanding of the mechanisms and functions that underpin genetic diseases.CHDTEPDB stores the expression profiles of over 200 sets of 7 types of CHD and provides users with more convenient basic analytical functions.Due to the differences in clinical indicators such as disease type and unavoidable detection errors among various datasets,users are able to customize their selection of corresponding data for personalized analysis.Moreover,we provide a submission page for researchers to submit their own data so that increasing expression profiles as well as some other histological data could be supplemented to the database.CHDTEPDB is a user-friendly interface that allows users to quickly browse,retrieve,download,and analyze their target samples.CHDTEPDB will significantly improve the current knowledge of expression profiling data in CHD and has the potential to be exploited as an important tool for future research on the disease.
文摘Gene expression data represents a condition matrix where each rowrepresents the gene and the column shows the condition. Micro array used todetect gene expression in lab for thousands of gene at a time. Genes encode proteins which in turn will dictate the cell function. The production of messengerRNA along with processing the same are the two main stages involved in the process of gene expression. The biological networks complexity added with thevolume of data containing imprecision and outliers increases the challenges indealing with them. Clustering methods are hence essential to identify the patternspresent in massive gene data. Many techniques involve hierarchical, partitioning,grid based, density based, model based and soft clustering approaches for dealingwith the gene expression data. Understanding the gene regulation and other usefulinformation from this data can be possible only through effective clustering algorithms. Though many methods are discussed in the literature, we concentrate onproviding a soft clustering approach for analyzing the gene expression data. Thepopulation elements are grouped based on the fuzziness principle and a degree ofmembership is assigned to all the elements. An improved Fuzzy clustering byLocal Approximation of Memberships (FLAME) is proposed in this workwhich overcomes the limitations of the other approaches while dealing with thenon-linear relationships and provide better segregation of biological functions.
文摘Acute leukemia is an aggressive disease that has high mortality rates worldwide.The error rate can be as high as 40%when classifying acute leukemia into its subtypes.So,there is an urgent need to support hematologists during the classification process.More than two decades ago,researchers used microarray gene expression data to classify cancer and adopted acute leukemia as a test case.The high classification accuracy they achieved confirmed that it is possible to classify cancer subtypes using microarray gene expression data.Ensemble machine learning is an effective method that combines individual classifiers to classify new samples.Ensemble classifiers are recognized as powerful algorithms with numerous advantages over traditional classifiers.Over the past few decades,researchers have focused a great deal of attention on ensemble classifiers in a wide variety of fields,including but not limited to disease diagnosis,finance,bioinformatics,healthcare,manufacturing,and geography.This paper reviews the recent ensemble classifier approaches utilized for acute leukemia gene expression data classification.Moreover,a framework for classifying acute leukemia gene expression data is proposed.The pairwise correlation gene selection method and the Rotation Forest of Bayesian Networks are both used in this framework.Experimental outcomes show that the classification accuracy achieved by the acute leukemia ensemble classifiers constructed according to the suggested framework is good compared to the classification accuracy achieved in other studies.
基金The authors extend their appreciation to the Deanship of Scientific Research at King Khalid University for funding this work under grant number(RGP 2/42/43)This work was supported by Taif University Researchers Supporting Program(project number:TURSP-2020/200),Taif University,Saudi Arabia.
文摘In bioinformatics applications,examination of microarray data has received significant interest to diagnose diseases.Microarray gene expression data can be defined by a massive searching space that poses a primary challenge in the appropriate selection of genes.Microarray data classification incorporates multiple disciplines such as bioinformatics,machine learning(ML),data science,and pattern classification.This paper designs an optimal deep neural network based microarray gene expression classification(ODNN-MGEC)model for bioinformatics applications.The proposed ODNN-MGEC technique performs data normalization process to normalize the data into a uniform scale.Besides,improved fruit fly optimization(IFFO)based feature selection technique is used to reduce the high dimensionality in the biomedical data.Moreover,deep neural network(DNN)model is applied for the classification of microarray gene expression data and the hyperparameter tuning of the DNN model is carried out using the Symbiotic Organisms Search(SOS)algorithm.The utilization of IFFO and SOS algorithms pave the way for accomplishing maximum gene expression classification outcomes.For examining the improved outcomes of the ODNN-MGEC technique,a wide ranging experimental analysis is made against benchmark datasets.The extensive comparison study with recent approaches demonstrates the enhanced outcomes of the ODNN-MGEC technique in terms of different measures.
文摘This work evaluates a recently developed multivariate statistical method based on the creation of pseudo or latent variables using principal component analysis (PCA). The application is the data mining of gene expression data to find a small subset of the most important genes in a set of thousand or tens of thousands of genes from a relatively small number of experimental runs. The method was previously developed and evaluated on artificially generated data and real data sets. Its evaluations consisted of its ability to rank the genes against known truth in simulated data studies and to identify known important genes in real data studies. The purpose of the work described here is to identify a ranked set of genes in an experimental study and then for a few of the most highly ranked unverified genes, experimentally verify their importance.This method was evaluated using the transcriptional response of Escherichia coli to treatment with four distinct inhibitory compounds: nitric oxide, S-nitrosoglutathione, serine hydroxamate and potassium cyanide. Our analysis identified genes previously recognized in the response to these compounds and also identified new genes.Three of these new genes, ycbR, yJhA and yahN, were found to significantly (p-values〈0.002) affect the sensitivityofE, coli to nitric oxide-mediated growth inhibition. Given that the three genes were not highly ranked in the selected ranked set (RS), these results support strong sensitivity in the ability of the method to successfully identify genes related to challenge by NO and GSNO. This ability to identify genes related to the response to an inhibitory compound is important for engineering tolerance to inhibitory metabolic products, such as biofuels, and utilization of cheap sugar streams, such as biomass-derived sugars or hydrolysate.
基金the National Natural Sci-ence Foundation of China (No. 30700161)the Na-tional High-Tech Research and Development Program(863 Program) of China (No. 2007AA01Z167 and2006AA02Z309)+1 种基金China Postdoctoral Science Foun-dation (No. 20070410223)Doctor Scientific Re-search Startup Foundation of Qufu Normal University(No. Bsqd2007036).
文摘We propose a new method for tumor classification from gene expression data, which mainly contains three steps. Firstly, the original DNA microarray gene expression data are modeled by independent component analysis (ICA). Secondly, the most discriminant eigenassays extracted by ICA are selected by the sequential floating forward selection technique. Finally, support vector machine is used to classify the modeling data. To show the validity of the proposed method, we applied it to classify three DNA microarray datasets involving various human normal and tumor tissue samples. The experimental results show that the method is efficient and feasible.
文摘The rapid developments of technologies that generate arrays of gene dataenable a global view of the transcription levels of hundreds of thousands of genes simultaneously.The outlier detection problem for gene data has its importance but together with the difficulty ofhigh dimensionality. The sparsity of data in high-dimensional space makes each point a relativelygood outlier in the view of traditional distance-based definitions. Thus, finding outliers in highdimensional data is more complex. In this paper, some basic outlier analysis algorithms arediscussed and a new genetic algorithm is presented. This algorithm is to find best dimensionprojections based on a revised cell-based algorithm and to give explanations to solutions. It cansolve the outlier detection problem for gene expression data and for other high dimensional data aswell.
基金Supported by the National Natural Science Foundation of China (No.61105057)the Ph.D Foundation of Jiangsu University of Science and Technology (Nos.35301002 and 35211104)
文摘There have been many skewed cancer gene expression datasets in the post-genomic era. Extraction of differential expression genes or construction of decision rules using these skewed datasets by traditional algorithms will seriously underestimate the performance of the minority class, leading to inaccurate diagnosis in clinical trails. This paper presents a skewed gene selection algorithm that introduces a weighted metric into the gene selection procedure. The extracted genes are paired as decision rules to distinguish both classes, with these decision rules then integrated into an ensemble learning framework by majority voting to recognize test examples; thus avoiding tedious data normalization and classifier construction. The mining and integrating of a few reliable decision rules gave higher or at least comparable classification performance than many traditional class imbalance learning algorithms on four benchmark imbalanced cancer gene expression datasets.
基金The authors thank the anonymous referees for their useful comments that greatly improved the quality of the paper. This work was supported in part by the National Basic Research Program 973 of China (2012CB316203), the Natural Science Foundation of China (Grant Nos. 61033007, 61272121, 61332014, 61572367, 61332006, 61472321, and 61502390), the National High Technology Research and Development Program 863 of China (2015AA015307), the Fundational Research Funds for the Central Universities (3102015JSJ0011, 3102014JSJ0005, and 3102014JSJ0013), and the Graduate Starting Seed Fund of Northwestern Polytechnical University (Z2012128).
文摘Order-preserving submatrix (OPSM) has become important in modelling biologically meaningful subspace cluster, capturing the general tendency of gene expressions across a subset of conditions. With the advance of microarray and analysis techniques, big volume of gene expression datasets and OPSM mining results are produced. OPSM query can efficiently retrieve relevant OPSMs from the huge amount of OPSM datasets. However, improving OPSM query relevancy remains a difficult task in real life exploratory data analysis processing. First, it is hard to capture subjective interestingness aspects, e.g., the analyst's expectation given her/his domain knowledge. Second, when these expectations can be declaratively specified, it is still challenging to use them during the computational process of OPSM queries. With the best of our knowledge, existing methods mainly fo- cus on batch OPSM mining, while few works involve OPSM query. To solve the above problems, the paper proposes two constrained OPSM query methods, which exploit userdefined constraints to search relevant results from two kinds of indices introduced. In this paper, extensive experiments are conducted on real datasets, and experiment results demonstrate that the multi-dimension index (cIndex) and enumerating sequence index (esIndex) based queries have better performance than brute force search.
基金supported by the National Natural Science Foundation of China (20835002)International Science and Technology Cooperation Program of the Ministry of Science and Technology (MOST) of China (2008DFA32250)
文摘The classification of cancer is a major research topic in bioinformatics. The nature of high dimensionality and small size associated with gene expression data,however,makes the classification quite challenging. Although principal component analysis (PCA) is of particular interest for the high-dimensional data,it may overemphasize some aspects and ignore some other important information contained in the richly complex data,because it displays only the difference in the first twoor three-dimensional PC subspaces. Based on PCA,a principal component accumulation (PCAcc) method was proposed. It employs the information contained in multiple PC subspaces and improves the class separability of cancers. The effectiveness of the present method was evaluated by four commonly used gene expression datasets,and the results show that the method performs well for cancer classification.
文摘Constructing biological networks is one of the most important issues in systems biology. However, constructing a network from data manually takes a considerable large amount of time, therefore an automated procedure is advocated. To automate the procedure of network construction, in this work we use two intelligent computing techniques, genetic programming and neural computation, to infer two kinds of network models that use continuous variables. To verify the presented approaches, experiments have been conducted and the preliminary results show that both approaches can be used to infer networks successfully.
基金This work was supported by the CAMS Innovation Fund for Medical Sciences[2017-I2M-1-010].
文摘Programmed cell death protein 1(PD-1)/programmed cell death ligand 1(PD-L1)blockade is an important therapeutic strategy for melanoma,despite its low clinical response.It is important to identify genes and pathways that may reflect the clinical outcomes of this therapy in patients.We analyzed clinical dataset GSE96619,which contains clinical information from five melanoma patients before and after anti-PD-1 therapy(five pairs of data).We identified 704 DEGs using these five pairs of data,and then the number of DEGs was narrowed down to 286 in patients who responded to treatment.Next,we performed KEGG pathway enrichment and constructed a DEG-associated protein-protein interaction network.Smooth muscle actin 2(ACTA2)and tyrosine kinase growth factor receptor(KDR)were identified as the hub genes,which were significantly downregulated in the tumor tissue of the two patients who re-sponded to treatment.To confirm our analysis,we demonstrated similar expression tendency to the clinical data for the two hub genes in a B16F10 subcutaneous xeno-graft model.This study demonstrates that ACTA2 and KDR are valuable responsive markers for PD-1/PD-L1 blockade therapy.
文摘This paper states the basic principle of program data flow analysis in a formal way and gives the concept of data flow expression. On the basis of this concept, an algorithm of finding data flow exceptions is rendered. This algorithm has great generality, with which it is easy to develop a tool for program test. So it is practical in application.
文摘In previous gene expression data analyses, supervised learning has mainly focused on the clas-sification of attribute data, such as the different experimental conditions, different known classes of the same tumor and sex. However, supervised learning classification is not suitable for interval-scaled attributes, such as age and survival outcome of cancer patients. For this problem, this paper proposed a new method by combining two well-known methods: principal component analysis (PCA) and Fisher analysis (FA). The method, PCA-FA, realizes supervised learning with two types of attributes (nominal attributes and interval-scaled attributes). The fuzzy FA was introduced to model the interval-scaled attributes. In this paper, an ap-proximate linear relationship between gene expression data of lung adenocarcinoma patients and survival outcome is successfully revealed by PCA-TA.
基金Project supported by the Key Program of Basic Research of Science & Technology Commission of Shanghai Municipality (No. 04dz14004) and the Shanghai Natural Science Foundation (No. 03ZR14065). Dedicated to Professor Xikui Jiang on the occasion of his 80th birthday.
文摘Clustering is perhaps one of the most widely used tools for microarray data analysis. Proposed roles for genes of unknown function are inferred from clusters of genes similarity expressed across many biological conditions. However, whether function annotation by similarity metrics is reliable or not and to what extent the similarity in gene expression patterns is useful for annotation of gene functions, has not been evaluated. This paper made a comprehensive research on the correlation between the similarity of expression data and of gene functions using Gene Ontology. It has been found that although the similarity in expression patterns and the similarity in gene functions are significantly dependent on each other, this association is rather weak. In addition, among the three categories of Gene Ontology, the similarity of expression data is more useful for cellular component annotation than for biological process and molecular function. The results presented are interesting for the gene functions prediction research area.
文摘BACKGROUND The objectives of this study were to identify hub genes and biological pathways involved in lung adenocarcinoma(LUAD)via bioinformatics analysis,and investigate potential therapeutic targets.AIM To determine reliable prognostic biomarkers for early diagnosis and treatment of LUAD.METHODS To identify potential therapeutic targets for LUAD,two microarray datasets derived from the Gene Expression Omnibus(GEO)database were analyzed,GSE3116959 and GSE118370.Differentially expressed genes(DEGs)in LUAD and normal tissues were identified using the GEO2R tool.The Hiplot database was then used to generate a volcanic map of the DEGs.Weighted gene co-expression network analysis was conducted to cluster the genes in GSE116959 and GSE-118370 into different modules,and identify immune genes shared between them.A protein-protein interaction network was established using the Search Tool for the Retrieval of Interacting Genes database,then the CytoNCA and CytoHubba components of Cytoscape software were used to visualize the genes.Hub genes with high scores and co-expression were identified,and the Database for Annotation,Visualization and Integrated Discovery was used to perform enrichment analysis of these genes.The diagnostic and prognostic values of the hub genes were calculated using receiver operating characteristic curves and Kaplan-Meier survival analysis,and gene-set enrichment analysis was conducted.The University of Alabama at Birmingham Cancer data analysis portal was used to analyze relationships between the hub genes and normal specimens,as well as their expression during tumor progression.Lastly,validation of protein expression was conducted on the identified hub genes via the Human Protein Atlas database.RESULTS Three hub genes with high connectivity were identified;cellular retinoic acid binding protein 2(CRABP2),matrix metallopeptidase 12(MMP12),and DNA topoisomerase II alpha(TOP2A).High expression of these genes was associated with a poor LUAD prognosis,and the genes exhibited high diagnostic value.CONCLUSION Expression levels of CRABP2,MMP12,and TOP2A in LUAD were higher than those in normal lung tissue.This observation has diagnostic value,and is linked to poor LUAD prognosis.These genes may be biomarkers and therapeutic targets in LUAD,but further research is warranted to investigate their usefulness in these respects.
基金Project supported by the National Natural Science Foundation of China(No.61472437)
文摘Behavior-based malware analysis is an important technique for automatically analyzing and detecting malware, and it has received considerable attention from both academic and industrial communities. By considering how malware behaves, we can tackle the malware obfuscation problem, which cannot be processed by traditional static analysis approaches, and we can also derive the as-built behavior specifications and cover the entire behavior space of the malware samples. Although there have been several works focusing on malware behavior analysis, such research is far from mature, and no overviews have been put forward to date to investigate current developments and challenges. In this paper, we conduct a survey on malware behavior description and analysis considering three aspects: malware behavior description, behavior analysis methods, and visualization techniques. First, existing behavior data types and emerging techniques for malware behavior description are explored, especially the goals, prin- ciples, characteristics, and classifications of behavior analysis techniques proposed in the existing approaches. Second, the in- adequacies and challenges in malware behavior analysis are summarized from different perspectives. Finally, several possible directions are discussed for future research.