Typically, magnesium alloys have been designed using a so-called hill-climbing approach, with rather incremental advances over the past century. Iterative and incremental alloy design is slow and expensive, but more i...Typically, magnesium alloys have been designed using a so-called hill-climbing approach, with rather incremental advances over the past century. Iterative and incremental alloy design is slow and expensive, but more importantly it does not harness all the data that exists in the field. In this work, a new approach is proposed that utilises data science and provides a detailed understanding of the data that exists in the field of Mg-alloy design to date. In this approach, first a consolidated alloy database that incorporates 916 datapoints was developed from the literature and experimental work. To analyse the characteristics of the database, alloying and thermomechanical processing effects on mechanical properties were explored via composition-process-property matrices. An unsupervised machine learning(ML) method of clustering was also implemented, using unlabelled data, with the aim of revealing potentially useful information for an alloy representation space of low dimensionality. In addition, the alloy database was correlated to thermodynamically stable secondary phases to further understand the relationships between microstructure and mechanical properties. This work not only introduces an invaluable open-source database, but it also provides, for the first-time data, insights that enable future accelerated digital Mg-alloy design.展开更多
In the Tano River Basin,groundwater serves as a crucial resource;however,its quantity and quality with regard to trace elements and microbiological loadings remain poorly understood due to the lack of groundwater logs...In the Tano River Basin,groundwater serves as a crucial resource;however,its quantity and quality with regard to trace elements and microbiological loadings remain poorly understood due to the lack of groundwater logs and limited water research.This study presents a comprehensive analysis of the Tano River Basin,focusing on three key objectives.First,it investigated the aquifer hydraulic parameters and the results showed significant spatial variations in borehole depths,yields,transmissivity,hydraulic conductivity,and specific capacity.Deeper boreholes were concentrated in the northeastern and southeastern zones,while geological formations,particu-larly the Apollonian Formation,exhibit a strong influence on borehole yields.The study identified areas with high transmissivity and hydraulic conductivity in the southern and eastern regions,suggesting good groundwater avail-ability and suitability for sustainable water supply.Sec-ondly,the research investigated the groundwater quality and observed that the majority of borehole samples fall within WHO(Guidelines for Drinking-water Quality,Environmental Health Criteria,Geneva,2011,2017.http://www.who.int)limit.However,some samples have pH levels below the standards,although the groundwater generally qualifies as freshwater.The study further explores hydrochemical facies and health risk assessment,highlighting the dominance of Ca–HCO3 water type.Trace element analysis reveals minimal health risks from most elements,with chromium(Cr)as the primary contributor to chronic health risk.Overall,this study has provided a key insights into the Tano River Basin’s hydrogeology and associated health risks.The outcome of this research has contributed to the broader understanding of hydrogeologi-cal dynamics and the importance of managing groundwater resources sustainably in complex geological environments.展开更多
This paper investigates the intelligent load monitoring problem with applications to practical energy management scenarios in smart grids.As one of the critical components for paving the way to smart grids’success,an...This paper investigates the intelligent load monitoring problem with applications to practical energy management scenarios in smart grids.As one of the critical components for paving the way to smart grids’success,an intelligent and feasible non-intrusive load monitoring(NILM)algorithm is urgently needed.However,most recent researches on NILM have not dealt with practical problems when applied to power grid,i.e.,①limited communication for slow-change systems;②requirement of low-cost hardware at the users’side;and③inconvenience to adapt to new households.Therefore,a novel NILM algorithm based on biology-inspired spiking neural network(SNN)has been developed to overcome the existing challenges.To provide intelligence in NILM,the developed SNN features an unsupervised learning rule,i.e.,spike-time dependent plasticity(STDP),which only requires the user to label one instance for each appliance while adapting to a new household.To upgrade the feasibility in NILM,the designed spiking neurons mimic the mechanism of human brain neurons that can be constructed by a resistor-capacitor(RC)circuit.In addition,a distributed computing system has been designed that divides the SNN into two parts,i.e.,smart outlets and local servers.Since the information flows as sparse binary vectors among spiking neurons in the developed SNN-based NILM,the high-frequency data can be easily compressed as the spike times,and are sent to the local server with limited communication capability,whereas it is unable to handle the traditional NILM.Finally,a series of experiments are conducted using a benchmark public dataset.Meanwhile,the effectiveness of developed SNN-based NILM can be demonstrated through comparisons with other emerging NILM algorithms such as the convolutional neural networks.展开更多
Since the emergence of Bitcoin,cryptocurrencies have grown significantly,not only in terms of capitalization but also in number.Consequently,the cryptocurrency market can be a conducive arena for investors,as it offer...Since the emergence of Bitcoin,cryptocurrencies have grown significantly,not only in terms of capitalization but also in number.Consequently,the cryptocurrency market can be a conducive arena for investors,as it offers many opportunities.However,it is difficult to understand.This study aims to describe,summarize,and segment the main trends of the entire cryptocurrency market in 2018,using data analysis tools.Accord-ingly,we propose a new clustering-based methodology that provides complementary views of the financial behavior of cryptocurrencies,and one that looks for associations between the clustering results,and other factors that are not involved in clustering.Particularly,the methodology involves applying three different partitional clustering algorithms,where each of them use a different representation for cryptocurrencies,namely,yearly mean,and standard deviation of the returns,distribution of returns that have not been applied to financial markets previously,and the time series of returns.Because each representation provides a different outlook of the market,we also examine the integration of the three clustering results,to obtain a fine-grained analysis of the main trends of the market.In conclusion,we analyze the association of the clustering results with other descriptive features of cryptocurrencies,including the age,technological attributes,and financial ratios derived from them.This will help to enhance the profiling of the clusters with additional descriptive insights,and to find associations with other variables.Consequently,this study describes the whole market based on graphical information,and a scalable methodology that can be reproduced by investors who want to understand the main trends in the market quickly,and those that look for cryptocurrencies with different financial performance.In our analysis of the 2018 and 2019 for extended period,we found that the market can be typically segmented in few clusters(five or less),and even considering the intersections,the 6 more populations account for 75%of the market.Regarding the associations between the clusters and descriptive features,we find associations between some clusters with volume,market capitalization,and some financial ratios,which could be explored in future research.展开更多
Based on four reanalysis datasets including CMA-RA,ERA5,ERA-Interim,and FNL,this paper proposes an improved intelligent method for shear line identification by introducing a second-order zonal-wind shear.Climatic char...Based on four reanalysis datasets including CMA-RA,ERA5,ERA-Interim,and FNL,this paper proposes an improved intelligent method for shear line identification by introducing a second-order zonal-wind shear.Climatic characteristics of shear lines and related rainstorms over the Southern Yangtze River Valley(SYRV)during the summers(June-August)from 2008 to 2018 are then analyzed by using two types of unsupervised machine learning algorithm,namely the t-distributed stochastic neighbor embedding method(t-SNE)and the k-means clustering method.The results are as follows:(1)The reproducibility of the 850 hPa wind fields over the SYRV using China’s reanalysis product CMARA is superior to that of European and American products including ERA5,ERA-Interim,and FNL.(2)Theory and observations indicate that the introduction of a second-order zonal-wind shear criterion can effectively eliminate the continuous cyclonic curvature of the wind field and identify shear lines with significant discontinuities.(3)The occurrence frequency of shear lines appearing in the daytime and nighttime is almost equal,but the intensity and the accompanying rainstorm have a clear diurnal variation:they are significantly stronger during daytime than those at nighttime.(4)Half(47%)of the shear lines can cause short-duration rainstorms(≥20 mm(3h)^(-1)),and shear line rainstorms account for one-sixth(16%)of the total summer short-duration rainstorms.Rainstorms caused by shear lines are significantly stronger than that caused by other synoptic forcing.(5)Under the influence of stronger water vapor transport and barotropic instability,shear lines and related rainstorms in the north and middle of the SYRV are stronger than those in the south.展开更多
Cluster analysis is a crucial technique in unsupervised machine learning,pattern recognition,and data analysis.However,current clustering algorithms suffer from the need for manual determination of parameter values,lo...Cluster analysis is a crucial technique in unsupervised machine learning,pattern recognition,and data analysis.However,current clustering algorithms suffer from the need for manual determination of parameter values,low accuracy,and inconsistent performance concerning data size and structure.To address these challenges,a novel clustering algorithm called the fully automated density-based clustering method(FADBC)is proposed.The FADBC method consists of two stages:parameter selection and cluster extraction.In the first stage,a proposed method extracts optimal parameters for the dataset,including the epsilon size and a minimum number of points thresholds.These parameters are then used in a density-based technique to scan each point in the dataset and evaluate neighborhood densities to find clusters.The proposed method was evaluated on different benchmark datasets andmetrics,and the experimental results demonstrate its competitive performance without requiring manual inputs.The results show that the FADBC method outperforms well-known clustering methods such as the agglomerative hierarchical method,k-means,spectral clustering,DBSCAN,FCDCSD,Gaussian mixtures,and density-based spatial clustering methods.It can handle any kind of data set well and perform excellently.展开更多
Prediction and diagnosis of cardiovascular diseases(CVDs)based,among other things,on medical examinations and patient symptoms are the biggest challenges in medicine.About 17.9 million people die from CVDs annually,ac...Prediction and diagnosis of cardiovascular diseases(CVDs)based,among other things,on medical examinations and patient symptoms are the biggest challenges in medicine.About 17.9 million people die from CVDs annually,accounting for 31%of all deaths worldwide.With a timely prognosis and thorough consideration of the patient’s medical history and lifestyle,it is possible to predict CVDs and take preventive measures to eliminate or control this life-threatening disease.In this study,we used various patient datasets from a major hospital in the United States as prognostic factors for CVD.The data was obtained by monitoring a total of 918 patients whose criteria for adults were 28-77 years old.In this study,we present a data mining modeling approach to analyze the performance,classification accuracy and number of clusters on Cardiovascular Disease Prognostic datasets in unsupervised machine learning(ML)using the Orange data mining software.Various techniques are then used to classify the model parameters,such as k-nearest neighbors,support vector machine,random forest,artificial neural network(ANN),naïve bayes,logistic regression,stochastic gradient descent(SGD),and AdaBoost.To determine the number of clusters,various unsupervised ML clustering methods were used,such as k-means,hierarchical,and density-based spatial clustering of applications with noise clustering.The results showed that the best model performance analysis and classification accuracy were SGD and ANN,both of which had a high score of 0.900 on Cardiovascular Disease Prognostic datasets.Based on the results of most clustering methods,such as k-means and hierarchical clustering,Cardiovascular Disease Prognostic datasets can be divided into two clusters.The prognostic accuracy of CVD depends on the accuracy of the proposed model in determining the diagnostic model.The more accurate the model,the better it can predict which patients are at risk for CVD.展开更多
This paper discusses low-cost approaches capable of ranking traffic intersections for the purpose of signal re-timing.We extracted intersections that are comprised of multiple roads,defined by alphanumeric traffic mes...This paper discusses low-cost approaches capable of ranking traffic intersections for the purpose of signal re-timing.We extracted intersections that are comprised of multiple roads,defined by alphanumeric traffic message channel segment codes per international classification standards.Each of these road segments includes a variety of metrics,including congestion,planning time index,and bottleneck ranking information provided by the Regional Integrated Transportation Information System.Our first approach was to use a ranking formula to calculate intersection rankings using a score between 0 and 10 by considering data for different times of the day and different days of the week,weighting weekdays more heavily than weekends and morning and evening commute times more heavily than other times of day.The second method was to utilize unsupervised machine learning algorithms,primarily k-means clustering,to accomplish the intersection ranking task.We first approach this by checking the performance of basic k-means clustering on our data set.We then explore the ranking problem further by utilizing data provided by traffic professionals in the state of Tennessee.This exploration involves using MATLAB to minimize the mean-squared error of intersection rankings to determine the optimum weights in the ranking formula based on a city’s professional data.We then attempted an optimization of our weights via a brute-force search approach to minimize the distance from ranking formula results to the clustering results.All the ranking information was aggregated into an online SQL database hosted by Amazon web services that utilized the PHP scripting language.展开更多
Nowadays,in almost every computer system,log files are used to keep records of occurring events.Those log files are then used for analyzing and debugging system failures.Due to this important utility,researchers have ...Nowadays,in almost every computer system,log files are used to keep records of occurring events.Those log files are then used for analyzing and debugging system failures.Due to this important utility,researchers have worked on finding fast and efficient ways to detect anomalies in a computer system by analyzing its log records.Research in log-based anomaly detection can be divided into two main categories:batch log-based anomaly detection and streaming log-based anomaly detection.Batch log-based anomaly detection is computationally heavy and does not allow us to instantaneously detect anomalies.On the other hand,streaming anomaly detection allows for immediate alert.However,current streaming approaches are mainly supervised.In this work,we propose a fully unsupervised framework which can detect anomalies in real time.We test our framework on hdfs log files and successfully detect anomalies with an F-1 score of 83%.展开更多
Power plant performance can decrease along with its life span,and move away from the design and commissioning targets.Maintenance issues,operational practices,market restrictions,and financial objectives may lead to t...Power plant performance can decrease along with its life span,and move away from the design and commissioning targets.Maintenance issues,operational practices,market restrictions,and financial objectives may lead to that behavior,and the knowledge of appropriate actions could support the system to retake its original operational performance.This paper applies unsupervised machine learning techniques to identify operating patterns based on the power plant’s historical data which leads to the identification of appropriate steam generator efficiency conditions.The selected operational variables are evaluated in respect to their impact on the system performance,quantified by the Variable Importance Index.That metric is proposed to identify the variables among a much wide set of monitored data whose variation impacts the overall power plant operation,and should be controlled with more attention.Principal Component Analysis(PCA)and k-means++clustering techniques are used to identify suitable operational conditions from a one-year-long data set with 27 recorded variables from a steam generator of a 360MW thermal power plant.The adequate number of clusters is identified by the average Silhouette coefficient and the Variable Importance Index sorts nine variables as the most relevant ones,to finally group recommended settings to achieve the target conditions.Results show performance gains in respect to the average historical values of 73.5%and the lowest efficiency condition records of 68%,to the target steam generator efficiency of 76%.展开更多
基金the support of the Monash-IITB Academy Scholarshipfunded in part by the Australian Research Council (DP190103592)。
文摘Typically, magnesium alloys have been designed using a so-called hill-climbing approach, with rather incremental advances over the past century. Iterative and incremental alloy design is slow and expensive, but more importantly it does not harness all the data that exists in the field. In this work, a new approach is proposed that utilises data science and provides a detailed understanding of the data that exists in the field of Mg-alloy design to date. In this approach, first a consolidated alloy database that incorporates 916 datapoints was developed from the literature and experimental work. To analyse the characteristics of the database, alloying and thermomechanical processing effects on mechanical properties were explored via composition-process-property matrices. An unsupervised machine learning(ML) method of clustering was also implemented, using unlabelled data, with the aim of revealing potentially useful information for an alloy representation space of low dimensionality. In addition, the alloy database was correlated to thermodynamically stable secondary phases to further understand the relationships between microstructure and mechanical properties. This work not only introduces an invaluable open-source database, but it also provides, for the first-time data, insights that enable future accelerated digital Mg-alloy design.
文摘In the Tano River Basin,groundwater serves as a crucial resource;however,its quantity and quality with regard to trace elements and microbiological loadings remain poorly understood due to the lack of groundwater logs and limited water research.This study presents a comprehensive analysis of the Tano River Basin,focusing on three key objectives.First,it investigated the aquifer hydraulic parameters and the results showed significant spatial variations in borehole depths,yields,transmissivity,hydraulic conductivity,and specific capacity.Deeper boreholes were concentrated in the northeastern and southeastern zones,while geological formations,particu-larly the Apollonian Formation,exhibit a strong influence on borehole yields.The study identified areas with high transmissivity and hydraulic conductivity in the southern and eastern regions,suggesting good groundwater avail-ability and suitability for sustainable water supply.Sec-ondly,the research investigated the groundwater quality and observed that the majority of borehole samples fall within WHO(Guidelines for Drinking-water Quality,Environmental Health Criteria,Geneva,2011,2017.http://www.who.int)limit.However,some samples have pH levels below the standards,although the groundwater generally qualifies as freshwater.The study further explores hydrochemical facies and health risk assessment,highlighting the dominance of Ca–HCO3 water type.Trace element analysis reveals minimal health risks from most elements,with chromium(Cr)as the primary contributor to chronic health risk.Overall,this study has provided a key insights into the Tano River Basin’s hydrogeology and associated health risks.The outcome of this research has contributed to the broader understanding of hydrogeologi-cal dynamics and the importance of managing groundwater resources sustainably in complex geological environments.
基金supported by the SGCC Science and Technology Program under project“Distributed High-Speed Frequency Control Under UHVDC Bipolar Blocking Fault Scenario”(No.SGGR0000DLJS1800934)。
文摘This paper investigates the intelligent load monitoring problem with applications to practical energy management scenarios in smart grids.As one of the critical components for paving the way to smart grids’success,an intelligent and feasible non-intrusive load monitoring(NILM)algorithm is urgently needed.However,most recent researches on NILM have not dealt with practical problems when applied to power grid,i.e.,①limited communication for slow-change systems;②requirement of low-cost hardware at the users’side;and③inconvenience to adapt to new households.Therefore,a novel NILM algorithm based on biology-inspired spiking neural network(SNN)has been developed to overcome the existing challenges.To provide intelligence in NILM,the developed SNN features an unsupervised learning rule,i.e.,spike-time dependent plasticity(STDP),which only requires the user to label one instance for each appliance while adapting to a new household.To upgrade the feasibility in NILM,the designed spiking neurons mimic the mechanism of human brain neurons that can be constructed by a resistor-capacitor(RC)circuit.In addition,a distributed computing system has been designed that divides the SNN into two parts,i.e.,smart outlets and local servers.Since the information flows as sparse binary vectors among spiking neurons in the developed SNN-based NILM,the high-frequency data can be easily compressed as the spike times,and are sent to the local server with limited communication capability,whereas it is unable to handle the traditional NILM.Finally,a series of experiments are conducted using a benchmark public dataset.Meanwhile,the effectiveness of developed SNN-based NILM can be demonstrated through comparisons with other emerging NILM algorithms such as the convolutional neural networks.
基金Funding was provided by EIT Digital(Grant no 825215)European Cooperation in Science and Technology(COST Action 19130).
文摘Since the emergence of Bitcoin,cryptocurrencies have grown significantly,not only in terms of capitalization but also in number.Consequently,the cryptocurrency market can be a conducive arena for investors,as it offers many opportunities.However,it is difficult to understand.This study aims to describe,summarize,and segment the main trends of the entire cryptocurrency market in 2018,using data analysis tools.Accord-ingly,we propose a new clustering-based methodology that provides complementary views of the financial behavior of cryptocurrencies,and one that looks for associations between the clustering results,and other factors that are not involved in clustering.Particularly,the methodology involves applying three different partitional clustering algorithms,where each of them use a different representation for cryptocurrencies,namely,yearly mean,and standard deviation of the returns,distribution of returns that have not been applied to financial markets previously,and the time series of returns.Because each representation provides a different outlook of the market,we also examine the integration of the three clustering results,to obtain a fine-grained analysis of the main trends of the market.In conclusion,we analyze the association of the clustering results with other descriptive features of cryptocurrencies,including the age,technological attributes,and financial ratios derived from them.This will help to enhance the profiling of the clusters with additional descriptive insights,and to find associations with other variables.Consequently,this study describes the whole market based on graphical information,and a scalable methodology that can be reproduced by investors who want to understand the main trends in the market quickly,and those that look for cryptocurrencies with different financial performance.In our analysis of the 2018 and 2019 for extended period,we found that the market can be typically segmented in few clusters(five or less),and even considering the intersections,the 6 more populations account for 75%of the market.Regarding the associations between the clusters and descriptive features,we find associations between some clusters with volume,market capitalization,and some financial ratios,which could be explored in future research.
基金Open Project Fund of Guangdong Provincial Key Laboratory of Regional Numerical Weather Prediction,CMA(J202009)Heavy Rain and Drought-Flood Disasters in Plateau and Basin Key Laboratory of Sichuan Province(SZKT202005)+1 种基金Innovation and Development Project of China Meteorological Administration(CXFZ2021J020)Key Projects of Hunan Meteorological Service(XQKJ21A003,XQKJ21A004,XQKJ22A004)。
文摘Based on four reanalysis datasets including CMA-RA,ERA5,ERA-Interim,and FNL,this paper proposes an improved intelligent method for shear line identification by introducing a second-order zonal-wind shear.Climatic characteristics of shear lines and related rainstorms over the Southern Yangtze River Valley(SYRV)during the summers(June-August)from 2008 to 2018 are then analyzed by using two types of unsupervised machine learning algorithm,namely the t-distributed stochastic neighbor embedding method(t-SNE)and the k-means clustering method.The results are as follows:(1)The reproducibility of the 850 hPa wind fields over the SYRV using China’s reanalysis product CMARA is superior to that of European and American products including ERA5,ERA-Interim,and FNL.(2)Theory and observations indicate that the introduction of a second-order zonal-wind shear criterion can effectively eliminate the continuous cyclonic curvature of the wind field and identify shear lines with significant discontinuities.(3)The occurrence frequency of shear lines appearing in the daytime and nighttime is almost equal,but the intensity and the accompanying rainstorm have a clear diurnal variation:they are significantly stronger during daytime than those at nighttime.(4)Half(47%)of the shear lines can cause short-duration rainstorms(≥20 mm(3h)^(-1)),and shear line rainstorms account for one-sixth(16%)of the total summer short-duration rainstorms.Rainstorms caused by shear lines are significantly stronger than that caused by other synoptic forcing.(5)Under the influence of stronger water vapor transport and barotropic instability,shear lines and related rainstorms in the north and middle of the SYRV are stronger than those in the south.
基金the Deanship of Scientific Research at Umm Al-Qura University,Grant Code:(23UQU4361009DSR001).
文摘Cluster analysis is a crucial technique in unsupervised machine learning,pattern recognition,and data analysis.However,current clustering algorithms suffer from the need for manual determination of parameter values,low accuracy,and inconsistent performance concerning data size and structure.To address these challenges,a novel clustering algorithm called the fully automated density-based clustering method(FADBC)is proposed.The FADBC method consists of two stages:parameter selection and cluster extraction.In the first stage,a proposed method extracts optimal parameters for the dataset,including the epsilon size and a minimum number of points thresholds.These parameters are then used in a density-based technique to scan each point in the dataset and evaluate neighborhood densities to find clusters.The proposed method was evaluated on different benchmark datasets andmetrics,and the experimental results demonstrate its competitive performance without requiring manual inputs.The results show that the FADBC method outperforms well-known clustering methods such as the agglomerative hierarchical method,k-means,spectral clustering,DBSCAN,FCDCSD,Gaussian mixtures,and density-based spatial clustering methods.It can handle any kind of data set well and perform excellently.
文摘Prediction and diagnosis of cardiovascular diseases(CVDs)based,among other things,on medical examinations and patient symptoms are the biggest challenges in medicine.About 17.9 million people die from CVDs annually,accounting for 31%of all deaths worldwide.With a timely prognosis and thorough consideration of the patient’s medical history and lifestyle,it is possible to predict CVDs and take preventive measures to eliminate or control this life-threatening disease.In this study,we used various patient datasets from a major hospital in the United States as prognostic factors for CVD.The data was obtained by monitoring a total of 918 patients whose criteria for adults were 28-77 years old.In this study,we present a data mining modeling approach to analyze the performance,classification accuracy and number of clusters on Cardiovascular Disease Prognostic datasets in unsupervised machine learning(ML)using the Orange data mining software.Various techniques are then used to classify the model parameters,such as k-nearest neighbors,support vector machine,random forest,artificial neural network(ANN),naïve bayes,logistic regression,stochastic gradient descent(SGD),and AdaBoost.To determine the number of clusters,various unsupervised ML clustering methods were used,such as k-means,hierarchical,and density-based spatial clustering of applications with noise clustering.The results showed that the best model performance analysis and classification accuracy were SGD and ANN,both of which had a high score of 0.900 on Cardiovascular Disease Prognostic datasets.Based on the results of most clustering methods,such as k-means and hierarchical clustering,Cardiovascular Disease Prognostic datasets can be divided into two clusters.The prognostic accuracy of CVD depends on the accuracy of the proposed model in determining the diagnostic model.The more accurate the model,the better it can predict which patients are at risk for CVD.
基金the Tennessee Department of Transportation for their support and funding for the duration of the project
文摘This paper discusses low-cost approaches capable of ranking traffic intersections for the purpose of signal re-timing.We extracted intersections that are comprised of multiple roads,defined by alphanumeric traffic message channel segment codes per international classification standards.Each of these road segments includes a variety of metrics,including congestion,planning time index,and bottleneck ranking information provided by the Regional Integrated Transportation Information System.Our first approach was to use a ranking formula to calculate intersection rankings using a score between 0 and 10 by considering data for different times of the day and different days of the week,weighting weekdays more heavily than weekends and morning and evening commute times more heavily than other times of day.The second method was to utilize unsupervised machine learning algorithms,primarily k-means clustering,to accomplish the intersection ranking task.We first approach this by checking the performance of basic k-means clustering on our data set.We then explore the ranking problem further by utilizing data provided by traffic professionals in the state of Tennessee.This exploration involves using MATLAB to minimize the mean-squared error of intersection rankings to determine the optimum weights in the ranking formula based on a city’s professional data.We then attempted an optimization of our weights via a brute-force search approach to minimize the distance from ranking formula results to the clustering results.All the ranking information was aggregated into an online SQL database hosted by Amazon web services that utilized the PHP scripting language.
文摘Nowadays,in almost every computer system,log files are used to keep records of occurring events.Those log files are then used for analyzing and debugging system failures.Due to this important utility,researchers have worked on finding fast and efficient ways to detect anomalies in a computer system by analyzing its log records.Research in log-based anomaly detection can be divided into two main categories:batch log-based anomaly detection and streaming log-based anomaly detection.Batch log-based anomaly detection is computationally heavy and does not allow us to instantaneously detect anomalies.On the other hand,streaming anomaly detection allows for immediate alert.However,current streaming approaches are mainly supervised.In this work,we propose a fully unsupervised framework which can detect anomalies in real time.We test our framework on hdfs log files and successfully detect anomalies with an F-1 score of 83%.
基金Authors acknowledge Energy of Portugal EDP for the financial and technical support to this projectJ.Duarte acknowledges the financial support from CNPq 154147/2020-6 for her undergraduate scholarship+2 种基金L.W.Vieira acknowledges the INCT-GD and the financial support from CAPES 23038.000776/2017-54 for her Ph.D.grantA.D.Marques ac-knowledges the financial support from CNPq 132422/2020-4 for his MSc grantP.S.Schneider acknowledges CNPq for his research grant(PQ 301619/2019-0).T.S.Prass acknowledges the support of FAPERGS(ARD 01/2017,Processo 17/2551-0000826-0).
文摘Power plant performance can decrease along with its life span,and move away from the design and commissioning targets.Maintenance issues,operational practices,market restrictions,and financial objectives may lead to that behavior,and the knowledge of appropriate actions could support the system to retake its original operational performance.This paper applies unsupervised machine learning techniques to identify operating patterns based on the power plant’s historical data which leads to the identification of appropriate steam generator efficiency conditions.The selected operational variables are evaluated in respect to their impact on the system performance,quantified by the Variable Importance Index.That metric is proposed to identify the variables among a much wide set of monitored data whose variation impacts the overall power plant operation,and should be controlled with more attention.Principal Component Analysis(PCA)and k-means++clustering techniques are used to identify suitable operational conditions from a one-year-long data set with 27 recorded variables from a steam generator of a 360MW thermal power plant.The adequate number of clusters is identified by the average Silhouette coefficient and the Variable Importance Index sorts nine variables as the most relevant ones,to finally group recommended settings to achieve the target conditions.Results show performance gains in respect to the average historical values of 73.5%and the lowest efficiency condition records of 68%,to the target steam generator efficiency of 76%.