A new direct method for solving unsymmetrical sparse linear systems(USLS) arising from meshless methods was introduced. Computation of certain meshless methods such as meshless local Petrov-Galerkin (MLPG) method ...A new direct method for solving unsymmetrical sparse linear systems(USLS) arising from meshless methods was introduced. Computation of certain meshless methods such as meshless local Petrov-Galerkin (MLPG) method need to solve large USLS. The proposed solution method for unsymmetrical case performs factorization processes symmetrically on the upper and lower triangular portion of matrix, which differs from previous work based on general unsymmetrical process, and attains higher performance. It is shown that the solution algorithm for USLS can be simply derived from the existing approaches for the symmetrical case. The new matrix factorization algorithm in our method can be implemented easily by modifying a standard JKI symmetrical matrix factorization code. Multi-blocked out-of-core strategies were also developed to expand the solution scale. The approach convincingly increases the speed of the solution process, which is demonstrated with the numerical tests.展开更多
In the previous papers, a high performance sparse static solver with two-level unrolling based on a cell-sparse storage scheme was reported. Although the solver reaches quite a high efficiency for a big percentage of ...In the previous papers, a high performance sparse static solver with two-level unrolling based on a cell-sparse storage scheme was reported. Although the solver reaches quite a high efficiency for a big percentage of finite element analysis benchmark tests, the MFLOPS (million floating operations per second) of LDL^T factorization of benchmark tests vary on a Dell Pentium IV 850 MHz machine from 100 to 456 depending on the average size of the super-equations, i.e., on the average depth of unrolling. In this paper, a new sparse static solver with two-level unrolling that employs the concept of master-equations and searches for an appropriate depths of unrolling is proposed. The new solver provides higher MFLOPS for LDL^T factorization of benchmark tests, and therefore speeds up the solution process.展开更多
Further improving the railway innovation capacity and technological strength is the important goal of the 14th Five-Year Plan for railway scientific and technological innovation.It includes promoting the deep integrat...Further improving the railway innovation capacity and technological strength is the important goal of the 14th Five-Year Plan for railway scientific and technological innovation.It includes promoting the deep integration of cutting-edge technologies with the railway systems,strengthening the research and application of intelligent railway technologies,applying green computing technologies and advancing the collaborative sharing of transportation big data.The high-speed rail system tasks need to process huge amounts of data and heavy workload with the requirement of ultra-fast response.Therefore,it is of great necessity to promote computation efficiency by applying High Performance Computing(HPC)to high-speed rail systems.The HPC technique is a great solution for improving the performance,efficiency,and safety of high-speed rail systems.In this review,we introduce and analyze the application research of high performance computing technology in the field of highspeed railways.These HPC applications are cataloged into four broad categories,namely:fault diagnosis,network and communication,management system,and simulations.Moreover,challenges and issues to be addressed are discussed and further directions are suggested.展开更多
This paper analyzes the physical potential, computing performance benefi t and power consumption of optical interconnects. Compared with electrical interconnections, optical ones show undoubted advantages based on phy...This paper analyzes the physical potential, computing performance benefi t and power consumption of optical interconnects. Compared with electrical interconnections, optical ones show undoubted advantages based on physical factor analysis. At the same time, since the recent developments drive us to think about whether these optical interconnect technologies with higher bandwidth but higher cost are worthy to be deployed, the computing performance comparison is performed. To meet the increasing demand of large-scale parallel or multi-processor computing tasks, an analytic method to evaluate parallel computing performance ofinterconnect systems is proposed in this paper. Both bandwidth-limit model and full-bandwidth model are under our investigation. Speedup and effi ciency are selected to represent the parallel performance of an interconnect system. Deploying the proposed models, we depict the performance gap between the optical and electrically interconnected systems. Another investigation on power consumption of commercial products showed that if the parallel interconnections are deployed, the unit power consumption will be reduced. Therefore, from the analysis of computing influence and power dissipation, we found that parallel optical interconnect is valuable combination of high performance and low energy consumption. Considering the possible data center under construction, huge power could be saved if parallel optical interconnects technologies are used.展开更多
GPU computing is expected to play an integral part in all modern Exascale supercomputers.It is also expected that higher order Godunov schemes will make up about a significant fraction of the application mix on such s...GPU computing is expected to play an integral part in all modern Exascale supercomputers.It is also expected that higher order Godunov schemes will make up about a significant fraction of the application mix on such supercomputers.It is,therefore,very important to prepare the community of users of higher order schemes for hyperbolic PDEs for this emerging opportunity.Not every algorithm that is used in the space-time update of the solution of hyperbolic PDEs will take well to GPUs.However,we identify a small core of algorithms that take exceptionally well to GPU computing.Based on an analysis of available options,we have been able to identify weighted essentially non-oscillatory(WENO)algorithms for spatial reconstruction along with arbitrary derivative(ADER)algorithms for time extension followed by a corrector step as the winning three-part algorithmic combination.Even when a winning subset of algorithms has been identified,it is not clear that they will port seamlessly to GPUs.The low data throughput between CPU and GPU,as well as the very small cache sizes on modern GPUs,implies that we have to think through all aspects of the task of porting an application to GPUs.For that reason,this paper identifies the techniques and tricks needed for making a successful port of this very useful class of higher order algorithms to GPUs.Application codes face a further challenge—the GPU results need to be practically indistinguishable from the CPU results—in order for the legacy knowledge bases embedded in these applications codes to be preserved during the port of GPUs.This requirement often makes a complete code rewrite impossible.For that reason,it is safest to use an approach based on OpenACC directives,so that most of the code remains intact(as long as it was originally well-written).This paper is intended to be a one-stop shop for anyone seeking to make an OpenACC-based port of a higher order Godunov scheme to GPUs.We focus on three broad and high-impact areas where higher order Godunov schemes are used.The first area is computational fluid dynamics(CFD).The second is computational magnetohydrodynamics(MHD)which has an involution constraint that has to be mimetically preserved.The third is computational electrodynamics(CED)which has involution constraints and also extremely stiff source terms.Together,these three diverse uses of higher order Godunov methodology,cover many of the most important applications areas.In all three cases,we show that the optimal use of algorithms,techniques,and tricks,along with the use of OpenACC,yields superlative speedups on GPUs.As a bonus,we find a most remarkable and desirable result:some higher order schemes,with their larger operations count per zone,show better speedup than lower order schemes on GPUs.In other words,the GPU is an optimal stratagem for overcoming the higher computational complexities of higher order schemes.Several avenues for future improvement have also been identified.A scalability study is presented for a real-world application using GPUs and comparable numbers of high-end multicore CPUs.It is found that GPUs offer a substantial performance benefit over comparable number of CPUs,especially when all the methods designed in this paper are used.展开更多
The high-performance computing paradigm needs high-speed switching fabrics to meet the heavy traffic generated by their applications.These switching fabrics are efficiently driven by the deployed scheduling algorithms...The high-performance computing paradigm needs high-speed switching fabrics to meet the heavy traffic generated by their applications.These switching fabrics are efficiently driven by the deployed scheduling algorithms.In this paper,we proposed two scheduling algorithms for input queued switches whose operations are based on ranking procedures.At first,we proposed a Simple 2-Bit(S2B)scheme which uses binary ranking procedure and queue size for scheduling the packets.Here,the Virtual Output Queue(VOQ)set with maximum number of empty queues receives higher rank than other VOQ’s.Through simulation,we showed S2B has better throughput performance than Highest Ranking First(HRF)arbitration under uniform,and non-uniform traffic patterns.To further improve the throughput-delay performance,an Enhanced 2-Bit(E2B)approach is proposed.This approach adopts an integer representation for rank,which is the number of empty queues in a VOQ set.The simulation result shows E2B outperforms S2B and HRF scheduling algorithms with maximum throughput-delay performance.Furthermore,the algorithms are simulated under hotspot traffic and E2B proves to be more efficient.展开更多
This paper proposes algorithm for Increasing Virtual Machine Security Strategy in Cloud Computing computations.Imbalance between load and energy has been one of the disadvantages of old methods in providing server and...This paper proposes algorithm for Increasing Virtual Machine Security Strategy in Cloud Computing computations.Imbalance between load and energy has been one of the disadvantages of old methods in providing server and hosting,so that if two virtual severs be active on a host and energy load be more on a host,it would allocated the energy of other hosts(virtual host)to itself to stay steady and this option usually leads to hardware overflow errors and users dissatisfaction.This problem has been removed in methods based on cloud processing but not perfectly,therefore,providing an algorithm not only will implement a suitable security background but also it will suitably divide energy consumption and load balancing among virtual severs.The proposed algorithm is compared with several previously proposed Security Strategy including SC-PSSF,PSSF and DEEAC.Comparisons show that the proposed method offers high performance computing,efficiency and consumes lower energy in the network.展开更多
Today the PC class machines are quite popular for HPC area, especially on the problemsthat require the good cost/performance ratios. One of the drawback of these machines is the poormemory throughput performance. And ...Today the PC class machines are quite popular for HPC area, especially on the problemsthat require the good cost/performance ratios. One of the drawback of these machines is the poormemory throughput performance. And one of the reasons of the poor performance is depend on the lack of the mapping capability of the TLB which is a buffer to accelerate the virtual memory access. In this report, I present that the mapping capability and the performance can be improved with the multi granularity TLB feature that some processors have. And I also present that the new TLB handling routine can be incorporated into the demand paging system of Linux.展开更多
In recent years, the widespread adoption of parallel computing, especially in multi-core processors and high-performance computing environments, ushered in a new era of efficiency and speed. This trend was particularl...In recent years, the widespread adoption of parallel computing, especially in multi-core processors and high-performance computing environments, ushered in a new era of efficiency and speed. This trend was particularly noteworthy in the field of image processing, which witnessed significant advancements. This parallel computing project explored the field of parallel image processing, with a focus on the grayscale conversion of colorful images. Our approach involved integrating OpenMP into our framework for parallelization to execute a critical image processing task: grayscale conversion. By using OpenMP, we strategically enhanced the overall performance of the conversion process by distributing the workload across multiple threads. The primary objectives of our project revolved around optimizing computation time and improving overall efficiency, particularly in the task of grayscale conversion of colorful images. Utilizing OpenMP for concurrent processing across multiple cores significantly reduced execution times through the effective distribution of tasks among these cores. The speedup values for various image sizes highlighted the efficacy of parallel processing, especially for large images. However, a detailed examination revealed a potential decline in parallelization efficiency with an increasing number of cores. This underscored the importance of a carefully optimized parallelization strategy, considering factors like load balancing and minimizing communication overhead. Despite challenges, the overall scalability and efficiency achieved with parallel image processing underscored OpenMP’s effectiveness in accelerating image manipulation tasks.展开更多
Low temperature complementary metal oxide semiconductor(CMOS)or cryogenic CMOS is a promising avenue for the continuation of Moore’s law while serving the needs of high performance computing.With temperature as a con...Low temperature complementary metal oxide semiconductor(CMOS)or cryogenic CMOS is a promising avenue for the continuation of Moore’s law while serving the needs of high performance computing.With temperature as a control“knob”to steepen the subthreshold slope behavior of CMOS devices,the supply voltage of operation can be reduced with no impact on operating speed.With the optimal threshold voltage engineering,the device ON current can be further enhanced,translating to higher performance.In this article,the experimentally calibrated data was adopted to tune the threshold voltage and investigated the power performance area of cryogenic CMOS at device,circuit and system level.We also presented results from measurement and analysis of functional memory chips fabricated in 28 nm bulk CMOS and 22 nm fully depleted silicon on insulator(FDSOI)operating at cryogenic temperature.Finally,the challenges and opportunities in the further development and deployment of such systems were discussed.展开更多
To improve the efficiency of evolutionary algorithms(EAs)for solving complex problems with large populations,this paper proposes a scalable parallel evolution optimization(SPEO)framework with an elastic asynchronous m...To improve the efficiency of evolutionary algorithms(EAs)for solving complex problems with large populations,this paper proposes a scalable parallel evolution optimization(SPEO)framework with an elastic asynchronous migration(EAM)mechanism.SPEO addresses two main challenges that arise in large-scale parallel EAs:(1)heavy communication workload from extensive information exchange across numerous processors,which reduces computational efficiency,and(2)loss of population diversity due to similar solutions generated and shared by many processors.The EAM mechanism introduces a self-adaptive communication scheme to mitigate communication overhead,while a diversity-preserving buffer helps maintain diversity by filtering similar solutions.Experimental results on eight CEC2014 benchmark functions using up to 512 CPU cores on the Australian National Computational Infrastructure(NCI)platform demonstrate that SPEO not only scales efficiently with an increasing number of processors but also achieves improved solution quality compared to state-of-the-art island-based EAs.展开更多
The integration of clusters,grids,clouds,edges and other computing platforms result in contemporary technology of jungle computing.This novel technique has the aptitude to tackle high performance computation systems a...The integration of clusters,grids,clouds,edges and other computing platforms result in contemporary technology of jungle computing.This novel technique has the aptitude to tackle high performance computation systems and it manages the usage of all computing platforms at a time.Federated learning is a collaborative machine learning approach without centralized training data.The proposed system effectively detects the intrusion attack without human intervention and subsequently detects anomalous deviations in device communication behavior,potentially caused by malicious adversaries and it can emerge with new and unknown attacks.The main objective is to learn overall behavior of an intruder while performing attacks to the assumed target service.Moreover,the updated system model is send to the centralized server in jungle computing,to detect their pattern.Federated learning greatly helps the machine to study the type of attack from each device and this technique paves a way to complete dominion over all malicious behaviors.In our proposed work,we have implemented an intrusion detection system that has high accuracy,low False Positive Rate(FPR)scalable,and versatile for the jungle computing environment.The execution time taken to complete a round is less than two seconds,with an accuracy rate of 96%.展开更多
High performance computing(HPC)is a powerful tool to accelerate the Kohn–Sham density functional theory(KS-DFT)calculations on modern heterogeneous supercomputers.Here,we describe a massively parallel implementation ...High performance computing(HPC)is a powerful tool to accelerate the Kohn–Sham density functional theory(KS-DFT)calculations on modern heterogeneous supercomputers.Here,we describe a massively parallel implementation of discontinuous Galerkin density functional theory(DGDFT)method on the Sunway Taihu Light supercomputer.The DGDFT method uses the adaptive local basis(ALB)functions generated on-the-fly during the self-consistent field(SCF)iteration to solve the KS equations with high precision comparable to plane-wave basis set.In particular,the DGDFT method adopts a two-level parallelization strategy that deals with various types of data distribution,task scheduling,and data communication schemes,and combines with the master–slave multi-thread heterogeneous parallelism of SW26010 processor,resulting in large-scale HPC KS-DFT calculations on the Sunway Taihu Light supercomputer.We show that the DGDFT method can scale up to 8,519,680 processing cores(131,072 core groups)on the Sunway Taihu Light supercomputer for studying the electronic structures of twodimensional(2 D)metallic graphene systems that contain tens of thousands of carbon atoms.展开更多
In this paper, a brief survey of smart citiy projects in Europe is presented. This survey shows the extent of transport and logistics in smart cities. We concentrate on a smart city project we have been working on tha...In this paper, a brief survey of smart citiy projects in Europe is presented. This survey shows the extent of transport and logistics in smart cities. We concentrate on a smart city project we have been working on that is related to A Logistic Mobile Application (ALMA). The application is based on Internet of Things and combines a communication infrastructure and a High Performance Computing infrastructure in order to deliver mobile logistic services with high quality of service and adaptation to the dynamic nature of logistic operations.展开更多
With supercomputers developing towards exascale, the number of compute cores increases dramatically, making more complex and larger-scale applications possible. The input/output (I/O) requirements of large-scale app...With supercomputers developing towards exascale, the number of compute cores increases dramatically, making more complex and larger-scale applications possible. The input/output (I/O) requirements of large-scale applications, workflow applications, and their checkpointing include substantial bandwidth and an extremely low latency, posing a serious challenge to high performance computing (HPC) storage systems. Current hard disk drive (HDD) based underlying storage systems are becoming more and more incompetent to meet the requirements of next-generation exascale supercomputers. To rise to the challenge, we propose a hierarchical hybrid storage system, on-line and near-line file system (ONFS). It leverages dynamic random access memory (DRAM) and solid state drive (SSD) in compute nodes, and HDD in storage servers to build a three-level storage system in a unified namespace. It supports portable operating system interface (POSIX) semantics, and provides high bandwidth, low latency, and huge storage capacity. In this paper, we present the technical details on distributed metadata management, the strategy of memory borrow and return, data consistency, parallel access control, and mechanisms guiding downward and upward migration in ONFS. We implement an ONFS prototype on the TH-1A supercomputer, and conduct experiments to test its I/O performance and scalability. The results show that the bandwidths of single-thread and multi-thread 'read'/'write' are 6-fold and 5-fold better than HDD-based Lustre, respectively. The I/O bandwidth of data-intensive applications in ONFS can be 6.35 timcs that in Lustre.展开更多
Parallel vector buffer analysis approaches can be classified into 2 types:algorithm-oriented parallel strategy and the data-oriented parallel strategy.These methods do not take its applicability on the existing geogra...Parallel vector buffer analysis approaches can be classified into 2 types:algorithm-oriented parallel strategy and the data-oriented parallel strategy.These methods do not take its applicability on the existing geographic information systems(GIS)platforms into consideration.In order to address the problem,a spatial decomposition approach for accelerating buffer analysis of vector data is proposed.The relationship between the number of vertices of each feature and the buffer analysis computing time is analyzed to generate computational intensity transformation functions(CITFs).Then,computational intensity grids(CIGs)of polyline and polygon are constructed based on the relative CITFs.Using the corresponding CIGs,a spatial decomposition method for parallel buffer analysis is developed.Based on the computational intensity of the features and the sub-domains generated in the decomposition,the features are averagely assigned within the sub-domains into parallel buffer analysis tasks for load balance.Compared with typical regular domain decomposition methods,the new approach accomplishes greater balanced decomposition of computational intensity for parallel buffer analysis and achieves near-linear speedups.展开更多
A new approach for the implementation of variogram models and ordinary kriging using the R statistical language, in conjunction with Fortran, the MPI (Message Passing Interface), and the "pbdDMAT" package within R...A new approach for the implementation of variogram models and ordinary kriging using the R statistical language, in conjunction with Fortran, the MPI (Message Passing Interface), and the "pbdDMAT" package within R on the Bridges and Stampede Supercomputers will be described. This new technique has led to great improvements in timing as compared to those in R alone, or R with C and MPI. These improvements include processing and forecasting vectors of size 25,000 in an average time of 6 minutes on the Stampede Supercomputer and 2.5 minutes on the Bridges Supercomputer as compared to previous processing times of 3.5 hours.展开更多
A multifrontal code is introduced for the efficient solution of the linear system of equations arising from the analysis of structures. The factorization phase is reduced into a series of interleaved element assembly ...A multifrontal code is introduced for the efficient solution of the linear system of equations arising from the analysis of structures. The factorization phase is reduced into a series of interleaved element assembly and dense matrix operations for which the BLAS3 kernels are used. A similar approach is generalized for the forward and back substitution phases for the efficient solution of structures having multiple load conditions. The program performs all assembly and solution steps in parallel. Examples are presented which demonstrate the code’s performance on single and dual core processor computers.展开更多
Traditional high performance computing(HPC)systems provide a standard preset environment to support scientific computation.However,HPC development needs to provide support for more and more diverse applications,such a...Traditional high performance computing(HPC)systems provide a standard preset environment to support scientific computation.However,HPC development needs to provide support for more and more diverse applications,such as artificial intelligence and big data.The standard preset environment can no longer meet these diverse requirements.If users still run these emerging applications on HPC systems,they need to manually maintain the specific dependencies(libraries,environment variables,and so on)of their applications.This increases the development and deployment burden for users.Moreover,the multi-user mode brings about privacy problems among users.Containers like Docker and Singularity can encapsulate the job’s execution environment,but in a highly customized HPC system,cross-environment application deployment of Docker and Singularity is limited.The introduction of container images also imposes a maintenance burden on system administrators.Facing the above-mentioned problems,in this paper we propose a self-deployed execution environment(SDEE)for HPC.SDEE combines the advantages of traditional virtualization and modern containers.SDEE provides an isolated and customizable environment(similar to a virtual machine)to the user.The user is the root user in this environment.The user develops and debugs the application and deploys its special dependencies in this environment.Then the user can load the job to compute nodes directly through the traditional HPC job management system.The job and its dependencies are analyzed,packaged,deployed,and executed automatically.This process enables transparent and rapid job deployment,which not only reduces the burden on users,but also protects user privacy.Experiments show that the overhead introduced by SDEE is negligible and lower than those of both Docker and Singularity.展开更多
基金Project supported by the National Natural Science Foundation of China (Nos. 10232040, 10572002 and 10572003)
文摘A new direct method for solving unsymmetrical sparse linear systems(USLS) arising from meshless methods was introduced. Computation of certain meshless methods such as meshless local Petrov-Galerkin (MLPG) method need to solve large USLS. The proposed solution method for unsymmetrical case performs factorization processes symmetrically on the upper and lower triangular portion of matrix, which differs from previous work based on general unsymmetrical process, and attains higher performance. It is shown that the solution algorithm for USLS can be simply derived from the existing approaches for the symmetrical case. The new matrix factorization algorithm in our method can be implemented easily by modifying a standard JKI symmetrical matrix factorization code. Multi-blocked out-of-core strategies were also developed to expand the solution scale. The approach convincingly increases the speed of the solution process, which is demonstrated with the numerical tests.
基金Project supported by the Research Fund for the Doctoral Program of Higher Education (No.20030001112).
文摘In the previous papers, a high performance sparse static solver with two-level unrolling based on a cell-sparse storage scheme was reported. Although the solver reaches quite a high efficiency for a big percentage of finite element analysis benchmark tests, the MFLOPS (million floating operations per second) of LDL^T factorization of benchmark tests vary on a Dell Pentium IV 850 MHz machine from 100 to 456 depending on the average size of the super-equations, i.e., on the average depth of unrolling. In this paper, a new sparse static solver with two-level unrolling that employs the concept of master-equations and searches for an appropriate depths of unrolling is proposed. The new solver provides higher MFLOPS for LDL^T factorization of benchmark tests, and therefore speeds up the solution process.
基金supported in part by the Talent Fund of Beijing Jiaotong University(2023XKRC017)in part by Research and Development Project of China State Railway Group Co.,Ltd.(P2022Z003).
文摘Further improving the railway innovation capacity and technological strength is the important goal of the 14th Five-Year Plan for railway scientific and technological innovation.It includes promoting the deep integration of cutting-edge technologies with the railway systems,strengthening the research and application of intelligent railway technologies,applying green computing technologies and advancing the collaborative sharing of transportation big data.The high-speed rail system tasks need to process huge amounts of data and heavy workload with the requirement of ultra-fast response.Therefore,it is of great necessity to promote computation efficiency by applying High Performance Computing(HPC)to high-speed rail systems.The HPC technique is a great solution for improving the performance,efficiency,and safety of high-speed rail systems.In this review,we introduce and analyze the application research of high performance computing technology in the field of highspeed railways.These HPC applications are cataloged into four broad categories,namely:fault diagnosis,network and communication,management system,and simulations.Moreover,challenges and issues to be addressed are discussed and further directions are suggested.
基金supported in part by National 863 Program (2009AA01Z256,No.2009AA01A345)National 973 Program (2007CB310705)the NSFC (60932004),P.R.China
文摘This paper analyzes the physical potential, computing performance benefi t and power consumption of optical interconnects. Compared with electrical interconnections, optical ones show undoubted advantages based on physical factor analysis. At the same time, since the recent developments drive us to think about whether these optical interconnect technologies with higher bandwidth but higher cost are worthy to be deployed, the computing performance comparison is performed. To meet the increasing demand of large-scale parallel or multi-processor computing tasks, an analytic method to evaluate parallel computing performance ofinterconnect systems is proposed in this paper. Both bandwidth-limit model and full-bandwidth model are under our investigation. Speedup and effi ciency are selected to represent the parallel performance of an interconnect system. Deploying the proposed models, we depict the performance gap between the optical and electrically interconnected systems. Another investigation on power consumption of commercial products showed that if the parallel interconnections are deployed, the unit power consumption will be reduced. Therefore, from the analysis of computing influence and power dissipation, we found that parallel optical interconnect is valuable combination of high performance and low energy consumption. Considering the possible data center under construction, huge power could be saved if parallel optical interconnects technologies are used.
基金support via the NSF grants NSF-19-04774,NSF-AST-2009776,NASA-2020-1241the NASA grant 80NSSC22K0628。
文摘GPU computing is expected to play an integral part in all modern Exascale supercomputers.It is also expected that higher order Godunov schemes will make up about a significant fraction of the application mix on such supercomputers.It is,therefore,very important to prepare the community of users of higher order schemes for hyperbolic PDEs for this emerging opportunity.Not every algorithm that is used in the space-time update of the solution of hyperbolic PDEs will take well to GPUs.However,we identify a small core of algorithms that take exceptionally well to GPU computing.Based on an analysis of available options,we have been able to identify weighted essentially non-oscillatory(WENO)algorithms for spatial reconstruction along with arbitrary derivative(ADER)algorithms for time extension followed by a corrector step as the winning three-part algorithmic combination.Even when a winning subset of algorithms has been identified,it is not clear that they will port seamlessly to GPUs.The low data throughput between CPU and GPU,as well as the very small cache sizes on modern GPUs,implies that we have to think through all aspects of the task of porting an application to GPUs.For that reason,this paper identifies the techniques and tricks needed for making a successful port of this very useful class of higher order algorithms to GPUs.Application codes face a further challenge—the GPU results need to be practically indistinguishable from the CPU results—in order for the legacy knowledge bases embedded in these applications codes to be preserved during the port of GPUs.This requirement often makes a complete code rewrite impossible.For that reason,it is safest to use an approach based on OpenACC directives,so that most of the code remains intact(as long as it was originally well-written).This paper is intended to be a one-stop shop for anyone seeking to make an OpenACC-based port of a higher order Godunov scheme to GPUs.We focus on three broad and high-impact areas where higher order Godunov schemes are used.The first area is computational fluid dynamics(CFD).The second is computational magnetohydrodynamics(MHD)which has an involution constraint that has to be mimetically preserved.The third is computational electrodynamics(CED)which has involution constraints and also extremely stiff source terms.Together,these three diverse uses of higher order Godunov methodology,cover many of the most important applications areas.In all three cases,we show that the optimal use of algorithms,techniques,and tricks,along with the use of OpenACC,yields superlative speedups on GPUs.As a bonus,we find a most remarkable and desirable result:some higher order schemes,with their larger operations count per zone,show better speedup than lower order schemes on GPUs.In other words,the GPU is an optimal stratagem for overcoming the higher computational complexities of higher order schemes.Several avenues for future improvement have also been identified.A scalability study is presented for a real-world application using GPUs and comparable numbers of high-end multicore CPUs.It is found that GPUs offer a substantial performance benefit over comparable number of CPUs,especially when all the methods designed in this paper are used.
文摘The high-performance computing paradigm needs high-speed switching fabrics to meet the heavy traffic generated by their applications.These switching fabrics are efficiently driven by the deployed scheduling algorithms.In this paper,we proposed two scheduling algorithms for input queued switches whose operations are based on ranking procedures.At first,we proposed a Simple 2-Bit(S2B)scheme which uses binary ranking procedure and queue size for scheduling the packets.Here,the Virtual Output Queue(VOQ)set with maximum number of empty queues receives higher rank than other VOQ’s.Through simulation,we showed S2B has better throughput performance than Highest Ranking First(HRF)arbitration under uniform,and non-uniform traffic patterns.To further improve the throughput-delay performance,an Enhanced 2-Bit(E2B)approach is proposed.This approach adopts an integer representation for rank,which is the number of empty queues in a VOQ set.The simulation result shows E2B outperforms S2B and HRF scheduling algorithms with maximum throughput-delay performance.Furthermore,the algorithms are simulated under hotspot traffic and E2B proves to be more efficient.
文摘This paper proposes algorithm for Increasing Virtual Machine Security Strategy in Cloud Computing computations.Imbalance between load and energy has been one of the disadvantages of old methods in providing server and hosting,so that if two virtual severs be active on a host and energy load be more on a host,it would allocated the energy of other hosts(virtual host)to itself to stay steady and this option usually leads to hardware overflow errors and users dissatisfaction.This problem has been removed in methods based on cloud processing but not perfectly,therefore,providing an algorithm not only will implement a suitable security background but also it will suitably divide energy consumption and load balancing among virtual severs.The proposed algorithm is compared with several previously proposed Security Strategy including SC-PSSF,PSSF and DEEAC.Comparisons show that the proposed method offers high performance computing,efficiency and consumes lower energy in the network.
文摘Today the PC class machines are quite popular for HPC area, especially on the problemsthat require the good cost/performance ratios. One of the drawback of these machines is the poormemory throughput performance. And one of the reasons of the poor performance is depend on the lack of the mapping capability of the TLB which is a buffer to accelerate the virtual memory access. In this report, I present that the mapping capability and the performance can be improved with the multi granularity TLB feature that some processors have. And I also present that the new TLB handling routine can be incorporated into the demand paging system of Linux.
文摘In recent years, the widespread adoption of parallel computing, especially in multi-core processors and high-performance computing environments, ushered in a new era of efficiency and speed. This trend was particularly noteworthy in the field of image processing, which witnessed significant advancements. This parallel computing project explored the field of parallel image processing, with a focus on the grayscale conversion of colorful images. Our approach involved integrating OpenMP into our framework for parallelization to execute a critical image processing task: grayscale conversion. By using OpenMP, we strategically enhanced the overall performance of the conversion process by distributing the workload across multiple threads. The primary objectives of our project revolved around optimizing computation time and improving overall efficiency, particularly in the task of grayscale conversion of colorful images. Utilizing OpenMP for concurrent processing across multiple cores significantly reduced execution times through the effective distribution of tasks among these cores. The speedup values for various image sizes highlighted the efficacy of parallel processing, especially for large images. However, a detailed examination revealed a potential decline in parallelization efficiency with an increasing number of cores. This underscored the importance of a carefully optimized parallelization strategy, considering factors like load balancing and minimizing communication overhead. Despite challenges, the overall scalability and efficiency achieved with parallel image processing underscored OpenMP’s effectiveness in accelerating image manipulation tasks.
基金funded by the Defense Advanced Research Project Agency(DARPA)Low Temperature Logic Technology(LTLT)program.
文摘Low temperature complementary metal oxide semiconductor(CMOS)or cryogenic CMOS is a promising avenue for the continuation of Moore’s law while serving the needs of high performance computing.With temperature as a control“knob”to steepen the subthreshold slope behavior of CMOS devices,the supply voltage of operation can be reduced with no impact on operating speed.With the optimal threshold voltage engineering,the device ON current can be further enhanced,translating to higher performance.In this article,the experimentally calibrated data was adopted to tune the threshold voltage and investigated the power performance area of cryogenic CMOS at device,circuit and system level.We also presented results from measurement and analysis of functional memory chips fabricated in 28 nm bulk CMOS and 22 nm fully depleted silicon on insulator(FDSOI)operating at cryogenic temperature.Finally,the challenges and opportunities in the further development and deployment of such systems were discussed.
基金This research was funded by the Zhejiang’JIANBING’R&D Project(No.2022C01055)Zhejiang Provincial Department of Transport Technology Project(No.2024011).
文摘To improve the efficiency of evolutionary algorithms(EAs)for solving complex problems with large populations,this paper proposes a scalable parallel evolution optimization(SPEO)framework with an elastic asynchronous migration(EAM)mechanism.SPEO addresses two main challenges that arise in large-scale parallel EAs:(1)heavy communication workload from extensive information exchange across numerous processors,which reduces computational efficiency,and(2)loss of population diversity due to similar solutions generated and shared by many processors.The EAM mechanism introduces a self-adaptive communication scheme to mitigate communication overhead,while a diversity-preserving buffer helps maintain diversity by filtering similar solutions.Experimental results on eight CEC2014 benchmark functions using up to 512 CPU cores on the Australian National Computational Infrastructure(NCI)platform demonstrate that SPEO not only scales efficiently with an increasing number of processors but also achieves improved solution quality compared to state-of-the-art island-based EAs.
文摘The integration of clusters,grids,clouds,edges and other computing platforms result in contemporary technology of jungle computing.This novel technique has the aptitude to tackle high performance computation systems and it manages the usage of all computing platforms at a time.Federated learning is a collaborative machine learning approach without centralized training data.The proposed system effectively detects the intrusion attack without human intervention and subsequently detects anomalous deviations in device communication behavior,potentially caused by malicious adversaries and it can emerge with new and unknown attacks.The main objective is to learn overall behavior of an intruder while performing attacks to the assumed target service.Moreover,the updated system model is send to the centralized server in jungle computing,to detect their pattern.Federated learning greatly helps the machine to study the type of attack from each device and this technique paves a way to complete dominion over all malicious behaviors.In our proposed work,we have implemented an intrusion detection system that has high accuracy,low False Positive Rate(FPR)scalable,and versatile for the jungle computing environment.The execution time taken to complete a round is less than two seconds,with an accuracy rate of 96%.
基金partly supported by the Supercomputer Application Project Trail Funding from Wuxi Jiangnan Institute of Computing Technology(BB2340000016)the Strategic Priority Research Program of Chinese Academy of Sciences(XDC01040100)+6 种基金the National Natural Science Foundation of China(21688102,21803066)the Anhui Initiative in Quantum Information Technologies(AHY090400)the National Key Research and Development Program of China(2016YFA0200604)the Fundamental Research Funds for Central Universities(WK2340000091)the Chinese Academy of Sciences Pioneer Hundred Talents Program(KJ2340000031)the Research Start-Up Grants(KY2340000094)the Academic Leading Talents Training Program(KY2340000103)from University of Science and Technology of China。
文摘High performance computing(HPC)is a powerful tool to accelerate the Kohn–Sham density functional theory(KS-DFT)calculations on modern heterogeneous supercomputers.Here,we describe a massively parallel implementation of discontinuous Galerkin density functional theory(DGDFT)method on the Sunway Taihu Light supercomputer.The DGDFT method uses the adaptive local basis(ALB)functions generated on-the-fly during the self-consistent field(SCF)iteration to solve the KS equations with high precision comparable to plane-wave basis set.In particular,the DGDFT method adopts a two-level parallelization strategy that deals with various types of data distribution,task scheduling,and data communication schemes,and combines with the master–slave multi-thread heterogeneous parallelism of SW26010 processor,resulting in large-scale HPC KS-DFT calculations on the Sunway Taihu Light supercomputer.We show that the DGDFT method can scale up to 8,519,680 processing cores(131,072 core groups)on the Sunway Taihu Light supercomputer for studying the electronic structures of twodimensional(2 D)metallic graphene systems that contain tens of thousands of carbon atoms.
文摘In this paper, a brief survey of smart citiy projects in Europe is presented. This survey shows the extent of transport and logistics in smart cities. We concentrate on a smart city project we have been working on that is related to A Logistic Mobile Application (ALMA). The application is based on Internet of Things and combines a communication infrastructure and a High Performance Computing infrastructure in order to deliver mobile logistic services with high quality of service and adaptation to the dynamic nature of logistic operations.
基金Project supported by the National Key Research and Development Program of China(No.2016YFB0200402)
文摘With supercomputers developing towards exascale, the number of compute cores increases dramatically, making more complex and larger-scale applications possible. The input/output (I/O) requirements of large-scale applications, workflow applications, and their checkpointing include substantial bandwidth and an extremely low latency, posing a serious challenge to high performance computing (HPC) storage systems. Current hard disk drive (HDD) based underlying storage systems are becoming more and more incompetent to meet the requirements of next-generation exascale supercomputers. To rise to the challenge, we propose a hierarchical hybrid storage system, on-line and near-line file system (ONFS). It leverages dynamic random access memory (DRAM) and solid state drive (SSD) in compute nodes, and HDD in storage servers to build a three-level storage system in a unified namespace. It supports portable operating system interface (POSIX) semantics, and provides high bandwidth, low latency, and huge storage capacity. In this paper, we present the technical details on distributed metadata management, the strategy of memory borrow and return, data consistency, parallel access control, and mechanisms guiding downward and upward migration in ONFS. We implement an ONFS prototype on the TH-1A supercomputer, and conduct experiments to test its I/O performance and scalability. The results show that the bandwidths of single-thread and multi-thread 'read'/'write' are 6-fold and 5-fold better than HDD-based Lustre, respectively. The I/O bandwidth of data-intensive applications in ONFS can be 6.35 timcs that in Lustre.
基金the National Natural Science Foundation of China(No.41971356,41701446)National Key Research and Development Program of China(No.2017YFB0503600,2018YFB0505500,2017YFC0602204).
文摘Parallel vector buffer analysis approaches can be classified into 2 types:algorithm-oriented parallel strategy and the data-oriented parallel strategy.These methods do not take its applicability on the existing geographic information systems(GIS)platforms into consideration.In order to address the problem,a spatial decomposition approach for accelerating buffer analysis of vector data is proposed.The relationship between the number of vertices of each feature and the buffer analysis computing time is analyzed to generate computational intensity transformation functions(CITFs).Then,computational intensity grids(CIGs)of polyline and polygon are constructed based on the relative CITFs.Using the corresponding CIGs,a spatial decomposition method for parallel buffer analysis is developed.Based on the computational intensity of the features and the sub-domains generated in the decomposition,the features are averagely assigned within the sub-domains into parallel buffer analysis tasks for load balance.Compared with typical regular domain decomposition methods,the new approach accomplishes greater balanced decomposition of computational intensity for parallel buffer analysis and achieves near-linear speedups.
文摘A new approach for the implementation of variogram models and ordinary kriging using the R statistical language, in conjunction with Fortran, the MPI (Message Passing Interface), and the "pbdDMAT" package within R on the Bridges and Stampede Supercomputers will be described. This new technique has led to great improvements in timing as compared to those in R alone, or R with C and MPI. These improvements include processing and forecasting vectors of size 25,000 in an average time of 6 minutes on the Stampede Supercomputer and 2.5 minutes on the Bridges Supercomputer as compared to previous processing times of 3.5 hours.
文摘A multifrontal code is introduced for the efficient solution of the linear system of equations arising from the analysis of structures. The factorization phase is reduced into a series of interleaved element assembly and dense matrix operations for which the BLAS3 kernels are used. A similar approach is generalized for the forward and back substitution phases for the efficient solution of structures having multiple load conditions. The program performs all assembly and solution steps in parallel. Examples are presented which demonstrate the code’s performance on single and dual core processor computers.
基金the Tianhe Supercomputer Project(No.2018YFB0204301)the National Natural Science Foundation of China(No.61902405)+1 种基金the PDL Research Fund(No.6142110190404)the National High-Level Personnel for Defense Technology Program(No.2017-JCJQ-ZQ-013)。
文摘Traditional high performance computing(HPC)systems provide a standard preset environment to support scientific computation.However,HPC development needs to provide support for more and more diverse applications,such as artificial intelligence and big data.The standard preset environment can no longer meet these diverse requirements.If users still run these emerging applications on HPC systems,they need to manually maintain the specific dependencies(libraries,environment variables,and so on)of their applications.This increases the development and deployment burden for users.Moreover,the multi-user mode brings about privacy problems among users.Containers like Docker and Singularity can encapsulate the job’s execution environment,but in a highly customized HPC system,cross-environment application deployment of Docker and Singularity is limited.The introduction of container images also imposes a maintenance burden on system administrators.Facing the above-mentioned problems,in this paper we propose a self-deployed execution environment(SDEE)for HPC.SDEE combines the advantages of traditional virtualization and modern containers.SDEE provides an isolated and customizable environment(similar to a virtual machine)to the user.The user is the root user in this environment.The user develops and debugs the application and deploys its special dependencies in this environment.Then the user can load the job to compute nodes directly through the traditional HPC job management system.The job and its dependencies are analyzed,packaged,deployed,and executed automatically.This process enables transparent and rapid job deployment,which not only reduces the burden on users,but also protects user privacy.Experiments show that the overhead introduced by SDEE is negligible and lower than those of both Docker and Singularity.