Objective To reduce the execution time of neural network training. Methods Parallel particle swarm optimization algorithm based on master-slave model is proposed to train radial basis function neural networks, which i...Objective To reduce the execution time of neural network training. Methods Parallel particle swarm optimization algorithm based on master-slave model is proposed to train radial basis function neural networks, which is implemented on a cluster using MPI libraries for inter-process communication. Results High speed-up factor is achieved and execution time is reduced greatly. On the other hand, the resulting neural network has good classification accuracy not only on training sets but also on test sets. Conclusion Since the fitness evaluation is intensive, parallel particle swarm optimization shows great advantages to speed up neural network training.展开更多
Parallel multi-thread processing in advanced intelligent processors is the core to realize high-speed and high-capacity signal processing systems.Optical neural network(ONN)has the native advantages of high paralleliz...Parallel multi-thread processing in advanced intelligent processors is the core to realize high-speed and high-capacity signal processing systems.Optical neural network(ONN)has the native advantages of high parallelization,large bandwidth,and low power consumption to meet the demand of big data.Here,we demonstrate the dual-layer ONN with Mach-Zehnder interferometer(MZI)network and nonlinear layer,while the nonlinear activation function is achieved by optical-electronic signal conversion.Two frequency components from the microcomb source carrying digit datasets are simultaneously imposed and intelligently recognized through the ONN.We successfully achieve the digit classification of different frequency components by demultiplexing the output signal and testing power distribution.Efficient parallelization feasibility with wavelength division multiplexing is demonstrated in our high-dimensional ONN.This work provides a high-performance architecture for future parallel high-capacity optical analog computing.展开更多
In order to improve the detection accuracy of small objects,a neighborhood fusion-based hierarchical parallel feature pyramid network(NFPN)is proposed.Unlike the layer-by-layer structure adopted in the feature pyramid...In order to improve the detection accuracy of small objects,a neighborhood fusion-based hierarchical parallel feature pyramid network(NFPN)is proposed.Unlike the layer-by-layer structure adopted in the feature pyramid network(FPN)and deconvolutional single shot detector(DSSD),where the bottom layer of the feature pyramid network relies on the top layer,NFPN builds the feature pyramid network with no connections between the upper and lower layers.That is,it only fuses shallow features on similar scales.NFPN is highly portable and can be embedded in many models to further boost performance.Extensive experiments on PASCAL VOC 2007,2012,and COCO datasets demonstrate that the NFPN-based SSD without intricate tricks can exceed the DSSD model in terms of detection accuracy and inference speed,especially for small objects,e.g.,4%to 5%higher mAP(mean average precision)than SSD,and 2%to 3%higher mAP than DSSD.On VOC 2007 test set,the NFPN-based SSD with 300×300 input reaches 79.4%mAP at 34.6 frame/s,and the mAP can raise to 82.9%after using the multi-scale testing strategy.展开更多
For studying and optimizing the performance of general-purpose computing on graphics processing units(GPGPU)based on single instruction multiple threads(SIMT)processor about the neural network application,this work co...For studying and optimizing the performance of general-purpose computing on graphics processing units(GPGPU)based on single instruction multiple threads(SIMT)processor about the neural network application,this work contributes a self-developed SIMT processor named Pomelo and correlated assembly program.The parallel mechanism of SIMT computing mode and self-developed Pomelo processor is briefly introduced.A common convolutional neural network(CNN)is built to verify the compatibility and functionality of the Pomelo processor.CNN computing flow with task level and hardware level optimization is adopted on the Pomelo processor.A specific algorithm for organizing a Z-shaped memory structure is developed,which addresses reducing memory access in mass data computing tasks.Performing the above-combined adaptation and optimization strategy,the experimental result demonstrates that reducing memory access in SIMT computing mode plays a crucial role in improving performance.A 6.52 times performance is achieved on the 4 processing elements case.展开更多
A neurocomputing model for Genetic Algorithm (GA) to break the speed bottleneck of GA was proposed. With all genetic operations parallel implemented by NN-based sub-modules, the model integrates both the strongpoint o...A neurocomputing model for Genetic Algorithm (GA) to break the speed bottleneck of GA was proposed. With all genetic operations parallel implemented by NN-based sub-modules, the model integrates both the strongpoint of parallel GA (PGA) and those of hardware GA (HGA). Moreover a new crossover operator named universe crossover was also proposed to suit the NN-based realization. This model was tested with a benchmark function set, and the experimental results validated the potential of the neurocomputing model. The significance of this model means that HGA and PGA can be integrated and the inherent parallelism of GA can be explicitly and farthest realized, as a result, the optimization speed of GA will be accelerated by one or two magnitudes compered to the serial implementation with same speed hardware, and GA will be turned from an algorithm into a machine.展开更多
It is significant to efficiently support artificial intelligence(AI)applications on heterogeneous mobile platforms,especially coordinately execute a deep neural network(DNN)model on multiple computing devices of one m...It is significant to efficiently support artificial intelligence(AI)applications on heterogeneous mobile platforms,especially coordinately execute a deep neural network(DNN)model on multiple computing devices of one mobile platform.This paper proposes HOPE,an end-to-end heterogeneous inference framework running on mobile platforms to distribute the operators in a DNN model to different computing devices.The problem is formalized into an integer linear programming(ILP)problem and a heuristic algorithm is proposed to determine the near-optimal heterogeneous execution plan.The experimental results demonstrate that HOPE can reduce up to 36.2%inference latency(with an average of 22.0%)than MOSAIC,22.0%(with an average of 10.2%)than StarPU and 41.8%(with an average of 18.4%)thanμLayer respectively.展开更多
A configurable U-Net architecture is trained to solve the multi-scale elliptical partial differential equations.The motivation is to improve the computational cost of the numerical solution of Navier-Stokes equations...A configurable U-Net architecture is trained to solve the multi-scale elliptical partial differential equations.The motivation is to improve the computational cost of the numerical solution of Navier-Stokes equations–the governing equations for fluid dynamics.Building on the underlying concept of V-Cycle multigrid methods,a neural network framework using U-Net architecture is optimized to solve the Poisson equation and Helmholtz equations–the characteristic form of the discretized Navier-Stokes equations.The results demonstrate the optimized U-Net captures the high dimensional mathematical features of the elliptical operator and with a better convergence than the multigrid method.The optimal performance between the errors and the FLOPS is the(3,2,5)case with 3 stacks of UNets,with 2 initial features,5 depth layers and with ELU activation.Further,by training the network with the multi-scale synthetic data the finer features of the physical system are captured.展开更多
基金This work was supported by the National Grand Fundamental Research"973"Programof China (No.2004CB719401)
文摘Objective To reduce the execution time of neural network training. Methods Parallel particle swarm optimization algorithm based on master-slave model is proposed to train radial basis function neural networks, which is implemented on a cluster using MPI libraries for inter-process communication. Results High speed-up factor is achieved and execution time is reduced greatly. On the other hand, the resulting neural network has good classification accuracy not only on training sets but also on test sets. Conclusion Since the fitness evaluation is intensive, parallel particle swarm optimization shows great advantages to speed up neural network training.
基金Peng Xie acknowledges the support from the China Scholarship Council(Grant no.201804910829).
文摘Parallel multi-thread processing in advanced intelligent processors is the core to realize high-speed and high-capacity signal processing systems.Optical neural network(ONN)has the native advantages of high parallelization,large bandwidth,and low power consumption to meet the demand of big data.Here,we demonstrate the dual-layer ONN with Mach-Zehnder interferometer(MZI)network and nonlinear layer,while the nonlinear activation function is achieved by optical-electronic signal conversion.Two frequency components from the microcomb source carrying digit datasets are simultaneously imposed and intelligently recognized through the ONN.We successfully achieve the digit classification of different frequency components by demultiplexing the output signal and testing power distribution.Efficient parallelization feasibility with wavelength division multiplexing is demonstrated in our high-dimensional ONN.This work provides a high-performance architecture for future parallel high-capacity optical analog computing.
基金The National Natural Science Foundation of China(No.61603091)。
文摘In order to improve the detection accuracy of small objects,a neighborhood fusion-based hierarchical parallel feature pyramid network(NFPN)is proposed.Unlike the layer-by-layer structure adopted in the feature pyramid network(FPN)and deconvolutional single shot detector(DSSD),where the bottom layer of the feature pyramid network relies on the top layer,NFPN builds the feature pyramid network with no connections between the upper and lower layers.That is,it only fuses shallow features on similar scales.NFPN is highly portable and can be embedded in many models to further boost performance.Extensive experiments on PASCAL VOC 2007,2012,and COCO datasets demonstrate that the NFPN-based SSD without intricate tricks can exceed the DSSD model in terms of detection accuracy and inference speed,especially for small objects,e.g.,4%to 5%higher mAP(mean average precision)than SSD,and 2%to 3%higher mAP than DSSD.On VOC 2007 test set,the NFPN-based SSD with 300×300 input reaches 79.4%mAP at 34.6 frame/s,and the mAP can raise to 82.9%after using the multi-scale testing strategy.
基金the Scientific Research Program Funded by Shaanxi Provincial Education Department(20JY058)。
文摘For studying and optimizing the performance of general-purpose computing on graphics processing units(GPGPU)based on single instruction multiple threads(SIMT)processor about the neural network application,this work contributes a self-developed SIMT processor named Pomelo and correlated assembly program.The parallel mechanism of SIMT computing mode and self-developed Pomelo processor is briefly introduced.A common convolutional neural network(CNN)is built to verify the compatibility and functionality of the Pomelo processor.CNN computing flow with task level and hardware level optimization is adopted on the Pomelo processor.A specific algorithm for organizing a Z-shaped memory structure is developed,which addresses reducing memory access in mass data computing tasks.Performing the above-combined adaptation and optimization strategy,the experimental result demonstrates that reducing memory access in SIMT computing mode plays a crucial role in improving performance.A 6.52 times performance is achieved on the 4 processing elements case.
基金NationalNaturalScienceFoundationofChina (No .60 2 3 40 2 0 )
文摘A neurocomputing model for Genetic Algorithm (GA) to break the speed bottleneck of GA was proposed. With all genetic operations parallel implemented by NN-based sub-modules, the model integrates both the strongpoint of parallel GA (PGA) and those of hardware GA (HGA). Moreover a new crossover operator named universe crossover was also proposed to suit the NN-based realization. This model was tested with a benchmark function set, and the experimental results validated the potential of the neurocomputing model. The significance of this model means that HGA and PGA can be integrated and the inherent parallelism of GA can be explicitly and farthest realized, as a result, the optimization speed of GA will be accelerated by one or two magnitudes compered to the serial implementation with same speed hardware, and GA will be turned from an algorithm into a machine.
基金Supported by the General Program of National Natural Science Foundation of China(No.61872043)。
文摘It is significant to efficiently support artificial intelligence(AI)applications on heterogeneous mobile platforms,especially coordinately execute a deep neural network(DNN)model on multiple computing devices of one mobile platform.This paper proposes HOPE,an end-to-end heterogeneous inference framework running on mobile platforms to distribute the operators in a DNN model to different computing devices.The problem is formalized into an integer linear programming(ILP)problem and a heuristic algorithm is proposed to determine the near-optimal heterogeneous execution plan.The experimental results demonstrate that HOPE can reduce up to 36.2%inference latency(with an average of 22.0%)than MOSAIC,22.0%(with an average of 10.2%)than StarPU and 41.8%(with an average of 18.4%)thanμLayer respectively.
文摘A configurable U-Net architecture is trained to solve the multi-scale elliptical partial differential equations.The motivation is to improve the computational cost of the numerical solution of Navier-Stokes equations–the governing equations for fluid dynamics.Building on the underlying concept of V-Cycle multigrid methods,a neural network framework using U-Net architecture is optimized to solve the Poisson equation and Helmholtz equations–the characteristic form of the discretized Navier-Stokes equations.The results demonstrate the optimized U-Net captures the high dimensional mathematical features of the elliptical operator and with a better convergence than the multigrid method.The optimal performance between the errors and the FLOPS is the(3,2,5)case with 3 stacks of UNets,with 2 initial features,5 depth layers and with ELU activation.Further,by training the network with the multi-scale synthetic data the finer features of the physical system are captured.