Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at t...Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at this problem,a parallelization approach was proposed with six memory optimization schemes for CG,four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20,the parallelization approach can reach up to 21 and 133 times speedups with size A and B,respectively,compared with single power processor element. Finally,the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV,simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores.展开更多
The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Co...The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Consequently,the single-core processor cannot meet the requirements of LTE system.This paper analyzes how to use multi-core processors to achieve parallel processing of uplink demodulation and decoding in LTE systems and designs an approach to parallel processing.The test results prove that this approach works quite well.展开更多
The availability of computers and communication networks allows us to gather and analyse data on a far larger scale than previously. At present, it is believed that statistics is a suitable method to analyse networks ...The availability of computers and communication networks allows us to gather and analyse data on a far larger scale than previously. At present, it is believed that statistics is a suitable method to analyse networks with millions, or more, of vertices. The MATLAB language, with its mass of statistical functions, is a good choice to rapidly realize an algorithm prototype of complex networks. The performance of the MATLAB codes can be further improved by using graphic processor units (GPU). This paper presents the strategies and performance of the GPU implementation of a complex networks package, and the Jacket toolbox of MATLAB is used. Compared with some commercially available CPU implementations, GPU can achieve a speedup of, on average, 11.3x. The experimental result proves that the GPU platform combined with the MATLAB language is a good combination for complex network research.展开更多
Virtualization is the key technology of cloud computing. Network virtualization plays an important role in this field. Its performance is very relevant to network virtualizing. Nowadays its implementations are mainly ...Virtualization is the key technology of cloud computing. Network virtualization plays an important role in this field. Its performance is very relevant to network virtualizing. Nowadays its implementations are mainly based on the idea of Software Define Network (SDN). Open vSwitch is a sort of software virtual switch, which conforms to the OpenFlow protocol standard. It is basically deployed in the Linux kernel hypervisor. This leads to its performance relatively poor because of the limited system resource. In turn, the packet process throughput is very low.In this paper, we present a Cavium-based Open vSwitch implementation. The Cavium platform features with multi cores and couples of hard ac-celerators. It supports zero-copy of packets and handles packet more quickly. We also carry some experiments on the platform. It indicates that we can use it in the enterprise network or campus network as convergence layer and core layer device.展开更多
Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to acc...Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to access the shared cache simultaneously.The main problem in improving memory performance is the shared cache architecture and cache replacement.This paper documents the implementation of a Dual-Port Content Addressable Memory(DPCAM)and a modified Near-Far Access Replacement Algorithm(NFRA),which was previously proposed as a shared L2 cache layer in a multi-core processor.Standard Performance Evaluation Corporation(SPEC)Central Processing Unit(CPU)2006 benchmark workloads are used to evaluate the benefit of the shared L2 cache layer.Results show improved performance of the multicore processor’s DPCAM and NFRA algorithms,corresponding to a higher number of concurrent accesses to shared memory.The new architecture significantly increases system throughput and records performance improvements of up to 8.7%on various types of SPEC 2006 benchmarks.The miss rate is also improved by about 13%,with some exceptions in the sphinx3 and bzip2 benchmarks.These results could open a new window for solving the long-standing problems with shared cache in multi-core processors.展开更多
Network processors (NPs) are widely used for programmable and high-performance networks;however, the programs for NPs are less portable, the number of NP program developers is small, and the development cost is high. ...Network processors (NPs) are widely used for programmable and high-performance networks;however, the programs for NPs are less portable, the number of NP program developers is small, and the development cost is high. To solve these problems, this paper proposes an open, high-level, and portable programming language called “Phonepl”, which is independent from vendor-specific proprietary hardware and software but can be translated into an NP program with high performance especially in the memory use. A common NP hardware feature is that a whole packet is stored in DRAM, but the header is cached in SRAM. Phonepl has a hardware-independent abstraction of this feature so that it allows programmers mostly unconscious of this hardware feature. To implement the abstraction, four representations of packet data type that cover all the packet operations (including substring, concatenation, input, and output) are introduced. Phonepl have been implemented on Octeon NPs used in plug-ins for a network-virtualization environment called the VNode Infrastructure, and several packet-handling programs were evaluated. As for the evaluation result, the conversion throughput is close to the wire rate, i.e., 10 Gbps, and no packet loss (by cache miss) occurs when the packet size is 256 bytes or larger.展开更多
As the traditional RISC+ASIC/ASSP approach for network processor design can not meet the today’s requirements, this paper described an alternate approach, Reconfigurable Processing Architecture, to boost the performa...As the traditional RISC+ASIC/ASSP approach for network processor design can not meet the today’s requirements, this paper described an alternate approach, Reconfigurable Processing Architecture, to boost the performance to ASIC level while reserve the programmability of the traditional RISC based system. This paper covers both the hardware architecture and the software development environment architecture.展开更多
Network processors are used in the core node of network to flexibly process packet streams. With the increase of performance, the power of network processor increases fast, and power and cooling become a bottleneck. A...Network processors are used in the core node of network to flexibly process packet streams. With the increase of performance, the power of network processor increases fast, and power and cooling become a bottleneck. Architecture-level power conscious design must go beyond low-level circuit design. Architectural power and performance tradeoff should be considered at the same time. Simulation is an efficient method to design modem network processor before making chip. In order to achieve the tradeoff between performance and power, the processor simulator is used to design the architecture of network processor. Using Netbeneh, Commubench benchmark and processor simulator-SimpleScalar, the performance and power of network processor are quantitatively evaluated. New performance tradeoff evaluation metric is proposed to analyze the architecture of network processor. Based on the high performance lnteI IXP 2800 Network processor eonfignration, optimized instruction fetch width and speed ,instruction issue width, instruction window size are analyzed and selected. Simulation resuits show that the tradeoff design method makes the usage of network processor more effectively. The optimal key parameters of network processor are important in architecture-level design. It is meaningful for the next generation network processor design.展开更多
This paper presents a new encryption embedded processor aimed at the application requirement of wireless sensor network (WSN). The new encryption embedded processor not only offers Rivest Shamir Adlemen (RSA), Adv...This paper presents a new encryption embedded processor aimed at the application requirement of wireless sensor network (WSN). The new encryption embedded processor not only offers Rivest Shamir Adlemen (RSA), Advanced Encryption Standard (AES), 3 Data Encryption Standard (3 DES) and Secure Hash Algorithm 1 (SHA - 1 ) security engines, but also involves a new memory encryption scheme. The new memory encryption scheme is implemented by a memory encryption cache (MEC), which protects the confidentiality of the memory by AES encryption. The experi- ments show that the new secure design only causes 1.9% additional delay on the critical path and cuts 25.7% power consumption when the processor writes data back. The new processor balances the performance overhead, the power consumption and the security and fully meets the wireless sensor environment requirement. After physical design, the new encryption embedded processor has been successfully tape-out.展开更多
为探索民机驾驶舱人机交互典型场景中人为差错发生的认知层面原因,运用人的排队网络信息加工模型(Queuing Network-Model Human Processor, QN-MHP)和人因可靠性方法对空速不可靠场景下的飞行员行为进行仿真研究。首先,通过设计任务及...为探索民机驾驶舱人机交互典型场景中人为差错发生的认知层面原因,运用人的排队网络信息加工模型(Queuing Network-Model Human Processor, QN-MHP)和人因可靠性方法对空速不可靠场景下的飞行员行为进行仿真研究。首先,通过设计任务及场景进行任务建模;然后,对模型中表示各脑区功能服务器的处理时间、处理容量及实体处理路径与差错概率赋值,进行24次仿真模拟;最后,通过设计模拟飞行试验,验证QN-MHP模型在民机驾驶舱人机交互研究中的可行性。结果表明,在空客A320机型空速不可靠处置任务中,飞行员在处置路径上易发生人为差错,在故障的识别、判断等关键节点也有少数差错发生,且任务过程中飞行员眼部利用率较高。研究表明,飞行员过高的用眼负荷是导致驾驶舱人机交互失效的原因之一,在未来驾驶舱人机交互流程设计及飞行训练中应予以重点关注。展开更多
为研究异构多核片上系统(multi-processor system on chip,MPSoC)在密集并行计算任务中的潜力,文章设计并实现了一种适用于粗粒度数据特征、面向任务级并行应用的异构多核系统动态调度协处理器,采用了片上缓存、任务输出的多级写回管理...为研究异构多核片上系统(multi-processor system on chip,MPSoC)在密集并行计算任务中的潜力,文章设计并实现了一种适用于粗粒度数据特征、面向任务级并行应用的异构多核系统动态调度协处理器,采用了片上缓存、任务输出的多级写回管理、任务自动映射、通讯任务乱序执行等机制。实验结果表明,该动态调度协处理器不仅能够实现任务级乱序执行等基本设计目标,还具有极低的调度开销,相较于基于动态记分牌算法的调度器,运行多个子孔径距离压缩算法的时间降低达17.13%。研究结果证明文章设计的动态调度协处理器能够有效优化目标场景下的任务调度效果。展开更多
The accuracy of present flatness predictive method is limited and it just belongs to software simulation. In order to improve it, a novel flatness predictive model via T-S cloud reasoning network implemented by digita...The accuracy of present flatness predictive method is limited and it just belongs to software simulation. In order to improve it, a novel flatness predictive model via T-S cloud reasoning network implemented by digital signal processor(DSP) is proposed. First, the combination of genetic algorithm(GA) and simulated annealing algorithm(SAA) is put forward, called GA-SA algorithm, which can make full use of the global search ability of GA and local search ability of SA. Later, based on T-S cloud reasoning neural network, flatness predictive model is designed in DSP. And it is applied to 900 HC reversible cold rolling mill. Experimental results demonstrate that the flatness predictive model via T-S cloud reasoning network can run on the hardware DSP TMS320 F2812 with high accuracy and robustness by using GA-SA algorithm to optimize the model parameter.展开更多
基金Project(2008AA01A201) supported the National High-tech Research and Development Program of ChinaProjects(60833004, 60633050) supported by the National Natural Science Foundation of China
文摘Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at this problem,a parallelization approach was proposed with six memory optimization schemes for CG,four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20,the parallelization approach can reach up to 21 and 133 times speedups with size A and B,respectively,compared with single power processor element. Finally,the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV,simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores.
文摘The Long Term Evolution (LTE) system imposes high requirements for dispatching delay.Moreover,very large air interface rate of LTE requires good processing capability for the devices processing the baseband signals.Consequently,the single-core processor cannot meet the requirements of LTE system.This paper analyzes how to use multi-core processors to achieve parallel processing of uplink demodulation and decoding in LTE systems and designs an approach to parallel processing.The test results prove that this approach works quite well.
基金Project supported by the Science Fund for Creative Research Groups of the National Natural Science Foundation of China (Grant No.60921062)the National Natural Science Foundation of China (Grant No.60873014)the Young Scientists Fund of the National Natural Science Foundation of China (Grant Nos.61003082 and 60903059)
文摘The availability of computers and communication networks allows us to gather and analyse data on a far larger scale than previously. At present, it is believed that statistics is a suitable method to analyse networks with millions, or more, of vertices. The MATLAB language, with its mass of statistical functions, is a good choice to rapidly realize an algorithm prototype of complex networks. The performance of the MATLAB codes can be further improved by using graphic processor units (GPU). This paper presents the strategies and performance of the GPU implementation of a complex networks package, and the Jacket toolbox of MATLAB is used. Compared with some commercially available CPU implementations, GPU can achieve a speedup of, on average, 11.3x. The experimental result proves that the GPU platform combined with the MATLAB language is a good combination for complex network research.
文摘Virtualization is the key technology of cloud computing. Network virtualization plays an important role in this field. Its performance is very relevant to network virtualizing. Nowadays its implementations are mainly based on the idea of Software Define Network (SDN). Open vSwitch is a sort of software virtual switch, which conforms to the OpenFlow protocol standard. It is basically deployed in the Linux kernel hypervisor. This leads to its performance relatively poor because of the limited system resource. In turn, the packet process throughput is very low.In this paper, we present a Cavium-based Open vSwitch implementation. The Cavium platform features with multi cores and couples of hard ac-celerators. It supports zero-copy of packets and handles packet more quickly. We also carry some experiments on the platform. It indicates that we can use it in the enterprise network or campus network as convergence layer and core layer device.
文摘Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to access the shared cache simultaneously.The main problem in improving memory performance is the shared cache architecture and cache replacement.This paper documents the implementation of a Dual-Port Content Addressable Memory(DPCAM)and a modified Near-Far Access Replacement Algorithm(NFRA),which was previously proposed as a shared L2 cache layer in a multi-core processor.Standard Performance Evaluation Corporation(SPEC)Central Processing Unit(CPU)2006 benchmark workloads are used to evaluate the benefit of the shared L2 cache layer.Results show improved performance of the multicore processor’s DPCAM and NFRA algorithms,corresponding to a higher number of concurrent accesses to shared memory.The new architecture significantly increases system throughput and records performance improvements of up to 8.7%on various types of SPEC 2006 benchmarks.The miss rate is also improved by about 13%,with some exceptions in the sphinx3 and bzip2 benchmarks.These results could open a new window for solving the long-standing problems with shared cache in multi-core processors.
文摘Network processors (NPs) are widely used for programmable and high-performance networks;however, the programs for NPs are less portable, the number of NP program developers is small, and the development cost is high. To solve these problems, this paper proposes an open, high-level, and portable programming language called “Phonepl”, which is independent from vendor-specific proprietary hardware and software but can be translated into an NP program with high performance especially in the memory use. A common NP hardware feature is that a whole packet is stored in DRAM, but the header is cached in SRAM. Phonepl has a hardware-independent abstraction of this feature so that it allows programmers mostly unconscious of this hardware feature. To implement the abstraction, four representations of packet data type that cover all the packet operations (including substring, concatenation, input, and output) are introduced. Phonepl have been implemented on Octeon NPs used in plug-ins for a network-virtualization environment called the VNode Infrastructure, and several packet-handling programs were evaluated. As for the evaluation result, the conversion throughput is close to the wire rate, i.e., 10 Gbps, and no packet loss (by cache miss) occurs when the packet size is 256 bytes or larger.
文摘As the traditional RISC+ASIC/ASSP approach for network processor design can not meet the today’s requirements, this paper described an alternate approach, Reconfigurable Processing Architecture, to boost the performance to ASIC level while reserve the programmability of the traditional RISC based system. This paper covers both the hardware architecture and the software development environment architecture.
基金Sponsored by the National Defence Research Foundation of China(Grant No.413460303).
文摘Network processors are used in the core node of network to flexibly process packet streams. With the increase of performance, the power of network processor increases fast, and power and cooling become a bottleneck. Architecture-level power conscious design must go beyond low-level circuit design. Architectural power and performance tradeoff should be considered at the same time. Simulation is an efficient method to design modem network processor before making chip. In order to achieve the tradeoff between performance and power, the processor simulator is used to design the architecture of network processor. Using Netbeneh, Commubench benchmark and processor simulator-SimpleScalar, the performance and power of network processor are quantitatively evaluated. New performance tradeoff evaluation metric is proposed to analyze the architecture of network processor. Based on the high performance lnteI IXP 2800 Network processor eonfignration, optimized instruction fetch width and speed ,instruction issue width, instruction window size are analyzed and selected. Simulation resuits show that the tradeoff design method makes the usage of network processor more effectively. The optimal key parameters of network processor are important in architecture-level design. It is meaningful for the next generation network processor design.
文摘This paper presents a new encryption embedded processor aimed at the application requirement of wireless sensor network (WSN). The new encryption embedded processor not only offers Rivest Shamir Adlemen (RSA), Advanced Encryption Standard (AES), 3 Data Encryption Standard (3 DES) and Secure Hash Algorithm 1 (SHA - 1 ) security engines, but also involves a new memory encryption scheme. The new memory encryption scheme is implemented by a memory encryption cache (MEC), which protects the confidentiality of the memory by AES encryption. The experi- ments show that the new secure design only causes 1.9% additional delay on the critical path and cuts 25.7% power consumption when the processor writes data back. The new processor balances the performance overhead, the power consumption and the security and fully meets the wireless sensor environment requirement. After physical design, the new encryption embedded processor has been successfully tape-out.
文摘为探索民机驾驶舱人机交互典型场景中人为差错发生的认知层面原因,运用人的排队网络信息加工模型(Queuing Network-Model Human Processor, QN-MHP)和人因可靠性方法对空速不可靠场景下的飞行员行为进行仿真研究。首先,通过设计任务及场景进行任务建模;然后,对模型中表示各脑区功能服务器的处理时间、处理容量及实体处理路径与差错概率赋值,进行24次仿真模拟;最后,通过设计模拟飞行试验,验证QN-MHP模型在民机驾驶舱人机交互研究中的可行性。结果表明,在空客A320机型空速不可靠处置任务中,飞行员在处置路径上易发生人为差错,在故障的识别、判断等关键节点也有少数差错发生,且任务过程中飞行员眼部利用率较高。研究表明,飞行员过高的用眼负荷是导致驾驶舱人机交互失效的原因之一,在未来驾驶舱人机交互流程设计及飞行训练中应予以重点关注。
文摘为研究异构多核片上系统(multi-processor system on chip,MPSoC)在密集并行计算任务中的潜力,文章设计并实现了一种适用于粗粒度数据特征、面向任务级并行应用的异构多核系统动态调度协处理器,采用了片上缓存、任务输出的多级写回管理、任务自动映射、通讯任务乱序执行等机制。实验结果表明,该动态调度协处理器不仅能够实现任务级乱序执行等基本设计目标,还具有极低的调度开销,相较于基于动态记分牌算法的调度器,运行多个子孔径距离压缩算法的时间降低达17.13%。研究结果证明文章设计的动态调度协处理器能够有效优化目标场景下的任务调度效果。
基金Project(E2015203354)supported by Natural Science Foundation of Steel United Research Fund of Hebei Province,ChinaProject(ZD2016100)supported by the Science and the Technology Research Key Project of High School of Hebei Province,China+1 种基金Project(LJRC013)supported by the University Innovation Team of Hebei Province Leading Talent Cultivation,ChinaProject(16LGY015)supported by the Basic Research Special Breeding of Yanshan University,China
文摘The accuracy of present flatness predictive method is limited and it just belongs to software simulation. In order to improve it, a novel flatness predictive model via T-S cloud reasoning network implemented by digital signal processor(DSP) is proposed. First, the combination of genetic algorithm(GA) and simulated annealing algorithm(SAA) is put forward, called GA-SA algorithm, which can make full use of the global search ability of GA and local search ability of SA. Later, based on T-S cloud reasoning neural network, flatness predictive model is designed in DSP. And it is applied to 900 HC reversible cold rolling mill. Experimental results demonstrate that the flatness predictive model via T-S cloud reasoning network can run on the hardware DSP TMS320 F2812 with high accuracy and robustness by using GA-SA algorithm to optimize the model parameter.