期刊文献+
共找到2篇文章
< 1 >
每页显示 20 50 100
国产SW26010-Pro处理器上3级BLAS函数众核并行优化 被引量:3
1
作者 胡怡 陈道琨 +5 位作者 杨超 马文静 刘芳芳 宋超博 孙强 史俊达 《软件学报》 EI CSCD 北大核心 2024年第3期1569-1584,共16页
BLAS(basic linear algebra subprograms)是最基本、最重要的底层数学库之一.在一个标准的BLAS库中,BLAS 3级函数涵盖的矩阵-矩阵运算尤为重要,在许多大规模科学与工程计算应用中被广泛调用.另外,BLAS 3级属于计算密集型函数,对充分发... BLAS(basic linear algebra subprograms)是最基本、最重要的底层数学库之一.在一个标准的BLAS库中,BLAS 3级函数涵盖的矩阵-矩阵运算尤为重要,在许多大规模科学与工程计算应用中被广泛调用.另外,BLAS 3级属于计算密集型函数,对充分发挥处理器的计算性能有至关重要的作用.针对国产SW26010-Pro处理器研究BLAS 3级函数的众核并行优化技术.具体而言,根据SW26010-Pro的存储层次结构,设计多级分块算法,挖掘矩阵运算的并行性.在此基础上,基于远程内存访问(remote memory access,RMA)机制设计数据共享策略,提高从核间的数据传输效率.进一步地,采用三缓冲、参数调优等方法对算法进行全面优化,隐藏直接内存访问(direct memory access,DMA)访存开销和RMA通信开销.此外,利用SW26010-Pro的两条硬件流水线和若干向量化计算/访存指令,还对BLAS 3级函数的矩阵-矩阵乘法、矩阵方程组求解、矩阵转置操作等若干运算进行手工汇编优化,提高了函数的浮点计算效率.实验结果显示,所提出的并行优化技术在SW26010-Pro处理器上为BLAS 3级函数带来了明显的性能提升,单核组BLAS 3级函数的浮点计算性能最高可达峰值性能的92%,多核组BLAS 3级函数的浮点计算性能最高可达峰值性能的88%. 展开更多
关键词 BLAS 3级 SW26010-Pro众核处理器 直接内存访问 远程内存访问 浮点计算效率
在线阅读 下载PDF
Design of area and power efficient Radix-4 DIT FFT butterfly unit using floating point fused arithmetic 被引量:2
2
作者 Prabhu E Mangalam H Karthick S 《Journal of Central South University》 SCIE EI CAS CSCD 2016年第7期1669-1681,共13页
In this work, power efficient butterfly unit based FFT architecture is presented. The butterfly unit is designed using floating-point fused arithmetic units. The fused arithmetic units include two-term dot product uni... In this work, power efficient butterfly unit based FFT architecture is presented. The butterfly unit is designed using floating-point fused arithmetic units. The fused arithmetic units include two-term dot product unit and add-subtract unit. In these arithmetic units, operations are performed over complex data values. A modified fused floating-point two-term dot product and an enhanced model for the Radix-4 FFT butterfly unit are proposed. The modified fused two-term dot product is designed using Radix-16 booth multiplier. Radix-16 booth multiplier will reduce the switching activities compared to Radix-8 booth multiplier in existing system and also will reduce the area required. The proposed architecture is implemented efficiently for Radix-4 decimation in time(DIT) FFT butterfly with the two floating-point fused arithmetic units. The proposed enhanced architecture is synthesized, implemented, placed and routed on a FPGA device using Xilinx ISE tool. It is observed that the Radix-4 DIT fused floating-point FFT butterfly requires 50.17% less space and 12.16% reduced power compared to the existing methods and the proposed enhanced model requires 49.82% less space on the FPGA device compared to the proposed design. Also, reduced power consumption is addressed by utilizing the reusability technique, which results in 11.42% of power reduction of the enhanced model compared to the proposed design. 展开更多
关键词 floating-point arithmetic floating-point fused dot product Radix-16 booth multiplier Radix-4 FFT butterfly fast fouriertransform decimation in time
在线阅读 下载PDF
上一页 1 下一页 到第
使用帮助 返回顶部