摘要
随着量子计算技术的发展,传统加密算法受到的威胁日益严重.为应对量子计算时代的挑战,各国正积极加强后量子密码算法的实现和迁移部署工作.由于NTRU密码方案具有结构简洁、计算效率高、尺寸较小、无专利风险等优点,因此NTRU格基密钥封装算法对于后量子时代的密码技术储备和应用具有重要意义.同时,图形处理器(Graphics Processing Unit,GPU)以其强大的并行计算能力、高吞吐量、低能耗等特性,已成为当前高并发密码工程实现的重要平台.本文给出后量子密码算法CTRU/CNTR的首个GPU高性能实现方案.对GPU主要资源占用进行分析,我们综合考虑并行计算、内存访问、数据布局和算法优化等多个方面,采用一系列计算和内存优化技术,旨在并行加速计算、优化访存、合理占用GPU资源以及减少I/O时延,从而提高本方案的计算能力和性能.本文的主要贡献在于以下几个方面:首先,针对模约减操作,使用NVIDIA并行指令集实现,有效减少所需指令条数;其次,针对耗时的多项式乘法模块,采用混合基NTT,并采用层融合、循环展开和延迟约减等方法,加快计算速度;此外,针对内存重复访问和冲突访问等问题,通过合并访存、核函数融合等优化技术,实现内存的高效访问;最后,为实现高并行的算法,设计恰当的线程块大小和数量,采用内存池机制,实现多任务的快速访存和高效处理.基于NVIDIA RTX4090平台,本方案CTRU768实现中密钥生成、封装和解封装的吞吐量分别为每秒1170.9万次、926.7万次和315.4万次.与参考实现相比,密钥生成、封装和解封装的吞吐量分别提高了336倍、174倍和128倍.本方案CNTR768实现中密钥生成、封装和解封装的吞吐量分别为每秒1117.3万次、971.8万次和322.2万次.与参考实现相比,密钥生成、封装和解封装的吞吐量分别提高了329倍、175倍和134倍;与开源Kyber实现相比,密钥生成、密钥封装和密钥解封装的吞吐量分别提升10.84~11.36倍、9.49~9.95倍和5.11~5.22倍.高性能的密钥封装实现在大规模任务处理场景下具有较大的应用潜力,对保障后量子时代的信息和数据安全具有重要意义.
As quantum computing technology advances,the threats to traditional encryption algorithms are becoming more and more serious.This entails not only developing robust encryption techniques resilient to quantum attacks but also implementing comprehensive strategies to ensure the security of sensitive data and communication channels in the post-quantum era.In order to meet the challenges of the quantum computing era,countries are actively strengthening the implementation,migration and deployment of post-quantum cryptographic algorithms.Since the NTRU cryptographic scheme has the advantages of simple structure,high computational efficiency,small size,and no patent risk,the NTRU lattice-based key encapsulation algorithm is of great significance to the reserve and application of cryptographic technology in the post-quantum era.At the same time,the Graphics Processing Unit(GPU),with its powerful parallel computing capabilities,high throughput,low energy consumption and other characteristics,has become an important platform in the implementation of current high-concurrency cryptographic engineering.We propose the first efficient GPU implementation of the post-quantum cryptographic algorithm CTRU/CNTR.Taking into account multiple aspects such as parallel computing,memory access,data layout and algorithm optimization,the main resource occupancy of the GPU is analyzed.We use a series of computation and memory optimization techniques to accelerate parallel computing,optimize memory access,reasonably occupy GPU resources,and reduce I/O delays,thereby improving the computing power and performance of this solution.The main contributions of our work are as follows:First,for the modular reduction operation,it is implemented using the NVIDIA parallel instruction set,which effectively reduces the number of instructions required.Besides,for the time-consuming polynomial multiplication module,we employ a mixed-base Number-Theoretic Transform(NTT)and utilize methods such as layer fusion,loop unrolling,and delayed reduction to speed up calculations.Additionally,for problems such as repeated memory access and conflict access,efficient memory access is achieved through optimization technologies such as memory coalescing and kernel fusion.Finally,to achieve high parallel algorithm computation,we design appropriate thread block size,and design a memory pool mechanism to achieve rapid memory access and efficient processing of multitasks.Based on the RTX 4090,our implementation of CTRU768 demonstrates throughputs of 11709k,9267k and 3154k operations per second for key generation,encapsulation and decapsulation,respectively.Compared with the C language reference implementation,the throughput of key generation,encapsulation and decapsulation is increased by 336×,174×and 128×,respectively.CNTR768 demonstrates throughputs of 11173k,9718k and 3222k operations per second for key generation,encapsulation and decapsulation,respectively.Compared with the C language reference implementation,the throughput of key generation,encapsulation and decapsulation is increased by 329×,175×and 134×,respectively.Compared with the latest open source Kyber implementation,the throughput of key generation,key encapsulation and key decapsulation is increased by 10.84~11.36×,9.49~9.95×and 5.11~5.22×respectively.High-performance key encapsulation implementation by this work is the key to ensuring the smooth operation of large-capacity applications,and is helpful to ensure information and data security in the post-quantum era.
作者
李文倩
沈诗羽
赵运磊
LI Wen-Qian;SHEN Shi-Yu;ZHAO Yun-Lei(School of Computer Science,Fudan University,Shanghai 200433;State Key Laboratory of Cryptology,Beijing 100878)
出处
《计算机学报》
EI
CAS
CSCD
北大核心
2024年第9期2163-2178,共16页
Chinese Journal of Computers
基金
国家重点研发计划基金资助项目(No.2022YFB2701601)
密码科学技术国家重点实验室面上课题基金资助项目(No.MMKFKT202227)
上海市科委技术标准基金资助项目(No.21DZ2200500)
上海市协同创新基金资助项目(No.XTCX-KJ-2023-54)
上海市科委区块链关键技术攻关专项基金资助项目(No.23511100300)资助.
关键词
后量子密码
格基密码
密钥封装方案
并行处理
图形处理器
post-quantum cryptography
lattice-based cryptography
key encapsulation mechanism
parallel processing
Graphics Processing Units(GPU)