Random sample partition(RSP)is a newly developed big data representation and management model to deal with big data approximate computation problems.Academic research and practical applications have confirmed that RSP...Random sample partition(RSP)is a newly developed big data representation and management model to deal with big data approximate computation problems.Academic research and practical applications have confirmed that RSP is an efficient solution for big data processing and analysis.However,a challenge for implementing RSP is determining an appropriate sample size for RSP data blocks.While a large sample size increases the burden of big data computation,a small size will lead to insufficient distribution information for RSP data blocks.To address this problem,this paper presents a novel density estimation-based method(DEM)to determine the optimal sample size for RSP data blocks.First,a theoretical sample size is calculated based on the multivariate Dvoretzky-Kiefer-Wolfowitz(DKW)inequality by using the fixed-point iteration(FPI)method.Second,a practical sample size is determined by minimizing the validation error of a kernel density estimator(KDE)constructed on RSP data blocks for an increasing sample size.Finally,a series of persuasive experiments are conducted to validate the feasibility,rationality,and effectiveness of DEM.Experimental results show that(1)the iteration function of the FPI method is convergent for calculating the theoretical sample size from the multivariate DKW inequality;(2)the KDE constructed on RSP data blocks with sample size determined by DEM can yield a good approximation of the probability density function(p.d.f);and(3)DEM provides more accurate sample sizes than the existing sample size determination methods from the perspective of p.d.f.estimation.This demonstrates that DEM is a viable approach to deal with the sample size determination problem for big data RSP implementation.展开更多
针对复杂背景下干扰因素多,圆的形态不规则,传统随机Hough变换圆检测速度慢,检测精度低的问题,提出一种复杂背景下的不规则圆检测方法。该方法首先通过连通区域标记算法,获得目标圆的感兴趣区域(region of interest,ROI),然后对圆的边...针对复杂背景下干扰因素多,圆的形态不规则,传统随机Hough变换圆检测速度慢,检测精度低的问题,提出一种复杂背景下的不规则圆检测方法。该方法首先通过连通区域标记算法,获得目标圆的感兴趣区域(region of interest,ROI),然后对圆的边缘点进行分区采样,提高随机采样的有效性;确定圆心位置后,取圆心的8邻域像素为圆心分别做圆,从而提高不规则圆的检测精度。通过对热保护器校准点的中心检测结果表明,提出的方法可在复杂背景下实现圆的精确检出,检测速度更快,检测精度更高。展开更多
针对大数据背景下随机森林算法中存在协方差矩阵规模较大、子空间特征信息覆盖不足和节点通信开销大的问题,提出了基于PCA和子空间分层选择的并行随机森林算法PLA-PRF(PCA and subspace layer sampling on parallel random forest algor...针对大数据背景下随机森林算法中存在协方差矩阵规模较大、子空间特征信息覆盖不足和节点通信开销大的问题,提出了基于PCA和子空间分层选择的并行随机森林算法PLA-PRF(PCA and subspace layer sampling on parallel random forest algorithm)。对初始特征集,提出了基于PCA的矩阵分解策略(matrix factorization strategy,MFS),压缩原始特征集,提取主成分特征,解决特征变换过程中协方差矩阵规模较大的问题;基于主成分特征,提出基于误差约束的分层子空间构造算法(error-constrained hierarchical subspace construction algorithm,EHSCA),分层选取信息素特征,构建特征子空间,解决子空间特征信息覆盖不足的问题;在Spark环境下并行化训练决策树的过程中,设计了一种数据复用策略(data reuse strategy,DRS),通过垂直划分RDD数据并结合索引表,实现特征复用,解决了节点通信开销大的问题。实验结果表明PLA-PRF算法分类效果更佳,并行化效率更高。展开更多
针对目前图像配准算法存在的配准时间较长、配准正确率低等问题,本文提出一种基于改进分层随机选择一致性(Stratified Random Selection Random Sample Consensus,SRS-RANSA)的图像配准算法。首先,通过ORB(Oriented FAST and Rotated BR...针对目前图像配准算法存在的配准时间较长、配准正确率低等问题,本文提出一种基于改进分层随机选择一致性(Stratified Random Selection Random Sample Consensus,SRS-RANSA)的图像配准算法。首先,通过ORB(Oriented FAST and Rotated BRIEF)算法对参考图像进行特征点提取;其次,采用最小距离法初步过滤匹配中存在的误匹配数量;最后,随机抽样一致性(RANSAC)框架中通过分层随机选择(SRS)提取分布相对分散且均匀的特征点,进一步过滤掉初始匹配中存在的不匹配特征点,实现提高配准正确率的同时缩短运行时间。通过本文算法与其他算法在Oxford标准图集和现实中拍摄的图像进行实验对比,结果表明,本文算法在匹配正确率与运行效率上有所提高。展开更多
基金This paper was supported by the National Natural Science Foundation of China(Grant No.61972261)the Natural Science Foundation of Guangdong Province(No.2023A1515011667)+1 种基金the Key Basic Research Foundation of Shenzhen(No.JCYJ20220818100205012)the Basic Research Foundation of Shenzhen(No.JCYJ20210324093609026)。
文摘Random sample partition(RSP)is a newly developed big data representation and management model to deal with big data approximate computation problems.Academic research and practical applications have confirmed that RSP is an efficient solution for big data processing and analysis.However,a challenge for implementing RSP is determining an appropriate sample size for RSP data blocks.While a large sample size increases the burden of big data computation,a small size will lead to insufficient distribution information for RSP data blocks.To address this problem,this paper presents a novel density estimation-based method(DEM)to determine the optimal sample size for RSP data blocks.First,a theoretical sample size is calculated based on the multivariate Dvoretzky-Kiefer-Wolfowitz(DKW)inequality by using the fixed-point iteration(FPI)method.Second,a practical sample size is determined by minimizing the validation error of a kernel density estimator(KDE)constructed on RSP data blocks for an increasing sample size.Finally,a series of persuasive experiments are conducted to validate the feasibility,rationality,and effectiveness of DEM.Experimental results show that(1)the iteration function of the FPI method is convergent for calculating the theoretical sample size from the multivariate DKW inequality;(2)the KDE constructed on RSP data blocks with sample size determined by DEM can yield a good approximation of the probability density function(p.d.f);and(3)DEM provides more accurate sample sizes than the existing sample size determination methods from the perspective of p.d.f.estimation.This demonstrates that DEM is a viable approach to deal with the sample size determination problem for big data RSP implementation.
文摘针对复杂背景下干扰因素多,圆的形态不规则,传统随机Hough变换圆检测速度慢,检测精度低的问题,提出一种复杂背景下的不规则圆检测方法。该方法首先通过连通区域标记算法,获得目标圆的感兴趣区域(region of interest,ROI),然后对圆的边缘点进行分区采样,提高随机采样的有效性;确定圆心位置后,取圆心的8邻域像素为圆心分别做圆,从而提高不规则圆的检测精度。通过对热保护器校准点的中心检测结果表明,提出的方法可在复杂背景下实现圆的精确检出,检测速度更快,检测精度更高。
文摘针对大数据背景下随机森林算法中存在协方差矩阵规模较大、子空间特征信息覆盖不足和节点通信开销大的问题,提出了基于PCA和子空间分层选择的并行随机森林算法PLA-PRF(PCA and subspace layer sampling on parallel random forest algorithm)。对初始特征集,提出了基于PCA的矩阵分解策略(matrix factorization strategy,MFS),压缩原始特征集,提取主成分特征,解决特征变换过程中协方差矩阵规模较大的问题;基于主成分特征,提出基于误差约束的分层子空间构造算法(error-constrained hierarchical subspace construction algorithm,EHSCA),分层选取信息素特征,构建特征子空间,解决子空间特征信息覆盖不足的问题;在Spark环境下并行化训练决策树的过程中,设计了一种数据复用策略(data reuse strategy,DRS),通过垂直划分RDD数据并结合索引表,实现特征复用,解决了节点通信开销大的问题。实验结果表明PLA-PRF算法分类效果更佳,并行化效率更高。
文摘针对目前图像配准算法存在的配准时间较长、配准正确率低等问题,本文提出一种基于改进分层随机选择一致性(Stratified Random Selection Random Sample Consensus,SRS-RANSA)的图像配准算法。首先,通过ORB(Oriented FAST and Rotated BRIEF)算法对参考图像进行特征点提取;其次,采用最小距离法初步过滤匹配中存在的误匹配数量;最后,随机抽样一致性(RANSAC)框架中通过分层随机选择(SRS)提取分布相对分散且均匀的特征点,进一步过滤掉初始匹配中存在的不匹配特征点,实现提高配准正确率的同时缩短运行时间。通过本文算法与其他算法在Oxford标准图集和现实中拍摄的图像进行实验对比,结果表明,本文算法在匹配正确率与运行效率上有所提高。