This paper investigates the fundamental data detection problem with burst interference in massive multiple-input multiple-output orthogonal frequency division multiplexing(MIMO-OFDM) systems. In particular, burst inte...This paper investigates the fundamental data detection problem with burst interference in massive multiple-input multiple-output orthogonal frequency division multiplexing(MIMO-OFDM) systems. In particular, burst interference may occur only on data symbols but not on pilot symbols, which means that interference information cannot be premeasured. To cancel the burst interference, we first revisit the uplink multi-user system and develop a matrixform system model, where the covariance pattern and the low-rank property of the interference matrix is discussed. Then, we propose a turbo message passing based burst interference cancellation(TMP-BIC) algorithm to solve the data detection problem, where the constellation information of target data is fully exploited to refine its estimate. Furthermore, in the TMP-BIC algorithm, we design one module to cope with the interference matrix by exploiting its lowrank property. Numerical results demonstrate that the proposed algorithm can effectively mitigate the adverse effects of burst interference and approach the interference-free bound.展开更多
When castings become complicated and the demands for precision of numerical simulation become higher,the numerical data of casting numerical simulation become more massive.On a general personal computer,these massive ...When castings become complicated and the demands for precision of numerical simulation become higher,the numerical data of casting numerical simulation become more massive.On a general personal computer,these massive numerical data may probably exceed the capacity of available memory,resulting in failure of rendering.Based on the out-of-core technique,this paper proposes a method to effectively utilize external storage and reduce memory usage dramatically,so as to solve the problem of insufficient memory for massive data rendering on general personal computers.Based on this method,a new postprocessor is developed.It is capable to illustrate filling and solidification processes of casting,as well as thermal stess.The new post-processor also provides fast interaction to simulation results.Theoretical analysis as well as several practical examples prove that the memory usage and loading time of the post-processor are independent of the size of the relevant files,but the proportion of the number of cells on surface.Meanwhile,the speed of rendering and fetching of value from the mouse is appreciable,and the demands of real-time and interaction are satisfied.展开更多
Because of the limited memory of the increasing amount of information in current wearable devices,the processing capacity of the servers in the storage system can not keep up with the speed of information growth,resul...Because of the limited memory of the increasing amount of information in current wearable devices,the processing capacity of the servers in the storage system can not keep up with the speed of information growth,resulting in low load balancing,long load balancing time and data processing delay.Therefore,a data load balancing technology is applied to the massive storage systems of wearable devices in this paper.We first analyze the object-oriented load balancing method,and formally describe the dynamic load balancing issues,taking the load balancing as a mapping problem.Then,the task of assigning each data node and the request of the corresponding data node’s actual processing capacity are completed.Different data is allocated to the corresponding data storage node to complete the calculation of the comprehensive weight of the data storage node.According to the load information of each data storage node collected by the scheduler in the storage system,the load weight of the current data storage node is calculated and distributed.The data load balancing of the massive storage system for wearable devices is realized.The experimental results show that the average time of load balancing using this method is 1.75h,which is much lower than the traditional methods.The results show the data load balancing technology of the massive storage system of wearable devices has the advantages of short data load balancing time,high load balancing,strong data processing capability,short processing time and obvious application.展开更多
With increasingly complex website structure and continuously advancing web technologies,accurate user clicks recognition from massive HTTP data,which is critical for web usage mining,becomes more difficult.In this pap...With increasingly complex website structure and continuously advancing web technologies,accurate user clicks recognition from massive HTTP data,which is critical for web usage mining,becomes more difficult.In this paper,we propose a dependency graph model to describe the relationships between web requests.Based on this model,we design and implement a heuristic parallel algorithm to distinguish user clicks with the assistance of cloud computing technology.We evaluate the proposed algorithm with real massive data.The size of the dataset collected from a mobile core network is 228.7GB.It covers more than three million users.The experiment results demonstrate that the proposed algorithm can achieve higher accuracy than previous methods.展开更多
The vegetation data of the Fengyun meteorological satellite are segmented according to the latitude and longitude, and can be written into 648 blocks. However, the vegetation data processing efficiency is low because ...The vegetation data of the Fengyun meteorological satellite are segmented according to the latitude and longitude, and can be written into 648 blocks. However, the vegetation data processing efficiency is low because the data belongs to massive data. This paper presents a data processing method based on RAM (h) for Fengyun-3 vegetation data. First of all, we introduce the Locality-Aware model to segment the input data, then locate the data based on geographic location, and finally fuse the independent data based on geographical location. Experimental results show that the proposed method can effectively improve the data processing efficiency.展开更多
At present,the process of digital image information fusion has the problems of low data cleaning unaccuracy and more repeated data omission,resulting in the unideal information fusion.In this regard,a visualized multi...At present,the process of digital image information fusion has the problems of low data cleaning unaccuracy and more repeated data omission,resulting in the unideal information fusion.In this regard,a visualized multicomponent information fusion method for big data based on radar map is proposed in this paper.The data model of perceptual digital image is constructed by using the linear regression analysis method.The ID tag of the collected image data as Transactin Identification(TID)is compared.If the TID of two data is the same,the repeated data detection is carried out.After the test,the data set is processed many times in accordance with the method process to improve the precision of data cleaning and reduce the omission.Based on the radar images,hierarchical visualization of processed multi-level information fusion is realized.The experiments show that the method can clean the redundant data accurately and achieve the efficient fusion of multi-level information of big data in the digital image.展开更多
In this paper, we consider the unified optimal subsampling estimation and inference on the lowdimensional parameter of main interest in the presence of the nuisance parameter for low/high-dimensionalgeneralized linear...In this paper, we consider the unified optimal subsampling estimation and inference on the lowdimensional parameter of main interest in the presence of the nuisance parameter for low/high-dimensionalgeneralized linear models (GLMs) with massive data. We first present a general subsampling decorrelated scorefunction to reduce the influence of the less accurate nuisance parameter estimation with the slow convergencerate. The consistency and asymptotic normality of the resultant subsample estimator from a general decorrelatedscore subsampling algorithm are established, and two optimal subsampling probabilities are derived under theA- and L-optimality criteria to downsize the data volume and reduce the computational burden. The proposedoptimal subsampling probabilities provably improve the asymptotic efficiency of the subsampling schemes in thelow-dimensional GLMs and perform better than the uniform subsampling scheme in the high-dimensional GLMs.A two-step algorithm is further proposed to implement, and the asymptotic properties of the correspondingestimators are also given. Simulations show satisfactory performance of the proposed estimators, and twoapplications to census income and Fashion-MNIST datasets also demonstrate its practical applicability.展开更多
In this paper,we consider the distributed inference for heterogeneous linear models with massive datasets.Noting that heterogeneity may exist not only in the expectations of the subpopulations,but also in their varian...In this paper,we consider the distributed inference for heterogeneous linear models with massive datasets.Noting that heterogeneity may exist not only in the expectations of the subpopulations,but also in their variances,we propose the heteroscedasticity-adaptive distributed aggregation(HADA)estimation,which is shown to be communication-efficient and asymptotically optimal,regardless of homoscedasticity or heteroscedasticity.Furthermore,a distributed test for parameter heterogeneity across subpopulations is constructed based on the HADA estimator.The finite-sample performance of the proposed methods is evaluated using simulation studies and the NYC flight data.展开更多
相似性查询常用于信息检索、生物学和网络安全等领域,用来分析数据之间的关联关系。传统方法执行相似性查询往往需要查询点与数据库中的每一条数据进行计算。随着数据量的增大,计算量会成线性式增长。为提升海量数据的分布式相似性查询...相似性查询常用于信息检索、生物学和网络安全等领域,用来分析数据之间的关联关系。传统方法执行相似性查询往往需要查询点与数据库中的每一条数据进行计算。随着数据量的增大,计算量会成线性式增长。为提升海量数据的分布式相似性查询效率,提出一种基于HBase的相似性查询索引结构HSIT(HBase Similarity Index Tree)。在数据存储的过程中,该算法实现动态建立相似性索引树结构。HSIT索引能够按照相似度阈值,划分相似性的数据在HBase的相邻区域存储;在用户执行相似性查询时,查询节点可以通过HSIT快速检索相似区域。该索引能够实现高效剪枝,使得只有相似的区域才需要两两计算。通过2万条数据指数型增长到128万条数据执行相似性查询,与DSCS-LTS算法比较,实验结果证明,HSIT算法效率有所提升。展开更多
Recently,internet stimulates the explosive progress of knowledge discovery in big volume data resource,to dig the valuable and hidden rules by computing.Simultaneously,the wireless channel measurement data reveals big...Recently,internet stimulates the explosive progress of knowledge discovery in big volume data resource,to dig the valuable and hidden rules by computing.Simultaneously,the wireless channel measurement data reveals big volume feature,considering the massive antennas,huge bandwidth and versatile application scenarios.This article firstly presents a comprehensive survey of channel measurement and modeling research for mobile communication,especially for 5th Generation(5G) and beyond.Considering the big data research progress,then a cluster-nuclei based model is proposed,which takes advantages of both the stochastical model and deterministic model.The novel model has low complexity with the limited number of cluster-nuclei while the cluster-nuclei has the physical mapping to real propagation objects.Combining the channel properties variation principles with antenna size,frequency,mobility and scenario dug from the channel data,the proposed model can be expanded in versatile application to support future mobile research.展开更多
This paper proposed a virtual quadtree (VQT) based loose architecture of multi-level massive geospatial data for integrating massive geospatial data dispersed in the departments of different hierarchies in the same se...This paper proposed a virtual quadtree (VQT) based loose architecture of multi-level massive geospatial data for integrating massive geospatial data dispersed in the departments of different hierarchies in the same sector into a unified GIS (Geographic Information System) platform. By virtualizing the nodes of the quad-tree,the VQT separates the structure of data organization from data storage,and screens the difference between the data storage in local computer and in the re-mote computers in network environment. And by mounting,VQT easily integrates the data from the remote computers into the local VQT so as to implement seam-less integration of distributed multi-level massive geospatial data. Based on that mode,the paper built an application system with geospatial data over 1200 GB distributed in 12 servers deployed in 12 cities. The experiment showed that all data can be seamlessly rapidly traveled and performed zooming in and zooming out smoothly.展开更多
This paper designs and develops a framework on a distributed computing platform for massive multi-source spatial data using a column-oriented database(HBase).This platform consists of four layers including ETL(extract...This paper designs and develops a framework on a distributed computing platform for massive multi-source spatial data using a column-oriented database(HBase).This platform consists of four layers including ETL(extraction transformation loading) tier,data processing tier,data storage tier and data display tier,achieving long-term store,real-time analysis and inquiry for massive data.Finally,a real dataset cluster is simulated,which are made up of 39 nodes including 2 master nodes and 37 data nodes,and performing function tests of data importing module and real-time query module,and performance tests of HDFS's I/O,the MapReduce cluster,batch-loading and real-time query of massive data.The test results indicate that this platform achieves high performance in terms of response time and linear scalability.展开更多
Digital data have become a torrent engulfing every area of business, science and engineering disciplines, gushing into every economy, every organization and every user of digital technology. In the age of big data, de...Digital data have become a torrent engulfing every area of business, science and engineering disciplines, gushing into every economy, every organization and every user of digital technology. In the age of big data, deriving values and insights from big data using rich analytics becomes important for achieving competitiveness, success and leadership in every field. The Internet of Things (IoT) is causing the number and types of products to emit data at an unprecedented rate. Heterogeneity, scale, timeliness, complexity, and privacy problems with large data impede progress at all phases of the pipeline that can create value from data issues. With the push of such massive data, we are entering a new era of computing driven by novel and ground breaking research innovation on elastic parallelism, partitioning and scalability. Designing a scalable system for analysing, processing and mining huge real world datasets has become one of the challenging problems facing both systems researchers and data management researchers. In this paper, we will give an overview of computing infrastructure for IoT data processing, focusing on architectural and major challenges of massive data. We will briefly discuss about emerging computing infrastructure and technologies that are promising for improving massive data management.展开更多
基金supported by the National Key Laboratory of Wireless Communications Foundation,China (IFN20230204)。
文摘This paper investigates the fundamental data detection problem with burst interference in massive multiple-input multiple-output orthogonal frequency division multiplexing(MIMO-OFDM) systems. In particular, burst interference may occur only on data symbols but not on pilot symbols, which means that interference information cannot be premeasured. To cancel the burst interference, we first revisit the uplink multi-user system and develop a matrixform system model, where the covariance pattern and the low-rank property of the interference matrix is discussed. Then, we propose a turbo message passing based burst interference cancellation(TMP-BIC) algorithm to solve the data detection problem, where the constellation information of target data is fully exploited to refine its estimate. Furthermore, in the TMP-BIC algorithm, we design one module to cope with the interference matrix by exploiting its lowrank property. Numerical results demonstrate that the proposed algorithm can effectively mitigate the adverse effects of burst interference and approach the interference-free bound.
基金supported by the New Century Excellent Talents in University(NCET-09-0396)the National Science&Technology Key Projects of Numerical Control(2012ZX04014-031)+1 种基金the Natural Science Foundation of Hubei Province(2011CDB279)the Foundation for Innovative Research Groups of the Natural Science Foundation of Hubei Province,China(2010CDA067)
文摘When castings become complicated and the demands for precision of numerical simulation become higher,the numerical data of casting numerical simulation become more massive.On a general personal computer,these massive numerical data may probably exceed the capacity of available memory,resulting in failure of rendering.Based on the out-of-core technique,this paper proposes a method to effectively utilize external storage and reduce memory usage dramatically,so as to solve the problem of insufficient memory for massive data rendering on general personal computers.Based on this method,a new postprocessor is developed.It is capable to illustrate filling and solidification processes of casting,as well as thermal stess.The new post-processor also provides fast interaction to simulation results.Theoretical analysis as well as several practical examples prove that the memory usage and loading time of the post-processor are independent of the size of the relevant files,but the proportion of the number of cells on surface.Meanwhile,the speed of rendering and fetching of value from the mouse is appreciable,and the demands of real-time and interaction are satisfied.
文摘Because of the limited memory of the increasing amount of information in current wearable devices,the processing capacity of the servers in the storage system can not keep up with the speed of information growth,resulting in low load balancing,long load balancing time and data processing delay.Therefore,a data load balancing technology is applied to the massive storage systems of wearable devices in this paper.We first analyze the object-oriented load balancing method,and formally describe the dynamic load balancing issues,taking the load balancing as a mapping problem.Then,the task of assigning each data node and the request of the corresponding data node’s actual processing capacity are completed.Different data is allocated to the corresponding data storage node to complete the calculation of the comprehensive weight of the data storage node.According to the load information of each data storage node collected by the scheduler in the storage system,the load weight of the current data storage node is calculated and distributed.The data load balancing of the massive storage system for wearable devices is realized.The experimental results show that the average time of load balancing using this method is 1.75h,which is much lower than the traditional methods.The results show the data load balancing technology of the massive storage system of wearable devices has the advantages of short data load balancing time,high load balancing,strong data processing capability,short processing time and obvious application.
基金supported in part by the Fundamental Research Funds for the Central Universities under Grant No.2013RC0114111 Project of China under Grant No.B08004
文摘With increasingly complex website structure and continuously advancing web technologies,accurate user clicks recognition from massive HTTP data,which is critical for web usage mining,becomes more difficult.In this paper,we propose a dependency graph model to describe the relationships between web requests.Based on this model,we design and implement a heuristic parallel algorithm to distinguish user clicks with the assistance of cloud computing technology.We evaluate the proposed algorithm with real massive data.The size of the dataset collected from a mobile core network is 228.7GB.It covers more than three million users.The experiment results demonstrate that the proposed algorithm can achieve higher accuracy than previous methods.
文摘The vegetation data of the Fengyun meteorological satellite are segmented according to the latitude and longitude, and can be written into 648 blocks. However, the vegetation data processing efficiency is low because the data belongs to massive data. This paper presents a data processing method based on RAM (h) for Fengyun-3 vegetation data. First of all, we introduce the Locality-Aware model to segment the input data, then locate the data based on geographic location, and finally fuse the independent data based on geographical location. Experimental results show that the proposed method can effectively improve the data processing efficiency.
基金2018 National Grade Innovation and Entrepreneurship Training Program for College Students,China(No.201811562005)Research Project of Gansu University,China(No.2016A-105)Innovation and Entrepreneurship Education Project of Gansu Province in 2019,China(No.2019024)。
文摘At present,the process of digital image information fusion has the problems of low data cleaning unaccuracy and more repeated data omission,resulting in the unideal information fusion.In this regard,a visualized multicomponent information fusion method for big data based on radar map is proposed in this paper.The data model of perceptual digital image is constructed by using the linear regression analysis method.The ID tag of the collected image data as Transactin Identification(TID)is compared.If the TID of two data is the same,the repeated data detection is carried out.After the test,the data set is processed many times in accordance with the method process to improve the precision of data cleaning and reduce the omission.Based on the radar images,hierarchical visualization of processed multi-level information fusion is realized.The experiments show that the method can clean the redundant data accurately and achieve the efficient fusion of multi-level information of big data in the digital image.
基金This work was supported by the Fundamental Research Funds for the Central Universities,National Natural Science Foundation of China(Grant No.12271272)and the Key Laboratory for Medical Data Analysis and Statistical Research of Tianjin.
文摘In this paper, we consider the unified optimal subsampling estimation and inference on the lowdimensional parameter of main interest in the presence of the nuisance parameter for low/high-dimensionalgeneralized linear models (GLMs) with massive data. We first present a general subsampling decorrelated scorefunction to reduce the influence of the less accurate nuisance parameter estimation with the slow convergencerate. The consistency and asymptotic normality of the resultant subsample estimator from a general decorrelatedscore subsampling algorithm are established, and two optimal subsampling probabilities are derived under theA- and L-optimality criteria to downsize the data volume and reduce the computational burden. The proposedoptimal subsampling probabilities provably improve the asymptotic efficiency of the subsampling schemes in thelow-dimensional GLMs and perform better than the uniform subsampling scheme in the high-dimensional GLMs.A two-step algorithm is further proposed to implement, and the asymptotic properties of the correspondingestimators are also given. Simulations show satisfactory performance of the proposed estimators, and twoapplications to census income and Fashion-MNIST datasets also demonstrate its practical applicability.
基金Supported by the National Science Foundation of China(Grant No.12271014)China Postdoctoral Science Foundation(Grant No.2022M720334)MOE(Ministry of Education in China)Project of Humanities and Social Sciences(Grant No.23YJCZH259)。
文摘In this paper,we consider the distributed inference for heterogeneous linear models with massive datasets.Noting that heterogeneity may exist not only in the expectations of the subpopulations,but also in their variances,we propose the heteroscedasticity-adaptive distributed aggregation(HADA)estimation,which is shown to be communication-efficient and asymptotically optimal,regardless of homoscedasticity or heteroscedasticity.Furthermore,a distributed test for parameter heterogeneity across subpopulations is constructed based on the HADA estimator.The finite-sample performance of the proposed methods is evaluated using simulation studies and the NYC flight data.
文摘相似性查询常用于信息检索、生物学和网络安全等领域,用来分析数据之间的关联关系。传统方法执行相似性查询往往需要查询点与数据库中的每一条数据进行计算。随着数据量的增大,计算量会成线性式增长。为提升海量数据的分布式相似性查询效率,提出一种基于HBase的相似性查询索引结构HSIT(HBase Similarity Index Tree)。在数据存储的过程中,该算法实现动态建立相似性索引树结构。HSIT索引能够按照相似度阈值,划分相似性的数据在HBase的相邻区域存储;在用户执行相似性查询时,查询节点可以通过HSIT快速检索相似区域。该索引能够实现高效剪枝,使得只有相似的区域才需要两两计算。通过2万条数据指数型增长到128万条数据执行相似性查询,与DSCS-LTS算法比较,实验结果证明,HSIT算法效率有所提升。
基金supported in part by National Natural Science Foundation of China (61322110, 6141101115)Doctoral Fund of Ministry of Education (201300051100013)
文摘Recently,internet stimulates the explosive progress of knowledge discovery in big volume data resource,to dig the valuable and hidden rules by computing.Simultaneously,the wireless channel measurement data reveals big volume feature,considering the massive antennas,huge bandwidth and versatile application scenarios.This article firstly presents a comprehensive survey of channel measurement and modeling research for mobile communication,especially for 5th Generation(5G) and beyond.Considering the big data research progress,then a cluster-nuclei based model is proposed,which takes advantages of both the stochastical model and deterministic model.The novel model has low complexity with the limited number of cluster-nuclei while the cluster-nuclei has the physical mapping to real propagation objects.Combining the channel properties variation principles with antenna size,frequency,mobility and scenario dug from the channel data,the proposed model can be expanded in versatile application to support future mobile research.
基金the National Natural Science Foundation of China (Grant No. 2006AA12Z208)the CAS Innovation Program (Grant No. KZCX2-YW-304-02)
文摘This paper proposed a virtual quadtree (VQT) based loose architecture of multi-level massive geospatial data for integrating massive geospatial data dispersed in the departments of different hierarchies in the same sector into a unified GIS (Geographic Information System) platform. By virtualizing the nodes of the quad-tree,the VQT separates the structure of data organization from data storage,and screens the difference between the data storage in local computer and in the re-mote computers in network environment. And by mounting,VQT easily integrates the data from the remote computers into the local VQT so as to implement seam-less integration of distributed multi-level massive geospatial data. Based on that mode,the paper built an application system with geospatial data over 1200 GB distributed in 12 servers deployed in 12 cities. The experiment showed that all data can be seamlessly rapidly traveled and performed zooming in and zooming out smoothly.
基金Supported by the National Science and Technology Support Project(No.2012BAH01F02)from Ministry of Science and Technology of Chinathe Director Fund(No.IS201116002)from Institute of Seismology,CEA
文摘This paper designs and develops a framework on a distributed computing platform for massive multi-source spatial data using a column-oriented database(HBase).This platform consists of four layers including ETL(extraction transformation loading) tier,data processing tier,data storage tier and data display tier,achieving long-term store,real-time analysis and inquiry for massive data.Finally,a real dataset cluster is simulated,which are made up of 39 nodes including 2 master nodes and 37 data nodes,and performing function tests of data importing module and real-time query module,and performance tests of HDFS's I/O,the MapReduce cluster,batch-loading and real-time query of massive data.The test results indicate that this platform achieves high performance in terms of response time and linear scalability.
文摘Digital data have become a torrent engulfing every area of business, science and engineering disciplines, gushing into every economy, every organization and every user of digital technology. In the age of big data, deriving values and insights from big data using rich analytics becomes important for achieving competitiveness, success and leadership in every field. The Internet of Things (IoT) is causing the number and types of products to emit data at an unprecedented rate. Heterogeneity, scale, timeliness, complexity, and privacy problems with large data impede progress at all phases of the pipeline that can create value from data issues. With the push of such massive data, we are entering a new era of computing driven by novel and ground breaking research innovation on elastic parallelism, partitioning and scalability. Designing a scalable system for analysing, processing and mining huge real world datasets has become one of the challenging problems facing both systems researchers and data management researchers. In this paper, we will give an overview of computing infrastructure for IoT data processing, focusing on architectural and major challenges of massive data. We will briefly discuss about emerging computing infrastructure and technologies that are promising for improving massive data management.