The proliferation of intelligent,connected Internet of Things(IoT)devices facilitates data collection.However,task workers may be reluctant to participate in data collection due to privacy concerns,and task requesters...The proliferation of intelligent,connected Internet of Things(IoT)devices facilitates data collection.However,task workers may be reluctant to participate in data collection due to privacy concerns,and task requesters may be concerned about the validity of the collected data.Hence,it is vital to evaluate the quality of the data collected by the task workers while protecting privacy in spatial crowdsourcing(SC)data collection tasks with IoT.To this end,this paper proposes a privacy-preserving data reliability evaluation for SC in IoT,named PARE.First,we design a data uploading format using blockchain and Paillier homomorphic cryptosystem,providing unchangeable and traceable data while overcoming privacy concerns.Secondly,based on the uploaded data,we propose a method to determine the approximate correct value region without knowing the exact value.Finally,we offer a data filtering mechanism based on the Paillier cryptosystem using this value region.The evaluation and analysis results show that PARE outperforms the existing solution in terms of performance and privacy protection.展开更多
Advanced cloud computing technology provides cost saving and flexibility of services for users.With the explosion of multimedia data,more and more data owners would outsource their personal multimedia data on the clou...Advanced cloud computing technology provides cost saving and flexibility of services for users.With the explosion of multimedia data,more and more data owners would outsource their personal multimedia data on the cloud.In the meantime,some computationally expensive tasks are also undertaken by cloud servers.However,the outsourced multimedia data and its applications may reveal the data owner’s private information because the data owners lose the control of their data.Recently,this thought has aroused new research interest on privacy-preserving reversible data hiding over outsourced multimedia data.In this paper,two reversible data hiding schemes are proposed for encrypted image data in cloud computing:reversible data hiding by homomorphic encryption and reversible data hiding in encrypted domain.The former is that additional bits are extracted after decryption and the latter is that extracted before decryption.Meanwhile,a combined scheme is also designed.This paper proposes the privacy-preserving outsourcing scheme of reversible data hiding over encrypted image data in cloud computing,which not only ensures multimedia data security without relying on the trustworthiness of cloud servers,but also guarantees that reversible data hiding can be operated over encrypted images at the different stages.Theoretical analysis confirms the correctness of the proposed encryption model and justifies the security of the proposed scheme.The computation cost of the proposed scheme is acceptable and adjusts to different security levels.展开更多
With the increasing popularity of cloud computing,privacy has become one of the key problem in cloud security.When data is outsourced to the cloud,for data owners,they need to ensure the security of their privacy;for ...With the increasing popularity of cloud computing,privacy has become one of the key problem in cloud security.When data is outsourced to the cloud,for data owners,they need to ensure the security of their privacy;for cloud service providers,they need some information of the data to provide high QoS services;and for authorized users,they need to access to the true value of data.The existing privacy-preserving methods can't meet all the needs of the three parties at the same time.To address this issue,we propose a retrievable data perturbation method and use it in the privacy-preserving in data outsourcing in cloud computing.Our scheme comes in four steps.Firstly,an improved random generator is proposed to generate an accurate "noise".Next,a perturbation algorithm is introduced to add noise to the original data.By doing this,the privacy information is hidden,but the mean and covariance of data which the service providers may need remain unchanged.Then,a retrieval algorithm is proposed to get the original data back from the perturbed data.Finally,we combine the retrievable perturbation with the access control process to ensure only the authorized users can retrieve the original data.The experiments show that our scheme perturbs date correctly,efficiently,and securely.展开更多
An attempt has been made to develop a distributed software infrastructure model for onboard data fusion system simulation, which is also applied to netted radar systems, onboard distributed detection systems and advan...An attempt has been made to develop a distributed software infrastructure model for onboard data fusion system simulation, which is also applied to netted radar systems, onboard distributed detection systems and advanced C3I systems. Two architectures are provided and verified: one is based on pure TCP/IP protocol and C/S model, and implemented with Winsock, the other is based on CORBA (common object request broker architecture). The performance of data fusion simulation system, i.e. reliability, flexibility and scalability, is improved and enhanced by two models. The study of them makes valuable explore on incorporating the distributed computation concepts into radar system simulation techniques.展开更多
With the development of Internet of Things(IoT),the delay caused by network transmission has led to low data processing efficiency.At the same time,the limited computing power and available energy consumption of IoT t...With the development of Internet of Things(IoT),the delay caused by network transmission has led to low data processing efficiency.At the same time,the limited computing power and available energy consumption of IoT terminal devices are also the important bottlenecks that would restrict the application of blockchain,but edge computing could solve this problem.The emergence of edge computing can effectively reduce the delay of data transmission and improve data processing capacity.However,user data in edge computing is usually stored and processed in some honest-but-curious authorized entities,which leads to the leakage of users’privacy information.In order to solve these problems,this paper proposes a location data collection method that satisfies the local differential privacy to protect users’privacy.In this paper,a Voronoi diagram constructed by the Delaunay method is used to divide the road network space and determine the Voronoi grid region where the edge nodes are located.A random disturbance mechanism that satisfies the local differential privacy is utilized to disturb the original location data in each Voronoi grid.In addition,the effectiveness of the proposed privacy-preserving mechanism is verified through comparison experiments.Compared with the existing privacy-preserving methods,the proposed privacy-preserving mechanism can not only better meet users’privacy needs,but also have higher data availability.展开更多
The fast proliferation of edge devices for the Internet of Things(IoT)has led to massive volumes of data explosion.The generated data is collected and shared using edge-based IoT structures at a considerably high freq...The fast proliferation of edge devices for the Internet of Things(IoT)has led to massive volumes of data explosion.The generated data is collected and shared using edge-based IoT structures at a considerably high frequency.Thus,the data-sharing privacy exposure issue is increasingly intimidating when IoT devices make malicious requests for filching sensitive information from a cloud storage system through edge nodes.To address the identified issue,we present evolutionary privacy preservation learning strategies for an edge computing-based IoT data sharing scheme.In particular,we introduce evolutionary game theory and construct a payoff matrix to symbolize intercommunication between IoT devices and edge nodes,where IoT devices and edge nodes are two parties of the game.IoT devices may make malicious requests to achieve their goals of stealing privacy.Accordingly,edge nodes should deny malicious IoT device requests to prevent IoT data from being disclosed.They dynamically adjust their own strategies according to the opponent's strategy and finally maximize the payoffs.Built upon a developed application framework to illustrate the concrete data sharing architecture,a novel algorithm is proposed that can derive the optimal evolutionary learning strategy.Furthermore,we numerically simulate evolutionarily stable strategies,and the final results experimentally verify the correctness of the IoT data sharing privacy preservation scheme.Therefore,the proposed model can effectively defeat malicious invasion and protect sensitive information from leaking when IoT data is shared.展开更多
Wireless sensor networks(WSNs)consist of a great deal of sensor nodes with limited power,computation,storage,sensing and communication capabilities.Data aggregation is a very important technique,which is designed to s...Wireless sensor networks(WSNs)consist of a great deal of sensor nodes with limited power,computation,storage,sensing and communication capabilities.Data aggregation is a very important technique,which is designed to substantially reduce the communication overhead and energy expenditure of sensor node during the process of data collection in a WSNs.However,privacy-preservation is more challenging especially in data aggregation,where the aggregators need to perform some aggregation operations on sensing data it received.We present a state-of-the art survey of privacy-preserving data aggregation in WSNs.At first,we classify the existing privacy-preserving data aggregation schemes into different categories by the core privacy-preserving techniques used in each scheme.And then compare and contrast different algorithms on the basis of performance measures such as the privacy protection ability,communication consumption,power consumption and data accuracy etc.Furthermore,based on the existing work,we also discuss a number of open issues which may intrigue the interest of researchers for future work.展开更多
This paper presents a comprehensive exploration into the integration of Internet of Things(IoT),big data analysis,cloud computing,and Artificial Intelligence(AI),which has led to an unprecedented era of connectivity.W...This paper presents a comprehensive exploration into the integration of Internet of Things(IoT),big data analysis,cloud computing,and Artificial Intelligence(AI),which has led to an unprecedented era of connectivity.We delve into the emerging trend of machine learning on embedded devices,enabling tasks in resource-limited environ-ments.However,the widespread adoption of machine learning raises significant privacy concerns,necessitating the development of privacy-preserving techniques.One such technique,secure multi-party computation(MPC),allows collaborative computations without exposing private inputs.Despite its potential,complex protocols and communication interactions hinder performance,especially on resource-constrained devices.Efforts to enhance efficiency have been made,but scalability remains a challenge.Given the success of GPUs in deep learning,lever-aging embedded GPUs,such as those offered by NVIDIA,emerges as a promising solution.Therefore,we propose an Embedded GPU-based Secure Two-party Computation(EG-STC)framework for Artificial Intelligence(AI)systems.To the best of our knowledge,this work represents the first endeavor to fully implement machine learning model training based on secure two-party computing on the Embedded GPU platform.Our experimental results demonstrate the effectiveness of EG-STC.On an embedded GPU with a power draw of 5 W,our implementation achieved a secure two-party matrix multiplication throughput of 5881.5 kilo-operations per millisecond(kops/ms),with an energy efficiency ratio of 1176.3 kops/ms/W.Furthermore,leveraging our EG-STC framework,we achieved an overall time acceleration ratio of 5–6 times compared to solutions running on server-grade CPUs.Our solution also exhibited a reduced runtime,requiring only 60%to 70%of the runtime of previously best-known methods on the same platform.In summary,our research contributes to the advancement of secure and efficient machine learning implementations on resource-constrained embedded devices,paving the way for broader adoption of AI technologies in various applications.展开更多
With the rapid development of information technology,IoT devices play a huge role in physiological health data detection.The exponential growth of medical data requires us to reasonably allocate storage space for clou...With the rapid development of information technology,IoT devices play a huge role in physiological health data detection.The exponential growth of medical data requires us to reasonably allocate storage space for cloud servers and edge nodes.The storage capacity of edge nodes close to users is limited.We should store hotspot data in edge nodes as much as possible,so as to ensure response timeliness and access hit rate;However,the current scheme cannot guarantee that every sub-message in a complete data stored by the edge node meets the requirements of hot data;How to complete the detection and deletion of redundant data in edge nodes under the premise of protecting user privacy and data dynamic integrity has become a challenging problem.Our paper proposes a redundant data detection method that meets the privacy protection requirements.By scanning the cipher text,it is determined whether each sub-message of the data in the edge node meets the requirements of the hot data.It has the same effect as zero-knowledge proof,and it will not reveal the privacy of users.In addition,for redundant sub-data that does not meet the requirements of hot data,our paper proposes a redundant data deletion scheme that meets the dynamic integrity of the data.We use Content Extraction Signature(CES)to generate the remaining hot data signature after the redundant data is deleted.The feasibility of the scheme is proved through safety analysis and efficiency analysis.展开更多
Federated learning for edge computing is a promising solution in the data booming era,which leverages the computation ability of each edge device to train local models and only shares the model gradients to the centra...Federated learning for edge computing is a promising solution in the data booming era,which leverages the computation ability of each edge device to train local models and only shares the model gradients to the central server.However,the frequently transmitted local gradients could also leak the participants’private data.To protect the privacy of local training data,lots of cryptographic-based Privacy-Preserving Federated Learning(PPFL)schemes have been proposed.However,due to the constrained resource nature of mobile devices and complex cryptographic operations,traditional PPFL schemes fail to provide efficient data confidentiality and lightweight integrity verification simultaneously.To tackle this problem,we propose a Verifiable Privacypreserving Federated Learning scheme(VPFL)for edge computing systems to prevent local gradients from leaking over the transmission stage.Firstly,we combine the Distributed Selective Stochastic Gradient Descent(DSSGD)method with Paillier homomorphic cryptosystem to achieve the distributed encryption functionality,so as to reduce the computation cost of the complex cryptosystem.Secondly,we further present an online/offline signature method to realize the lightweight gradients integrity verification,where the offline part can be securely outsourced to the edge server.Comprehensive security analysis demonstrates the proposed VPFL can achieve data confidentiality,authentication,and integrity.At last,we evaluate both communication overhead and computation cost of the proposed VPFL scheme,the experimental results have shown VPFL has low computation costs and communication overheads while maintaining high training accuracy.展开更多
The introduction of the Internet of Things(IoT)paradigm serves as pervasive resource access and sharing platform for different real-time applications.Decentralized resource availability,access,and allocation provide a...The introduction of the Internet of Things(IoT)paradigm serves as pervasive resource access and sharing platform for different real-time applications.Decentralized resource availability,access,and allocation provide a better quality of user experience regardless of the application type and scenario.However,privacy remains an open issue in this ubiquitous sharing platform due to massive and replicated data availability.In this paper,privacy-preserving decision-making for the data-sharing scheme is introduced.This scheme is responsible for improving the security in data sharing without the impact of replicated resources on communicating users.In this scheme,classification learning is used for identifying replicas and accessing granted resources independently.Based on the trust score of the available resources,this classification is recurrently performed to improve the reliability of information sharing.The user-level decisions for information sharing and access are made using the classification of the resources at the time of availability.This proposed scheme is verified using the metrics access delay,success ratio,computation complexity,and sharing loss.展开更多
The Internet of Things(IoT)has revolutionized how we interact with and gather data from our surrounding environment.IoT devices with various sensors and actuators generate vast amounts of data that can be harnessed to...The Internet of Things(IoT)has revolutionized how we interact with and gather data from our surrounding environment.IoT devices with various sensors and actuators generate vast amounts of data that can be harnessed to derive valuable insights.The rapid proliferation of Internet of Things(IoT)devices has ushered in an era of unprecedented data generation and connectivity.These IoT devices,equipped with many sensors and actuators,continuously produce vast volumes of data.However,the conventional approach of transmitting all this data to centralized cloud infrastructures for processing and analysis poses significant challenges.However,transmitting all this data to a centralized cloud infrastructure for processing and analysis can be inefficient and impractical due to bandwidth limitations,network latency,and scalability issues.This paper proposed a Self-Learning Internet Traffic Fuzzy Classifier(SLItFC)for traffic data analysis.The proposed techniques effectively utilize clustering and classification procedures to improve classification accuracy in analyzing network traffic data.SLItFC addresses the intricate task of efficiently managing and analyzing IoT data traffic at the edge.It employs a sophisticated combination of fuzzy clustering and self-learning techniques,allowing it to adapt and improve its classification accuracy over time.This adaptability is a crucial feature,given the dynamic nature of IoT environments where data patterns and traffic characteristics can evolve rapidly.With the implementation of the fuzzy classifier,the accuracy of the clustering process is improvised with the reduction of the computational time.SLItFC can reduce computational time while maintaining high classification accuracy.This efficiency is paramount in edge computing,where resource constraints demand streamlined data processing.Additionally,SLItFC’s performance advantages make it a compelling choice for organizations seeking to harness the potential of IoT data for real-time insights and decision-making.With the Self-Learning process,the SLItFC model monitors the network traffic data acquired from the IoT Devices.The Sugeno fuzzy model is implemented within the edge computing environment for improved classification accuracy.Simulation analysis stated that the proposed SLItFC achieves 94.5%classification accuracy with reduced classification time.展开更多
As an essential component of intelligent transportation systems(ITS),electric vehicles(EVs)can store massive amounts of electric power in their batteries and send power back to a charging station(CS)at peak hours to b...As an essential component of intelligent transportation systems(ITS),electric vehicles(EVs)can store massive amounts of electric power in their batteries and send power back to a charging station(CS)at peak hours to balance the power supply and generate profits.However,when the system collects the corresponding power data,several severe security and privacy issues are encountered.The identity and private injection data may be maliciously intercepted by network attackers and be tampered with to damage the services of ITS and smart grids.Existing approaches requiring high computational overhead render them unsuitable for the resource-constrained Internet of Things(IoT)environment.To address above problems,this paper proposes a blockchain-enabled secure and privacy-preserving data aggregation scheme for fog-based ITS.First,a fog computing and blockchain co-aware aggregation framework of power injection data is designed,which provides strong support for ITS to achieve secure and efficient power injection.Second,Paillier homomorphic encryption,the batch aggregation signature mechanism and a Bloom filter are effectively integrated with efficient aggregation of power injection data with security and privacy guarantees.In addition,the fine-grained homomorphic aggregation is designed for power injection data generated by all EVs,which provides solid data support for accurate power dispatching and supply management in ITS.Experiments show that the total computational cost is significantly reduced in the proposed scheme while providing security and privacy guarantees.The proposed scheme is more suitable for ITS with latency-sensitive applications and is also adapted to deploying devices with limited resources.展开更多
In this paper,we provide a new approach to data encryption using generalized inverses.Encryption is based on the implementation of weighted Moore–Penrose inverse A y MNenxmT over the nx8 constant matrix.The square He...In this paper,we provide a new approach to data encryption using generalized inverses.Encryption is based on the implementation of weighted Moore–Penrose inverse A y MNenxmT over the nx8 constant matrix.The square Hermitian positive definite matrix N8x8 p is the key.The proposed solution represents a very strong key since the number of different variants of positive definite matrices of order 8 is huge.We have provided NIST(National Institute of Standards and Technology)quality assurance tests for a random generated Hermitian matrix(a total of 10 different tests and additional analysis with approximate entropy and random digression).In the additional testing of the quality of the random matrix generated,we can conclude that the results of our analysis satisfy the defined strict requirements.This proposed MP encryption method can be applied effectively in the encryption and decryption of images in multi-party communications.In the experimental part of this paper,we give a comparison of encryption methods between machine learning methods.Machine learning algorithms could be compared by achieved results of classification concentrating on classes.In a comparative analysis,we give results of classifying of advanced encryption standard(AES)algorithm and proposed encryption method based on Moore–Penrose inverse.展开更多
In a smart grid, a huge amount of data is collected for various applications, such as load monitoring and demand response. These data are used for analyzing the power state and formulating the optimal dispatching stra...In a smart grid, a huge amount of data is collected for various applications, such as load monitoring and demand response. These data are used for analyzing the power state and formulating the optimal dispatching strategy. However, these big energy data in terms of volume, velocity and variety raise concern over consumers' privacy. For instance, in order to optimize energy utilization and support demand response, numerous smart meters are installed at a consumer's home to collect energy consumption data at a fine granularity, but these fine-grained data may contain information on the appliances and thus the consumer's behaviors at home. In this paper, we propose a privacy-preserving data aggregation scheme based on secret sharing with fault tolerance in a smart grid, which ensures that the control center obtains the integrated data without compromising privacy. Meanwhile, we also consider fault tolerance and resistance to differential attack during the data aggregation. Finally, we perform a security analysis and performance evaluation of our scheme in comparison with the other similar schemes. The analysis shows that our scheme can meet the security requirement, and it also shows better performance than other popular methods.展开更多
Medical data mining has become an essential task in healthcare sector to secure the personal and medical data of patients using privacy policy.In this background,several authentication and accessibility issues emerge ...Medical data mining has become an essential task in healthcare sector to secure the personal and medical data of patients using privacy policy.In this background,several authentication and accessibility issues emerge with an inten-tion to protect the sensitive details of the patients over getting published in open domain.To solve this problem,Multi Attribute Case based Privacy Preservation(MACPP)technique is proposed in this study to enhance the security of privacy-preserving data.Private information can be any attribute information which is categorized as sensitive logs in a patient’s records.The semantic relation between transactional patient records and access rights is estimated based on the mean average value to distinguish sensitive and non-sensitive information.In addition to this,crypto hidden policy is also applied here to encrypt the sensitive data through symmetric standard key log verification that protects the personalized sensitive information.Further,linear integrity verification provides authentication rights to verify the data,improves the performance of privacy preserving techni-que against intruders and assures high security in healthcare setting.展开更多
In recent years,with the development of the social Internet of Things(IoT),all kinds of data accumulated on the network.These data,which contain a lot of social information and opinions.However,these data are rarely f...In recent years,with the development of the social Internet of Things(IoT),all kinds of data accumulated on the network.These data,which contain a lot of social information and opinions.However,these data are rarely fully analyzed,which is a major obstacle to the intelligent development of the social IoT.In this paper,we propose a sentence similarity analysis model to analyze the similarity in people’s opinions on hot topics in social media and news pages.Most of these data are unstructured or semi-structured sentences,so the accuracy of sentence similarity analysis largely determines the model’s performance.For the purpose of improving accuracy,we propose a novel method of sentence similarity computation to extract the syntactic and semantic information of the semi-structured and unstructured sentences.We mainly consider the subjects,predicates and objects of sentence pairs and use Stanford Parser to classify the dependency relation triples to calculate the syntactic and semantic similarity between two sentences.Finally,we verify the performance of the model with the Microsoft Research Paraphrase Corpus(MRPC),which consists of 4076 pairs of training sentences and 1725 pairs of test sentences,and most of the data came from the news of social data.Extensive simulations demonstrate that our method outperforms other state-of-the-art methods regarding the correlation coefficient and the mean deviation.展开更多
Data mining is the extraction of vast interesting patterns or knowledge from huge amount of data. The initial idea of privacy-preserving data mining PPDM was to extend traditional data mining techniques to work with t...Data mining is the extraction of vast interesting patterns or knowledge from huge amount of data. The initial idea of privacy-preserving data mining PPDM was to extend traditional data mining techniques to work with the data modified to mask sensitive information. The key issues were how to modify the data and how to recover the data mining result from the modified data. Privacy-preserving data mining considers the problem of running data mining algorithms on confidential data that is not supposed to be revealed even to the party running the algorithm. In contrast, privacy-preserving data publishing (PPDP) may not necessarily be tied to a specific data mining task, and the data mining task may be unknown at the time of data publishing. PPDP studies how to transform raw data into a version that is immunized against privacy attacks but that still supports effective data mining tasks. Privacy-preserving for both data mining (PPDM) and data publishing (PPDP) has become increasingly popular because it allows sharing of privacy sensitive data for analysis purposes. One well studied approach is the k-anonymity model [1] which in turn led to other models such as confidence bounding, l-diversity, t-closeness, (α,k)-anonymity, etc. In particular, all known mechanisms try to minimize information loss and such an attempt provides a loophole for attacks. The aim of this paper is to present a survey for most of the common attacks techniques for anonymization-based PPDM & PPDP and explain their effects on Data Privacy.展开更多
Wearable technologies have the potential to become a valuable influence on human daily life where they may enable observing the world in new ways,including,for example,using augmented reality(AR)applications.Wearable ...Wearable technologies have the potential to become a valuable influence on human daily life where they may enable observing the world in new ways,including,for example,using augmented reality(AR)applications.Wearable technology uses electronic devices that may be carried as accessories,clothes,or even embedded in the user's body.Although the potential benefits of smart wearables are numerous,their extensive and continual usage creates several privacy concerns and tricky information security challenges.In this paper,we present a comprehensive survey of recent privacy-preserving big data analytics applications based on wearable sensors.We highlight the fundamental features of security and privacy for wearable device applications.Then,we examine the utilization of deep learning algorithms with cryptography and determine their usability for wearable sensors.We also present a case study on privacy-preserving machine learning techniques.Herein,we theoretically and empirically evaluate the privacy-preserving deep learning framework's performance.We explain the implementation details of a case study of a secure prediction service using the convolutional neural network(CNN)model and the Cheon-Kim-Kim-Song(CHKS)homomorphic encryption algorithm.Finally,we explore the obstacles and gaps in the deployment of practical real-world applications.Following a comprehensive overview,we identify the most important obstacles that must be overcome and discuss some interesting future research directions.展开更多
In this study, we delve into the realm of efficient Big Data Engineering and Extract, Transform, Load (ETL) processes within the healthcare sector, leveraging the robust foundation provided by the MIMIC-III Clinical D...In this study, we delve into the realm of efficient Big Data Engineering and Extract, Transform, Load (ETL) processes within the healthcare sector, leveraging the robust foundation provided by the MIMIC-III Clinical Database. Our investigation entails a comprehensive exploration of various methodologies aimed at enhancing the efficiency of ETL processes, with a primary emphasis on optimizing time and resource utilization. Through meticulous experimentation utilizing a representative dataset, we shed light on the advantages associated with the incorporation of PySpark and Docker containerized applications. Our research illuminates significant advancements in time efficiency, process streamlining, and resource optimization attained through the utilization of PySpark for distributed computing within Big Data Engineering workflows. Additionally, we underscore the strategic integration of Docker containers, delineating their pivotal role in augmenting scalability and reproducibility within the ETL pipeline. This paper encapsulates the pivotal insights gleaned from our experimental journey, accentuating the practical implications and benefits entailed in the adoption of PySpark and Docker. By streamlining Big Data Engineering and ETL processes in the context of clinical big data, our study contributes to the ongoing discourse on optimizing data processing efficiency in healthcare applications. The source code is available on request.展开更多
基金This work was supported by the National Natural Science Foundation of China under Grant 62233003the National Key Research and Development Program of China under Grant 2020YFB1708602.
文摘The proliferation of intelligent,connected Internet of Things(IoT)devices facilitates data collection.However,task workers may be reluctant to participate in data collection due to privacy concerns,and task requesters may be concerned about the validity of the collected data.Hence,it is vital to evaluate the quality of the data collected by the task workers while protecting privacy in spatial crowdsourcing(SC)data collection tasks with IoT.To this end,this paper proposes a privacy-preserving data reliability evaluation for SC in IoT,named PARE.First,we design a data uploading format using blockchain and Paillier homomorphic cryptosystem,providing unchangeable and traceable data while overcoming privacy concerns.Secondly,based on the uploaded data,we propose a method to determine the approximate correct value region without knowing the exact value.Finally,we offer a data filtering mechanism based on the Paillier cryptosystem using this value region.The evaluation and analysis results show that PARE outperforms the existing solution in terms of performance and privacy protection.
基金This work was supported by the National Natural Science Foundation of China(No.61702276)the Startup Foundation for Introducing Talent of Nanjing University of Information Science and Technology under Grant 2016r055 and the Priority Academic Program Development(PAPD)of Jiangsu Higher Education Institutions.The authors are grateful for the anonymous reviewers who made constructive comments and improvements.
文摘Advanced cloud computing technology provides cost saving and flexibility of services for users.With the explosion of multimedia data,more and more data owners would outsource their personal multimedia data on the cloud.In the meantime,some computationally expensive tasks are also undertaken by cloud servers.However,the outsourced multimedia data and its applications may reveal the data owner’s private information because the data owners lose the control of their data.Recently,this thought has aroused new research interest on privacy-preserving reversible data hiding over outsourced multimedia data.In this paper,two reversible data hiding schemes are proposed for encrypted image data in cloud computing:reversible data hiding by homomorphic encryption and reversible data hiding in encrypted domain.The former is that additional bits are extracted after decryption and the latter is that extracted before decryption.Meanwhile,a combined scheme is also designed.This paper proposes the privacy-preserving outsourcing scheme of reversible data hiding over encrypted image data in cloud computing,which not only ensures multimedia data security without relying on the trustworthiness of cloud servers,but also guarantees that reversible data hiding can be operated over encrypted images at the different stages.Theoretical analysis confirms the correctness of the proposed encryption model and justifies the security of the proposed scheme.The computation cost of the proposed scheme is acceptable and adjusts to different security levels.
基金supported in part by NSFC under Grant No.61172090National Science and Technology Major Project under Grant 2012ZX03002001+3 种基金Research Fund for the Doctoral Program of Higher Education of China under Grant No.20120201110013Scientific and Technological Project in Shaanxi Province under Grant(No.2012K06-30, No.2014JQ8322)Basic Science Research Fund in Xi'an Jiaotong University(No. XJJ2014049,No.XKJC2014008)Shaanxi Science and Technology Innovation Project (2013SZS16-Z01/P01/K01)
文摘With the increasing popularity of cloud computing,privacy has become one of the key problem in cloud security.When data is outsourced to the cloud,for data owners,they need to ensure the security of their privacy;for cloud service providers,they need some information of the data to provide high QoS services;and for authorized users,they need to access to the true value of data.The existing privacy-preserving methods can't meet all the needs of the three parties at the same time.To address this issue,we propose a retrievable data perturbation method and use it in the privacy-preserving in data outsourcing in cloud computing.Our scheme comes in four steps.Firstly,an improved random generator is proposed to generate an accurate "noise".Next,a perturbation algorithm is introduced to add noise to the original data.By doing this,the privacy information is hidden,but the mean and covariance of data which the service providers may need remain unchanged.Then,a retrieval algorithm is proposed to get the original data back from the perturbed data.Finally,we combine the retrievable perturbation with the access control process to ensure only the authorized users can retrieve the original data.The experiments show that our scheme perturbs date correctly,efficiently,and securely.
文摘An attempt has been made to develop a distributed software infrastructure model for onboard data fusion system simulation, which is also applied to netted radar systems, onboard distributed detection systems and advanced C3I systems. Two architectures are provided and verified: one is based on pure TCP/IP protocol and C/S model, and implemented with Winsock, the other is based on CORBA (common object request broker architecture). The performance of data fusion simulation system, i.e. reliability, flexibility and scalability, is improved and enhanced by two models. The study of them makes valuable explore on incorporating the distributed computation concepts into radar system simulation techniques.
文摘With the development of Internet of Things(IoT),the delay caused by network transmission has led to low data processing efficiency.At the same time,the limited computing power and available energy consumption of IoT terminal devices are also the important bottlenecks that would restrict the application of blockchain,but edge computing could solve this problem.The emergence of edge computing can effectively reduce the delay of data transmission and improve data processing capacity.However,user data in edge computing is usually stored and processed in some honest-but-curious authorized entities,which leads to the leakage of users’privacy information.In order to solve these problems,this paper proposes a location data collection method that satisfies the local differential privacy to protect users’privacy.In this paper,a Voronoi diagram constructed by the Delaunay method is used to divide the road network space and determine the Voronoi grid region where the edge nodes are located.A random disturbance mechanism that satisfies the local differential privacy is utilized to disturb the original location data in each Voronoi grid.In addition,the effectiveness of the proposed privacy-preserving mechanism is verified through comparison experiments.Compared with the existing privacy-preserving methods,the proposed privacy-preserving mechanism can not only better meet users’privacy needs,but also have higher data availability.
基金supported in part by Zhejiang Provincial Natural Science Foundation of China under Grant nos.LZ22F020002 and LY22F020003National Natural Science Foundation of China under Grant nos.61772018 and 62002226the key project of Humanities and Social Sciences in Colleges and Universities of Zhejiang Province under Grant no.2021GH017.
文摘The fast proliferation of edge devices for the Internet of Things(IoT)has led to massive volumes of data explosion.The generated data is collected and shared using edge-based IoT structures at a considerably high frequency.Thus,the data-sharing privacy exposure issue is increasingly intimidating when IoT devices make malicious requests for filching sensitive information from a cloud storage system through edge nodes.To address the identified issue,we present evolutionary privacy preservation learning strategies for an edge computing-based IoT data sharing scheme.In particular,we introduce evolutionary game theory and construct a payoff matrix to symbolize intercommunication between IoT devices and edge nodes,where IoT devices and edge nodes are two parties of the game.IoT devices may make malicious requests to achieve their goals of stealing privacy.Accordingly,edge nodes should deny malicious IoT device requests to prevent IoT data from being disclosed.They dynamically adjust their own strategies according to the opponent's strategy and finally maximize the payoffs.Built upon a developed application framework to illustrate the concrete data sharing architecture,a novel algorithm is proposed that can derive the optimal evolutionary learning strategy.Furthermore,we numerically simulate evolutionarily stable strategies,and the final results experimentally verify the correctness of the IoT data sharing privacy preservation scheme.Therefore,the proposed model can effectively defeat malicious invasion and protect sensitive information from leaking when IoT data is shared.
基金supported in part by the National Natural Science Foundation of China(No.61272084,61202004)the Natural Science Foundation of Jiangsu Province(No.BK20130096)the Project of Natural Science Research of Jiangsu University(No.14KJB520031,No.11KJA520002)
文摘Wireless sensor networks(WSNs)consist of a great deal of sensor nodes with limited power,computation,storage,sensing and communication capabilities.Data aggregation is a very important technique,which is designed to substantially reduce the communication overhead and energy expenditure of sensor node during the process of data collection in a WSNs.However,privacy-preservation is more challenging especially in data aggregation,where the aggregators need to perform some aggregation operations on sensing data it received.We present a state-of-the art survey of privacy-preserving data aggregation in WSNs.At first,we classify the existing privacy-preserving data aggregation schemes into different categories by the core privacy-preserving techniques used in each scheme.And then compare and contrast different algorithms on the basis of performance measures such as the privacy protection ability,communication consumption,power consumption and data accuracy etc.Furthermore,based on the existing work,we also discuss a number of open issues which may intrigue the interest of researchers for future work.
基金supported in part by Major Science and Technology Demonstration Project of Jiangsu Provincial Key R&D Program under Grant No.BE2023025in part by the National Natural Science Foundation of China under Grant No.62302238+2 种基金in part by the Natural Science Foundation of Jiangsu Province under Grant No.BK20220388in part by the Natural Science Research Project of Colleges and Universities in Jiangsu Province under Grant No.22KJB520004in part by the China Postdoctoral Science Foundation under Grant No.2022M711689.
文摘This paper presents a comprehensive exploration into the integration of Internet of Things(IoT),big data analysis,cloud computing,and Artificial Intelligence(AI),which has led to an unprecedented era of connectivity.We delve into the emerging trend of machine learning on embedded devices,enabling tasks in resource-limited environ-ments.However,the widespread adoption of machine learning raises significant privacy concerns,necessitating the development of privacy-preserving techniques.One such technique,secure multi-party computation(MPC),allows collaborative computations without exposing private inputs.Despite its potential,complex protocols and communication interactions hinder performance,especially on resource-constrained devices.Efforts to enhance efficiency have been made,but scalability remains a challenge.Given the success of GPUs in deep learning,lever-aging embedded GPUs,such as those offered by NVIDIA,emerges as a promising solution.Therefore,we propose an Embedded GPU-based Secure Two-party Computation(EG-STC)framework for Artificial Intelligence(AI)systems.To the best of our knowledge,this work represents the first endeavor to fully implement machine learning model training based on secure two-party computing on the Embedded GPU platform.Our experimental results demonstrate the effectiveness of EG-STC.On an embedded GPU with a power draw of 5 W,our implementation achieved a secure two-party matrix multiplication throughput of 5881.5 kilo-operations per millisecond(kops/ms),with an energy efficiency ratio of 1176.3 kops/ms/W.Furthermore,leveraging our EG-STC framework,we achieved an overall time acceleration ratio of 5–6 times compared to solutions running on server-grade CPUs.Our solution also exhibited a reduced runtime,requiring only 60%to 70%of the runtime of previously best-known methods on the same platform.In summary,our research contributes to the advancement of secure and efficient machine learning implementations on resource-constrained embedded devices,paving the way for broader adoption of AI technologies in various applications.
基金sponsored by the National Natural Science Foundation of China under grant number No. 62172353, No. 62302114, No. U20B2046 and No. 62172115Innovation Fund Program of the Engineering Research Center for Integration and Application of Digital Learning Technology of Ministry of Education No.1331007 and No. 1311022+1 种基金Natural Science Foundation of the Jiangsu Higher Education Institutions Grant No. 17KJB520044Six Talent Peaks Project in Jiangsu Province No.XYDXX-108
文摘With the rapid development of information technology,IoT devices play a huge role in physiological health data detection.The exponential growth of medical data requires us to reasonably allocate storage space for cloud servers and edge nodes.The storage capacity of edge nodes close to users is limited.We should store hotspot data in edge nodes as much as possible,so as to ensure response timeliness and access hit rate;However,the current scheme cannot guarantee that every sub-message in a complete data stored by the edge node meets the requirements of hot data;How to complete the detection and deletion of redundant data in edge nodes under the premise of protecting user privacy and data dynamic integrity has become a challenging problem.Our paper proposes a redundant data detection method that meets the privacy protection requirements.By scanning the cipher text,it is determined whether each sub-message of the data in the edge node meets the requirements of the hot data.It has the same effect as zero-knowledge proof,and it will not reveal the privacy of users.In addition,for redundant sub-data that does not meet the requirements of hot data,our paper proposes a redundant data deletion scheme that meets the dynamic integrity of the data.We use Content Extraction Signature(CES)to generate the remaining hot data signature after the redundant data is deleted.The feasibility of the scheme is proved through safety analysis and efficiency analysis.
基金supported by the National Natural Science Foundation of China(No.62206238)the Natural Science Foundation of Jiangsu Province(Grant No.BK20220562)the Natural Science Research Project of Universities in Jiangsu Province(No.22KJB520010).
文摘Federated learning for edge computing is a promising solution in the data booming era,which leverages the computation ability of each edge device to train local models and only shares the model gradients to the central server.However,the frequently transmitted local gradients could also leak the participants’private data.To protect the privacy of local training data,lots of cryptographic-based Privacy-Preserving Federated Learning(PPFL)schemes have been proposed.However,due to the constrained resource nature of mobile devices and complex cryptographic operations,traditional PPFL schemes fail to provide efficient data confidentiality and lightweight integrity verification simultaneously.To tackle this problem,we propose a Verifiable Privacypreserving Federated Learning scheme(VPFL)for edge computing systems to prevent local gradients from leaking over the transmission stage.Firstly,we combine the Distributed Selective Stochastic Gradient Descent(DSSGD)method with Paillier homomorphic cryptosystem to achieve the distributed encryption functionality,so as to reduce the computation cost of the complex cryptosystem.Secondly,we further present an online/offline signature method to realize the lightweight gradients integrity verification,where the offline part can be securely outsourced to the edge server.Comprehensive security analysis demonstrates the proposed VPFL can achieve data confidentiality,authentication,and integrity.At last,we evaluate both communication overhead and computation cost of the proposed VPFL scheme,the experimental results have shown VPFL has low computation costs and communication overheads while maintaining high training accuracy.
基金supported by the Deanship of Scientific Research(DSR),King Abdulaziz University,Jeddah,under grant No.(DF-203-611-1441)。
文摘The introduction of the Internet of Things(IoT)paradigm serves as pervasive resource access and sharing platform for different real-time applications.Decentralized resource availability,access,and allocation provide a better quality of user experience regardless of the application type and scenario.However,privacy remains an open issue in this ubiquitous sharing platform due to massive and replicated data availability.In this paper,privacy-preserving decision-making for the data-sharing scheme is introduced.This scheme is responsible for improving the security in data sharing without the impact of replicated resources on communicating users.In this scheme,classification learning is used for identifying replicas and accessing granted resources independently.Based on the trust score of the available resources,this classification is recurrently performed to improve the reliability of information sharing.The user-level decisions for information sharing and access are made using the classification of the resources at the time of availability.This proposed scheme is verified using the metrics access delay,success ratio,computation complexity,and sharing loss.
基金This research is funded by 2023 Henan Province Science and Technology Research Projects:Key Technology of Rapid Urban Flood Forecasting Based onWater Level Feature Analysis and Spatio-Temporal Deep Learning(No.232102320015)Henan Provincial Higher Education Key Research Project Program(Project No.23B520024)a Multi-Sensor-Based Indoor Environmental Parameters Monitoring and Control System.
文摘The Internet of Things(IoT)has revolutionized how we interact with and gather data from our surrounding environment.IoT devices with various sensors and actuators generate vast amounts of data that can be harnessed to derive valuable insights.The rapid proliferation of Internet of Things(IoT)devices has ushered in an era of unprecedented data generation and connectivity.These IoT devices,equipped with many sensors and actuators,continuously produce vast volumes of data.However,the conventional approach of transmitting all this data to centralized cloud infrastructures for processing and analysis poses significant challenges.However,transmitting all this data to a centralized cloud infrastructure for processing and analysis can be inefficient and impractical due to bandwidth limitations,network latency,and scalability issues.This paper proposed a Self-Learning Internet Traffic Fuzzy Classifier(SLItFC)for traffic data analysis.The proposed techniques effectively utilize clustering and classification procedures to improve classification accuracy in analyzing network traffic data.SLItFC addresses the intricate task of efficiently managing and analyzing IoT data traffic at the edge.It employs a sophisticated combination of fuzzy clustering and self-learning techniques,allowing it to adapt and improve its classification accuracy over time.This adaptability is a crucial feature,given the dynamic nature of IoT environments where data patterns and traffic characteristics can evolve rapidly.With the implementation of the fuzzy classifier,the accuracy of the clustering process is improvised with the reduction of the computational time.SLItFC can reduce computational time while maintaining high classification accuracy.This efficiency is paramount in edge computing,where resource constraints demand streamlined data processing.Additionally,SLItFC’s performance advantages make it a compelling choice for organizations seeking to harness the potential of IoT data for real-time insights and decision-making.With the Self-Learning process,the SLItFC model monitors the network traffic data acquired from the IoT Devices.The Sugeno fuzzy model is implemented within the edge computing environment for improved classification accuracy.Simulation analysis stated that the proposed SLItFC achieves 94.5%classification accuracy with reduced classification time.
基金The authors received Funding for this study from the National Natural Science Foundation of China(No.61971235)the China Postdoctoral Science Foundation(No.2018M630590)+1 种基金the Jiangsu Planned Projects for Postdoctoral Research Funds(No.2021K501C)the 333 High-level Talents Training Project of Jiangsu Province,and the 1311 Talents Plan of NJUPT.
文摘As an essential component of intelligent transportation systems(ITS),electric vehicles(EVs)can store massive amounts of electric power in their batteries and send power back to a charging station(CS)at peak hours to balance the power supply and generate profits.However,when the system collects the corresponding power data,several severe security and privacy issues are encountered.The identity and private injection data may be maliciously intercepted by network attackers and be tampered with to damage the services of ITS and smart grids.Existing approaches requiring high computational overhead render them unsuitable for the resource-constrained Internet of Things(IoT)environment.To address above problems,this paper proposes a blockchain-enabled secure and privacy-preserving data aggregation scheme for fog-based ITS.First,a fog computing and blockchain co-aware aggregation framework of power injection data is designed,which provides strong support for ITS to achieve secure and efficient power injection.Second,Paillier homomorphic encryption,the batch aggregation signature mechanism and a Bloom filter are effectively integrated with efficient aggregation of power injection data with security and privacy guarantees.In addition,the fine-grained homomorphic aggregation is designed for power injection data generated by all EVs,which provides solid data support for accurate power dispatching and supply management in ITS.Experiments show that the total computational cost is significantly reduced in the proposed scheme while providing security and privacy guarantees.The proposed scheme is more suitable for ITS with latency-sensitive applications and is also adapted to deploying devices with limited resources.
基金the support of Network Communication Technology(NCT)Research Groups,FTSM,UKM in providing facilities for this research.This paper is supported under the Dana Impak Perdana UKM DIP-2018-040 and Fundamental Research Grant Scheme FRGS/1/2018/TK04/UKM/02/7.
文摘In this paper,we provide a new approach to data encryption using generalized inverses.Encryption is based on the implementation of weighted Moore–Penrose inverse A y MNenxmT over the nx8 constant matrix.The square Hermitian positive definite matrix N8x8 p is the key.The proposed solution represents a very strong key since the number of different variants of positive definite matrices of order 8 is huge.We have provided NIST(National Institute of Standards and Technology)quality assurance tests for a random generated Hermitian matrix(a total of 10 different tests and additional analysis with approximate entropy and random digression).In the additional testing of the quality of the random matrix generated,we can conclude that the results of our analysis satisfy the defined strict requirements.This proposed MP encryption method can be applied effectively in the encryption and decryption of images in multi-party communications.In the experimental part of this paper,we give a comparison of encryption methods between machine learning methods.Machine learning algorithms could be compared by achieved results of classification concentrating on classes.In a comparative analysis,we give results of classifying of advanced encryption standard(AES)algorithm and proposed encryption method based on Moore–Penrose inverse.
文摘In a smart grid, a huge amount of data is collected for various applications, such as load monitoring and demand response. These data are used for analyzing the power state and formulating the optimal dispatching strategy. However, these big energy data in terms of volume, velocity and variety raise concern over consumers' privacy. For instance, in order to optimize energy utilization and support demand response, numerous smart meters are installed at a consumer's home to collect energy consumption data at a fine granularity, but these fine-grained data may contain information on the appliances and thus the consumer's behaviors at home. In this paper, we propose a privacy-preserving data aggregation scheme based on secret sharing with fault tolerance in a smart grid, which ensures that the control center obtains the integrated data without compromising privacy. Meanwhile, we also consider fault tolerance and resistance to differential attack during the data aggregation. Finally, we perform a security analysis and performance evaluation of our scheme in comparison with the other similar schemes. The analysis shows that our scheme can meet the security requirement, and it also shows better performance than other popular methods.
文摘Medical data mining has become an essential task in healthcare sector to secure the personal and medical data of patients using privacy policy.In this background,several authentication and accessibility issues emerge with an inten-tion to protect the sensitive details of the patients over getting published in open domain.To solve this problem,Multi Attribute Case based Privacy Preservation(MACPP)technique is proposed in this study to enhance the security of privacy-preserving data.Private information can be any attribute information which is categorized as sensitive logs in a patient’s records.The semantic relation between transactional patient records and access rights is estimated based on the mean average value to distinguish sensitive and non-sensitive information.In addition to this,crypto hidden policy is also applied here to encrypt the sensitive data through symmetric standard key log verification that protects the personalized sensitive information.Further,linear integrity verification provides authentication rights to verify the data,improves the performance of privacy preserving techni-que against intruders and assures high security in healthcare setting.
基金supported by the Major Scientific and Technological Projects of CNPC under Grant ZD2019-183-006partially supported by the Shandong Provincial Natural Science Foundation,China under Grant ZR2020MF006partially supported by“the Fundamental Research Funds for the Central Universities”of China University of Petroleum(East China)under Grant 20CX05017A,18CX02139A.
文摘In recent years,with the development of the social Internet of Things(IoT),all kinds of data accumulated on the network.These data,which contain a lot of social information and opinions.However,these data are rarely fully analyzed,which is a major obstacle to the intelligent development of the social IoT.In this paper,we propose a sentence similarity analysis model to analyze the similarity in people’s opinions on hot topics in social media and news pages.Most of these data are unstructured or semi-structured sentences,so the accuracy of sentence similarity analysis largely determines the model’s performance.For the purpose of improving accuracy,we propose a novel method of sentence similarity computation to extract the syntactic and semantic information of the semi-structured and unstructured sentences.We mainly consider the subjects,predicates and objects of sentence pairs and use Stanford Parser to classify the dependency relation triples to calculate the syntactic and semantic similarity between two sentences.Finally,we verify the performance of the model with the Microsoft Research Paraphrase Corpus(MRPC),which consists of 4076 pairs of training sentences and 1725 pairs of test sentences,and most of the data came from the news of social data.Extensive simulations demonstrate that our method outperforms other state-of-the-art methods regarding the correlation coefficient and the mean deviation.
文摘Data mining is the extraction of vast interesting patterns or knowledge from huge amount of data. The initial idea of privacy-preserving data mining PPDM was to extend traditional data mining techniques to work with the data modified to mask sensitive information. The key issues were how to modify the data and how to recover the data mining result from the modified data. Privacy-preserving data mining considers the problem of running data mining algorithms on confidential data that is not supposed to be revealed even to the party running the algorithm. In contrast, privacy-preserving data publishing (PPDP) may not necessarily be tied to a specific data mining task, and the data mining task may be unknown at the time of data publishing. PPDP studies how to transform raw data into a version that is immunized against privacy attacks but that still supports effective data mining tasks. Privacy-preserving for both data mining (PPDM) and data publishing (PPDP) has become increasingly popular because it allows sharing of privacy sensitive data for analysis purposes. One well studied approach is the k-anonymity model [1] which in turn led to other models such as confidence bounding, l-diversity, t-closeness, (α,k)-anonymity, etc. In particular, all known mechanisms try to minimize information loss and such an attempt provides a loophole for attacks. The aim of this paper is to present a survey for most of the common attacks techniques for anonymization-based PPDM & PPDP and explain their effects on Data Privacy.
文摘Wearable technologies have the potential to become a valuable influence on human daily life where they may enable observing the world in new ways,including,for example,using augmented reality(AR)applications.Wearable technology uses electronic devices that may be carried as accessories,clothes,or even embedded in the user's body.Although the potential benefits of smart wearables are numerous,their extensive and continual usage creates several privacy concerns and tricky information security challenges.In this paper,we present a comprehensive survey of recent privacy-preserving big data analytics applications based on wearable sensors.We highlight the fundamental features of security and privacy for wearable device applications.Then,we examine the utilization of deep learning algorithms with cryptography and determine their usability for wearable sensors.We also present a case study on privacy-preserving machine learning techniques.Herein,we theoretically and empirically evaluate the privacy-preserving deep learning framework's performance.We explain the implementation details of a case study of a secure prediction service using the convolutional neural network(CNN)model and the Cheon-Kim-Kim-Song(CHKS)homomorphic encryption algorithm.Finally,we explore the obstacles and gaps in the deployment of practical real-world applications.Following a comprehensive overview,we identify the most important obstacles that must be overcome and discuss some interesting future research directions.
文摘In this study, we delve into the realm of efficient Big Data Engineering and Extract, Transform, Load (ETL) processes within the healthcare sector, leveraging the robust foundation provided by the MIMIC-III Clinical Database. Our investigation entails a comprehensive exploration of various methodologies aimed at enhancing the efficiency of ETL processes, with a primary emphasis on optimizing time and resource utilization. Through meticulous experimentation utilizing a representative dataset, we shed light on the advantages associated with the incorporation of PySpark and Docker containerized applications. Our research illuminates significant advancements in time efficiency, process streamlining, and resource optimization attained through the utilization of PySpark for distributed computing within Big Data Engineering workflows. Additionally, we underscore the strategic integration of Docker containers, delineating their pivotal role in augmenting scalability and reproducibility within the ETL pipeline. This paper encapsulates the pivotal insights gleaned from our experimental journey, accentuating the practical implications and benefits entailed in the adoption of PySpark and Docker. By streamlining Big Data Engineering and ETL processes in the context of clinical big data, our study contributes to the ongoing discourse on optimizing data processing efficiency in healthcare applications. The source code is available on request.