Privacy-preserving clustering for big data in cyber-physical-social systems: Survey and perspectives
خوشه بندی حفظ حریم خصوصی برای داده های بزرگ در سیستم های سایبر-فیزیکی-اجتماعی: بررسی و چشم انداز-2020
Clustering technique plays a critical role in data mining, and has received great success to solve application problems like community analysis, image retrieval, personalized rec- ommendation, activity prediction, etc. This paper first reviews the traditional clustering and the emerging multiple clustering methods, respectively. Although the existing meth- ods have superior performance on some small or certain datasets, they fall short when clustering is performed on CPSS big data because of the high cost of computation and stor- age. With the powerful cloud computing, this challenge can be effectively addressed, but it brings enormous threat to individual or company’s privacy. Currently, privacy preserving data mining has attracted widespread attention in academia. Compared to other reviews, this paper focuses on privacy preserving clustering technique, guiding a detailed overview and discussion. Specifically, we introduce a novel privacy-preserving tensor-based multi- ple clustering, propose a privacy-preserving tensor-based multiple clustering analytic and service framework, and give an illustrated case study on the public transportation dataset. Furthermore, we indicate the remaining challenges of privacy preserving clustering and discuss the future significant research in this area.
Keywords: CPSS | Big data | Cloud computing | Privacy preserving | Clustering
A hybrid deep learning model for efficient intrusion detection in big data environment
یک مدل یادگیری عمیق ترکیبی برای تشخیص نفوذ موثر در محیط داده های بزرگ-2020
The volume of network and Internet traffic is expanding daily, with data being created at the zettabyte to petabyte scale at an exceptionally high rate. These can be character- ized as big data, because they are large in volume, variety, velocity, and veracity. Security threats to networks, the Internet, websites, and organizations are growing alongside this growth in usage. Detecting intrusions in such a big data environment is difficult. Various intrusion-detection systems (IDSs) using artificial intelligence or machine learning have been proposed for different types of network attacks, but most of these systems either cannot recognize unknown attacks or cannot respond to such attacks in real time. Deep learning models, recently applied to large-scale big data analysis, have shown remarkable performance in general but have not been examined for detection of intrusions in a big data environment. This paper proposes a hybrid deep learning model to efficiently detect network intrusions based on a convolutional neural network (CNN) and a weight-dropped, long short-term memory (WDLSTM) network. We use the deep CNN to extract mean- ingful features from IDS big data and WDLSTM to retain long-term dependencies among extracted features to prevent overfitting on recurrent connections. The proposed hybrid method was compared with traditional approaches in terms of performance on a publicly available dataset, demonstrating its satisfactory performance.
Keywords: Big data | Intrusion | detection Deep learning | Convolution neural network | Weight-dropped long short-term memory | network
The digital surgeon: How big data, automation, and artificial intelligence will change surgical practice
جراح دیجیتال: داده های بزرگ ، اتوماسیون و هوش مصنوعی چقدر عمل جراحی را تغییر می دهند-2020
Exponential growth in computing power, data storage, and sensing technology has led to a world in which we can both capture and analyze incredibleamounts of data. The evolution of machine learning has further advanced the ability of computers to develop insights from massive data sets that are beyond the capacity of human analysis. The convergence of computational power, data storage, connectivity, and Artificial Intelligence (AI) has led to health technologies that, to date, have focused on diagnostic areas such as radiology and pathology. The question remains how the digital revolution will translate in the realm of surgery. There are three main areas where the authors believe that AI could impact surgery in the near future: enhancement of trainingmodalities, cognitive enhancement of the surgeon, and procedural automation.While the promise of Big Data, AI, and Automation is high, there have been unanticipated missteps in the use of such technologies that are worth considering as we evaluate how such technologies could/should be adopted in surgical practice. Surgeons must be prepared to adopt smarter training modalities, supervise the learning of machines that can enhance cognitive function, and ultimately oversee autonomous surgery without allowing for a decay in the surgeon’s operating skills.
Key words: Future pediatric surgery | Automation and artificial intelligence in | pediatric surgery
What Can We Learn About Drug Safety and Other Effects in the Era of Electronic Health Records and Big Data That We Would Not Be Able to Learn From Classic Epidemiology?
چه چیزی می توانیم درباره ایمنی دارو و سایر تأثیرات در عصر سوابق الکترونیکی سلامت و داده های بزرگی که نمی توانیم از اپیدمیولوژی کلاسیک یاد بگیریم؟-2020
As more and more health systems have converted to the use of electronic health records, the amount of searchable and analyzable data is exploding. This includes not just provider or laboratory created data but also data collected by instruments, personal devices, and patients themselves, among others. This has led to more attention being paid to the analysis of these data to answer previously unaddressed questions. This is especially important given the number of therapies previously found to be beneficial in clinical trials that are currently being re-scrutinized. Because there are orders of magnitude more information contained in these data sets, a fundamentally different approach needs to be taken to their processing and analysis and the generation of knowledge. Health care and medicine are drivers of this phenomenon and will ultimately be the main beneficiaries. Concurrently, many different types of questions can now be asked using these data sets. Research groups have become increasingly active in mining large data sets, including nationwide health care databases, to learn about associations of medication use and various unrelated diseases such as cancer. Given the recent increase in research activity in this area, its promise to radically change clinical research, and the relative lack of widespread knowledge about its potential and advances, we surveyed the available literature to understand the strengths and limitations of these new tools. We also outline new databases and techniques that are available to researchers worldwide, with special focus on work pertaining to the broad and rapid monitoring of drug safety and secondary effects.
Keywords: Electronic health record | Big data | Drug safety | Health care database | Cancer risk
UNIC: A fast nonparametric clustering
UNIC: خوشه بندی سریع غیر پارامتری-2020
Clustering is among the tools for exploring, analyzing, and deriving information from data. In the case of large data sets, the real burden to the application of clustering algorithms can be their complexity and demand of control parameters. We present a new fast nonparametric clustering algorithm, UNIC, to address these challenges. To identify clusters, the algorithm evaluates the distances between selected points and other points in the set. While assessing these distances, it employs methods of robust statistics to identify the cluster borders. The performance of the proposed algorithm is assessed in an experimental study and compared with several existing clustering methods over a variety of benchmark data sets.
Keywords: Cluster analysis | Hard (conventional | crisp) clustering | Nonparametric algorithms | Data mining | Big data
GeoVReality: A computational interactive virtual reality visualization framework and workflow for geophysical research
GeoVReality: چارچوب تجسم واقعیت مجازی تعاملی محاسباتی و گردش کار برای تحقیقات ژئوفیزیکی-2020
We present a new interactive computational virtual reality (VR) visualization framework for geophysical Big Data and models for the development of immersive collaborative virtual reality applications with a focus on targeted processing and interaction of Big Data. The framework includes a high-performance scalable persistent storage solution for the spatial analysis of Geospatial Information System (GIS), which uses an engine based on efficient in-memory computing. To more effectively visualize and interact in a VR environment, a machine learning algorithm library is used for compressing and extracting visual data. The framework supports mainstream rendering engines and VR hardware. The framework is extensible, customizable, cross-platform, and it is based only on open source tools. A workflow was introduced, and the geophysical data visualization and interaction effects were demonstrated by taking the abyss data of the Mariana Trench as example.
Keywords: Virtual reality | Geophysical model | Interactive visualization | Unreal engine | Unity 3D | Big data
Leveraging Google Earth Engine (GEE) and machine learning algorithms to incorporate in situ measurement from different times for rangelands monitoring
اهرم موتور زمین گوگل و الگوریتم های یادگیری ماشین برای ترکیب در اندازه گیری درجا از زمان های مختلف برای نظارت بر مراتع-2020
Mapping and monitoring of indicators of soil cover, vegetation structure, and various native and non-native species is a critical aspect of rangeland management. With the advancement in satellite imagery as well as cloud storage and computing, the capability now exists to conduct planetary-scale analysis, including mapping of rangeland indicators. Combined with recent investments in the collection of large amounts of in situ data in the western U.S., new approaches using machine learning can enable prediction of surface conditions at times and places when no in situ data are available. However, little analysis has yet been done on how the temporal relevancy of training data influences model performance. Here, we have leveraged the Google Earth Engine (GEE) platform and a machine learning algorithm (Random Forest, after comparison with other candidates) to identify the potential impact of different sampling times (across months and years) on estimation of rangeland indicators from the Bureau of Land Managements (BLM) Assessment, Inventory, and Monitoring (AIM) and Landscape Monitoring Framework (LMF) programs. Our results indicate that temporally relevant training data improves predictions, though the training data need not be from the exact same month and year for a prediction to be temporally relevant. Moreover, inclusion of training data from the time when predictions are desired leads to lower prediction error but the addition of training data from other times does not contribute to overall model error. Using all of the available training data can lead to biases, toward the mean, for times when indicator values are especially high or low. However, for mapping purposes, limiting training data to just the time when predictions are desired can lead to poor predictions of values outside the spatial range of the training data for that period. We conclude that the best Random Forest prediction maps will use training data from all possible times with the understanding that estimates at the extremes will be biased.
Keywords: Google earth engine | Big data | Machine learning | Domain adaptation | Transfer learning | Feature selection | Rangeland monitoring
Comparative analysis of surface water quality prediction performance and identification of key water parameters using different machine learning models based on big data
تجزیه و تحلیل مقایسه ای عملکرد پیش بینی کیفیت آب سطحی و شناسایی پارامترهای کلیدی آب با استفاده از مدل های مختلف یادگیری ماشین بر اساس داده های بزرگ-2020
The water quality prediction performance of machine learning models may be not only dependent on the models, but also dependent on the parameters in data set chosen for training the learning models. Moreover, the key water parameters should also be identified by the learning models, in order to further reduce prediction costs and improve prediction efficiency. Here we endeavored for the first time to compare the water quality prediction performance of 10 learning models (7 traditional and 3 ensemble models) using big data (33,612 observations) from the major rivers and lakes in China from 2012 to 2018, based on the precision, recall, F1-score, weighted F1-score, and explore the potential key water parameters for future model prediction. Our results showed that the bigger data could improve the performance of learning models in prediction of water quality. Compared to other 7 models, decision tree (DT), random forest (RF) and deep cascade forest (DCF) trained by data sets of pH, DO, CODMn, and NH3 eN had significantly better performance in prediction of all 6 Levels of water quality recommended by Chinese government. Moreover, two key water parameter sets (DO, CODMn, and NH3eN; CODMn, and NH3eN) were identified and validated by DT, RF and DCF to be high specificities for perdition water quality. Therefore, DT, RF and DCF with selected key water parameters could be prioritized for future water quality monitoring and providing timely water quality warning.
Keywords: Water quality prediction | Machine learning models | Ensemble methods | Deep cascade forest | The key water parameters
A nonlinear data-driven reduced order model for computational homogenization with physics/pattern-guided sampling
یک مدل سفارش داده کاهش یافته غیرخطی برای همگن سازی محاسباتی با نمونه گیری فیزیک / الگوی هدایت شده-2020
Developing an accurate nonlinear reduced order model from simulation data has been an outstanding research topic for many years. For many physical systems, data collection is very expensive and the optimal data distribution is not known in advance. Thus, maximizing the information gain remains a grand challenge. In a recent paper, Bhattacharjee and Matouš (2016) proposed a manifold-based nonlinear reduced order model for multiscale problems in mechanics of materials. Expanding this work here, we develop a novel sampling strategy based on the physics/pattern-guided data distribution. Our adaptive sampling strategy relies on enrichment of sub-manifolds based on the principal stretches and rotational sensitivity analysis. This novel sampling strategy substantially decreases the number of snapshots needed for accurate reduced order model construction (i.e., ∼ 5× reduction of snapshots over Bhattacharjee and Matouš (2016)). Moreover, we build the nonlinear manifold using the displacement rather than deformation gradient data. We provide rigorous verification and error assessment. Finally, we demonstrate both localization and homogenization of the multiscale solution on a large particulate composite unit cell
Keywords: Computational homogenization | Nonlinear manifold | Reduced order model | Machine learning | Parallel computing | Big data
Deciphering the recreational use of urban parks: Experiments using multi-source big data for all Chinese cities
رمزگشایی استفاده تفریحی از پارک های شهری: آزمایش هایی با استفاده از داده های بزرگ چند منبع برای همه شهرهای چین-2020
China’s rapid urbanization process has accentuated the disparity between the demand for and supply of its park recreational services. Estimations of park use and an understanding of the factors that influence it are critical for increasing these services. However, the data traditionally used to quantify park use are often subjective as well as costly and laborious to procure. This paper assessed the use of parks through an analysis of check-in data obtained from the Weibo social media platform for 13,759 parks located in all 287 cities at prefecture level and above across China. We investigated how park attributes, accessibility, and the socioeconomic environment affected the number and density of park check-ins. We used multiple linear regression models to analyze the factors influencing check-ins for park visits. The results showed that in all the cities, the influence of external factors on the number and density of check-in visits, notably the densities of points of interest (POIs) and bus stops around the parks was significantly positive, with the density of POIs being the most influential factor. Conversely, park attributes, which included the park service area and the landscape shape index (LSI), negatively influenced park use. The density of POIs and bus stops located around the park positively influenced the density of the recreational use of urban parks in cities within all administrative tiers, whereas the impact of park service areas was negative in all of them. Finally, the factors with the greatest influence varied according to the administrative tiers of the cities. These findings provide valuable inputs for increasing the efficiency of park use and improving recreational services according to the characteristics of different cities.
Keywords: Weibo check-ins | Park attributes | Regression models | Park usage | China