sensitivity in data mining

This dataset is the simplified version of diabetes data available at Kaggle. Upper and lower bounds: the function is used to initialize the enumeration tree. We will take a simple binary class classification problem to calculate the confusion matrix and evaluate accuracy, sensitivity, and specificity. Although the algorithm of [11] reduces the risk of privacy leakage to a certain extent, the proposed K-anonymity model cannot solve the problems of homogeneous attacks and background attacks. 14 of Conferences in Research and Practice in Information Technology, ACS, Maebashi City, Japan. The Flexible Partition algorithm is briefly described below. The dataset selected in the data mining uses two kinds of tourist reviews of the Tongcheng tourism website as an experimental dataset. Firstly, the enumeration tree is initialized according to the initial sliding window dataset and absolute support degree; then the algorithm uses the time characteristics of data arrival and departure to mine sensitive data and prunes the enumerate trees. The experiment acquires the total data length, the longest data item length, the shortest data item length, and the running time. Check if you have access through your login credentials or your institution to get full access on this article.

The sensitive data recognition rate is 100%, and it is not difficult to see that when the threshold value is 1/13, the running time is the least, that is, the optimal threshold point, 3. Necessary cookies are absolutely essential for the website to function properly. The threshold cannot be dynamically changed in the algorithm FIMoTS, so it is not suitable for the mining of sensitive data of various sizes of datasets, resulting in low mining efficiency. & Hamilton, H. J. To manage your alert preferences, click on the button below. This has led to the mining and protection of sensitive information, that is, through the mining of super large amounts of data to obtain important information of users. And through mathematical derivation, we also saw that at the optimum cut-off point the accuracy is also equal to Sensitivity and Specificity. ]], Rachels, J. This paper improves the algorithm FIMoTS in the process of mining sensitive data. 8 0 obj proposed the SWM-MFI algorithm in 2015 [7]. electrical things lighting problems buffet warren differently sir ll think cord royal However, it has been found in practice that this method does not protect user information well, and it is difficult for data recovery, which destroys the usefulness of data. These cookies do not store any personal information. (1999), Heuristic measures of interestingness, in J. Zytkow & J. Rauch, eds, '3rd European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'99)', Vol. 439--450. In view of the above problems, this paper mainly uses the following steps to mine and protect the data. 61871412) and Natural Science Foundation of Anhui Province (No. Although the above method can efficiently mine sensitive data, it ignores the most important temporal characteristics of data flow. endobj 11 0 obj /Type/Font Use to save sensitive data, and a single sensitive dataset consists of and two datasets, which represent changes in the upper and lower bounds of sensitive data; use to save nonsensitive data, and a single nonsensitive dataset consists of and two datasets, indicating nonsensitive data type changes. We will build the model and calculate the confusion matrix using a step-by-step approach. For a large dataset , it can be divided into two anonymized groups and , where , and when the equation is established. Anonymous groups can be divided into W anonymous groups. /Filter[/FlateDecode] <<

Massive text clustering and topic extraction based on sensitive data can obtain accurate and sensitive information, but it is not suitable for mining sensitive information in social networks [2]. Among them, the data stream has strong time characteristics, and there is also the risk of sensitive information being tampered with and eavesdropped. 9 0 obj When there are itemsets removed, the lower bound of the type change of becomes . resistance ground testing grounding system electrical earthing supplier pabx voltage email automatic philippine thru attendance low previous regulator We also use third-party cookies that help us analyze and understand how you use this website. So far, we have calculated the confusion matrix and accuracy with cut-off=0.5. We renamed the algorithm based on the dynamic rounding function K- anonymous algorithm as DIDF algorithm, and the specific algorithm steps are as shown in Algorithm 2. The training dataset will be used to build the model, Based on training data we will create a new dataframe named dib_train which will include original and predicted diabetes, Create predicted diabetes based on a 0.5 cut-off probability. The DIDF algorithm always tries to split the dataset of a single time period into more anonymous groups and has stronger advantages in processing the data on the boundary line, which can fully reflect the time characteristics of the data flow in the social network. (1975), 'Why privacy is important', Philosophy and Public Affairs4(4). (2001), Evaluation of interestingness measures for ranking discovered knowledge, in D. W.-L. Cheung, G. J. Williams & Q. Li, eds, '5th Pacific-Asia Conference on Knowledge Discovery and Data Mining - PAKDD 2001', Vol. It is an NxN matrix which helps to evaluate the performance of machine learning model for classification problem. The dataset studied in this paper mainly comes from online commentary. Let us calculate the value of Sensitivity, Specificity, and accuracy at the optimum point. Let the phrase after lexical analysis be . The model performance in a classification problem is assessed through a confusion matrix. Data Science and Machine Learning is my recent interest area. Contact: FarhadMalik84@googlemail.com, Review on Suspicious Human Action Recognition, Chart 2.6 a lot of new visualizations, negative values and more flexible parameters, Bertelsmann Avarato: Customer Segmentation and Potential Customer Prediction, Data Science current and future landscape. The phrase was used to analyze the word and word frequency to obtain a new phrase . 147/quotedblleft/quotedblright/bullet/endash/emdash/tilde/trademark/scaron/guilsinglright/oe However, SWM-FI and FIUI-Strean are based on sliding window to mine sensitive data. With the rapid development of network technology, Internet platforms such as search engines, social networks, and e-commerce have generated a large amount of data when it is convenient for users. Diabetes is dependent or response variable. Table 1 lists the data characteristics of the two datasets. ]], Evfimievski, A., Srikant, R., Agrawal, R. & Gehrke, J. (2001), 'Information week research'. This paper optimizes the K-anonymity algorithm [10] known as Flexible Partition algorithm based on the rounding partition function, which regards time as an important attribute. We intend to explore several directions in future work, including extending the algorithm to deal with the frequent pattern mining on data stream in distributed environment. This blog aims to bridge the gap between technologists, mathematicians and financial experts and helps them understand how fundamental concepts work within each field. (1998), 'Data mining: Staking a claim on your privacy'. IEEE International Conference on Data Mining Workshop on Privacy, Security, and Data Mining, Vol. The intermediate threshold value is in the range of [1/14, 1/12]. Let , where denotes the -th network comment. At the optimum cut-off or crossing point, the sensitivity and specificity are equal. Therefore, how to effectively protect the sensitive data excavated becomes another important research area. stream This method can determine the optimal threshold in the same dataset, thereby improving the experimental efficiency. As can be seen from the table, there is a certain wrong prediction or misclassification. The algorithm mainly uses enumeration tree as the data structure to save data. Firstly, we segment the data stream and extract sensitive words. 4. Human and societal aspects of security and privacy. Based on the sliding window processing method, the time period is the processing unit, which increases the computational efficiency. Among them, denotes the phrase after the segmentation of the -th dataset. Ch. Figures 2(a) and 2(b) show the relationship between the threshold and the sensitivity data recognition rate; there are maps showing that when the threshold range is 1/14, 1/12, sensitive data identification rate is highest, up to 100%. 14/Zcaron/zcaron/caron/dotlessi/dotlessj/ff/ffi/ffl 30/grave/quotesingle/space/exclam/quotedbl/numbersign/dollar/percent/ampersand/quoteright/parenleft/parenright/asterisk/plus/comma/hyphen/period/slash/zero/one/two/three/four/five/six/seven/eight/nine/colon/semicolon/less/equal/greater/question/at/A/B/C/D/E/F/G/H/I/J/K/L/M/N/O/P/Q/R/S/T/U/V/W/X/Y/Z/bracketleft/backslash/bracketright/asciicircum/underscore/quoteleft/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z/braceleft/bar/braceright/asciitilde ]], Pinkas, B. 159/Ydieresis 161/exclamdown/cent/sterling/currency/yen/brokenbar/section/dieresis/copyright/ordfeminine/guillemotleft/logicalnot/hyphen/registered/macron/degree/plusminus/twosuperior/threesuperior/acute/mu/paragraph/periodcentered/cedilla/onesuperior/ordmasculine/guillemotright/onequarter/onehalf/threequarters/questiondown/Agrave/Aacute/Acircumflex/Atilde/Adieresis/Aring/AE/Ccedilla/Egrave/Eacute/Ecircumflex/Edieresis/Igrave/Iacute/Icircumflex/Idieresis/Eth/Ntilde/Ograve/Oacute/Ocircumflex/Otilde/Odieresis/multiply/Oslash/Ugrave/Uacute/Ucircumflex/Udieresis/Yacute/Thorn/germandbls/agrave/aacute/acircumflex/atilde/adieresis/aring/ae/ccedilla/egrave/eacute/ecircumflex/edieresis/igrave/iacute/icircumflex/idieresis/eth/ntilde/ograve/oacute/ocircumflex/otilde/odieresis/divide/oslash/ugrave/uacute/ucircumflex/udieresis/yacute/thorn/ydieresis] Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. ]], Miller, M. (1991), A model of statistical database compromise incorporating supplementary knowledge, in B. Srinivasan & J. Zeleznikow, eds, 'Second Australian Database-Information Systems Conference', World Scientific, Sydney. In the enumeration tree, the parent node data item is a subset of the child nodes. >> endobj These cookies will be stored in your browser only with your consent. ]], Johnson, D. G. & Nissenbaum, H. (1995), Computers, ethics and social values, Prentice-Hall, New Jersey. Furthermore, because sensitive data mining may lead to personal privacy leakage, we intend to add a differential privacy method into our SL-SDMA method. Copyright 2018 Xiaoyao Zheng et al. (1995), On subjective measures of interestingness in knowledge discovery, in U. M. Fayyad & R. Uthurusamy, eds, 'First International Conference on Knowledge Discovery and Data Mining (KDD-95)', AAAI Press, Menlo Park, CA, USA, Montreal, Quebec, Canada, pp. Theoretical analysis and experimental results show that the algorithm can effectively mine the sensitive data in the data stream and can effectively protect the sensitive data. And Blood Sugar Levelis an independent variable or feature variable to predict diabetes. This paper first uses NLPs lexical data package THULAC to preprocess the dataset. At the same time, threshold sensitivity analysis is used to find out the optimal threshold. In addition, the sliding window in our algorithm is divided according to the time, which can fully reflect the time characteristics of the data stream, so the time efficiency of SL-SDMA is higher. The experimental results after the operation are similar to the experimental results of the Flexible Partition algorithm which is shown in Figure 4. Read the winning articles. /Type/Encoding >> /Differences[1/dotaccent/fi/fl/fraction/hungarumlaut/Lslash/lslash/ogonek/ring 11/breve/minus The SL-SDMA algorithm adopted in this paper changes the threshold value dynamically. As can be seen from the figure, the blue dashed line is the dividing line, and the line graph of the red threshold and time is roughly divided into three parts. Finally, the threshold self-learning function is used to determine the threshold for finding the minimum time spent in ensuring the accuracy of mining data. 2035 of Lecture Notes in Computer Science, Springer, Hong Kong, China, pp. << The specific steps are shown in Figure 1.

The specific algorithm steps are as shown in Algorithm 1. ]], Estivill-Castro, V., Brankovic, L. & Dowe, D. (1999), 'Privacy in data mining', Privacy - Law and Policy Reporter6(3), 33--35. This assumes that the data is divided exactly at 0.5 probability. For example, when and dataset , is known based on the algorithm, so after the algorithm operation, two anonymous groups and are generated. The cross point provides the optimum cutoff to create boundaries between classes. This category only includes cookies that ensures basic functionalities and security features of the website. Finally, the algorithm sets the upper and lower boundaries of data changes to improve the mining efficiency. This website uses cookies to improve your experience while you navigate through the website. FN = confusion[1,0] # false negatives, cut-off point. Among them, denotes the phrase after the word segmentation of the -th dataset, denotes the -th phrase after the lexical analysis of the -th group dataset, and denotes the part of speech of the phrase . The storage of the calculation results is realized. /Type/Font [4] presented the FIUT-Stream algorithm in 2013, and Yin Shaohong et al. The plot between sensitivity, specificity, and accuracy shows their variation with various values of cut-off. xYK W-TM}\N*#FHMXIjE'9~ Ea=wIw1Qyzwa=~1`[W^O2xhjG, Threshold self-learning-sensitive data mining algorithm: Mining time of experiment on two datasets. /Type/Font ]], Hilderman, R. J. (2002), 'Randomization in privacy-preserving data mining', SigKDD Explorations4(2), 28--34. Therefore, when the dataset size is 28900, the threshold value is 1/13, which can not only guarantee the highest recognition rate of sensitive data, but also guarantee the shortest running time, which is the optimal threshold point in this paper. The FIMoTS algorithm mentioned in [6] is more in line with the characteristics of data flow in todays social networks, which emphasizes the influence of time on the value of data. The process caters for differing sensitivities at the attribute value level and allows a variety of sensitivity combination functions to be employed. Moreover, the accuracy of most, if not all, of the classification machine learning models, is measured by their specificity and sensitivity.

& Tuzhilin, A. 14 of Conferences in Research and Practice in Information Technology, ACS, Maebashi City, Japan, p. Next, we find the optimal threshold. (Source:-https://towardsdatascience.com/confusion-matrix-for-your-multi-class-machine-learning-model-ff9aa3bf7826), From the confusion matrix Accuracy, Sensitivity and Specificity is evaluated using the following equations. Wu, Q.-M. Tang, W.-W. Ni, and Z.-H. Sun, Algorithm for k-anonymity based on rounded partition function,, A. Masoumzadeh and J. Joshi, An alternative approach to k-anonymity for location-based services,, L. Yang, H. Zhifeng, and W. Wen, Research on differential privacy preserving k-means clustering,. So, at the cut-off point where sensitivity and specificity are equal, the accuracy is also the same. ]], Clarke, R. (1997), Privacy and dataveillance, and organisational strategy, in 'Region 8 EDPAC'96 Information Systems Audit and Control Assoc. /Type/Font By defining the upper and lower bounds of the data item type, the enumeration tree and the data collection information are updated only when the relative support degree reaches the upper and lower bounds of the type change, thereby saving the calculation time. The advantage is that the threshold determined by self-learning can make the accuracy of data extraction the highest, but when comparing threshold effects, a large number of calculations are generated, which greatly reduces the mining efficiency. Current methods for protecting sensitive data include the privacy protection method based on K-anonymity [10], anonymized privacy protection technology based on clustering [11], and differential privacy protection [12]. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. /Subtype/Type1 13 0 obj Therefore it is one of the most important topics that all data scientists should be familiar with as it explains the concepts behind the two most common metrics.

sensitivity in data miningbest stand for samsung rear speakers

Compare & Book

Cheap Flights, Trains, Buses and more

Your journey starts when you leave the doorstep.
Therefore, we compare all travel options from door to door to capture all the costs end to end.

Flights

Ride share

Bicycle

Coach travel

Trains

Taxi

All travel options in one overview

CombiTrip is unique

Popular Bus, Train and Flight routes around Europe

Popular routes in The Netherlands

Popular Bus, Train and Flight routes in France

Popular Bus, Train and Flight routes in Germany

Popular Bus, Train and Flight routes in Spain

Popular Bus, Train and Flight routes in Italy