31 Ocak 2017 Salı

Article: Data Mining Case Study



Introduction
Constantly evolving technology has made it easier to keep the data. In situation some methods have been developed for the analysis of data. The science that collects these methods together is called data mining. In this article we will examine Data mining and its algorithms. Our goal is to talk about data mining and algorithms in generally. Then we will examine the clustering algorithm from the algorithms, In which areas it is applied and for what purpose it is used. After these phase, we will focus on k-means from clustering algorithms and methodology of that. In the next step, we will have research question, we analyze data properties after we will solve in the WEKA application.


What is Data Mining?
Technology rapidly has increased, so the ease of accessing information. It is easy to store information as easily as it is easy to access the information,So the data is rapidy growing and there are a lot data in everywhere. This condition has to be processed and analyzed. Many data analysis algorithms are called data mining, which is gathered under one roof. So data mining is a process of extracting useful information from a heap. In conclusion, data mining is the process of useful information from extracting large data by solving a number of algorithms.

Data mining is a combination of many disciplines. These disciplines: database systems, statistics, machine learning, and pattern recognition. The algebraic, geometric, and probabilistic viewpoints of data play a key role in data mining. Given a dataset of n points in a d-dimensional space, the fundamental analysis and mining tasks covered in this book include exploratory data analysis, frequent pattern discovery, data clustering, and classification models, which are described next.(DATA MINING AND ANALYSIS Fundamental Concepts and Algorithms, MOHAMMED J. ZAKI and WAGNER MEIRA JR,Cambridge Universiy, Pg: 26).

Data mining mostly used  in  Market Analysis and Management , Corporate Analysis & Risk Management , Fraud Detection.  The main problems are finding target customer type, forecasting for future, and prevent. Also data mining can be used in many ways in everyday life. Some of these can be listed as follows:
·         Evaluation of treatment claims made to hospitals according to time, place and need will be helpful in the initial stage of epidemic risk assessment, control and resource planning.

·         A model that identifies the profiles of users of fugitive energy will allow effective fighting with fugitives at low cost, which will allow them to predict potential fugitive energy users.

·         A study aimed at predicting the intensity of highways by region and time will ensure that, for example, accident rates are minimized by correct resource planning at the right time.

·         When implementing public support schemes, the success of programs implemented through institutional risk scoring increases the amount of support to be given to organizations with the right amount and right goals. Reducing the amount of bad loans is the fact that the profits that are the risk of not paying when allocating credits have been identified.


Generally used data mining applications can be listed as follows: Marketing Banking Retailing and sales, Manufacturing and production, Brokerage and securities trading, Government and defense , Computer hardware and software, Airlines, Health care ,Broadcasting,  Homeland security, Insurance Police (Week4Presantation, Keziban Seçkin, AYBU, pg: 12)


If we look at the algorithms of data mining in generally, algoritms of classification,most common used  naive bayes and decision tree, algoritms of clustring algoritmları, most common used Hierarchy Clustering and K-means, Association Rules (Apriori Algorithm), Text Mining, Web Mining.

Classification used to analyze the historical data stored in a database and to automatically generate a model that can predict future behavior. Its objective is to find a derived model that describes and distinguishes data classes or concepts. (Week4Presantation, Keziban Seçkin, AYBU, pg: 8)
·         Naive Bayes based on classification so we need training data,classification algorithm building classifier or model and test data to estimate the accuracy of classification rules (supervised learning). A learned model can be used to make predictions.


·         Decision Tree A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The topmost node in the tree is the root node.(Data Mining Tuturiol, tutoroils points, pg: 31)
Clustring Partitioning a database into segments in which the members of a segment share similar qualities. After we will examine this issue in detail.
Association A category of data mining algorithm that establishes relationships about items that occur together in a given record
Text mining Application of data mining to non-structured or less structured text files. It entails the generation of meaningful numerical indices from the unstructured text and then processing these indices using various data mining algorithms. That is, by scanning an existing text (looking at the number of words from which the word is spoken, looking at what the frequency range is, and repeating it), it yields a meaningful result.
Web mining The discovery and analysis of interesting and useful information from the Web, about the Web, and usually through Web-based tools (Week4Presantation, Keziban Seçkin, AYBU)

ALGORİTM  OF CLUSTRING
What is the Clustring

Classes emerging by keeping similar data close together are called clustering. Classification is used mostly as a supervised learning method, clustering for unsupervised learning. The logic of the cluster is like this: Increase intracellar similarity as much as possible, the difference between the clusters is as much as high. The clustering algorithm is the oldest algorithm.Clustring can use many areas for example we think that the records of the products that are received by a customer in a business. From these data sets, clustering algorithms can provide useful information. For example, how many produce which size and pieces the shirt  (small, medium, large eg.)

If we will give the definition of the cluster academically: Clustering is a standard procedure in multivariate data analysis. It is designed to explore an inherent natural structure of the data objects, where objects in the same cluster are as similar as possible and objects in different clusters are as dissimilar as possible

Clustering is an exploratory data analysis. Therefore, the explorer might have no or little information about the parameters of the resulting cluster analysis. In typical uses of clustering the goal is to determine all of the following: The number of clusters, The absolute and relative positions of the clusters, The size of the clusters, The shape of the clusters, The density of the clusters.
The cluster properties are explored in the process of the cluster analysis, which can be split into the following steps.
1. Definition of objects: Which are the objects for the cluster analysis?
2. Definition of clustering purpose: What is the interest in clustering the objects?
3. Definition of features: Which are the features that describe the objects?
4. Definition of similarity measure: How can the objects be compared?
5. Definition of clustering algorithm: Which algorithm is suitable for clustering the data?
6. Definition of cluster quality: How good is the clustering result? What is the interpretation? ( Clustering Algorithms and Evaluations, pg: 180).
Application of Cluster Analysis
Clustering analysis is mostly used in many area  such as market research, pattern recognition, data analysis, and image processing.
Clustering can also help marketers find different target groups in their customer base. And using these application they can describe their customer groups based on the purchasing sample.
In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functions and gain insight into structures inherent to populations.
Clustering also helps in description of areas of similar land use in an earth observation database. It also helps in the description of groups of houses in a city according to house type, value, and geographic location.
Clustering also helps in classifying texts on the web for information exploration.
Clustering is also used in outlier detection applications such as detection of credit card fraud. (Fraud Detection)

Requirements of Clustering in Data Mining
The following statements throw light on why clustering is required in data mining: 
Scalability - We need extremely scalable clustering algorithms to agreement with wide databases.
Ability to deal with different kinds of attributes - Algorithms should be able to be operative on any kind of data such as interval-based (numerical) data, categorical, and binary data.
Discovery of clusters with attribute shape - The clustering algorithm should be capable of detecting clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find spherical cluster of small sizes.
High dimensionality - The clustering algorithm should not only be able to handle low-dimensional data but also the high dimensional space.
 Ability to deal with noisy data - Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters.
Interpretability - The clustering results should be interpretable, comprehensible, and usable.

Algoritm of Cluster
In this section we describe the most well-known clustering algorithms. The main reason for having many clustering methods is the fact that the notion of “cluster” is not precisely defined (Estivill-Castro, 2000). Consequently many clustering methods have been developed, each of which uses a different induction principle. Farley and Raftery (1998) suggest dividing the clustering methods into two main groups: hierarchical and partitioning methods. Han and Kamber (2001) suggest categorizing the methods into additional three main categories: density-based methods, model-based clustering and gridbased methods. An alternative categorization based on the induction principle of the various clustering methods is presented in (Estivill-Castro, 2000).
We talk  a little bit about Hierarchical Methods After that we focus K-means algoritm.
Hierarchical Methods
Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom.
Applications such as;
      the discovery of different customer groups in the grocery and the emergence of shopping patterns of these groups,
      the classification of similar genes according to plant and animal classifications and functions in biology,
      the classification of houses according to types, values and geographical places in city planning are typical clustering applications.
      At the same time as clustering is used to classify documents for information discovery on the Internet

Summary of Hierarchal Clustering Methods
• No need to specify the number of clusters in advance.
 • Hierarchical structure maps nicely onto human intuition for some domains
• They do not scale well: time complexity of at least O(n2), where n is the number of total objects.
• Like any heuristic search algorithms, local optima are a problem.
• Interpretation of results is (very) subjective.
K-MEANS
One of the earliest clustering algorithms, was developed by J. B. Mac Queen in 1967,K means is an unsupervised clustering method. Groups data into K clusters and attempts to group data points to minimize the sum of squares distance to their central mean.There are two most important goals:

1- The values within the cluster are very similar.
2- Values outside the set are not as similar as possible
The main idea is to define a the center for each cluster. The number of clusters is determined randomly. The most diffucult phase is the select of k. Because if we choose little k numbers, the objects we want to arrive in different clusters, can fall to the same cluster or if we increase the number of clusters, we disperse the objects too much. After the number K is randomly determined by the person, the cluster centers are selected, randomly. The easiest way to select a center is to choose highly distant data. After center is selected ,clustered according to distance of data using Euclidean connections. Afterwards, new centers are assigned by iteration. These steps are repeated until each data belongs to a cluster. Because the K-Means assignment mechanism allows each dataset to belong to only one cluster.
In conclusion, data set separate k number clusters, after distance of each point  measure to centroid. So calculating mean. The name of algoritm is K-Means.

How K-Means Works

1) Randomly select ‘k’ cluster centers.
2) Calculate the distance between each data point and cluster centers.
3) Assign the data point to the cluster center whose distance from the cluster center is minimum of all the cluster centers..
4) Recalculate the new cluster center
5) Recalculate the distance between each data point and new obtained cluster centers.
6) If no data point was reassigned then stop, otherwise repeat.
ADVANTAGE:
*
It is suitable to run large data sets and maproduce.
*Assuming clusters are symmetric
*Fast, sturdy and easy to understand.
DISADVANTAGE
*Difficult to assign the center of cluster (k).
*İf there are two similar dataset, kmeans can’t *understand there are two cluster.
*
Different results can be obtained with different displays.
*
Random selection of cluster centers is inefficient.
*Algoritm cant run for non-linear dataset.
*
It is sensitive to noisy data. This data is included in the sets.




CONCLUSION OF THE ARTICLE
We examined data mining, and we learnt it is a machine learning. We must have analyzed data so we have to have some methodoliges. Data mining gather all these algoritm. As using these algoritms, to be easy to life. Because we understand, what is meaning a lot data and how can move for the future. Also preparing to strategies is so important. Using todays datas, firms will predict the future and applied some stratejic plans. In Conclusion data mining is most important for Information Era. Because, data can be reached everwhere, but understanding is a science.




Reference
Data Clustering: A Review A.K. JAIN Michigan State University M.N. MURTY Indian Institute of Science AND P.J. FLYNN The Ohio State University
Comparision Between data  Clustring algoritms, Osama Abu Abbas, Computer Science departmanr, Yarmouk University, Jordan.
From Data Mining to Knowledge Discovery in Databases , Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth  
Clustering Algorithms and Evaluations
An Efficient K-Means Clustering Algorithm, Khaled Alsabti Syracuse University, Sanjay Ranka University of Florida, Vineet Singh Hitachi America, Ltd.
Cluster analysis: Basic concepts and algoritms.
DATA CLUSTERING Algorithms and Applications,Edited by Charu C. Aggarwal Chandan K. Reddy
CLUSTERING METHODS, Lior Rokach Department of Industrial Engineering Tel-Aviv University, Oded Maimon Department of Industrial Engineering Tel-Aviv University
DATA MINING AND ANALYSIS, Fundamental Concepts and Algorithms MOHAMMED J. ZAKI Rensselaer Polytechnic Institute, Troy, New York WAGNER MEIRA JR. Universidade Federal de Minas Gerais, Brazil.
K-means Algorithm g Cluster Analysis in Data Mining Edited by Zijun Zhang
K-means algorithm ,Mark Herbster, University College London Department of Computer Science
K -means Clustering  Edited by Ke Chen
K-means Clustering via Principal Component Analysis,Chris Ding Xiaofeng He
An Efficient k-Means Clustering Algorithm: Analysis and Implementation Tapas Kanungo, Senior Member, IEEE, David M. Mount, Member, IEEE, Nathan S. Netanyahu, Member, IEEE, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu, Senior Member, IEEE
Our lessons presentation


Hiç yorum yok:

Yorum Gönder

Kategoriler