Introduction
Constantly evolving technology has made it easier to
keep the data. In situation some methods have been developed for
the analysis of data. The science that collects these methods
together is called data mining. In this article we will examine Data mining and
its algorithms. Our goal is to talk about data mining and
algorithms in generally. Then we will examine the clustering
algorithm from the algorithms, In which areas it is applied and for
what purpose it is used. After these phase, we will focus on k-means from
clustering algorithms and methodology of that. In the next step, we will have
research question, we analyze data properties after we will solve in the WEKA
application.
What is Data Mining?
Technology rapidly has increased, so the ease of
accessing information. It is easy to store information as easily
as it is easy to access the information,So the data is rapidy growing and there
are a lot data in everywhere. This condition has to be processed and analyzed.
Many
data analysis algorithms are called data mining, which is gathered under one
roof. So
data mining is a process of extracting useful information from a heap. In
conclusion, data mining is the process of useful information from
extracting large data by solving a number of algorithms.
Data mining is a combination of many disciplines.
These disciplines: database systems, statistics, machine learning, and pattern
recognition. The algebraic, geometric,
and probabilistic viewpoints of data play a key role in data mining. Given a
dataset of n points in a d-dimensional space, the fundamental analysis and
mining tasks covered in this book include exploratory data analysis, frequent
pattern discovery, data clustering, and classification models, which are
described next.(DATA MINING AND ANALYSIS Fundamental Concepts and Algorithms,
MOHAMMED J. ZAKI and WAGNER MEIRA
JR,Cambridge Universiy, Pg: 26).
Data mining mostly used in Market
Analysis and Management , Corporate Analysis & Risk Management , Fraud
Detection. The main problems are finding
target customer type, forecasting for future, and prevent. Also data mining can
be used in many ways in everyday life. Some of these can be listed as follows:
·
Evaluation of treatment claims made to hospitals
according to time, place and need will be helpful in the initial stage of
epidemic risk assessment, control and resource planning.
·
A model that identifies the profiles of users of
fugitive energy will allow effective fighting with fugitives at low cost, which
will allow them to predict potential fugitive energy users.
·
A study aimed at predicting the intensity of
highways by region and time will ensure that, for example, accident rates are
minimized by correct resource planning at the right time.
·
When implementing public support schemes, the
success of programs implemented through institutional risk scoring increases
the amount of support to be given to organizations with the right amount and
right goals. Reducing the amount of bad loans is the fact that the profits that
are the risk of not paying when allocating credits have been identified.
Generally used data mining applications can be listed
as follows: Marketing Banking
Retailing and sales, Manufacturing and production, Brokerage
and securities trading,
Government and defense , Computer hardware and
software, Airlines,
Health care ,Broadcasting,
Homeland security, Insurance Police (Week4Presantation, Keziban Seçkin, AYBU, pg: 12)
If we look at the algorithms of data mining in generally,
algoritms of classification,most common used naive bayes and decision tree, algoritms of clustring
algoritmları, most common used Hierarchy
Clustering and K-means, Association Rules (Apriori Algorithm), Text Mining, Web
Mining.
Classification used
to analyze the historical data stored in a database and to automatically
generate a model that can predict future behavior.
Its objective is to find a derived model that describes and distinguishes data
classes or concepts.
(Week4Presantation, Keziban Seçkin, AYBU, pg: 8)
·
Naive
Bayes based
on classification so we need training data,classification algorithm building
classifier or model and test data to estimate the
accuracy of classification rules
(supervised learning). A learned model
can be used to make predictions.
·
Decision Tree A decision tree is a structure that includes a root node, branches, and
leaf nodes. Each internal node denotes a test on an attribute, each branch
denotes the outcome of a test, and each leaf node holds a class label. The
topmost node in the tree is the root node.(Data Mining Tuturiol, tutoroils
points, pg: 31)
Clustring Partitioning a
database into segments in which the members of a segment share similar
qualities. After we will examine this issue in detail.
Association A category of data mining
algorithm that establishes relationships about items that occur together in a
given record
Text mining Application of data mining to
non-structured
or less structured text files. It entails the generation of meaningful numerical
indices from the
unstructured text and then processing these indices
using various data mining algorithms. That is, by scanning an existing text (looking at the
number of words from which the word is spoken, looking at what the frequency
range is, and repeating it), it yields a meaningful result.
Web mining The discovery and analysis of
interesting and useful information from the Web, about the Web, and usually
through Web-based tools (Week4Presantation,
Keziban Seçkin, AYBU)
ALGORİTM OF
CLUSTRING
What is the Clustring
Classes emerging by keeping similar data close
together are called clustering. Classification is used mostly as a supervised
learning method, clustering for unsupervised learning. The logic of the cluster
is like this: Increase intracellar similarity as much as possible, the
difference between the clusters is as much as high. The
clustering algorithm is the oldest algorithm.Clustring can use
many areas for example we think that the records of the products that are
received by a customer in a business. From these data sets, clustering
algorithms can provide useful information. For example, how many produce which
size and pieces the shirt (small,
medium, large eg.)
If we will give the definition of the cluster
academically: Clustering is a standard
procedure in multivariate data analysis. It is designed to explore an inherent
natural structure of the data objects, where objects in the same cluster are as
similar as possible and objects in different clusters are as dissimilar as
possible
Clustering
is an exploratory data analysis. Therefore, the explorer might have no or
little information about the parameters of the resulting cluster analysis. In
typical uses of clustering the goal is to determine all of the following: The
number of clusters, The absolute and relative positions of the clusters, The
size of the clusters, The shape of the clusters, The density of the clusters.
The
cluster properties are explored in the process of the cluster analysis, which
can be split into the following steps.
1.
Definition of objects: Which are the objects for the cluster analysis?
2.
Definition of clustering purpose: What is the interest in clustering the
objects?
3.
Definition of features: Which are the features that describe the objects?
4.
Definition of similarity measure: How can the objects be compared?
5.
Definition of clustering algorithm: Which algorithm is suitable for clustering
the data?
6.
Definition of cluster quality: How good is the clustering result? What is the
interpretation? ( Clustering
Algorithms and Evaluations, pg: 180).
Application
of Cluster Analysis
Clustering analysis is mostly used in many area such as market research, pattern recognition,
data analysis, and image processing.
Clustering can also help marketers find different target
groups in their customer base. And using these application they can describe
their customer groups based on the purchasing sample.
In the field of biology, it can be used to derive plant and
animal taxonomies, categorize genes with similar functions and gain insight
into structures inherent to populations.
Clustering also helps in description of areas of similar
land use in an earth observation database. It also helps in the description of
groups of houses in a city according to house type, value, and geographic
location.
Clustering also helps in classifying texts on the web for
information exploration.
Clustering is also used in outlier detection applications
such as detection of credit card fraud. (Fraud Detection)
Requirements
of Clustering in Data Mining
The following statements throw light on why clustering
is required in data mining:
Scalability - We need extremely scalable clustering
algorithms to agreement with wide databases.
Ability to deal with different kinds of attributes -
Algorithms should be able to be operative on any kind of data such as
interval-based (numerical) data, categorical, and binary data.
Discovery of clusters with attribute shape - The
clustering algorithm should be capable of detecting clusters of arbitrary
shape. They should not be bounded to only distance measures that tend to find
spherical cluster of small sizes.
High dimensionality - The clustering algorithm should
not only be able to handle low-dimensional data but also the high dimensional
space.
Ability to deal
with noisy data - Databases contain noisy, missing or erroneous data. Some
algorithms are sensitive to such data and may lead to poor quality clusters.
Interpretability - The clustering results should be
interpretable, comprehensible, and usable.
Algoritm
of Cluster
In
this section we describe the most well-known clustering algorithms. The main
reason for having many clustering methods is the fact that the notion of
“cluster” is not precisely defined (Estivill-Castro, 2000). Consequently many
clustering methods have been developed, each of which uses a different
induction principle. Farley and Raftery (1998) suggest dividing the clustering
methods into two main groups: hierarchical and partitioning methods. Han and
Kamber (2001) suggest categorizing the methods into additional three main
categories: density-based methods, model-based clustering and gridbased
methods. An alternative categorization based on the induction principle of the
various clustering methods is presented in (Estivill-Castro, 2000).
We talk a
little bit about Hierarchical Methods
After that we focus K-means algoritm.
Hierarchical
Methods
Hierarchical clustering involves creating clusters
that have a predetermined ordering from top to bottom.
Applications such as;
• the
discovery of different customer groups in the grocery and the emergence of
shopping patterns of these groups,
• the
classification of similar genes according to plant and animal classifications
and functions in biology,
• the
classification of houses according to types, values and geographical places in
city planning are typical clustering applications.
• At
the same time as clustering is used to classify documents for information
discovery on the Internet.
Summary of
Hierarchal Clustering Methods
• No need to
specify the number of clusters in advance.
• Hierarchical structure maps nicely onto
human intuition for some domains
• They do not
scale well: time complexity of at least O(n2), where n is the number of total
objects.
• Like any
heuristic search algorithms, local optima are a problem.
• Interpretation
of results is (very) subjective.
K-MEANS
One of the earliest clustering algorithms, was developed by J. B. Mac
Queen in 1967,K means is an unsupervised
clustering method. Groups data into K clusters and attempts to group data points to minimize
the sum of squares distance to their central mean.There are two most important goals:
1- The values within the cluster are very similar.
2- Values outside the set are not as similar as possible
1- The values within the cluster are very similar.
2- Values outside the set are not as similar as possible
The main idea is to define a
the
center for each cluster.
The
number of clusters is determined randomly. The most diffucult phase is the
select of k. Because if we choose little k numbers, the objects we want to
arrive in different clusters, can fall to the same cluster or if we increase
the number of clusters, we disperse the objects too much. After the number K is randomly determined by the person, the cluster
centers are selected, randomly. The easiest way to select a
center is to choose highly distant data. After center is selected ,clustered
according to distance of data using Euclidean connections. Afterwards, new centers are assigned by iteration. These steps are repeated until each data belongs to a cluster. Because
the K-Means assignment mechanism allows each dataset to
belong to only one cluster.
In conclusion, data set separate k number clusters, after distance of
each point measure to centroid. So
calculating mean. The name of algoritm is K-Means.
How K-Means Works
1) Randomly select ‘k’ cluster centers.
2) Calculate the distance between each data point and cluster centers.
3) Assign the data point to the cluster center whose distance from the cluster center is minimum of all the cluster centers..
4) Recalculate the new cluster center
5) Recalculate the distance between each data point and new obtained cluster centers.
6) If no data point was reassigned then stop, otherwise repeat.
ADVANTAGE:
*It is suitable to run large data sets and maproduce.
*Assuming clusters are symmetric
*Fast, sturdy and easy to understand.
*It is suitable to run large data sets and maproduce.
*Assuming clusters are symmetric
*Fast, sturdy and easy to understand.
DISADVANTAGE
*Difficult to assign the center of cluster (k).
*İf there are two similar dataset, kmeans can’t *understand there are two cluster.
*Different results can be obtained with different displays.
*Random selection of cluster centers is inefficient.
*Algoritm cant run for non-linear dataset.
*It is sensitive to noisy data. This data is included in the sets.
*Difficult to assign the center of cluster (k).
*İf there are two similar dataset, kmeans can’t *understand there are two cluster.
*Different results can be obtained with different displays.
*Random selection of cluster centers is inefficient.
*Algoritm cant run for non-linear dataset.
*It is sensitive to noisy data. This data is included in the sets.
CONCLUSION OF THE ARTICLE
We examined data mining, and we learnt it is a
machine learning. We must have analyzed data so we have to have some
methodoliges. Data mining gather all these algoritm. As using these algoritms,
to be easy to life. Because we understand, what is meaning a lot data and how
can move for the future. Also preparing to strategies is so important. Using
todays datas, firms will predict the future and applied some stratejic plans.
In Conclusion data mining is most important for Information Era. Because, data
can be reached everwhere, but understanding is a science.
Reference
Data Clustering: A Review A.K. JAIN Michigan
State University M.N. MURTY Indian Institute of Science AND P.J. FLYNN The Ohio
State University
Comparision Between
data Clustring algoritms, Osama Abu
Abbas, Computer Science departmanr, Yarmouk University, Jordan.
From Data Mining to
Knowledge Discovery in Databases , Usama Fayyad, Gregory Piatetsky-Shapiro, and
Padhraic Smyth
Clustering Algorithms and
Evaluations
An Efficient K-Means
Clustering Algorithm, Khaled Alsabti Syracuse University, Sanjay Ranka
University of Florida, Vineet Singh Hitachi America, Ltd.
Cluster analysis: Basic
concepts and algoritms.
DATA CLUSTERING Algorithms
and Applications,Edited by Charu C. Aggarwal Chandan K. Reddy
CLUSTERING METHODS, Lior
Rokach Department of Industrial Engineering Tel-Aviv University, Oded Maimon
Department of Industrial Engineering Tel-Aviv University
DATA MINING AND ANALYSIS, Fundamental
Concepts and Algorithms MOHAMMED J. ZAKI Rensselaer Polytechnic Institute,
Troy, New York WAGNER MEIRA JR. Universidade Federal de Minas Gerais, Brazil.
K-means Algorithm g Cluster
Analysis in Data Mining Edited by Zijun Zhang
K-means algorithm ,Mark
Herbster, University College London Department of Computer Science
K -means Clustering Edited by Ke Chen
K-means Clustering via
Principal Component Analysis,Chris Ding Xiaofeng He
An Efficient k-Means
Clustering Algorithm: Analysis and Implementation Tapas Kanungo, Senior Member,
IEEE, David M. Mount, Member, IEEE, Nathan S. Netanyahu, Member, IEEE,
Christine D. Piatko, Ruth Silverman, and Angela Y. Wu, Senior Member, IEEE
Our lessons presentation