Concept of Data Mining and Steps to Efficiently Mine Unstructured Data

By CIOReview | Monday, September 12, 2016

Data mining is a process of finding important information from a large amount of data stored in data warehouses, databases and other information repositories. There is a huge amount of data available in electronic forms and it is important to convert extracted data into useful information. This information will be further utilized for business management, market analysis, decision support and other forms. Data mining has attracted great attention in IT industry in recent years and now it appeared as an interdisciplinary subfield of Computer Science.

Extraction of useful information is not an only process performed in the data mining, but it also involves other processes like Pattern Evaluation, Data Integration, Data Cleaning, Data Transformation and Data Presentation. After performing different processes on the extracted data, fetched information will further used in different areas like Science Exploration, Production Control, Astrology, Sports, Customer retention, and Internet Web Surf-Aid. The value of unstructured data will remain underdeveloped, if the process of data mining is executed on an inadequate and static data model basically designed for structured data.

The adaptability of a data model is at stake while integrating it to the unstructured data mining. A problem that is very often faced is that deep analysis of data leads to more questions. In data mining, if the data model is not properly designed to handle new questions than the model requires sheer modification, which is a complex process and will take months of time. This problem even becomes more cumbersome with unstructured data because of its nature, as it is not organized in a pre-defined manner. The enhancement of current taxonomies is important along with finding a perfect combination of algorithmic assistance and domain expertise.

Experts came to the conclusion that the analytical process can’t be fully automated especially with the unstructured information. Even for the structured data, it requires an analyst in the network to simplify the results. For unstructured information, the analyst needs to guide the operation with a blend of domain expertise, algorithmic assistance and useful set of metrics to assist in the interpretation process. Most important thing is to efficiently utilize expensive and inadequate resource and not to eliminate them completely.

A Systematic Methodology

To settle with a common analytical methodology is a bit work of headache. Marketers need to go through several iterations to fix with a single one and it will have to cover all of the business objectives, domains, and information sources. The systematic method consists of three sections Explore, Understand, and Analyze. Each phase possesses a unique capability that depends on each other.

Many times a marketer will have to deal with the large data repositories. But all of the information presents is not relevant. It totally depends on the business objectives and information sources. A series of techniques can be used to find the relevant data from a large set of information. In the case of unstructured information, search option can be used and later combine the resulted data in individual ways by using a different set of operations. Queries will be used to describe the structured fields in the database to select the subset of required information. Standard SQL query language is very helpful in finding the sub-collections of required data and it is a very powerful technique that provides all appropriate attributes in the database and finds the relevant data after analyzing it.

In many cases, a query or search is not a good option to find the optimal data so, in those situations performing a set of operations on a collection of data is necessary. Join and Intersect are most commonly used operations. The intersection is useful while finding subset from the two attributes, either from the unstructured or structured fields. If results are too large to analyze than sampling techniques can be used to get a statistically valid subset. Later these results will be used as inputs to different explore operations.

Understand Explored Data

This phase is about discovering, what information really comprises of. The analyst needs to deduce the underlying structured inherent in the given unstructured information and data model to be captured by analyst’s to understand the domain knowledge and business objectives. Statistics are most important part to understand what the data comprises of. Like, after getting summary of a game, you can get lots of information about the happening in the game without watching it live. So, summarizing numerical data work as a good indicator of the statistical data.

A large amount of data can be divided into small parts. Just like a book divided into chapters, paragraphs, sentences and words. Converting a large entity into small parts makes easier to summarize the data and makes statistics possible. There are many ways of data partitioning with its individual advantages. But, the best methods create a section of data that fully justifies the motive. After partitioning, the analyst needs to decide the events to measure and statistics to keep that means holding the important data and eradicating the other.

After partitioning the data, analyst needs to decide events to measure and statistics to keep. The statistical analysis of text is done by measuring the average word length, a number of words in sentences or an average number of times each letter of the alphabet occurs. These statistics will be later used to get a rough idea about the readability of section of text. Later, clustering of the data needs to be performed to seed the process of taxonomy in a quick and easy way. Clustering is an attempt to automatically group documents into thematic categories by using algorithms. These categories will later form taxonomy to provide an overview of the information that a document contains.

The visualization of taxonomy will be later employed to create pictures of the information that a human brain can process to locate areas of special interest comprises of relationships or patterns. The relationship between structured and unstructured information can be displayed through the trees, bar graphs, scatter plots, and pie charts. This will help in understanding the information and do modifications in the taxonomies to reflect business motives. By using the vector space model, visual representations of text can be automatically calculated from the data source. Through these representations, a computer can draw a synchronized set of documents and enable analysts to explore text space similar to astronomer explores planets and stars.


After passing the second phase, analysts have multiple taxonomies that represent characteristics of the unstructured information, along with a featured set that describes individual documents made of separate taxonomy. But taxonomy alone will not be able to achieve the objective of mining unstructured information. Thus, the final step involves combining the unstructured and structured information to get the idea about the relationships, trends, and patterns inherited in the data and further utilize them to make better business decisions.

Trend analysis is an important part of data mining because it helps in detection of current categories along with predicting the future ones. Trending is an important aspect during detection of emerging events.  Further, co-occurrence analysis reveals the hidden relationship between concepts and attributes captured by the taxonomies.

Once the analyst got the taxonomy that models the important factor of the information, it is now important to implement the classification scheme to the latest unstructured data. Many algorithms for classification exists and analyst needs to implement the best one for the given taxonomy and collection of information. However, the general approach is to select an algorithm that perfectly represents each category and the accuracy of each modeling approach will be tested by random sampling of the documents in the category.