Hadoop: Text Mining Framework

Framework for Data Analytics Applications
In his Forbes article on Hadoop, IT consultant Dan Woods describes Hadoop this way, “Much like any other operating system, Hadoop has the basic constructs needed to perform computing: It has a file system, a way to write programs, a way of managing the distribution of those programs over a distributed cluster and a way of accepting the results of those programs, ultimately combining them back into one result set.”

Distributive Processing Controller
Because of its ability to control parallel distributive processing of data, Hadoop has, or will soon, become the DOS or Linux for text mining tools to cope with and analyze the multi-terabyte amounts of data held by many companies.  These could be any electronic formats of unstructured text such as customer feedback and support inquiries, email, spreadsheets, PowerPoint presentations, PDFs, databases, and standard correspondence.

Hadoop Foundational Entity
For example, Google and Yahoo! in separate projects developed a text analysis program called MapReduce which was a data mining tool based on Hadoop and designed for use on  distributive networks.  The MapReduce software was designed to take large bodies of unstructured HTML text and break it down into smaller “chunks” in order to extract previously identified relevancies.  The Map function then mined these chunks for actionable information which was fed to the Reduce function to output an assemblage of all the processed data for various further analysis.  Other applications that have been designed for use on a Hadoop foundation are the Apache Hive and Pig open source projects.

The Necessity For Parallel Processing
Hadoop is ideal for the textmining requirements of parallel processing.  Parallel processing is the only feasible way to handle the enormously large volumes of collected text within a manageable amount of time and under acceptable fault tolerances.  Parallel processing makes use of up to several distributed computing systems consisting of multiple computers that are “ganged” together in clusters (also called “nodes”) that work together in a related network.  Hadoop not only handles user application programs for specified mining tasks, but it also controls and coordinates the distributive network.

 

banner