The mega-volumes of information available to companies today are on exponential trajectories. As a matter of fact, information has never been the problem in text mining. There’s plenty of information. The problem is gaining knowledge from all that information, since it’s absolutely useless if it can’t be put to good use. This is the primary goal of developers of text analytics software.
Information must be controlled and exploited in ways that reduce the irrelevant, eliminate the redundant, and that will sift the semantical chaff in ways that will still require human analysts to assess the resulting output of their tools. The knowledge gained must be a more “distilled” volume of information so that operator workload can necessarily be less arduous. At the very minimum, if we can mix a metaphor, the software must narrow the “ballpark volume” so the users at least know where within the body of information their researching can begin.
We’re reminded of another project about thirty years ago in which an attempt was made to create a computer program to handle a cognitive-assessment task that had previously required human intellect exclusively. In the late 70s the US Navy came up with software for a particular sonar system that would collect and catalog underwater sounds by their discrete frequencies. The computer would assign a figure of merit to each frequency and make value judgments on their sources based on an algorithm that accounted for water conditions and other oceanographic data. In the long run, it turned out that the Navy couldn’t do away with their human sonar men after all. And that project never involved textual analytics of any kind. But we believe the need for good human assessment expertise will long be a part of any text mining process. We could be wrong, of course, but it seems as if we can’t help but hear the continuous refrain from industry observers that the current state of text analytics software is not nearly as accurate as the data mining tools now in use over in the data warehouses. We just hope such observers aren’t getting paid any high amounts of money for coming up with such extremely obvious and brilliantly simple statements like that.
Different Data Warehouses
That’s a dead-center perfectly easy call to make since data mining and text mining are still miles apart in the relative effectiveness of their mining tools. And, that’s not a pejorative statement against text mining, because text analytics software may never reach the levels of accuracy that data mining has. That likely won’t happen until the creation of text mining tools that can make inferences and judgments on unstructured (i.e., regular, printed pages of) text. Data mining tools are accurate because they have made good use of technology in a mature field which requires only the extraction of statistical facts from highly structured databases.
Still The Indispensible Element
Which means, as the Navy found out, usability and interpretation of the data, in this case text, will require a certain amount of user interaction. The degree at which that requirement is ameliorated by continued upgrading of the technology will ultimately determine how effective text mining tools have become–not how they compare with their counterpart tools in data warehousing. Another factor will be the learning curve in using the new tools. Procurement and installation will be the easy parts. Training a human to use the software will be the bottleneck since there must be a skilled researcher to interpret some portions of the mined knowledge.