Projects are almost always customized, and include a consulting component. We assemble a specific solution from a toolbox of Precognox proprietary software and also some open source components.
Step 1: Search – Find relevant information, organize into groups and topics
- 1. Build Indexes – use open source “swift text” search, e.g. Solr/Lucene
- 2. Group Results of related terms, e.g. “fever” and “infection” not synonyms but are related
- 3. Semantic Search – include terms that are synonyms and antonyms in multiple languages
- 4. Display Results – build custom screens to display the results in a useable fashion
In this process, the Precognox software learns and improves the results, by identifying and recognizing entities like companies, people, locations, diseases, etc. Human assistance can be valuable in identifying relationships, such as nicknames, which the system then uses to improve the results.
Step 2:Build results into structured format, extract data from unstructured text
- 1. Focused Crawling – cyclic walkthrough of web sources
- 2. Text parsing – breaking into sentences and paragraphs, data retrieval, structured storing
We take into account the type of information the client desires, and develop a heuristic approach to get a good result with optimal use of the computer resource. A heuristic approach to discovery means a practical method not guaranteed to be perfect, but which is sufficient for the immediate goal. For example, in news articles one might predefine where to expect the “leader” section, and not have to search through all the text.
As an example of building structured information from unstructured text, we could analyze thousands of restaurant reviews for a city. For each review, we would take the restaurant name, even if it was misspelled, and use sentiment and emotion analysis (see later) to identify if this was a positive or negative review, and whether the review was angry or enthusiastic. Our software would then tabulate these results into an easy-to-read spreadsheet.
Step 2:Step 3: Text and Data Mining – predictive analytics
- 1. Key-Phrase Extraction – Gather natural keywords from unstructured text.
- 2. Natural Language Processing – It is impossible for a human to read all the documents in a collection, so we need sophisticated automatic solutions. Natural Language Processing (NLP) provides us mature techniques to extract information from vast amounts of textual data. For example, humans find it easy to identify names and dates in an article, but for machines this can be a really difficult task. Components of a Natural Language system can accomplish such identification. Our company often uses using open-source Java components like OpenNLP for these NLP text analysis tasks.
- 3. Link data – Use SKOS (the Simple Knowledge Organization System) to build a thesaurus of related (but not identical) terms.
- 4. Predictive Analytics – For example, analyze job seeker sites to predict unemployment.
- 5. Text Classification and Clustering – Group search results into meaningful groups, categorize documents based on their contents. For example, many news sources already categorize articles into USA, EU, Business, Opinion, etc. However, many sources do not have such categorization. The Precognox methodology categorizes such articles, based on the frequency of words. An article mentioning several countries might be classified as Foreign News. After the initial result, the criteria can be refined for the particular data store.
- 6. Text Tagging – Extract the main concepts and key-phrases of a text. Tags can be used in search applications to provide related documents. Though many articles now have such tags, they are often very inconsistent. For example, are “bus”, “bus travel”, and “bus factory” tags related or unrelated? As with each step, Precognox includes a consulting component to refine the analysis.
- 7. Extractive Text Summarization – For example, report the 5 main sentences of a text.
- 8. Sentiment Analysis – Determine if comment (e.g. Twitter) is positive, negative, or neutral.
- 9. Emotion Analysis – This is a more fine-grained process to determine the intensity of the comment, for example anger, enthusiasm, surprise, disgust, approval, etc.
Precognox never uses a “brute force” approach to data analysis. We refine and develop heuristics for each situation, and therefore economize the computer resource. However, cognitive systems do often require extensive computing. We optimize the time and cost through parallel processing computing, and the Hadoop approach. Hadoop is a software methodology that uses inexpensive hardware, with the expectation that there will be hardware failures. The Hadoop software corrects for these expected failures, resulting in exact precision at lowest cost.
With a combination of our consulting expertise and experience, coupled with our software toolkit, Precognox can bring order and insight into your large unstructured data stores.