The importance of entity recognition
Entity extraction is becoming an increasingly important process in finding mentions of people, locations, organizations or even dates or products in huge amount of text contents. In patent searches, law enforcement, sentiment analysis, ad targeting, content recommendation, e-discovery, and anti-fraud, entity extraction enables swift analysis of gigabytes of data.
In which cases is entity extraction important?
In the fields of law enforcement (OSINT – open source intelligence), sentiment analysis, ad targeting, content recommendation, patent searches or even customer identification and fraud prevention, entity recognition and retrieval are all of paramount importance.
What entities exist?
The most common entities are, of course, personal names, locations (geographical names), organizations, dates, languages, nationalities, units of measurement, but there are also many types of entities, such as emails, ID numbers, diseases, religions or events. Rosette Entity Extractor (REX), one of the world’s leading entity extraction solutions, developed by Basis Technology, is based on a machine learning approach and can identify 29 entity types and more than 450 subtypes.
Advanced name matching by entity extraction
Among entity types, personal names are extremely important and their identifiability is of paramount importance.
Basis Technology’s world-leading name identification solution, Rosette Name Indexer (RNI), includes 13 name pairing methods. Thanks to its matching methods, the solution is particularly useful for organisations and businesses in the security sector, in the field of law enforcement, border protection, justice, or airport and other security services.
Learn more about the topic: enterprise search for security.
- Credit card
- ID number
- Chinese, Traditional
- Chinese, Simplified
Supported entity types may vary from language to language, but users can new ones.
Learn more about entity extraction!
More accurate tagging and faster annotation
Rosette Adaptation Studio (RAS) is a user-friendly application designed for nontechnical users. In addition to entities extracted by REX, the intuitive interface allows the user to specify new and unique tag categories. The process can be done by the client itself, without the need for a data scientist or NLP expert.
Using the application will accelerate the process and enables faster annotation with Rosette Adaptation Studio.
Improving accuracy by using field training
Step 1: Only adding data
The easiest level of adaptation, that can be almost completely user driven, is called “Unsupervised Field Training,”. In this case Rosette provides access to a state-of-the art clustering tool chain, the user adds any quantity of data without the need for annotation!
Any documents that represents the data the user needs to extract is sufficient for REX to build you a new model adapted to the idiosyncrasies of given data, increasing the entity extraction accuracy significantly.
This unsupervised process allows Rosette to more accurately locate entities in the genre, style and vocabulary used by your data, based on the idea of word clusters.
Rosette Entity Extractor understands the context surrounding unfamiliar words, and as a result, extract them into existing, previously defined clusters.
Step 2: Better results by additional annotation
For better accuracy, user can annotate a small quantity of your data and actively teach Rosette the particular contexts for entities that are common in the documents. Only a few hundred annotated documents can create significant improvements in accuracy. Rosette Adaptation Studio (RAS) makes adding annotated documents to boost the existing REX model much faster and more efficient than traditional annotation methods.
How it is done?
Leveraging interim models: the training is bootstrapped by tagging a tiny number of documents to build an interim model.
Efficient annotation: active learning technology prioritizes the untagged documents that the interim model shows least confidence in; therefore, a greater variety of events are tagged sooner.
Computer-assisted tagging: the interim model pre-tags unannotated documents so that annotators should only correct errors, which is way faster than a manually hand-tagging every event.
Iterative model evaluation: the system continuously measures the model’s accuracy, allowing all annotators to stop the process as soon as accuracy is achieved.
Previously annotators could not know whether they had annotated enough documents or not to reach the desired level of accuracy. Rosette Adaptation Studio is able to coordinate and harmonize the work of numerous annotators, thus creates training data exponentially faster than traditional methods.
Lower costs, better models
By reducing the amount of data and time required, Rosette Annotation Studio (RAS) shortens and makes the model training work really efficient, especially for very specific natural language processing (NLP) models.
The extraction process of individual entities is from now on very simplified. Customers can train and build the new NLP model very quickly. The solution is 4 times faster than traditional text annotation methods*.
* based on Basis Technology’s internal tests
Rosette Adaptation Studio is an excellent complementary tool to REX, which is now available free of charge to Rosette Entity Extractor users.
Since most customers welcome guidance in selecting data, building a new model and evaluating results, Precognox, as a partner of Basis Technology, offer professional services for the training process*.
*if the customer has ordered Rosette solutions through Precognox
The result of advanced entity extraction: more efficient and intelligent search
Entity extraction from text contents is not a self-serving process, but the cornerstone of a much more efficient and user-friendly search process. The different entity types (names, dates, times, locations, currencies, etc.) can all be filtering options for search engines as TAS Enterprise Search, and can be used to narrow down search results in a matter of seconds.