TAS Data Collector

Enables the collection of both structured and unstructured data from numerous online sources, which can be used independently or integrated into other services.

let's talk
Home > TAS Insight Engine > TAS Data Collector

What is TAS Data Collector?

By TAS Data Collector the user can download unstructured data (textual content) from the Internet by structuring the content, making it accessible to other information systems, and suitable for further processing, analysis or visualization.

The content collected by the TAS Data Collector can be utilized immediately or can serve as a basis for text analysis workflows that can be implemented with other build-in modules of the TAS Platform.

Data collection workflow

  • data (textual content) of webpages (or subassemblies) specified by the customer are collected by the service
  • further steps (data cleaning, data enrichment, validation) are implemented under the supervision of our specialists
  • as a result, a structured database is created that can be used for further data processing (analysis, visualization) or serve as a basis for further text analytics solutions
  • providing and transferring the collected, properly formatted content to the customer (even through an authenticated, password protected channel)

Features of the TAS Data Collector

  • TAS Data Collector is able to extract the visible data, metadata (tags, picture description) or pagination from a website.
  • Sites, subpages, login-required pages, even hierarchical sites or pages with a slideshow component or with multilingual content also cause no problem for TAS Data Collector.
  • When data is recognized as hidden, we offer a screenshot solution (the original exact look of the data).
  • In some cases it is forbidden by robots.txt to collect data. We respect this; however, this data is also possible to collect.
  • We can extract texts from a lot of different documents and image formats (PDF, spreadsheet, diagram or image file formats).
  • We are prepared to produce and deliver any required output format, even ones that require software development
  • The output format is JSON extension, but other formats are also possible (for example, MySQL database table, which can be analyzed and visualized immediately with the most well-known business intelligence tools (details in the technical description section)).

World-class data collection

Content from the Internet can also form part of a company’s data assets or may be the basis for world-class projects such as DIGIWHIST, which deals with public procurement data. Precognox’s solution for collecting this kind of web content is TAS Data Collector.

Reaching the goal of data collection

Collecting data is rarely a standalone process, the main goal is mostly to attain the comprehensive searchability throughout the whole company data assets. Learn more about how to get from successful enterprise data collection to intelligent search.

What can the collected content be used for?

  • research and development projects
  • new content and publications
  • service, information, thematic sites, blogs, public interest and open data portals
  • analyzes, statistics, visualizations
  • enterprise processes / operations, data backup
  • competitor and media monitoring
  • searchable databases
  • artificial intelligence, machine learning processes
  • data change monitoring

Lossless data collection

For business content, the most important rule is that corporate data collection occurs without any loss. Our data collection mechanism employs an integrated controlling method that ensures the achievement of the above goal.

Appearance of TAS Data Collector

The TAS Data Collector GUI provides the ability to monitor the downloading stream. The appearance of the interface matches the corporate identity of the TAS Platform.

The interface provides information about:

  • resources overview: which are wired, how many records are received
  • the number of valid and broken records
  • overview of the total number of records
  • the date of the data collection