Best Practice

Text Analytics: dataset creation

Blue Reply, the Reply Group company specialised in digital transformation through services, consulting and the implementation of solutions based on IBM technologies, brings Artificial Intelligence technologies to companies through the use of the IBM Watson suite.

Extracting value from data

Every day, 2.5 quintillion bytes of new data are produced, many of which represent unstructured documents of various types written in natural language: requests, reports, complaints, medical prescriptions and claims written in different languages. Due to the unstructured nature of the data, it was estimated that organisations typically fail to capitalise on more than 8% of their information assets. Today, Artificial Intelligence is experiencing a strong growth, driven by the potential offered by Cloud computing. Natural language processing has always been one of the most closely followed aspects of Artificial Intelligence.

Within the landscape of the Artificial Intelligence technologies available, about a dozen solutions have excelled and are now leaders in the market.

Blue Reply chose to adopt the IBM Watson technologies: compared to major competitors, these are characterised by a high degree of maturity in terms of Machine Learning capabilities, a wide range of products (both on-premise and Cloud-based) with many out-of-the-box features, internationalisation and a high degree of flexibility in designing the solutions.

The three approaches

Blue Reply offers customers highly specialised skills and consulting services expertise and supports them in the selection of software and in the definition of architectures and solutions designed to extract business value from documents written in natural language, through the application of cognitive technologies.

Documents written in natural language can be processed to extract certain entities such as people, products, geographical references, organisations and relations between them, both within the scope of a general domain, and on a specific business domain. For computers, written text is nothing more than a sequence of words without meaning.

The system cannot determine whether a phrase corresponds to a sentence, a word or numbers. The system must be trained to recognise certain patterns to help identify such entities. Relationships between the various entities must be identified within the text, in order to ensure that the meaning is more correct in relation to the context of the conversation. This facilitates the identification of a model composed of entities and relationships.

The solutions designed by Blue Reply to extract information from data make it possible to train the system through the application of manual, Machine Learning or hybrid rules.
These three approaches offer different characteristics.

Rules-based approach

Uses predefined rules for the analysis of natural language;
Facilitates simple tracing and debugging;
Requires human intervention to program complex rules;
It is difficult to maintain with increased complexity.

Machine learning approach

Uses inferences and statistical models to analyse natural language;
Learns through examples, does not require coding;
Is recommended when the process involves a large volume of data;
Can be confusing for the developer and makes debugging more difficult;
Requires the creation of a (Ground truth) knowledge database.

Hybrid approach

Combines the Rules-based and Machine Learning approaches;
Makes it possible to start with the Rules-based approach and then move towards Machine Learning;
Uses rules to speed up training and improve the accuracy of the Machine Learning models;
Requires the development of a solution for integrating the two approaches.

Particular attention should be paid to the creation of the dataset (the sample of documents used to train the system). Performance percentages can be evaluated compared to a small set of documents that have been manually annotated through human intervention. Using fully manual data extraction processes, software development specialists and domain experts work in isolation, learning to interface with each another with difficulties attributable to the knowledge of the domain and/or dealing with the study of language that can often be ambiguous. The Watson technologies can help simplify and make this process intuitive: through the sharing of a collaborative platform, cognitive specialists and domain experts can collaborate by integrating products and APIs, in order to develop an automated solution designed to process large volumes of data.

The creation of the dataset using Watson is therefore:

Intuitive: the different nuances of natural language are learned without the need to write code;

Collaborative: two users with different types of skills can simultaneously access the tool and carry out their work;
Convenient: the processing speed and the intrinsic nature of the SaaS model, which makes it possible to purchase only the elements required to meet the specific needs of the customer, also renders the solution effective from a cost perspective.

The solution is designed for anyone who needs to process natural language documents, to extract information, to identify the intentions and/or the meaning of a document. Customers potentially interested in this type of solution may, for example, operate in the insurance, healthcare, telco, retail, banking and manufacturing sectors.