Natalie is a Data Science Consultant at TÜV Austria Data Intelligence located in Vienna. She is responsible for supporting companies in their digitization processes by developing custom data-driven solutions based on statistical methods. Her work is focused on gaining maximum value from data, starting from data engineering and feature extraction as a basis for machine learning algorithms, up to providing BI tools and software solutions to clients.
After finishing her master’s study in Technical Mathematics at the Alpen-Adria University of Klagenfurt in 2015 she started a research career at Carinthian Tech Research (CTR). Her main focus was on her Ph.D. work that dealt with the modeling of computer simulation output based on Gaussian process surrogates. In 2018 she changed her scope to the field of data science by starting to work with Applied Statistics. Additionally she finished her doctorial studies in 2020.
Technical Vision Talk: “Natural Language Processing for the classification of documents in an industrial environment”
Most companies, especially production facilities, are often in need to efficiently handle thousands to millions of documents including operating instructions, technical drawings, licensing documentation and more. Assuming that data warehouse concepts are already in place to centrally store this huge amount of data, it can still be quite cumbersome to find specific information of interest. For such big data problems, a full-text search cannot be performed any more. Thus, clustering documents in predefined groups is of particular interest to be able to access the required information. However, executing this classification manually is a very expensive and time-consuming task, so the benefit of replacing it with an AI tool appears obvious.
The basic idea is to retrieve information directly out of document texts, which is part of the field of natural language processing. The general workflow can be summarized as follows: 1. preprocessing the raw texts to generate a vocabulary; 2. generate features based on word counts or term frequency–inverse document frequency transformations; 3. use these features as input variables for machine learning algorithms. This approach will be demonstrated in more detail on an example, where machine-readable documents of different types need to be classified into approximately 200 groups describing the document content. A linear support vector machine was trained on 2.3 million documents and evaluated on 800.000 different documents for testing, achieving an accuracy of 90%. For the use case at hand the successful project completion was related to savings of 150k €/year capital expenditures for the client.
__________________
Fri. Oct 1 | 9:30 am – Technical Vision Talk: “Natural Language Processing for the classification of documents in an industrial environment”