Browse Poster Wiki: 2022 Poster Award Winners, Awards Ceremony (Watch Recording from minute marker 1630), Posters by 2022 HPCC Systems Interns, Posters by Academic Partners, Poster Judges, About Virtual Judging, 2022 Poster Contest Home Page, Poster Contest Previous Years
Renato Campos Afonso is a Student at the at University of São Paulo (USP) in Brazil.
In the context of Big Data, characterized by 5V (variety, volume, velocity, value, and veracity), it is important to work with tools that can provide an efficient performance in order to derive business intelligence that support decision making. The analysis to be presented in this material was performed through an ETL process (extract, transform and load data) using an end-to-end Big Data management platform called HPCC Systems. The “ETL process” consists of extract information from different data sources in various formats (e.g., txt, cvs, xls), apply transformation techniques (e.g., normalization) and, after these two steps, the cleaned data are migrated into a data warehouse to be processed and analyzed.
The proposal for the development of these project is to create an arbitrary ranking for Brazilian companies using data provided by the federal government. This ranking should be able to classify them into three categories (good, regular, and bad), aiming to offer life insurance to those who have a good rating. In addition, it is necessary to build the classification system without considering information from existing “accidents”, which makes the process a little more difficult; however, an IBGE (Brazilian Institute of Geography and Statistics) database containing socioeconomic indicators by region was made available, allowing for better data segmentation.
It becomes necessary to use the HPCC Systems because the government data are very bulky and hard to manipulate; the treatment performed with ETL allowed to clean the data (removing duplicates by cnpj and region, simultaneously), group the tables by region and classify companies, making information available through queries with optimized processing. For the scores, qualitative and quantitative variables were used; for the quantitative (e.g., survival rate, aging rate, development index), scores were generated by region and by quartile: +200 for regions located between the maximum value and the third quartile; +150 between third quartile and median; +100 between median and first quartile; +50 between first quartile and minimum value. For the qualitative ones (e.g., company size, risk activity, registration status), the scores were made judiciously - e.g., large, and low-risk companies receive +200. Finally, companies are classified according to the score: >=800 are good; <800 and >=600 are regular and <600are bad.
After processing all the information, in a total amount of approximately forty-four million companies, it was possible to classify approximately six million as good, sixteen million as regular and twenty-two million as bad; this shows that the method is very strict, and it may be necessary to modify the score as per market requirement. Despite all the effort to organize the information effectively, some data could not be used because they do not have information in some specific fields; this information was not used in the results.
In this Video Recording, Renato provides a tour and explanation of his poster content.
Open Data processing with HPCC Systems; classifying Brazilian companies into risk groups
Click on the poster for a larger image.