Poster Abstract

The Brazilian government has a limited budget to spend in its population’s welfare. Consequently, financial aid packages must be directed precisely to those who need it the most. Although there are systems and algorithms regulating application entries for such benefits, it was identified that a handful of people has managed to acquire multiple benefits at the same period, which is an irregularity for some social programs. Secondly, differences in name inputs associated with the same registration number are a common mistake, which create barriers for the task of tracking a person’s subscriptions to determine whether they are irregular or not. Irregularities such as those create structural weaknesses, which in turn creates risks of fraud, costing a considerable amount of funds for the country`s federal bank.

The main objective of this project is to analyze inconsistencies between entries in different datasets from two distinct databases. To put it another way, this work seeks to illustrate frequent types of inconsistencies related to credential entries in databases. To deal with large datasets, a high efficiency program language had to be used to process huge amounts of data in a short time span. ECL was chosen to do this task, performing extractions, transformations, and loadings.

The first step of the analysis is extracting and combining the databases. The csv files available to download are separated in months, spanning from years until today. With the files separated in a folder, the next step was to combine those multiple csv files to create a single data set, containing thousands of rows with the names and NIS identification numbers related to each person. The same logic was repeated for the other datasets. In order to obtain a dataset containing unique credentials for each person, basic transformations available in the ECL environment were used to remove duplicated rows and to then sort the data in alphabetical order.

The content of the poster will contain examples of the datasets used to perform the analysis, additionally it will be included charts outlining the most common inconsistencies as well as examples of inconsistencies for each type. Furthermore, it will be presented a score measuring the degree of inconsistency to each user. Finally, it will be shown the impact of the COVID-19 pandemic in the usage of the fisherman`s benefits.

As a result, we have found a considerable rate of misspelled names within governmental data bases, as well as the presence of irregular recipients, receiving multiple financial benefits within a single month. This project is intended to benefit the government, with data outlining the most frequent inconsistencies observed in its databases, giving them a diagnostic of the most common mistakes users make when entering credentials to a given service. Additionally, the results of this project could help developers create solutions to mitigate the frequency of inconsistencies of user registrations. And by mitigating the number of discrepancies between registrations, limit the opportunities of frauds of governmental benefit programs.

Presentation

In this Video Recording, Matheus provides a tour and explanation of his poster content.

Massive data analysis applied to the identification of inconsistencies in governmental data bases

Click on the poster for a larger image.