Luiz Fernando Cavalcante Silva is studying for a Bachelor of Civil Engineering at the University of Sao Paulo in Brazil.
Luiz, started this project as a part of a scientific initiation program offered by his university which he will work on for the next year.
The amount of open data made available by government agencies is getting bigger over time. This results in a large number of datasets with different layouts, formats and frequency updates, that can fall under the domains of Big Data. Despite being difficult to analyze, these datasets have a large amount of rich information that could be useful for applications involving public policies.
One such example is the dataset covering the São Paulo real estate registry which is made publicly available by the São Paulo city government and contains a variety of information about each property in the São Paulo city, such as its address, land square footage, building area, terrain and construction values, number of floors, and the type of the property. Since São Paulo is a large city with more than 3 million registered real estate properties, this dataset has a large amount of data that can be analyzed for different purposes. For instance, since property taxes are calculated based on the property location and its physical characteristics, this dataset can be utilized to group properties sharing similar characteristics and compare whether their tax values are similar. Outliers in these groups could be potentially considered candidates for tax evasion or fraud. Based on the list of outliers, the city council can assign tax inspectors to physically visit these properties and inspect its characteristics, something that could help the city hall on tax evasion problems.
To assist with challenges like this there is a need to combine different Big Data technologies, such as a powerful platform for data extraction, transformation and loading (ETL), plus machine learning algorithms. One example of such an end-to-end Big Data management platform is HPCC Systems.
The objective of this project is to develop a machine learning pipeline using HPCC Systems that can be ultimately used to identify outliers in the São Paulo city government´s real state registry extract. To achieve this aim, a complete ETL pipeline needed to be structured to treat the data before unsupervised machine learning algorithms could be utilized to cluster the data.
This approach required extensive use of the HPCC Systems functionalities on various data extraction and transformation instances. In the extraction phase, the dataset was defined to be utilized by the platform with the adequate data types. It was also needed to decrease the amount of data in every record because of the challenge to manipulate and interpret a large number of fields. For this, ECL code was developed to calculate the correlation between fields and the correlation values were used to combine the fields into factors. ECL code was also developed to normalize the resulting field values using the Machine Learning Core bundle. Lastly, the resulting dataset was submitted to the K-Means clustering algorithm using the HPCC systems K-means bundle to group all the remaining data into clusters. Besides giving information about the organization of properties in the São Paulo city, the clusters also allowed the identification of outliers in relation to each cluster, by using the box-plot method.
In this Video Recording, Luiz provides a tour and explanation of his poster content.
Massive data analysis in public management: A proposal to identify outliers in the São Paulo city government's real estate registry
Click on the poster for a larger image.