André Fontanez Bravo is studying Industrial Engineering at the University of Sao Paulo in Brazil.
André has been working on a scientific research project for undergraduate students in the field of Data Analysis. The project was developed in partnership with LexisNexis Risk Solutions Group and the University of Sao Paulo. The HPCC Systems platform is used to handle the large amount of data available and the HPCC Systems Machine Learning Library has been used to build the ML models used.
Best Poster - Research
Big Data and its applications are becoming more and more important across many different fields. In this context, techniques and tools that are able to process the immense flow of information to create value can be powerful instruments. This study focuses on the application of data analysis to financial investments at LendingClub’s platform. LendingClub is an American peer-to-peer lending company. The company’s platform allows users to file loan requests and others to finance them, becoming investors. Each loan is broken up into Notes that represent a fraction of said loan. These Notes can also be traded among investors, similarly to what is done in the stock market. The investors can choose the loans they wish to finance based on a plethora of information about the loan and the borrower, such as the loan’s interest rate and the borrower’s purpose and credit score. Even though LendingClub assesses all loan requests before making them available on its platform, the company’s public historic database shows that around 12.5% of loans were charged off. As investing in loans that end up not being paid evidently incurs in financial losses, it would be useful to have a way to identify loan requests that have a higher probability of being paid on time.
With that in mind, the goal of this study is to develop a logistic regression model capable of identifying the best options for investment among the loan requests at LendingClub’s platform using the information available to investors. This binary model should calculate the likelihood of a loan being paid and then classify it as “good loan” or not.
Given the size of the company’s dataset (over two million records, with dozens of columns), this project was developed on the HPCC Systems platform, which is able to handle large volumes of data and also has a logistic regression module. The modelling process involved four main stages:
- Data extraction
- Data cleaning and preparation
- Model training and evaluation
By the end of the study two final models were obtained through two different methods of optimization. The first one is better to filter loan requests to obtain a higher proportion of good loans, while the second one is better to filter loan requests discarding as few good loans as possible.
In this Video Recording, André provides a tour and explanation of his poster content.
Big Data and Logistic Regression Applied to Analysis of Loan Requests
Click on the poster for a larger image.