Analyzing Clustered Latent Dirichlet Allocation (CLDA)
Christopher Gropp is a research assistant and Masters student in Computer Science at Clemson University. The aim of this project was to analyse the impact of computing resources on research output. Dynamic Topic Models (DTM) are typically used to extract time-variant information from a collection of documents but this is usually a slow process taking days to process around half a million records. This performance hit becomes prohibitive on Big Data so they decided to look at technologies that would allow them to gather topics across long time periods without sacrificing performance.
This project began in 2015 using HPCC Systems and a working prototype was produced early in 2016. It is now a fully realised system used in their research. Christopher wanted to see how topics (both their key words and proportional size in all documents) change over time. It was discovered that using CLDA, they could process datasets (generating a broadly comparable output) that would take weeks using DTM, in minutes using CLDA. For example, the first test results showed DTM took 58 hours to process what CLDA could process is 12 minutes using nearly identical resources. Moreover, but using more cores (480 specifically) he was able to process a much larger dataset in 20 minutes using CLDA which would have taken 29 weeks for DTM to process.
Many of the components used for CLDA were readily available in the HPCC Systems Machine Learning Library. Others are under active development, which means that when they are complete, they can be easily combined to produce a fast and powerful text analysis method for HPCC Systems.
Work is still underway on this project to improve the metrics used to evaluate the quality of the topic models. Christopher is working on developing more rigorous tests to properly evaluate the output of CLDA and other models. The aim is to provide the text analysis community with a testing framework for quantitatively determining the effectiveness of their methods, and how various methods compare to one another.
Christopher's poster presentation entered into our competition held on Community Day at the HPCC Systems Engineering Summit in 2016.