HaaS: HPCC Systems as a Service
Chin-Yung Hsu - North Carolina State University
1st Place Winning Entry
Amazon Web Services (AWS) is the premier IaaS provider. It leads the pack by offering more and better services at lower prices. Furthermore, AWS continuously improves and innovates to stay in front. There are numerous reasons to use an IaaS for HPCC Systems instead of dedicated hardware, especially if the workload does not execute 24/7.
We developed a CFT and an AMI for HPCC Systems. This poster presents the tools we created and explains and justifies the reference architecture and many of the configuration options for managing HPCC Systems clusters in AWS.
Unicode Implementations for HPCC Systems Standard Library Functions
David Skaff - NSU University School
2nd Place Winning Entry
The capability to effectively manipulate unstructured text continues to gain importance as HPCC Systems faces increasing variations across the documents that it works with. These variations present challenges in maintaining efficiency despite accounting for the different possibilities; however, implementing Unicode helps to universalize HPCC Systems standard library functions’ range of input.
Unicode is an encoding system that represents a total of 136,690 characters as of June 2017, and through ICU’s (the International Components for Unicode) libraries, we can manipulate these characters and the strings they form while accounting for a wide variety of different writing systems. For instance, we can count the number of words in sentences that are written in English and are written in languages that don’t use whitespaces.
We define these ECL programs in C++, each of which have their own test cases and supporting documentation to ensure accurate performance of the code and complete understanding for the users. We have accounted for potential errors from processes like character iteration by writing these programs to support Unicode’s different encoding methods and to supply optional Normalization support to make given strings consistent in their glyph representations. Each of the implemented functions have been tested to work with several differing languages and technical cases—establishing progress towards globalization.
Cohesive Framework for Legislative Documents and Research Papers
George Mathew - North Carolina State University
3rd Place Winning Entry
What is the connection between the laws we write and the papers generated by researchers? Do government directives guide research? Does government legislation respond appropriately to new research results? How can we check?
To answer these questions, we explore text mining and LDA for legislative and research documents. Specifically, we a) build a vocabulary on corpus using the top words in each document type, b) construct a vectorized representation of the words, c) create vectors for each document using word vectors and d) generate a mapping between different document types based on document similarity. With this approach, we are able to achieve a cosine similarity of up to 97% between legislative and research documents (by way of comparison, the cosine distance between random items is around 75%).
This preliminary study was conducted on a relatively small set of 100,000 legislative and 75,000 research documents. Our future plans focus on repeating this analysis on a larger corpus and also handle temporal analysis of research with respect to legislation.
Is the Secret to Longevity Eating Chocolate?
Cerise Walker - Wayne State University
In this study, correlation and regression analysis are explored to compare chocolate consumption, life expectancy, and happiness. To determine whether a strong correlation exists, amounts of chocolate consumption worldwide will be compared alongside the average life expectancy and happiness index of the corresponding countries. Using the HPCC Systems ML library, the correlation module will be utilized to determine whether there is a strong correlation between the variables, followed by the OLS linear regression module to model the linear regression. Additionally, the visualizer bundle in HPCC Systems will be used for charting and graphing results from the correlation and regression analysis to provide visual representations.
Fast Retrieval of Relevant Information through HPCC Systems
Zhe Yu - North Carolina State University
In the age of big data, identifying “useful” information buried under “useless” ones has become a critical problem for researchers. Common research activities start with this process to understand the state-of-the-art results and to avoid reinventions. The state-of-the-art solution for this problem is utilizing active learning methods to assist human classifications since a) machine learning alone cannot accurately classify “useful” information from “useless” ones; b) the cost is too high for human to work alone without machine suggestions. Because of its ability of massive scale-up for training the learner and its fast distributed data storage, HPCC Systems becomes an ideal platform for the implementation of active learning solutions and allows multiple human experts working on one project collaboratively. For the above reasons, we build the tool FASTREAD_ECL with ECL-ML library to support human experts on identifying “useful” information with reduced cost of time and effort. Our results suggest that large portion of manual works can be reduced with the help of this tool. We plan to explore further on issues like scalability, resolving conflicts, and search incorporation based on data collected through the usage of this tool.
Optimizing ECL-ML Yinyang K-Means Clustering Algorithm on HPCC Systems
Lily Xu - Clemson University
Yinyang K-Means Clustering Algorithm is a recent attempt to improve the scalability and velocity of the classic K-means algorithm. The algorithm involves two steps: the assignment step to assign each point to its closest center and the update step to re-locate the K centers. Yinyang K-Means improves assignment step by applying group filter and local filter to avoid unnecessary calculations and the results from the optimized assignment step can naturally reduce the computation in the update step. It gives a speedup over the standard K-Means between two times to an order of magnitude.
In my previous work, I implemented Yinyang K-Means Clustering algorithm in HPCC Systems. However, the performance of the implementation was about 3 times slower than the implementation of the K-Means clustering algorithm in HPCC Systems. The main reasons for the performance are the lack of the grouping concept, the optimization of the implementation, the size of the testing sets and the difference of the programming environment.
In this poster, I will introduce the work I’ve done to improve the performance of the last implementation of Yinyang K-Means in the HPCC Systems during this summer. In this poster, we focused on the large dataset to test the performance of the optimized Yinyang K-Means. The reason is that based on the time complexity of Yinyang K-Means and K-Means, the overhead of running K-Means at the initialization step is comparable to the runtime of running K-Means on the small dataset. Also, we focused on the large number of centroids where we can explore the advantage of group concepts that is implemented in the optimized Yinyang K-Means. The result shows that its performance is largely improved compared to the previous work.
Spark and HPCC Systems: Strangers? No more!
Vivek Nair - North Carolina State University
The motivation of this work is to solve two specific scenarios which LexisNexis engineers face every day. First even though the ECL machine learning library hosts multiple parallel machine learning (ML) algorithms, there exist other ML algorithms which are not implemented. For such situations, ECL users might want to try other parallel frameworks of ML algorithms (such as SparkML) where, currently, the result handoff between systems is time-consuming. The second challenge is data movement between two big data systems (such as HPCC Systems and Spark) since the transfer of data introduces technical as well as compliance issues.
We introduce a solution: Spark-HPCC. This solution seamlessly integrates two big data systems, HPCC Systems, and Spark. Spark-HPCC enables (i) Spark users to use HPCC Systems as a data store (which solves the problem of data movement) and, (ii) ECL users to use subroutines written in (Py)Spark (which solves the problem of results handoff). We leverage HPCCFuseJ - a Python-based FUSE plugin for HPCC and Apache Livy - a REST-based API to submit and track Spark jobs. Even though this work demonstrates the interoperability between systems, the technique is not scalable due to the I/O bottleneck introduced by HPCCFuseJ. In the future, we plan to extend HPCCFuseJ to allow parallel transfer of data.
Implementing A Sentiment-Change-Driven Event Discovery System on HPCC Systems
Lili Zhang - Kennesaw State University
The emergence and prevalence of social sites provides the public platforms for people to exchange information and express their opinions, which makes a huge amount of data available for study. We present a system leveraging the HPCC Systems platform to automatically discover important events that have significantly driven people’s sentiment changes towards a target based on Twitter data. This system can provide the time, importance, and description of events associated with people’s sentiment changes.
Representativeness of Latent Dirichlet Allocation Topics Estimated from Data Samples with Application to Common Crawl
Yuheng Du - Clemson University
Common Crawl is a massive multi-petabyte dataset hosted by Amazon. It contains archived HTML web page data from 2008 to date. Common Crawl has been widely used for text mining purposes. Using data extracted from Common Crawl has several advantages over a direct crawl of web data, among which is removing the likelihood of a user’s home IP address becoming blacklisted for accessing a given web site too frequently. However, Common Crawl is a data sample, and so questions arise about the quality of Common Crawl as a sample of the original data. We perform a systematic test on the representativeness of topics estimated from Common Crawl compared to topics estimated from the full data of online forums. Our target is online discussions from a user forum for automotive enthusiasts. We show that topic proportions estimated from Common Crawl are not significantly different than those estimated on the full data. We also show that topics are similar in terms of their word compositions, and not worse than topic similarity estimated under true random sampling, which we simulate through a series of experiments. To leverage the power of HPCC Systems, we plan to port our data pipeline to the HPCC System platform to improve data storage management, speed up data preprocessing and topic analysis time. Our research will be of interest to analysts who wish to use Common Crawl and HPCC Systems to study topics of interest in user forum data.
We employ a new approach that is frequency based graph mining using HPCC Systems platform to extract characteristic patterns from a collection of malware graphs and apply supervised machine learning to detect the malware automatically.
Malware Detection Using Frequency Based Graph Mining
Nusrat Asrafi - Kennesaw State University
Malicious software poses a major threat to the security of computer systems. The amount and diversity of its variants render classic security defenses ineffective, such that millions of hosts in the internet are infected with malware in the form of computer viruses, Internet worms and Trojan horses. While obfuscation and polymorphism employed by malware largely impede detection at file level the dynamic analysis of malware during run-time provides an instrument for characterizing and defending against the threat of malicious software. Traditionally, analysis of malicious software is only semi-automated process, often requiring a skilled human analyst. As a new malware appears at an increasingly alarming rate-now over 100 thousand new variants each day-there is a need for automated techniques for identifying suspicious behaviors in programs. It is not possible to handle this large amount of data by human analyst. In this poster, we represent a method for extracting statistically malicious behavior from system call graph (obtained by running malware in a sandbox).