Browse: Winners and runners up, Awards Ceremony (Watch Recording / View Slides), Posters by HPCC Systems interns, Posters from academic partners, Home
MPI - Proof of Concept
Saminda Wijeratne - Georgia Tech
1st Place Winning Entry
The data delivery of a distributed application will be at least as fast as its underlying communication layer which handles all the data traffic. This layer in HPCC Systems connects all the different components and worker nodes seamlessly in order to accomplish simplicity in accessing and performing non-trivial data flows while keeping the design APIs flexible for adaptation. This allows the communication layer explore more advanced/evolved message passing frameworks without disrupting application-level designs or implementations. In this poster we will present our evaluation of using the popular communication library called MPI framework.
MPI framework is now the most widely used distributed message passing library due to its intuitive usage and inherent optimizations and adaptations based on underlying network/system hardware. Our main goal is to examine how MPI framework handles the communication patterns and load subject to typical use cases of a running HPCC Systems platform.
Measuring the geo social distribution of Opioid Prescriptions
Nicole Navarro - New College of Florida
2nd Place Winning Entry
Drug overdose was the leading cause of accidental death in the US in 2015, and the number of drug overdoses involving opioids in 2016 was 42,249 – an increase of 18% per year since 2014. For my project I utilized the open source HPCC Systems capabilities around knowledge engineering to create data features and interactive visualizations. These were designed to allow research into Drug Socialization across social groups and geographical regions with a focus on opioid prescription rates.
Drug Socialization is used to measure drug diversion within organized social groups. This behavior is concerning because it can be indicative of prescription resale to drug dealers in a community, or drug seeking social groups in which drugs are being supplied to friends or family. This project was designed to broaden the understanding of potential diversion behavior across geographic regions and social groups. The main goal was to generate new features to help create interactive visualization tools that can be used to isolate communities and social groups containing both high social interconnectedness and high opiate prescription rates. These groups could then be considered for potential intervention or disruption through a combination of treatment programs or provider education.
Distributed deep learning with TensorFlow
Robert Kennedy - Florida Atlantic University
3rd Place Winning Entry
The training process for modern deep neural networks requires big data and large amounts of computational power. Combining HPCC Systems and Google’s TensorFlow, I implemented, during my summer internship, a parallel stochastic gradient descent algorithm to provide a basis for future deep neural network research and to enhance the distributed neural network training capabilities of HPCC Systems.
Equivalence terms of the Text Search Bundle
Farah Alshanik - Clemson University
The sources of big data are rapidly growing within various domains and industries, science, financial, engineering and medicine, to name a few. Nearly every business sector imaginable accumulates huge amount of valuable information. Some of this information is available as text, and is used by text search tools to return specific information based on search queries. While the possible use of these huge data sets is important, there is a need to find a way to process this information quickly, and in an effective way. HPCC Systems is an ideal platform for the implementation of text search tools because of its ability for massive scale-up, its fast distributed data storage, and its ability to retrieve a set of documents or text based on the search query.
Text searching presents several challenges. For example, search with synonyms is one of the inherent challenges in text search. Use of the absence or presence of keywords is a common method that is used in full text searching, but this method returns only the set of documents that contain the target keywords. This is inefficient for finding information, since is does not return documents that contain synonyms in the text. Abbreviations and initialisms are additional challenges that make full-text searching using keywords only ineffective for information retrieval. In this project we demonstrate results using a new method to solve these problems. We developing a text search bundle in HPCC using the ECL language and the Moby thesaurus to find the equivalence terms for set of words in search query. Also, we use word2vec to discover the similarity of terms based on context, and implement a method to handle the initialisms and abbreviations for search queries.
How to be rich - A study of monsters and mice of American industry
Zhe Yu - North Carolina State University
One way to ensure that the HPCC software suite reaches a larger audience is to provide examples significant business-level insights. LexisNexis stores large amounts of information on United State companies. What can we learn from that kind of data? And do those conclusions have important business impact?
This research reports results of a study on 7,108 10-K filings from United States public companies in 2017. After some elaborate entity recognition pre-processing, it was possible to extract graph of what people served on the boards of what companies. The graph could be divided into “components” A, B, C,... with properties that (1) no company in “A” shared board member with company “B” but (2) any companies in any one component were connected to another via shared board membership.
The results were startling. While the data were divided into 3,159 components, we found one “major” and many, many, “minor” components. Specifically, the major component held 2867 / 7108 = 40% of the companies while the rest components featured the pattern of the same board members on all companies. This raises the question: what is so special on the the major component? Through analyzing the US fortune 500 companies, we found that 348 / 500 = 70% of the fortune 500 companies are connected with each other in the “major” component while the rest stay isolated. This might suggest that building connections with board members in the “major” component helps a company to grow larger. However, a contradictory view is that, among the fortune 500 companies, those isolated ones held more revenues in average, which may suggest a better investment would be those isolated fortune 500 companies over the connected ones.
To further explore the information in the knowledge graph of US public companies, heterogenous text mining tools like those in HPCC are vital by offering timely business analytics from the kinds of complex data seen in current corporate America.
HPCC Systems Robotics Data Ingestion
Aramis Tanelus - American Heritage School
Given that robots are collecting more and more data these days that require heavier computing power, making the interface between HPCC Systems and robots is a necessity for HPCC Systems to be relevant to robotics in schools, universities, and industry.
To facilitate this interface, ROS packages along with ECL code was written to ingest data from three common robot sensors: GPS sensor, infrared cameras, and lidar. All of these sensors have one thing in common: big data.
All of the sensors used in this project are part of autonomous agricultural robotic project at American Heritage which will go up and down the rows of plants inspecting, collecting data, and collecting soil samples.
To collect GPS data, the SwiftNav Piksi GNSS receiver was used running Ubuntu with the newest version of ROS installed. The swiftnav_ros package published the data and recorded the messages with rosbag. By using rosbag, the package was tested as if it were running on the computer itself, receiving a live stream of data from the GNSS receiver.
For the infrared camera, we started with a digital camera that provided images at 1080p resolution at 30 frames per second. To record images from the camera, a simple package in ROS was written that made use of OpenCV to read images from the camera and cv_bridge alongside Image Transport to convert the images into ROS messages.
After completing the integration of the digital camera, work was begun with the infrared camera, model Flir C2. Drivers for the camera were located, but drivers for Linux could not be found. Instead, the same method used on the digital camera was used which successfully procured images from the infrared camera.
As with the GPS GNSS visualizations, a ROS package was used to feed measurements from a physical sensor (RPLidar A2) to an ECL workspace. However, the same thing queries could not be used. Since data coming from laser scans are much larger than those from GNSS fixes, the data needed to be split into batches to somehow avoid sending all of it at once. This was done by grouping the records by timestamp. Since records cannot be indexed by reals, timestamps were multiplied by 1,000,000 and stored the new value as an unsigned. When a request came in asking for records within a certain timestamp range, the reals in the request were also multiplied, to allow for comparison with the unsigned timestamps.
ECL record structures were created for all the sensors along with ECL attributes to ingest code from data sprayed in from all three robotic sensors.
All the code for this project is available on the ROS Packages repository online and available for everyone to use. Although no data was actually analyzed by the HPCCS systems for this project, the next logical step is to apply machine learning and other techniques to the big data collected by the agricultural robot.
Dimensionality reduction using PBblas
Shah Muhammad Hamdi - Georgia State University
Because of the advances in data collection, processing, and storage technologies, high dimensional data has become ubiquitous in almost any type of data-centric analysis. Such high dimensionality of data is also known as the curse of dimensionality because it leads to the poor performance of supervised (e.g., classifiers) and unsupervised (e.g., clustering algorithms) learning methods. Reducing the dimensionality of high-dimensional data has been a prominent research problem in statistics and machine learning for decades. Principal Component Analysis (PCA) is a widely used dimensionality reduction technique, which reduces the dimensionality by Singular Value Decomposition (SVD)-based matrix decomposition. Recently, kernel PCA has been introduced to sperate the linearly inseparable data points. Parallel Block Basic Linear Algebra Subsystem (PBblas) is a library for HPCC platform which helps execute matrix operations using the parallel processing capabilities of the HPCC system clusters. In this work, we have implemented PCA and kernel PCA for HPCC platform using PBblas library. In this poster, we present the basic concepts of PCA and kernel PCA, implementation details of these dimensionality reduction techniques using PBblas, and the helper functions which can extend the capability of PBblas for handling machine learning projects
Cervical cancer risk factors: Exploratory analysis using HPCC Systems
Itauma Itauma - Keiser University
Cervical cancer is a leading cause of cancer-related death among women with about half a million new cases worldwide in 2018. 90% of cervical cancer deaths occur in low resource settings. This mortality could be reduced through effective prevention, screening and treatment programs. HPV vaccination reduces cervical cancer risk. However, not all populations have access to HPV vaccination. Using HPCC Systems, an exploratory analysis of a cervical cancer database with data visualizations are performed. The findings from this study could be beneficial in resource-scarce settings with limited access to cervical cancer screening and HPV vaccination.
The future of automotive telemetry
Everett Matthew Upchurch Butler - Kennesaw State University
The concept of self-driving vehicles is one that has fascinated mankind for generations. It was once a topic thought to be so farfetched that it could only exist in a child’s cartoon. Yet, society now finds itself thoroughly entrenched in this new reality. Many enterprises have already taken major strides toward progressing on this once fanciful idea with a new sense of validity. Racing closer to that dream, we must pause to consider the potential risks that we may be creating for ourselves.
This project will focus on assessing two major areas of concern regarding autonomous vehicles. First, using HPCC Systems’ big data analytics platform (ECL), there will be an intensive focus on telemetry as it relates to autonomous vehicles. Provided in this research is a dataset that is composed of over 700,000 data points. These points represent simulations of the pertinent information autonomous vehicles will track and log. The categories in this data include: individual vehicle ID, timestamp, x coordinate, y coordinate, and velocity. ECL provides the ability to quickly analyze and manipulate this dataset in order to determine useful conclusions about the risks of autonomous vehicles to insurers and, moreover, the automotive industry sector as a whole.
As a secondary focus of this project, there will be an assessment of the potential cybersecurity risks associated with this topic. Autonomous vehicles will he highly computerized by nature, and will require various levels of network interconnectivity. Assuming that these new vehicles will be connected to the internet and to the infrastructure in the world around them, the question is posed: how will these vehicles respond to cyberattacks? This project will assess the risks and liabilities associated with autonomous vehicles by utilizing tools such as penetration testing, TCP/IP attacks, ICMP attacks, and other cybersecurity attacking methods that may prove to be useful.
Additionally, there will be an assessment of how the data collected from these vehicles will be stored and protected. These data can be stored by either manufactures, insurance agencies, or information housing entities such as LexisNexis. For this section of the project, interviews will be conducted with various industry professionals to determine how their individual companies are addressing their own information security policies. This will be done in a deliberate attempt to determine how this sensitive information could be comprised by either an ethical or unethical hacker.
Using HPCC Systems machine learning to map public records data descriptions to standard categories
Lili Xu - Clemson University
The public driving records from a state’s Department of Motor Vehicles (DMV) can assist in the identification of individuals who have deficient driving history. . This information is important for making business decisions such as indicating a bad risk for insurance companies. A standard code appear in the Driving Record section of a driving history report  to standardize the various descriptions of state motor vehicle record violations to a standard categories.
As more and more driving records become available, it’s important to efficiently process incoming data. The proposed work explores the driving records data with machine learning library in HPCC Systems . The experiment results uncover the linguistic character of driving violation description and the specific violation description terminologies in different states utilizing the massive parallel computing capability of HPCC Systems.