The aim of this project was to research and develop GPU accelerated deep learning algorithms on HPCC Systems and to incorporate many points in the project outlined in the Create HPCC Systems VM on Hyper-V project, which would greatly benefit other aspects of this proposed project both the current proposed scope and for future work.
The training process for modern deep neural networks require big data and large computational power. Though HPCC Systems excels at both of these, HPCC is limited to utilizing the CPU only. It has been shown that GPU acceleration vastly improves Deep Learning training time. This project will greatly expand upon HPCC Systems by not only being the first GPU accelerated library (to my knowledge) but also greatly expand upon HPCC System’s deep neural network capabilities.
This project provides a software library (consisting of ECL and Python code) that provides HPCC Systems GPU accelerated neural network training, as well as expand and improve the existing Deep Learning framework. Additionally, it will increase the number of configurations HPCC can be deployed on by offering a Hyper-V image. This project’s outcome would also serve as a building block for future development for different distributed configurations that were not previously available on HPCC, such as model parallelism as well as enabling HPCC to Deep Learn using asynchronous algorithms; current implementations do are synchronous and were suited to synchronous algorithms.
Substance use disorders and mental illness affect a large number of people in the United States. In 2017, an estimated 19.7 million individuals aged 12 or older had a substance use disorder (SUD) and 46.6 million individuals aged 18 or over had any mental illness (AMI) in the past year (SHAMSHA, 2018).
The burden of both mental illness and substance use continues to get progressively worse and overdose deaths are increasing (Pew Research Center, 2018). Anxiety/depression now tops the list of problems teenagers see among their peers, with drug addiction and alcohol use listed third and fourth (Pew Reseaarch Center, 2019). Almost half of U.S. adults report that they have either had a family member or close friend who wa addicted to drugs (Pew Research Center, 2017), while suicide was the second leading cause of death in 2017 among people aged 10 to 34, with the total suicide rate increasing over time (NIMH, 2019).
The purpose of this study is to utilise data analytics to gain a deeper understanding of mental illness and substance abuse. Using the National Survey on Drug Use and Health and the HPCC Systems Machine Learning Library, this study will explore the relationship between co-occurring diagnoses of mental illness and substance use disorder as follows:
Since January 2019, I have been surveying various log analysis techniques to detect abnormal activates on computing/networking systems in the literature as a part of my thesis supervisor’s research collaboration with LexisNexis Risk Solutions Group. During the course of my study, unsupervised anomaly detection has attracted my attention as this technique has a good potential to detect unknown cybersecurity threat. Hence, I was really excited to learn this internship opportunity, which allowed me to learn various machine learning and big data analysis technique and adopt them to implement an algorithm for this real world problem.
This project is mainly based on the following paper: Experience Report: System Log Analysis for Anomaly Detection by Shilin He and Al, published in 2016 at the IEEE 27th International Symposium on Software Reliability Engineering.
Log Analysis can be divided into four main steps:
Log Collection: getting the raw logs
Log parsing: getting log templates from raw logs
Feature extraction: getting relevant log sequences that would be further fed to the machine learning algorithm
Anomaly Detection: running unsupervised learning algorithms on our extracted features. I intend to use two clustering approaches: K-Means and Hierarchical Agglomerative Clustering.
I used HPCC Systems Roxie Logs, downloadable from ECL Watch.
Many commercial cloud resources are available to deploy high performance systems but managing these resources can be tedious. It can also be expensive when resources are run for longer than they are needed. It is also possible that developers and researchers may spend more time confirguring cluster and environments manually. This project looks at how to automate the steps involved in provisioning clusters, the execution of jobs on those clusters and the deprovisioning of the cluster in a single step when the job have completed.
Digital threats and compromised data are growing at epic proportions, meanwhile safety concerns on school campuses and other public places is just as colossal of a concern. When it comes to safety, especially that of school aged children, it is a disturbing reality, that today, they are just not safe to be alone, online or outside. So what is the solution? By integrating HPCC Systems into our Autonomous Security Robot, we will be able to ingest data and apply advanced sorting techniques to develop a safety tool that has the ability to recognise potential risks on campus that might otherwise be missed by the human eye.
Intelligence led policing (ILP) refers to technology-driven crime data analysis activities to support the design of effective crime prevention and prosecution strategies. This is a new approach to fighting crime that has been gaining strength due to the convergence of two technological streams: the digitization and release of public information related to the occurrence of crimes and the development of technological platforms that allow the proper handling of such information, such as the HPCC Systems platform.
In this context, an element that can be exploited through this approach and that can be of great potential value for the public safety of a city is the understanding of the criminal mind or, more specifically, studying the patterns of victims and places of preference of criminals. Based on such knowledge, it could be possible to develop more precise actions for crime prevention and repression, as well as the development of predictive models to estimate the probability of occurrence of crime toward a specific individual profile or geographical location.
Motivated by this context, the objective of this study is to use the HPCC platform to analyze crime patterns in the state of São Paulo in Brazil between the years of 2006 and 2017. For this purpose, a public database was used. The choice of the database of the state of São Paulo is justified by the fact that it is the most populous state in Brazil, with just over 45 million inhabitants, including areas with the highest crime rates in the world. These features provide a rich crime database to be exploited by a high performance computing platform such as HPCC Systems.
The database used has information about the person who was injured by the crime occurrence, having for example information related to the characteristics of a person, such as their age, gender, profession and educational level. The database also contains information regarding the type and the location of the crime.
Based on the type of information available in this database, the analysis conducted in this work focused on the creation of patterns around victim profiles, crime concentration, time of occurrence, as well as seeking trends among crime types.
The purpose of this project was to analyse the infrastructure statistics of elementary schools in Brazil in urban and rural areas. The goal was to investigate the basic infrastructure available for all students and school dependents such as water supply, electrical network, sewage network, internet access and the availability of ramps, handrails, signage and accessible toilets for people with special needs. The dataset used was public, covering data from 2015-2018.
Vehicle traffic crashes increases more than 25 percent from 2010 to 2016 derived from National Statistic. The Average Auto insurance fee rises around 20 percent from 2008 to 2016 by 2018 National Association of Insurance Commissioners (NAIC). This project simulates the telematics system and performs a preliminary analysis to evaluation the utility of telematics data in driving habit. The analytics result of this project will benefit the business applications in vehicle insurance risk assessment and offer initial observations to explore the value of large amount of existed data in telematics system. The study will also contribute the behavior analysis in both academia and industry.
The project consists of three modules: (1) Telematics system simulation, (2) Apache Kafka message system and (3) HPCC Systems analysis system implementation.
The simulation of Telematics system consists multiple instances on google cloud. This system will generates more than 1 million trips data of 10,000 vehicles in a period of 180 days. There will be four to seven trips per day for each vehicle. The differential in the date of all the data is no more than 180 days. The clients with google instances in this system could export these real-time data to JSON messages and delivery them to Kafka message queue time by time.
Apache Kafka is an open-source stream-processing software platform. Kafka can deliver in-order, persistent, scalable messaging. It enables you to build distributed applicants. Kafka can connect real-time data from Telematics demo to HPCC Platform via Kafka Connect and provides Kafka Streams.
HPCC Data Analysis is an ECL program which processes data based on HPCC Systems Platform. This demo fetches message data from Kafka message queue, parses these messages to the required format, cleans the unnecessary data, saves them to datasets, analyzes these datasets, and send the result to our client. There are
two important concepts Hard Acceleration and Hard Braking impacted a lot in our analysis. Hard Acceleration or Hard Braking is a driver event when more force than normal is applied to the vehicle’s accelerator or brake system. It can be an indicator of aggressive or unsafe driving behavior. Depend on the trip data and the thread hold, the analysis can help our client to identify the good or bad habit when driving.
In summary, our project based on the HPCC Systems implemented a data processing pipeline for the vehicle industry which demonstrated the big value for the industry, and the analysis result could also potentially reduce the damage of the vehicle accident resulted by human behaviors.
Text cleaning is becoming an essential step in text classification. Stop word removal is a crucial space-saving technique in text cleaning which saves huge amounts of space in text indexing. There are many domain-based common words which different from one domain to another and have no value within particular domain. It based on the document-collection, for example, the word "protein" would be a stop word in a collection of articles addressing bioinformatics, but not in a collection describing the events of political issues. Eliminating these words will reduce the size of the corpus and enhance the performance of text mining. In this project we used text vectors bundle (CBOW) in HPCC Systems to find the domain based common words. The idea behind using text vectors is the ability to map each unique token in the corpus to a vector of n dimension. Text vectors maps words into a high dimensional vector space such that similar words are grouped together, and the distances between words can reveal relationships. By using the vector representation of words we can find the center of space, and by finding the distance between each unique word in corpus and center we can find the domain based common words which have the shortest distance from center. To test our methodology we applied some of the commonly used text classification such as ClassificationForest before and after eliminating the common words. Eliminating domain based common words will enhance the performance of the classification methods.
The ultimate goal in any sport is to win. Coaches and athletes must be striving to be able to compete at peak performance. The field of sport science is a rapidly growing field that looks to address the questions that come with trying to help athletes be at their best when it matters most. This takes a strong understanding of exercise physiology as well as data science in order to try and find answers to the questions that arise. Some questions may be how hard does an athlete need to train on a given day? Or how does a practice session compare to a game in terms of physical demand? These are the types of questions that we try to answer using HPCC Systems with our data at NC State University Strength and Conditioning. From uploading the data into HPCC Systems, to processing and manipulation, to visualization, see how HPCC Systems can be used to optimize obtaining information from raw gps data collected with our soccer teams to help us best develop training programs to improve sport performance.
Forensic sound analysis is a fairly new field of science, which, in its current form, can be traced back to 1973, and the aftermath of the Watergate scandal. After the release of a series of tapes that revealed several crucial conversations between the then president Richard Nixon and his council, investigators discovered an 18 1⁄2-minute gap on one of the tapes. This "gap" was determined to have been created intentionally, by re-recording "noise" over the original audio, obscuring it beyond recognition. The discovery prompted a team of audio experts to analyse the altered tape to try and recover the lost audio. Although they were unable to restore the tapes, their forensic process was well documented, and now forms the basis of forensic audio analysis procedure today.
Today, with the increasing availability of smart phones and other handheld recording devices, people are generating vast amounts of data in the form of digital audio. Yet, despite a move from analog equipment to digital plug-ins, many of the fundamental processes used in forensic sound analysis have remained relatively unchanged. This project aims to use HPCC Systems ECL, in tandem with TensorFlow’s Machine learning libraries, to offer a modern solution to some of the problems presented in forensic sound analysis.
The aim of this project was to design a program that can take an audio file as an input and give a description of where it was recorded as an output, by utilising a combination of machine learning techniques and convolution reverb. Mainly, I am interested in the forensic applications of this type of sound analysis system and the research conducted during this project, as well as the program itself, was explored from this perspective.
In the present era of e-commerce where millions of transactions occur every second, the facilities to verify account details of the consumer with the bank and then transfer the amount from the bank to the merchant is not only time consuming but also a costly affair, as a solution to this Stored-Value cards were made which consists of a card having a prepaid amount of money residing in it and allowing transactions to be made by just using facilities which deduct the required amount from the card.
The same features which makes this card attractive to consumers (data privacy and anonymity) also entails in itself the hidden disadvantage of making the card susceptible to fraud.The identification of fraudulent methods in a cost-effective and timely manner has been of major concern to company making use of such cards.
Application of CNN models on the transactional data for identification of anomalies and fraud has proved to be quite promising as the model itself takes care of most of the static and dynamic feature extraction hence making the detection easier than random forests. In this project we aim to solve the problem outlined above by the application of CNN model in ML for dynamic detection of fraud in stored-value cards.
HPCC Systems is an open source big data processing platform used to pro- vide big data solutions. Big data processing applications such as image process- ing, audio processing often make use of mathematical computations, however currently there is little provision for execution of extensive mathematical com- putations on HPCC systems platform.
The objective of the project was to support Octave by allowing the em- bedding of Octave database queries within ECL code, with the help of simple wrapper classes to handle scalar values and structured data, including multi- threaded access from the ECL side. Gnu Octave is an open source scientific programming langugae for numerical computations.
The poster tries to express out the need for octave on a big data processing platform like HPCC Systems and briefs about the implementation of octave plu- gin. Octave plugin gives HPCC Systems a new dimension for its mathematical computation ability, especially the simplicity of octave enhances ECL numerical computation power.