Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Since January 2019, I have been surveying various log analysis techniques to detect abnormal activates on computing/networking systems in the literature as a part of my thesis supervisor’s research collaboration with LexisNexis Risk SolutionSolutions Group. During the course of my study, unsupervised anomaly detection has attracted my attention as this technique has a good potential to detect unknown cybersecurity threat. Hence, I was really excited to learn this internship opportunity, which allowed me to learn various machine learning and big data analysis technique and adopt them to implement an algorithm for this real world problem. 

...

The project consists of three modules: (1) Telematics system simulation, (2) Apache Kafka message system and (3) HPCC Systems analysis system implementation.

...

HPCC Data Analysis is an ECL program which processes data based on HPCC Systems Platform. This demo fetches message data from Kafka message queue, parses these messages to the required format, cleans the unnecessary data, saves them to datasets, analyzes these datasets, and send the result to our client. There are

...

In summary, our project based on the HPCC system Systems implemented a data processing pipeline for the vehicle industry which demonstrated the big value for the industry, and the analysis result could also potentially reduce the damage of the vehicle accident resulted by human behaviors.

...

Text cleaning is becoming an essential step in text classification. Stop word removal is a crucial space-saving technique in text cleaning which saves huge amounts of space in text indexing. There are many domain-based common words which different from one domain to another and have no value within particular domain. It based on the document-collection, for example, the word "protein" would be a stop word in a collection of articles addressing bioinformatics, but not in a collection describing the events of political issues. Eliminating these words will reduce the size of the corpus and enhance the performance of text mining.  In this project we used text vectors bundle (CBOW) in HPCC Systems to find the domain based common words. The idea behind using text vectors is the ability to map each unique token in the corpus to a vector of n dimension. Text vectors maps words into a high dimensional vector space such that similar words are grouped together, and the distances between words can reveal relationships. By using the vector representation of words we can find the center of space, and by finding the distance between each unique word in corpus and center we can find the domain based common words which have the shortest distance from center.  To test our methodology we applied some of the commonly used text classification such as ClassificationForest before and after eliminating the common words. Eliminating domain based common words will enhance the performance of the classification methods.

...

Today, with the increasing availability of smart phones and other handheld recording devices, people are generating vast amounts of data in the form of digital audio. Yet, despite a move from analog equipment to digital plug-ins, many of the fundamental processes used in forensic sound analysis have remained relatively unchanged. This project aims to use HPCC systems Systems ECL, in tandem with TensorFlow’s Machine learning libraries, to offer a modern solution to some of the problems presented in forensic sound analysis.

...