Browse Poster Wiki: 2022 Poster Award Winners, Awards Ceremony (Watch Recording from minute marker 1630), Posters by 2022 HPCC Systems Interns, Posters by Academic Partners, Poster Judges, About Virtual Judging, 2022 Poster Contest Home Page, Poster Contest Previous Years
Arya Adesh is studying for the Bachelor of Computer Science and Engineering at RVCE in India. Arya suggested this project idea himself, producing a proposal to complete a piece of research in an area of interest to him. His proposal was accepted and he joined the 2022 HPCC Systems Intern Program to complete this research, contributing a new anomaly detection algorithm to the HPCC Systems Machine Learning Library as one of the deliverables.
As well as the resources included here, read Arya's intern blog journal which includes a more in depth look of his work.
Anomaly detection is the process of finding unexpected abnormalities in the dataset. Anomalies are rare occurrences that differ from the norm. An anomaly in a real-time dataset may indicate critical incidents like bank frauds, data compromise, infrastructure failure, and other deviations. Hence it is critical to identify such anomalies for further action. Local Outlier Factor(LOF) is an unsupervised anomaly detection method that identifies anomalies without training. It is a density-based anomaly detection algorithm that assigns a degree of outlier-ness (called Local outlier factor) to each point in the dataset. LOF can find both global and Local Outliers. Local anomalies are points that are outlying with respect to their neighbors. Other anomaly detection algorithms accurately find global anomalies, however, they fail to identify local outliers as they assume the dataset to exhibit uniform data distribution. LOF is most suitable for uneven distribution datasets as it doesn’t make assumptions about the distribution. It can identify both global outliers (outlying with respect to all the points in the dataset) and local outliers. The determination of outliers is based on the density between each data point and its neighbor points.
The major steps in the classical algorithm are mentioned below
- Find K nearest Neighbors (Knns) of all points
- Find the Reachability distance of every point with respect to all its Knns
- Find the Local Reachability density of a point with respect to all its Knns, from Reachability distance
- Find the Local Outlier factor from Local Reachability Density
- Assign binary labels to points with high Local Outlier Factor
In total LOF of p depends on k+k2+k3 (k is user input) points in the dataset. Mining the nearest neighbors is the most important and time-consuming step. It is important to reduce redundant steps to improve the run time of the algorithm. KD trees and Ball Trees will be used to find the KNNs in a distributed fashion in HPCC Systems. The steps in distributed systems like HPCC Systems differ from the classical algorithm to reduce redundancy and improve performance. Implementation of LOF in HPCC Systems will be discussed using flow charts.
In this Video Recording, Arya provides a tour and explanation of his poster content.
Implementation of Local Outlier factor Algorithm for Anomaly Detection in HPCC Systems
Click on the poster for a larger image.