Browse Poster Wiki: 2022 Poster Award Winners, Awards Ceremony (Watch Recording from minute marker 1630), Posters by 2022 HPCC Systems Interns, Posters by Academic Partners, Poster Judges, About Virtual Judging, 2022 Poster Contest Home Page, Poster Contest Previous Years

Sumanth Hegde is a 3rd year student in the Computer Science Undergraduate program at the R.V. College of Engineering.

Poster Abstract

HPCC systems is the front-running platform for large volume data engineering. Due to large amounts of data involved it is essential to understand and interpret the data flow within a program both to the developer and users. External data providers can peek into how the data provided to them was used, and developers can get an overview of the data flow without going through the entire code. A typical Enterprise Control Language (ECL) job involves multiple datasets each having multiple fields which are affected by multiple operations and transformations. The current system shows operations at dataset level. Our Project will help in understanding how each field contributes to the outputs which will be of great use for the users.

Objectives:

Exploring how to extract relevant information from various stages of ECL code compilation to understand the operations performed to get the output dataset from input dataset.
Infer the extracted information to develop relationships between dataset fields and operations performed.
Test results on various ECL codes and datasets.

Methodology:

Input Gathering and Preprocessing: To track the field level dependencies, we are using the Intermediate representation (IR) generated by the ECL compiler. The IR contains variables corresponding to type definitions, field definitions, literals, and operations. The format of these is very structured which can be easily tokenized using a simple string delimiter based approach, which is stored in a C++ std::vector. The vector contains information like independent and dependent variables in the given line of IR, operation involved, annotations, etc. An additional map corresponding to this vector is created for faster access to information.

Parsing: The parsing consists of backtracking from output tokens to other tokens which are linked to them, according to custom defined rules. The useful information about any token requires a forward trace and the graph generation requires a backward trace through the IR. Therefore, to improve the efficiency of the backtracking process, we forward trace the vector beforehand to store some information about these tokens which are directly used while backtracking. The backtracking process results in the generation of the desired directed graph.

Graph Rendering: The directed graph generated is rendered using the Javascript library cytoscape.js . Nodes correspond to ECL attributes, fields, or functions. Annotation on the edge denotes the operational relationship between source and destination nodes. The User Interface is provided with options to select and highlight a particular node and view the immediate and cumulative dependents of that node.

Outcome: The data-flow graph which is generated for ECL codes, can be used by developers and end users. The various dependencies (immediate and long term) can be tracked via UI options.

Presentation

In this Video Recording, Sumanth provides a tour and explanation of his poster content.

Field Level Tracking of Datasets in an ECL WorkFlow

Click on the poster for a larger image.