Anshu Ranjan contributed to this project as part of the HPCC Systems Summer Internship Program. He’s been working on this project with his mentors Gavin Halliday and Jamie Noss.
While this project has had limited success there are many reasons why this should be the case. The code generator is pretty much the engine room of the system and as a result, not only do you need an eye for detail but, you also need a good overview of the entire system and how each component interacts with others. Gavin Halliday is our resident expert in this area and through years of experience knows it inside out. So we know that it is a complex and challenging area.
One of the things we hoped to get out of this project was some feedback on how to improve the internal documentation for developers so that others in the future can contribute to the codebase. Having a student working in this area has certainly helped us to highlight some specific improvements we can make to our internal documentation and we have already made some as a result of Anshu joining the team.
Unfortunately, it was not possible to complete this project, although work is still going on behind the scenes. There may be more to post about this project in the future.
Queries executed on the HPCC system can be very complex. In particular they often contain "child queries" which can range from very simple - e.g., filter the list of addresses associated with this person, to complex - e.g., for this person look up all the information that is available about them, perform some scoring, select the best of those results and use that to retrieve some more information to augment the current record.
There are often two ways of executing these child queries. Some can be executed with C++ code generated inline, while others require sub-graphs to be generated. Currently the selection between the two options is fairly simplistic, and the two options do not have the same optimizations applied to them. This can lead to poor code generated for inline child datasets e.g, common sub expressions not being spotted, unused fields not being removed. Processing child queries also involves converting the declarative ECL into a more imperative execution graph. This often involves a trade off between ensuring that expressions are only executed once, that conditional expressions are only executed when required, and ensuring expressions can be evaluated efficiently. Improving that trade off can produce large benefits.
Completion of this project involves:
- Development of test cases.
- Designing a new mechanism for processing child queries - which may involve changes to the code generator and execution engines.
- Improving the code generator to generate code that is significantly better for many of the test cases.
- Ensuring that there is no significant degradation in the code generated from the queries in the regression suites.
- Coordinating with the maintainers of the engines to implement any extensions required.
In particular the solution needs to address:
- Ensuring inline and out of line datasets are optimized.
- Ensuring unshared inline datasets are evaluated as efficiently as they are currently.
- Ensuring child queries are not executed if they can be short-circuited.
- Ensuring any shared code is only executed once.
- Ensuring results that are used conditionally are only evaluated when required (can conflict with the previous requirement)
- Providing a framework which is clearly understood and can be extended.
By the GSoC mid term review we would expect you to have a provisional design for a new child query algorithm, and an implementation which shows some improvements on a selection of test cases.
Backup Mentor: Jamie Noss
End of project