HPCC Systems 4.2.x Releases

Welcome to HPCC Systems^® 4.2. Please review the information included in both categories shown below.

General -Details of the impact of a change and information about any modifications you may need to make to your ECL code.

Significant New Features - We recommend you consider using these features right away.

General

DeleteLogicalFiles in a PROJECT or APPLY requires NOTHOR

Thorslaves are prevented from having direct access to DALI which means that most FileServices calls cannot be used in a PROJECT/APPLY if they are executed on slaves. To avoid getting an error, wrap the PROJECT/APPLY with a NOTHOR to force the expressions/action to be executed in the ECL Agent context. For example instead of using the following code:

^{APPLY(ddd,STD.File.DeleteLogicalFile(ddd.name)); //this does not work}

Use:

^{NOTHOR(APPLY(ddd,STD.File.DeleteLogicalFile(ddd.name))); // NOTHOR forces the expression to be evaluated outside (using ECL Agent)}

This new approach applies to all releases since 3.10.8.

https://track.hpccsystems.com/browse/HPCC-12514

Using the addscopes utility to control user access to private files (fixed in 5.0)

When a new ESP user account is created a private “hpccinternal::<user>” file scope is also created granting the new user full access to that scope and restricting access to other users. This file scope is used to store temporary HPCC Systems^® files such as spill files and temp files.

If you are enabling LDAP file scope security and already have user accounts, you should run the addScopes utility program to create the hpccinternal::<user> scope for those existing users. Users which already have this scope defined are ignored and so it can be used on both new and legacy ESP user accounts safely.

The tool is stored in the /opt/HPCCSystems/bin/addscopes folder and to run it you must pass the location of daliconf.xml, for example:

^{/opt/HPCCSystems/bin/addScopes /var/lib/HPCCSystems/mydali/daliconf.xml}

https://track.hpccsystems.com/browse/HPCC-10815

ECL using the 4.0.4 DeleteOwnedSubFiles feature does not compile on 4.2.0 (fixed in 4.2.2)

A function was added in 4.0.4 called DeleteOwnedSubFiles, to allow subfiles to be removed from a superfile if they were only referenced from that one superfile.

The same change was also applied in 4.2, except that the name and signature of the function added was called RemoveOwnedSubFiles rather than DeleteOwnedSubFiles, and with an additional parameter to indicate whether the files in question were to be deleted physically as well as being removed from the superfile.

Unfortunately, this means that ECL code taking advantage of the 4.0.4 feature will not compile on 4.2.0. In 4.2.2, the DeleteOwnedSubFiles function will be reinstated with the same semantics as 4.0.4, though we would encourage the use of RemoveOwnedSubFiles where possible.

It should be possible to use #isdefined to create code that will compile on both 4.0.4 and 4.2.0 should it be necessary to work around this issue for code that needs to run on both 4.0.4 and 4.2.0

You will want to say something like:

^{#IF (#ISDEFINED(Std.File.RemoveOwnedSubFiles))}

^{Std.File.RemoveOwnedSubFiles(superkeyname+'_Delete'))}

^#else

^{Std.File.DeleteOwnedSubFiles(superkeyname+'_Delete'))}

^#end

https://track.hpccsystems.com/browse/HPCC-10288

In some cases, the standard Python libraries do not link statically to the Python core (fixed in 4.2.2)

On some distros (Centos in particular) the standard Python 2.6 packages have been built in such a way that the standard Python libraries do not link statically to the Python core. This results in undefined symbols (typically _Py_ZeroStruct) when trying to execute embedded Python code that uses one of these libraries. The workaround is to add the following code into the top of /opt/HPCCSystems/sbin/hpcc_setenv (note that the name will vary by distro):

export LD_PRELOAD=/usr/lib64/libpython2.6.so.1.0

From release 4.2.2, code has been added to the plugin that supports embedded Python to perform this load automatically, so the workaround will not be required.

The expression EXISTS(a + b + ... + last) occasionally being evaluated as EXISTS(last) (Fixed in 4.2)

This happened only if the expression was evaluated inside a transform/filter, rather than generating a child query.

In the few cases where it was observed, it was being triggered by some code that which defined a dataset by appending several datasets which are inline tables, child datasets, or other simple datasets. Other code was then checking EXISTS() to determine if it contained any elements.

As a result of the fix included in 4.2, you may notice a difference in the results of a query where the code has been relying on this incorrect behaviour and/or where the results were previously incorrect.

https://track.hpccsystems.com/browse/HPCC-10309

Superfile issue causing incorrect compression symptoms (fixed in 4.2.2)

This problem was seen when using a superfile with a few superfiles as subfiles, the 1st of which was empty. This causes the filedescriptor for the superfile to incorrectly configure the common shared attributes, in particular it fails to set @blockCompression which causes the engines to think the file is uncompressed when in fact all files are compressed.

This in turn leads to the deserialization problems, which in the case reported resulted in the deserializer trying to allocate a massive buffer running out of memory. Other symptoms could include deserialization errors relating to reading beyond the end of stream, or record size mismatch errors.

https://track.hpccsystems.com/browse/HPCC-10319

New environment.conf setting using epoll() instead of select() (Fixed in 4.2)

When listening for input on a number of sockets (dafilesrv and Thor do so quite often), we now use the epoll() system call rather than the select() system call. This can be much more efficient when large numbers of sockets are involved.

It is possible to force the old usage of select() using a setting in environment.conf by specifying:

use_epoll=false

https://track.hpccsystems.com/browse/HPCC-9415

Persist file per code-hash (Fixed in 4.2)

In previous versions of the platform, a persist generated a single output file with a name that matched the string supplied within the persist(). It was rebuilt if a query was submitted which was based on different code, or the input files had changed.

In 4.2 this has changed so that the output filename is derived by appending a unique identifier derived from the ECL code onto the end of the filename. This means if there are two users using older/newer versions of the same persist they can both rebuild and co-exist independently. There are a couple of implications to note which may cause backward compatibility problems:

Disk usage may go up because there are more copies of the persist. This can be alleviated by using the new persist options (https://track.hpccsystems.com/browse/HPCC-10213), or by manually deleting out of date persists.
Very occasionally ECL users have directly declared the files as ECL DATASETs and then read them directly. (This isn't recommended!) If this method is being used, the persist definitions should have the SINGLE attribute added onto them to ensure the unmodified filename is used.

https://track.hpccsystems.com/browse/HPCC-10022

Other improvements to the PERSIST functionality (Fixed in 4.2)

Improvements have also been made to the way that the expiry of persist files is handled. Rather than calculating the expiry date as a fixed time from when the file was created, we now track the last access to a file and expire based on that.

Finally, the persist handling has been refactored so that ‘intermediate’ persist files can be safely deleted without causing persists that are dependent on them to rebuild unnecessarily.

https://track.hpccsystems.com/browse/HPCC-9985

Improvements to the version of JOIN that allows an ATMOST attribute with a condition and a limit (Fixed in 4.2)

The previous implementation was not quite correct for global joins in thor. It sorted and distributed the input files by id and name meaning that there might be some values of id on different nodes and so potentially losing some matches. Now it distributes by id only but sorts by id and name.

https://track.hpccsystems.com/browse/HPCC-10005

Also previously, the sort implementation didn’t allow the distribution to be a subset of the sort condition. This has now been changed to use a HASH JOIN which will have the following impact:

Extra matches if there were any boundary cases.
Different performance profile of hash-join v. standard join
Possibly different different graphs – since self-hash joins are converted to hash distribute followed by a local join.

https://track.hpccsystems.com/browse/HPCC-9711

Improvements to ENUM where the first element matches a typedef name could affect existing layouts (fixed in 4.2)

Code which attempted to use an typedef-ed type as the base type for an enumeration would not be interpreted as the user expected. This could lead to unexpected behaviour in any code that attempted to use this feature. The enumerated type name would instead be treated as the first element of the enumeration.

Previously, the type name would have been added as a label. From 4.2, the type name will be used to defined the base type (where previously it was ignored). If a field of this type has been included in a dataset then the type of that field will change.

If the elements of the enum didn’t have explicit values associated with them, then their values will change. For example:

baseType := unsigned2;
myEnum := ENUM(baseType, one, two, three);
output(myEnum.two);

This would have previously generated 3. In 4.2 it will now generate 2.

https://track.hpccsystems.com/browse/HPCC-9553

Optimization of MANY LOOKUP

In previous releases, a MANY LOOKUP join with a high number of matching right-hand-side key values, caused a severe degradation in speed. The symptom would be seen as a large delay after the lookup join had read all its right hand sided rows. In severe cases, this gave the impression that the JOIN had stalled before outputting any matches. This has been resolved in 4.2.

Significant New Features

New ZAP button in ECL Watch for information gathering when reporting issues

The ZAP (Zipped Analysis Package) button is located on the workunits details page in ECL Watch.

When experiencing problems that require investigation by the HPCC Systems^® platform developer team, pressing this button collects any associated log files and workunit information and zips them up so they can be emailed to the developers or attached to a JIRA ticket. A form is displayed asking you to provide other details leading up to the discovery of the problem. For example:

The Workunit ID
ESP IP address
Thor IP address
Build/Release number
Timings information

This feature is available in both the old and new ECL Watch.

https://track.hpccsystems.com/browse/HPCC-7899

New Group Join

The new GROUP JOIN syntax allows you to efficiently join two datasets on one condition, but have the result grouped by another condition. This is useful for efficiently solving some relationship matching problems. As a first approximation the following ECL:

R := JOIN(l, r, LEFT.key = RIGHT.key, t(LEFT,RIGHT), GROUP(leftId))

where leftId is a value assigned from LEFT.Id inside the transform t() is equivalent to:

DL = DISTRIBUTE(L, HASH(key));

DR = DISTRIBUTE(R, HASH(key));

SL = SORT(DL, id, LOCAL);

JR := JOIN(SL, DR, LEFT.key = RIGHT.key, t(LEFT,RIGHT), LOOKUP MANY, LOCAL);

DJ := DISTRIBUTE(J, HASH(LEFTID), MERGE(LEFTID));

R := GROUP(DJ, LEFTID, LOCAL);

https://track.hpccsystems.com/browse/HPCC-9951

A new flexible lookup join - JOIN, SMART

A SMART join attempts to perform an in-memory LOOKUP join. If there is insufficient memory, smart join will automatically ensure that both sides are efficiently distributed and attempt to perform a LOCAL LOOKUP join.

If there is still insufficient memory, smart join will become a LOCAL HASH join which is not limited by memory.

https://track.hpccsystems.com/browse/HPCC-9951