[SYSTEMDS-3859] Improved Relational Algebra Builtin Functions #2284

maxrankl · 2025-06-30T13:15:37Z

This improves the runtime of ra_group_by in the builtin functions. Both methods pass the provided tests by the SystemDS.

I created an additional benchmarking test which uses a start schema benchmark data set generator (ssb-dbgen). This is the link to the repo (https://github.com/eyalroz/ssb-dbgen). I modified the lineorder table for scale factor 1 and only choose the first five columns, because ra_groupby only allows numerical values. Then I compared the two functions for a different amount of rows (10, 50, 100, 1000, 5000, 10000). I then used the Python API to run the test and visualize the results.

Nested loop
The previous implementation featured for the input matrix X (N x M) and the ouput matrix Y (N' x M) the following iterations:

previous version = #groups * N

The improved version only iterates one time over the entire input matrix X

current version = N

The idea of this implementation is to save the initial order of X and then sort X by the respective column (the choosen column for grouping). Due to the fact that X is ordered by groups, once the rows to copy into Y are selected, it allows the function to continue directly at the next group without the need to restart the iteration at the beginning of X. The selected rows then get copied into Y.

The result is for the other 4 columns the same.

Permutation matrix
The idea of previous implemenation features a permutation matrix P where the input matrix X gets multiplied with. This creates a temporary Y matrix where the selected column gets removed and then the temporary Y matrix gets reshaped into the dimension of the final Y (N' x M).

The new implmentation features also the permutation matrix P, but the creation of this matrix is different. While the initial version needs a comparism matrix (matrix multiplication and comparism operations) to create the column index for P, the new version only creates P, when padding is needed. This is the case when the distrubtion of rows compared to the respective groups is not equal. So the new version calculates the frequency of each group in the data set and then checks if padding is needed.

Case no padding (groups have the same amount of rows or there are no groups)
X gets sorted by the selected col and saved in the temporary Y. The selected col gets removed and the temporary Y gets reshaped into Y.

This benchmarking uses just two columns with the indexes as values from 1 to the respective amount of rows. So there are no groups.

Case padding (groups have different amounts of rows)
In this case the distribution of the missing padding is taken and a two column matrix, with zeros in one row and the groups in the other one, is created. To repeat the group values for the padding a loop is used iterates over the groups that need padding. This was necessary to make the calculation more stable for larger data sets. This construct gets attached to another two column matrix with the extracted selected column in both columns. This gets sorted according to the group and the indexes get saved. Then the indexes are removed where the value in the matrix was zero. The result is the column index of the permutaiton matrix.

Due to the fact that nearly any data set has equally distributed groups, this is case is seen to be more important.

The same benchmarking approach was used as for nested loop.

This is the amount of groups for the different amount of rows:

columns / rows	10	50	100	1000	5000	10000	20000	30000
1	3	12	23	248	1223	2469	4932	7416
2	5	7	7	7	7	7	7	7
3	3	12	23	247	1180	2315	4332	6122
4	10	50	100	997	4942	9756	19005	27843
5	10	50	99	791	1845	1980	2000	2000

columns / rows	40000	50000	60000	70000	80000	90000	100000
1	9940	12440	14959	17444	19927	22387	24855
2	7	7	7	7	7	7	7
3	7706	9069	10255	11293	12196	12990	13702
4	36209	44201	51867	59099	65923	72418	78610
5	2000	2000	2000	2000	2000	2000	2000

The value -1 means the computation failed for this amount of rows.

…y in raGroupby_exp1.dml

…r the last pull request

Found the error

…r the last pull request

Baunsgaard · 2025-07-02T08:45:40Z

Hi @maxrankl

Please combine your pull requests into one, and rename the pull request title to a corresponding JIRA ticket.

I would also suggest using commit messages that use imperative language to describe what you did.

Thanks

maxrankl · 2025-07-02T09:57:40Z

Hi @Baunsgaard,

Thank you for you comment. I will improve the pull request this week. I just wanted to share the code for the LDE Project on time.

…e initial order of Y and the resotring it after copying. In order to copy the values it was necessary to also order X. Before a copy matrix was needed, which can be avoided as well by saving and restoring the initial order of X

… switch if the first and the last column ist selected. Passes now all provided tests.

…y in raGroupby_exp1.dml

…r the last pull request

…e initial order of Y and the resotring it after copying. In order to copy the values it was necessary to also order X. Before a copy matrix was needed, which can be avoided as well by saving and restoring the initial order of X

… switch if the first and the last column ist selected. Passes now all provided tests.

codecov · 2025-07-21T21:25:20Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.51%. Comparing base (64455b9) to head (0188b4c).
⚠️ Report is 33 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2284      +/-   ##
============================================
- Coverage     72.96%   72.51%   -0.45%     
- Complexity    46097    46322     +225     
============================================
  Files          1479     1491      +12     
  Lines        172654   174897    +2243     
  Branches      33796    34277     +481     
============================================
+ Hits         125970   126834     +864     
- Misses        37192    38500    +1308     
- Partials       9492     9563      +71

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…cleanup

…t uses one loop to avoid heap overload for bigger data sets

…padding via matrix multiplikation. binding the padding to the existing X and sorting it according to their group. Then X gets reshaped to Y

…ation matrix. It uses a loop to generate the padding to ensure robustness compared to the original version that fails for bigger data set. It beats performance in most cases. It still needs some cleanup and renaming of variables for better comprehension.

…ation matrix. It uses a loop to generate the padding to ensure robustness compared to the original version that fails for bigger data set. It beats performance in most cases. I added the comments and cleaned up the naming of the variables.

…f ra_groupby should beat performance in most cases and be more stable than the original implementation

Final clean up for better code readability

maxrankl added 5 commits June 29, 2025 23:46

Current status, unfortunately, it does not insert the values correctl…

f5d4f22

…y in raGroupby_exp1.dml

Added the Becnhmarking framework (Python)

caaa93d

Found the error, should beat the performance of nested loop, sorry fo…

a961759

…r the last pull request

Merge branch 'experiment1'

cc85eaf

Found the error

Found the error, should beat the performance of nested loop, sorry fo…

fed2aa4

…r the last pull request

github-project-automation bot added this to SystemDS PR Queue Jun 30, 2025

github-project-automation bot moved this to In Progress in SystemDS PR Queue Jun 30, 2025

Merge branch 'apache:main' into main

243ff1c

maxrankl added 2 commits July 4, 2025 21:45

Removed additional files. Copied content into the correct file. Added…

80293ce

… switch if the first and the last column ist selected. Passes now all provided tests.

maxrankl changed the title ~~Performance improvement for nested-loop~~ SystenDS-#3859 Improved Relational Algebra Builtin Functions Jul 6, 2025

maxrankl and others added 13 commits July 6, 2025 22:19

Removed print statement for debugging

a79420e

Removed print statement for debugging, forgot one

de96b3e

Merge branch 'apache:main' into main

a086101

commit to merge the fix of permutation matrix

3ffaa34

Added the Becnhmarking framework (Python)

f07c387

Current status, unfortunately, it does not insert the values correctl…

a93c2f5

…y in raGroupby_exp1.dml

Found the error, should beat the performance of nested loop, sorry fo…

4ac8268

…r the last pull request

Found the error, should beat the performance of nested loop, sorry fo…

59305fc

…r the last pull request

Removed additional files. Copied content into the correct file. Added…

a85d3d2

… switch if the first and the last column ist selected. Passes now all provided tests.

Finished merge for the ra_groupby permutation matrix fix

a669e42

commit to merge the fix of permutation matrix

66656c7

merged the changes of apache main into the main branch of this project

2e4a6ad

maxrankl added 3 commits July 24, 2025 11:08

alternative version of permuatation works except edge cases

09cc1d8

alternative version of permuatation works with edge cases, but needs …

81d4785

…cleanup

permutation matrix is not a real permutation amtrix anymore because i…

effc323

…t uses one loop to avoid heap overload for bigger data sets

maxrankl added 4 commits July 26, 2025 17:50

alternative version of permutation matrix sorts X and calculates the …

debdffc

…padding via matrix multiplikation. binding the padding to the existing X and sorting it according to their group. Then X gets reshaped to Y

Removed additional added files. Both implementations of the methods o…

f1ce29c

…f ra_groupby should beat performance in most cases and be more stable than the original implementation

maxrankl changed the title ~~SystenDS-#3859 Improved Relational Algebra Builtin Functions~~ [SYSTEMDS-3859]Improved Relational Algebra Builtin Functions Jul 30, 2025

maxrankl changed the title ~~[SYSTEMDS-3859]Improved Relational Algebra Builtin Functions~~ [SYSTEMDS-3859] Improved Relational Algebra Builtin Functions Jul 30, 2025

gaturchenko and others added 2 commits August 8, 2025 14:59

chore(raGroupBy): edits for the pull request

a3adfbc

Merge pull request #2 from gaturchenko/grigorii-revision

0188b4c

Final clean up for better code readability

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYSTEMDS-3859] Improved Relational Algebra Builtin Functions #2284

[SYSTEMDS-3859] Improved Relational Algebra Builtin Functions #2284

Uh oh!

maxrankl commented Jun 30, 2025 •

edited

Loading

Uh oh!

Baunsgaard commented Jul 2, 2025

Uh oh!

maxrankl commented Jul 2, 2025

Uh oh!

codecov bot commented Jul 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

[SYSTEMDS-3859] Improved Relational Algebra Builtin Functions #2284

Are you sure you want to change the base?

[SYSTEMDS-3859] Improved Relational Algebra Builtin Functions #2284

Uh oh!

Conversation

maxrankl commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Baunsgaard commented Jul 2, 2025

Uh oh!

maxrankl commented Jul 2, 2025

Uh oh!

codecov bot commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

maxrankl commented Jun 30, 2025 •

edited

Loading

codecov bot commented Jul 21, 2025 •

edited

Loading