Skip to content

Conversation

maxrankl
Copy link

@maxrankl maxrankl commented Jun 30, 2025

This improves the runtime of ra_group_by in the builtin functions. Both methods pass the provided tests by the SystemDS.

I created an additional benchmarking test which uses a start schema benchmark data set generator (ssb-dbgen). This is the link to the repo (https://github.com/eyalroz/ssb-dbgen). I modified the lineorder table for scale factor 1 and only choose the first five columns, because ra_groupby only allows numerical values. Then I compared the two functions for a different amount of rows (10, 50, 100, 1000, 5000, 10000). I then used the Python API to run the test and visualize the results.

Nested loop
The previous implementation featured for the input matrix X (N x M) and the ouput matrix Y (N' x M) the following iterations:

previous version = #groups * N

The improved version only iterates one time over the entire input matrix X

current version = N

The idea of this implementation is to save the initial order of X and then sort X by the respective column (the choosen column for grouping). Due to the fact that X is ordered by groups, once the rows to copy into Y are selected, it allows the function to continue directly at the next group without the need to restart the iteration at the beginning of X. The selected rows then get copied into Y.

LDE_NL_column1

The result is for the other 4 columns the same.

Permutation matrix
The idea of previous implemenation features a permutation matrix P where the input matrix X gets multiplied with. This creates a temporary Y matrix where the selected column gets removed and then the temporary Y matrix gets reshaped into the dimension of the final Y (N' x M).

The new implmentation features also the permutation matrix P, but the creation of this matrix is different. While the initial version needs a comparism matrix (matrix multiplication and comparism operations) to create the column index for P, the new version only creates P, when padding is needed. This is the case when the distrubtion of rows compared to the respective groups is not equal. So the new version calculates the frequency of each group in the data set and then checks if padding is needed.

Case no padding (groups have the same amount of rows or there are no groups)
X gets sorted by the selected col and saved in the temporary Y. The selected col gets removed and the temporary Y gets reshaped into Y.

This benchmarking uses just two columns with the indexes as values from 1 to the respective amount of rows. So there are no groups.

LDE_PM_nogroups

Case padding (groups have different amounts of rows)
In this case the distribution of the missing padding is taken and a two column matrix, with zeros in one row and the groups in the other one, is created. To repeat the group values for the padding a loop is used iterates over the groups that need padding. This was necessary to make the calculation more stable for larger data sets. This construct gets attached to another two column matrix with the extracted selected column in both columns. This gets sorted according to the group and the indexes get saved. Then the indexes are removed where the value in the matrix was zero. The result is the column index of the permutaiton matrix.

Due to the fact that nearly any data set has equally distributed groups, this is case is seen to be more important.

The same benchmarking approach was used as for nested loop.

This is the amount of groups for the different amount of rows:

columns / rows 10 50 100 1000 5000 10000 20000 30000
1 3 12 23 248 1223 2469 4932 7416
2 5 7 7 7 7 7 7
3 3 12 23 247 1180 2315 4332 6122
4 10 50  100 997 4942 9756 19005 27843
5 10  50 99 791 1845 1980 2000 2000
columns / rows 40000 50000 60000 70000 80000 90000 100000
1 9940 12440 14959 17444 19927 22387 24855
2  7 7 7 7 7 7 7
3 7706 9069 10255 11293 12196 12990 13702
4 36209 44201 51867 59099 65923 72418 78610
5 2000 2000 2000 2000 2000 2000 2000
LDE_PM_column1 LDE_PM_column2 LDE_PM_column3 LDE_PM_column4 LDE_PM_column5

The value -1 means the computation failed for this amount of rows.

@Baunsgaard
Copy link
Contributor

Hi @maxrankl

Please combine your pull requests into one, and rename the pull request title to a corresponding JIRA ticket.

I would also suggest using commit messages that use imperative language to describe what you did.

Thanks

@maxrankl
Copy link
Author

maxrankl commented Jul 2, 2025

Hi @Baunsgaard,

Thank you for you comment. I will improve the pull request this week. I just wanted to share the code for the LDE Project on time.

maxrankl added 2 commits July 4, 2025 21:45
…e initial order of Y and the resotring it after copying. In order to copy the values it was necessary to also order X. Before a copy matrix was needed, which can be avoided as well by saving and restoring the initial order of X
… switch if the first and the last column ist selected. Passes now all provided tests.
@maxrankl maxrankl changed the title Performance improvement for nested-loop SystenDS-#3859 Improved Relational Algebra Builtin Functions Jul 6, 2025
Copy link

codecov bot commented Jul 21, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.51%. Comparing base (64455b9) to head (0188b4c).
⚠️ Report is 33 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2284      +/-   ##
============================================
- Coverage     72.96%   72.51%   -0.45%     
- Complexity    46097    46322     +225     
============================================
  Files          1479     1491      +12     
  Lines        172654   174897    +2243     
  Branches      33796    34277     +481     
============================================
+ Hits         125970   126834     +864     
- Misses        37192    38500    +1308     
- Partials       9492     9563      +71     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

maxrankl added 4 commits July 26, 2025 17:50
…padding via matrix multiplikation. binding the padding to the existing X and sorting it according to their group. Then X gets reshaped to Y
…ation matrix. It uses a loop to generate the padding to ensure robustness compared to the original version that fails for bigger data set. It beats performance in most cases. It still needs some cleanup and renaming of variables for better comprehension.
…ation matrix. It uses a loop to generate the padding to ensure robustness compared to the original version that fails for bigger data set. It beats performance in most cases. I added the comments and cleaned up the naming of the variables.
…f ra_groupby should beat performance in most cases and be more stable than the original implementation
@maxrankl maxrankl changed the title SystenDS-#3859 Improved Relational Algebra Builtin Functions [SYSTEMDS-3859]Improved Relational Algebra Builtin Functions Jul 30, 2025
@maxrankl maxrankl changed the title [SYSTEMDS-3859]Improved Relational Algebra Builtin Functions [SYSTEMDS-3859] Improved Relational Algebra Builtin Functions Jul 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants