-
Notifications
You must be signed in to change notification settings - Fork 500
[SYSTEMDS-3859] Improved Relational Algebra Builtin Functions #2284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…y in raGroupby_exp1.dml
…r the last pull request
Found the error
…r the last pull request
Hi @maxrankl Please combine your pull requests into one, and rename the pull request title to a corresponding JIRA ticket. I would also suggest using commit messages that use imperative language to describe what you did. Thanks |
Hi @Baunsgaard, Thank you for you comment. I will improve the pull request this week. I just wanted to share the code for the LDE Project on time. |
…e initial order of Y and the resotring it after copying. In order to copy the values it was necessary to also order X. Before a copy matrix was needed, which can be avoided as well by saving and restoring the initial order of X
… switch if the first and the last column ist selected. Passes now all provided tests.
…y in raGroupby_exp1.dml
…r the last pull request
…r the last pull request
…e initial order of Y and the resotring it after copying. In order to copy the values it was necessary to also order X. Before a copy matrix was needed, which can be avoided as well by saving and restoring the initial order of X
… switch if the first and the last column ist selected. Passes now all provided tests.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2284 +/- ##
============================================
- Coverage 72.96% 72.51% -0.45%
- Complexity 46097 46322 +225
============================================
Files 1479 1491 +12
Lines 172654 174897 +2243
Branches 33796 34277 +481
============================================
+ Hits 125970 126834 +864
- Misses 37192 38500 +1308
- Partials 9492 9563 +71 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…t uses one loop to avoid heap overload for bigger data sets
…padding via matrix multiplikation. binding the padding to the existing X and sorting it according to their group. Then X gets reshaped to Y
…ation matrix. It uses a loop to generate the padding to ensure robustness compared to the original version that fails for bigger data set. It beats performance in most cases. It still needs some cleanup and renaming of variables for better comprehension.
…ation matrix. It uses a loop to generate the padding to ensure robustness compared to the original version that fails for bigger data set. It beats performance in most cases. I added the comments and cleaned up the naming of the variables.
…f ra_groupby should beat performance in most cases and be more stable than the original implementation
Final clean up for better code readability
This improves the runtime of ra_group_by in the builtin functions. Both methods pass the provided tests by the SystemDS.
I created an additional benchmarking test which uses a start schema benchmark data set generator (ssb-dbgen). This is the link to the repo (https://github.com/eyalroz/ssb-dbgen). I modified the lineorder table for scale factor 1 and only choose the first five columns, because ra_groupby only allows numerical values. Then I compared the two functions for a different amount of rows (10, 50, 100, 1000, 5000, 10000). I then used the Python API to run the test and visualize the results.
Nested loop
The previous implementation featured for the input matrix X (N x M) and the ouput matrix Y (N' x M) the following iterations:
previous version = #groups * N
The improved version only iterates one time over the entire input matrix X
current version = N
The idea of this implementation is to save the initial order of X and then sort X by the respective column (the choosen column for grouping). Due to the fact that X is ordered by groups, once the rows to copy into Y are selected, it allows the function to continue directly at the next group without the need to restart the iteration at the beginning of X. The selected rows then get copied into Y.
The result is for the other 4 columns the same.
Permutation matrix
The idea of previous implemenation features a permutation matrix P where the input matrix X gets multiplied with. This creates a temporary Y matrix where the selected column gets removed and then the temporary Y matrix gets reshaped into the dimension of the final Y (N' x M).
The new implmentation features also the permutation matrix P, but the creation of this matrix is different. While the initial version needs a comparism matrix (matrix multiplication and comparism operations) to create the column index for P, the new version only creates P, when padding is needed. This is the case when the distrubtion of rows compared to the respective groups is not equal. So the new version calculates the frequency of each group in the data set and then checks if padding is needed.
Case no padding (groups have the same amount of rows or there are no groups)
X gets sorted by the selected col and saved in the temporary Y. The selected col gets removed and the temporary Y gets reshaped into Y.
This benchmarking uses just two columns with the indexes as values from 1 to the respective amount of rows. So there are no groups.
Case padding (groups have different amounts of rows)
In this case the distribution of the missing padding is taken and a two column matrix, with zeros in one row and the groups in the other one, is created. To repeat the group values for the padding a loop is used iterates over the groups that need padding. This was necessary to make the calculation more stable for larger data sets. This construct gets attached to another two column matrix with the extracted selected column in both columns. This gets sorted according to the group and the indexes get saved. Then the indexes are removed where the value in the matrix was zero. The result is the column index of the permutaiton matrix.
Due to the fact that nearly any data set has equally distributed groups, this is case is seen to be more important.
The same benchmarking approach was used as for nested loop.
This is the amount of groups for the different amount of rows:
The value -1 means the computation failed for this amount of rows.