Skip to content

Conversation

Isso-W
Copy link

@Isso-W Isso-W commented Jun 19, 2025

This pull request introduces a new framework for column decoding in Apache SystemDS, with the addition of a base class ColumnDecoder and several specialized implementations (ColumnDecoderBin, ColumnDecoderComposite, ColumnDecoderRecode,ColumnDecoderPassthrough and ColumnDecoderDummycode). These changes provide a flexible and extensible structure for decoding encoded data in matrix-to-frame transformations. Below are the most important changes grouped by theme:

Core Framework for Column Decoding

  • Added ColumnDecoder as an abstract base class to define the structure for decoding operations, including methods for decoding (columnDecode), handling sub-range decoding (subRangeDecoder), and metadata initialization (initMetaData). It also implements Externalizable for efficient serialization.

Current Issues

  • ColumnDecoderDummycode ist not supported yet, as well as the test case ColumnDecoderMixedMethodsTest
  • There are still some issues with DecoderDummyCode, it do not work together with DecoderRecode

Copy link

codecov bot commented Jun 20, 2025

Codecov Report

Attention: Patch coverage is 52.98913% with 173 lines in your changes missing coverage. Please review.

Project coverage is 72.55%. Comparing base (11c5a74) to head (64651ec).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
.../runtime/transform/decode/ColumnDecoderRecode.java 45.33% 39 Missing and 2 partials ⚠️
.../sysds/runtime/transform/decode/ColumnDecoder.java 35.71% 27 Missing ⚠️
...runtime/transform/decode/ColumnDecoderFactory.java 65.71% 23 Missing and 1 partial ⚠️
...sds/runtime/transform/decode/ColumnDecoderBin.java 57.40% 21 Missing and 2 partials ⚠️
...ntime/transform/decode/ColumnDecoderDummycode.java 55.76% 20 Missing and 3 partials ⚠️
...ntime/transform/decode/ColumnDecoderComposite.java 52.27% 21 Missing ⚠️
...ime/transform/decode/ColumnDecoderPassThrough.java 47.61% 11 Missing ⚠️
...sds/runtime/transform/decode/DecoderComposite.java 0.00% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2275      +/-   ##
============================================
- Coverage     72.58%   72.55%   -0.04%     
- Complexity    46221    46275      +54     
============================================
  Files          1489     1496       +7     
  Lines        174193   174561     +368     
  Branches      34182    34232      +50     
============================================
+ Hits         126434   126646     +212     
- Misses        38196    38347     +151     
- Partials       9563     9568       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@phaniarnab
Copy link
Contributor

Thank you for the patch. I will take a look into this next week @Isso-W.
Meanwhile, please add the missing license headers to the new files.

import java.io.IOException;
import java.io.ObjectInput;
import java.io.ObjectOutput;
import java.util.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid all import. Only import the required classes.

Comment on lines 69 to 75
for( int j=0; j<_colList.length; j++ ) {
int colID = _colList[j];
double val = UtilFunctions.objectToDouble(
out.getSchema()[colID-1], out.get(i, colID-1));
long key = UtilFunctions.toLong(val);
out.set(i, colID-1, getRcMapValue(j, key));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you iterating all the columns? A column decoder should be called for each column.

Comment on lines +74 to +80
if( _onOut ) { //recode on output (after dummy)
for( int i=rl; i<ru; i++ ) {
for( int j=0; j<_colList.length; j++ ) {
int colID = _colList[j];
double val = UtilFunctions.objectToDouble(
out.getSchema()[colID-1], out.get(i, colID-1));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these empty lines that you added.

Comment on lines 40 to 45
protected int[] _colList;
protected String[] _colnames = null;
protected ColumnDecoder(ValueType[] schema, int[] colList) {
_schema = schema;
_colList = colList;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a column list? A column encoder should work on a single column.

long b1 = System.nanoTime();
out.ensureAllocatedColumns(in.getNumRows());

final int outColIndex = _colList[0] - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the outColIndex is always _colList[0] - 1?

Comment on lines 108 to 119
for (int j = 0; j < _colList.length; j++) {
double val = in.get(i, j);
if (!Double.isNaN(val)) {
int key = (int) Math.round(val);
double bmin = _binMins[j][key - 1];
double bmax = _binMaxs[j][key - 1];
double oval = bmin + (bmax - bmin) / 2 + (val - key) * (bmax - bmin);
out.getColumn(_colList[j] - 1).set(i, oval);
} else {
out.getColumn(_colList[j] - 1).set(i, val);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why you are iterating all columns in a column decoder.

Comment on lines 66 to 73
for( int j=0; j<_colList.length; j++ )
for( int k=_clPos[j]; k<_cuPos[j]; k++ )
if( in.get(i, k-1) != 0 ) {
int col = _colList[j] - 1;
Object val = UtilFunctions.doubleToObject(out.getSchema()[col], k-_clPos[j]+1);
synchronized(out) { out.set(i, col, val); }
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A column decoder should work on a single column that is provided.

@phaniarnab
Copy link
Contributor

@Isso-W, please address the comments. And fix the tests as well. Your tests are failing in the transform package, which you should be able to reproduce.

@phaniarnab
Copy link
Contributor

The latest changes look good @Isso-W.
Please make sure all the unit tests pass and we get a green tick.
As discussed, for the remaining time, focus on extending the FTBench benchmark with decoding scripts and report performance improvement by multithreading for those use cases.

@phaniarnab
Copy link
Contributor

@Isso-W, can you please post your plots of FTBench (decoding) here for others to see?

@Isso-W
Copy link
Author

Isso-W commented Aug 10, 2025

Sure @phaniarnab, here is the plot.
Full decode means the time cost for whole decode function in dml
Pure decoder means the time cost for decoder part in decoder function in Java code
And there is an additional comparison for these both.

FTbench pure decoder FTbench full decode FTbench compare

I also created a new test using part of FTbench T9 by running DML, I can also put that test with responding json in PR to make the plot reproduciable. B.t.w. will this PR be merged into main?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants