-
Notifications
You must be signed in to change notification settings - Fork 55
Added ICU charset conversion implementation #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
8809cd5
10defe7
827c164
362f7e9
8b638bb
4e4cbfd
eafadfb
8abbee4
9b06bf1
340d545
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -1197,6 +1197,112 @@ std::string kaitai::kstream::bytes_to_str(const std::string src, int codepage) { | |||||
return utf8; | ||||||
} | ||||||
|
||||||
#elif defined(KS_STR_ENCODING_ICU) | ||||||
#include <unicode/ucnv.h> | ||||||
#include <unicode/ustring.h> | ||||||
|
||||||
std::string kaitai::kstream::bytes_to_str(const std::string src, const char *src_enc) { | ||||||
UErrorCode err = U_ZERO_ERROR; | ||||||
|
||||||
// Open the source converter | ||||||
UConverter* conv = ucnv_open(src_enc, &err); | ||||||
if (U_FAILURE(err)) { | ||||||
if (err == U_FILE_ACCESS_ERROR) { | ||||||
throw unknown_encoding(src_enc); | ||||||
} | ||||||
throw bytes_to_str_error(u_errorName(err)); | ||||||
} | ||||||
|
||||||
// Open UTF-8 converter | ||||||
UConverter* utf8Conv = ucnv_open("UTF-8", &err); | ||||||
if (U_FAILURE(err)) { | ||||||
ucnv_close(conv); | ||||||
throw bytes_to_str_error(u_errorName(err)); | ||||||
} | ||||||
|
||||||
// Configure source converter to stop on illegal sequences | ||||||
err = U_ZERO_ERROR; | ||||||
ucnv_setToUCallBack( | ||||||
conv, | ||||||
UCNV_TO_U_CALLBACK_STOP, | ||||||
nullptr, | ||||||
nullptr, | ||||||
nullptr, | ||||||
&err); | ||||||
if (U_FAILURE(err)) { | ||||||
ucnv_close(conv); | ||||||
ucnv_close(utf8Conv); | ||||||
throw illegal_seq_in_encoding(u_errorName(err)); | ||||||
} | ||||||
|
||||||
// Allocate buffer for UTF-16 intermediate representation | ||||||
const int32_t uniStrCapacity = UCNV_GET_MAX_BYTES_FOR_STRING(src.length(), ucnv_getMaxCharSize(conv)); | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is
Whereas here we use it for a target buffer for conversion to Unicode (UTF-16), which is the opposite. The examples of returned values also illustrate the possible problem nicely:
SBCS apparently stands for single-byte character set (I've never heard this acronym before). This makes sense: if we take ISO-8859-1 as an example, then any It seems that this mismatch in meaning wasn't caught by our tests because you're also using
So the first parameter of the macro is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. According to the docs for
So we should probably use this instead. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, after thinking about it for a while, I've convinced myself that this calculated capacity should (coincidentally) always be enough, even if the calculation is logically incorrect and should definitely be fixed. In theory, the only scenario in which it could underallocate memory would be for a SBCS (single-byte character set) with some character outside the BMP (Basic Multilingual Plane) - then However, such SBCS seemingly doesn't exist, because BMP pretty much covers all the common characters that any sane (real-world) SBCS would use. The most popular supplementary characters (U+10000 and above) are emojis, but of course there is no real-world SBCS with emojis, since all SBCSs predate emojis by a lot. So underallocation isn't a problem here, but overallocation by a certain factor is quite likely. For example, if we're decoding a UTF-8 string, then
So |
||||||
UChar* uniStr = new UChar[uniStrCapacity]; | ||||||
|
||||||
// Convert from source encoding to UTF-16 | ||||||
err = U_ZERO_ERROR; | ||||||
int32_t uniLength = ucnv_toUChars( | ||||||
conv, | ||||||
uniStr, | ||||||
uniStrCapacity, | ||||||
src.c_str(), | ||||||
src.length(), | ||||||
&err); | ||||||
if (U_FAILURE(err)) { | ||||||
delete[] uniStr; | ||||||
ucnv_close(conv); | ||||||
ucnv_close(utf8Conv); | ||||||
throw illegal_seq_in_encoding(u_errorName(err)); | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the only place where Therefore, |
||||||
} | ||||||
|
||||||
// Configure target converter to stop on illegal sequences | ||||||
err = U_ZERO_ERROR; | ||||||
ucnv_setFromUCallBack( | ||||||
utf8Conv, | ||||||
UCNV_FROM_U_CALLBACK_STOP, | ||||||
nullptr, | ||||||
nullptr, | ||||||
nullptr, | ||||||
&err); | ||||||
if (U_FAILURE(err)) { | ||||||
delete[] uniStr; | ||||||
ucnv_close(conv); | ||||||
ucnv_close(utf8Conv); | ||||||
throw illegal_seq_in_encoding(u_errorName(err)); | ||||||
} | ||||||
|
||||||
// Allocate buffer for UTF-8 output | ||||||
const int32_t dstCapacity = UCNV_GET_MAX_BYTES_FOR_STRING(uniLength, ucnv_getMaxCharSize(utf8Conv)); | ||||||
char* dst = new char[dstCapacity]; | ||||||
|
||||||
// Convert from UTF-16 to UTF-8 | ||||||
err = U_ZERO_ERROR; | ||||||
int32_t outputLength = ucnv_fromUChars( | ||||||
utf8Conv, | ||||||
dst, | ||||||
dstCapacity, | ||||||
uniStr, | ||||||
uniLength, | ||||||
&err); | ||||||
if (U_FAILURE(err)) { | ||||||
delete[] uniStr; | ||||||
delete[] dst; | ||||||
ucnv_close(conv); | ||||||
ucnv_close(utf8Conv); | ||||||
throw illegal_seq_in_encoding(u_errorName(err)); | ||||||
} | ||||||
|
||||||
// Create result string | ||||||
std::string result(dst, outputLength); | ||||||
|
||||||
// Clean up | ||||||
delete[] uniStr; | ||||||
delete[] dst; | ||||||
ucnv_close(conv); | ||||||
ucnv_close(utf8Conv); | ||||||
|
||||||
return result; | ||||||
} | ||||||
#else | ||||||
#error Need to decide how to handle strings: please define one of: KS_STR_ENCODING_ICONV, KS_STR_ENCODING_WIN32API, KS_STR_ENCODING_NONE | ||||||
#error Need to decide how to handle strings: please define one of: KS_STR_ENCODING_ICONV, KS_STR_ENCODING_WIN32API, KS_STR_ENCODING_ICU, KS_STR_ENCODING_NONE | ||||||
#endif |
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these
err = U_ZERO_ERROR
assignments are necessary, because we check the value oferr
after every call to an ICU function and throw an exceptionif (U_FAILURE(err))
. Therefore, we know that if we get past anyif (U_FAILURE(err)) { ... }
block, we know thaterr
indicates success.Strictly speaking, it might not be equal to
U_ZERO_ERROR = 0
, because negativeerr
values are used for warnings andU_ZERO_ERROR
is only used when there is no warnings. But warnings don't represent errors, so they will not cause ICU functions to exit if they see one - only errors do, seeUErrorCode
API docs:Note that
U_FAILURE
is defined like this -icu4c/source/common/unicode/utypes.h:713-717
:Bottom line, I'd remove them because they are unnecessary (to be clear, we still need to initialize the variable as
UErrorCode err = U_ZERO_ERROR;
at the beginning, but then we don't have to worry about it anymore).