Unicode bug? Non-English characters in install path fails
by Matthew Jessick · in Torque Game Engine · 02/20/2008 (12:20 pm) · 7 replies
I wanted to add some more visibility to the bug first reported here:
www.garagegames.com/mg/forums/result.thread.php?qt=72017
Using WinXP, if non-English characters (presumably those above ASCII 127) are used in the install path, the resource manager apparently can't find any zip files in the install folder, so the MOD can't be found if it is zipped.
If this bug persists, anyone who installs my game into a folder using these higher characters will have a very short user experience (can't find any mods, then quits.) Anyone using a language where Windows "program files" gets translated into a string that contains any higher characters would presumably have the same problem.
For example, a game installed below this folder:
(DELETED - forum can't take the non-English characters ;( )
would result in zipped mods not being found.
(I'm told that in Spanish, this folder has something to do with babies and storks.)
One part of the problem may be: recurseDumpPath in winFileio.cc fails the call to FindFirstFile.
INVALID_HANDLE is returned,
and GetLastError() == 3 == ERROR_PATH_NOT_FOUND
The UNICODE macro is defined. Presumably the FindFirstFileW variant is being called. In any event, changing the call to FindFirstFileW specifically results in the same behavior.
It seems odd that the system can find the top level main.cs, yet not the modName.zip file in the working directory. Unzipping the mods works, and the game loads successfully.
The ERROR_PATH_NOT_FOUND seems very odd, suggesting that perhaps the Unicode utf8 to utf16 conversion had some problem. I glanced at the UTF16 version of the path and it seems to have bumped up the four instances of the code values for the higher characters into large two byte integer values as expected. I would assume that any problem in this function would have been noticed before however. Wierd.
As you can see, I have mucked around a bit looking into this, but my Unicode is weak and severe depression beckons.
www.garagegames.com/mg/forums/result.thread.php?qt=72017
Using WinXP, if non-English characters (presumably those above ASCII 127) are used in the install path, the resource manager apparently can't find any zip files in the install folder, so the MOD can't be found if it is zipped.
If this bug persists, anyone who installs my game into a folder using these higher characters will have a very short user experience (can't find any mods, then quits.) Anyone using a language where Windows "program files" gets translated into a string that contains any higher characters would presumably have the same problem.
For example, a game installed below this folder:
(DELETED - forum can't take the non-English characters ;( )
would result in zipped mods not being found.
(I'm told that in Spanish, this folder has something to do with babies and storks.)
One part of the problem may be: recurseDumpPath in winFileio.cc fails the call to FindFirstFile.
INVALID_HANDLE is returned,
and GetLastError() == 3 == ERROR_PATH_NOT_FOUND
The UNICODE macro is defined. Presumably the FindFirstFileW variant is being called. In any event, changing the call to FindFirstFileW specifically results in the same behavior.
It seems odd that the system can find the top level main.cs, yet not the modName.zip file in the working directory. Unzipping the mods works, and the game loads successfully.
The ERROR_PATH_NOT_FOUND seems very odd, suggesting that perhaps the Unicode utf8 to utf16 conversion had some problem. I glanced at the UTF16 version of the path and it seems to have bumped up the four instances of the code values for the higher characters into large two byte integer values as expected. I would assume that any problem in this function would have been noticed before however. Wierd.
As you can see, I have mucked around a bit looking into this, but my Unicode is weak and severe depression beckons.
About the author
#2
From winFileio.cc:
A full fix for the zip'd mod finding problem might be to read the current directory as UTF-16 (?) on Windows UNICODE builds, saving it in the cwd stringtable as UTF-8. But I can image some problems if certain code assumes that all the characters are single byte. For example, the forwardslash scanning above could conceivably break, depending on the specific codes and how it was written.
02/20/2008 (4:01 pm)
I believe this wierdness comes from Platform::getWorkingDirectory() getting the working directory as ANSI (see use of GetCurrentDirectoryA below), but in the recurseDumpPath, it is assumed to be valid UTF-8. Unfortunately, the codes > 127 used for special Spanish characters (for example) are NOT preserved in UTF-8. For valid UTF-8, I believe they would be multibyte. The UTF8 to UTF16 translations fall into failure branches for these miscoded characters, therefore FindFirstFileW fails because of an invalid UTF-16 path.From winFileio.cc:
StringTableEntry Platform::getWorkingDirectory()
{
static StringTableEntry cwd = NULL;
if (!cwd)
{
char cwd_buf[2048];
GetCurrentDirectoryA(2047, cwd_buf);
forwardslash(cwd_buf);
cwd = StringTable->insert(cwd_buf);
}
return cwd;
}A full fix for the zip'd mod finding problem might be to read the current directory as UTF-16 (?) on Windows UNICODE builds, saving it in the cwd stringtable as UTF-8. But I can image some problems if certain code assumes that all the characters are single byte. For example, the forwardslash scanning above could conceivably break, depending on the specific codes and how it was written.
#3
While this appear to work for me without introducing more errors (so far!), Platform::getWorkingDirectory() is used about 20 times in the code. I have not yet analyzed and regression tested all impacts of this change. Use at your own risk!
This fix replaces the winFileio.cc Platform::getWorkingDirectory() method (see stock method above) with a UTF-8 version.
There could well be a more efficient way to implement this: please improve. (However, the two large temporary buffers are only used once per game run then discarded.)
02/21/2008 (9:32 am)
I have fixed this "Zip'd mod finding bug" in my code (only tested for Windows) by converting the winFileio.cc routine Platform::getWorkingDirectory() to returning a UTF-8 encoded path, rather than the previous ANSI (Latin-1?) encoded path. This likely increases the size of the path slightly, encoding characters like my Spanish test character that use diacritical marks as two byte UTF-8 codes. While this appear to work for me without introducing more errors (so far!), Platform::getWorkingDirectory() is used about 20 times in the code. I have not yet analyzed and regression tested all impacts of this change. Use at your own risk!
This fix replaces the winFileio.cc Platform::getWorkingDirectory() method (see stock method above) with a UTF-8 version.
There could well be a more efficient way to implement this: please improve. (However, the two large temporary buffers are only used once per game run then discarded.)
//MVJ fix to Zip'd resource finding bug.
// see: http://www.garagegames.com/mg/forums/result.thread.php?qt=72017
// and: http://www.garagegames.com/mg/forums/result.thread.php?qt=72297
//
// FIX: read current working directory as UTF-16 (Native form of Microsoft Windows 2000/XP/2003/Vista/CE)
// and store it as UTF-8. This converts the previous single byte values of the Latin-1 encoding > 127
// into two byte UTF-8 sequences.
//
// This fixes this particular bug because the Zip finding routines (e.g.: see recurseDumpPath above)
// use UTF-16 Windows routines and convert from UTF-8 to UTF-16 internally to make the Windows calls.
// so Platform::getWorkingDirectory() returning UTF-8 (inserted in the function CALL)
// works properly for these routines.
//
// WARNING: The approximately 20 uses of Platform::getWorkingDirectory()
// have not been adequately regression tested! This fix works for me,
// without regressive errors (so far) - use at your own risk!
//
//
/// @return UTF8 encoded working directory. (Used to be Latin-1-ish)
StringTableEntry Platform::getWorkingDirectory()
{
static StringTableEntry cwdUTF8 = NULL;
if (!cwdUTF8)
{
// Microsoft Windows 2000/XP/2003/Vista/CE and future, are native UTF-16
char cwd_buf_UTF16[2048];
GetCurrentDirectoryW(2047, (UTF16*)cwd_buf_UTF16); // W variant is the UTF-16 function
char cwd_buf_UTF8[2048];
convertUTF16toUTF8((UTF16 *)cwd_buf_UTF16, (UTF8 *)cwd_buf_UTF8, sizeof(cwd_buf_UTF8));
forwardslash(cwd_buf_UTF8); // works transparently on UTF-8 encoded text (probably not on UTF-16)
cwdUTF8 = StringTable->insert(cwd_buf_UTF8);
}
return cwdUTF8;
}
#4
Also, the longest theoretically possible UTF8 string byte count is 3 * UTF16 string byte count, or 6 * UTF16 char count, and + 1 byte for nul terminator. So in extreme cases your conversion buffer for the UTF8 may not be long enough and the string will be truncated. In practical terms this won't happen very often -- but it doesn't really hurt to make the buffer bigger, either.
02/23/2008 (3:34 am)
This looks pretty good, but I think you have a potential crash bug with buffer sizes. GetCurrentDirectoryW() expects the first parameter to be number of TCHARs in the buffer, and for Unicode, a TCHAR is 16 bits, not 8. So you are telling it your buffer is 2x what is really allocated.Also, the longest theoretically possible UTF8 string byte count is 3 * UTF16 string byte count, or 6 * UTF16 char count, and + 1 byte for nul terminator. So in extreme cases your conversion buffer for the UTF8 may not be long enough and the string will be truncated. In practical terms this won't happen very often -- but it doesn't really hurt to make the buffer bigger, either.
#5
Taking better care with the buffers per Ed's suggestions, this may be clearer, safer and more long lasting.
I also left the old routine in a non #ifdef UNICODE branch, although this version would have the ZIP finding error because of modern Windows native use of UTF-16.
02/23/2008 (9:04 am)
Good points, thanks! But I believe the longest possible UTF-8 byte sequence is 4 bytes. (One could probably make a case for only 3 because of the current Torque unicode UTF-16 implementation's BMP only constraint, but the translations are complicated: UTF-16 => UTF-32 => UTF-8, and the unicode implementation could be upgraded in the future.)Taking better care with the buffers per Ed's suggestions, this may be clearer, safer and more long lasting.
I also left the old routine in a non #ifdef UNICODE branch, although this version would have the ZIP finding error because of modern Windows native use of UTF-16.
#ifdef UNICODE
//MVJ fix to Zip'd resource finding bug.
// see: http://www.garagegames.com/mg/forums/result.thread.php?qt=72017
// and: http://www.garagegames.com/mg/forums/result.thread.php?qt=72297
//
// FIX: read current working directory as UTF-16 (Native form of Microsoft Windows 2000/XP/2003/Vista/CE)
// and store it as UTF-8. This converts the previous single byte values of the Latin-1 encoding > 127
// into two byte UTF-8 sequences.
//
// This fixes this particular bug because the Zip finding routines (e.g.: see recurseDumpPath above)
// use UTF-16 Windows routines and convert from UTF-8 to UTF-16 internally to make the Windows calls.
// so Platform::getWorkingDirectory() returning UTF-8 (inserted in the function CALL)
// works properly for these routines.
//
// WARNING: The approximately 20 uses of Platform::getWorkingDirectory()
// have not been adequately regression tested! This fix works for me,
// without regressive errors (so far) - use at your own risk!
//
//
/// @return UTF8 encoded working directory. (Used to be Latin-1-ish)
StringTableEntry Platform::getWorkingDirectory()
{
static StringTableEntry cwdUTF8 = NULL;
if (!cwdUTF8)
{
// Microsoft Windows 2000/XP/2003/Vista/CE and future, are native UTF-16
char cwd_buf_UTF16[(MAX_PATH+1)*2]; // max windows path length 260 TCHAR
GetCurrentDirectoryW((MAX_PATH+1), (UTF16*)cwd_buf_UTF16); // W variant is the UTF-16 function
char cwd_buf_UTF8[(MAX_PATH+1)*4];
convertUTF16toUTF8((UTF16 *)cwd_buf_UTF16, (UTF8 *)cwd_buf_UTF8, sizeof(cwd_buf_UTF8));
forwardslash(cwd_buf_UTF8); // works transparently on UTF-8 encoded text (probably not on UTF-16)
cwdUTF8 = StringTable->insert(cwd_buf_UTF8);
}
return cwdUTF8;
}
#else
StringTableEntry Platform::getWorkingDirectory()
{
static StringTableEntry cwd = NULL;
if (!cwd)
{
char cwd_buf[2048];
GetCurrentDirectoryA(2047, cwd_buf);
forwardslash(cwd_buf);
cwd = StringTable->insert(cwd_buf);
}
return cwd;
}
#endif
#6
For all practical purposes, using 4 bytes is going to be fine for the foreseeable future. I'm just paranoid and hate memory stomps :-)
02/24/2008 (8:30 am)
I think the question of 4 version 6 bytes is a little complex. The current UTF-8 specification as per RFC 3629 limits UTF-8 to the formal range of declared character space, all of which is encoded in 4 bytes or fewer. However, the encoding used for UTF-8 is technically capable of using up to 6 bytes, so if Unicode is ever assigns characters to the unused space, then your maximum length string composed all of these hypothetical 5 or 6 byte long new characters would break via memory stomp.For all practical purposes, using 4 bytes is going to be fine for the foreseeable future. I'm just paranoid and hate memory stomps :-)
#7
03/18/2009 (9:24 pm)
Does it have TGEA 1.7.1 or later version?
Torque 3D Owner Matthew Jessick