1/9/2025 at 5:25:04 PM
This is a tough one. It’s systemic —- MS provides a “best fit” code mapping from wide Unicode to ASCII, which is a known, published, “vibes-based” mapper. This best fit parser is used a lottt of places, and I’m sure that it’s required for ongoing inclusion based on how MS views backward compatibility. It’s linked in by default everywhere, whether or not you know you included it.The exploits largely revolved around either speccing an unusual code point that “vibes” into say a slash or a hyphen or quotes. These code points are typically evaluated one way (correct full Unicode evaluation) inside a modern programming language, but when passed to shell commands or other Win32 API things are vibes-downed. Crucially this happens after you check them, since it’s when you’ve passed control.
To quote the curl maintainer “curl is a victim” here — but who is the culprit? It seems certain that curl will be used to retrieve user supplied data automatically by a server in the future. When that server mangles user input in one way for validation and another when applied to system libraries, you’re going to have a problem.
It seems to me like maybe the solution is to provide an opt-out of “best fit” munging in the Win32 space, but I’m not a Windows guy, so I speculate. At least then open source providers could just add the opt out to best practices, and deal with the many terrible problems that things like a Unicode wide variant of “ or \ delivers to them.
And of course even if you do that, you’ll interact with officially shipped APIs and software that has not opted out.
by vessenes
1/9/2025 at 5:45:03 PM
The opt-out is to use the unicode windows APIs (the functions ending in "w" instead of "a"). This also magically fixes all issues with paths longer than 260 characters (if you add a "\\?\" prefix or set you manifest correctly), and has been available and recommended since Windows XP.I'm not sure why the non-unicode APIs are still so commonly used. I can't imagine it's out of a desire to support Windows 98 or Window 2000.
by wongarsu
1/9/2025 at 7:01:44 PM
_Or_ set your application to use UTF-8 for the "A" APIs. Apparently this is supported as of a Windows 10 update from 2019. [1][1] https://learn.microsoft.com/en-us/windows/apps/design/global...
by comex
1/10/2025 at 1:35:31 AM
It should have been supported approximately 20 years earlier than that. I was coding against Win32 looong before 2019 and wondering for years why they wouldn't let you.An explanation I heard ~10 years prior is that doing so exposed bugs in CRT and nobody wanted to fix them.
by asveikau
1/10/2025 at 12:27:01 PM
> An explanation I heard ~10 years prior is that doing so exposed bugs in CRT and nobody wanted to fix them.What I've heard is that the issue is not with the CRT, but with applications using fixed-size byte buffers. IIRC, converting from UTF-16 to any of the traditional Windows code pages requires at most two bytes for each UTF-16 code unit, while the UTF-8 "code page" can need three bytes. That would lead to buffer overflows in these legacy applications if the "ANSI" code page was changed to UTF-8.
by cesarb
1/10/2025 at 10:26:24 AM
Not sure what that has to do with CRT, given that it isn't part of Win32.by pjmlp
1/10/2025 at 12:11:51 PM
CRT in a form of msvcrt.dll file had a de-facto presence in Windows since the end of 1990s. Later on, since 2018 or so, CRT availability was formalized in Windows API in form of ucrtbase.dll module.by garganzol
1/10/2025 at 6:45:59 PM
msvcrt was never for applications to use: https://devblogs.microsoft.com/oldnewthing/20140411-00/?p=12...by ygra
1/10/2025 at 6:53:27 PM
The bundled one with Windows wasn't. However the same "feature" exists in redistributed versions of msvcrt.by okanat
1/10/2025 at 1:13:37 PM
Which doesn't change the fact that Win32 doesn't depend on it.by pjmlp
1/10/2025 at 6:50:57 PM
It is extremely hard to create an application that doesn't depend on CRT on Windows. CRT provides tables for handlers of SEH exceptions and provides the default event handlers. Win32 headers have hard dependencies of the handler tables CRT provides. So you need to go quite a bit out of your way to hack deep Win32 headers. Loading DLLs etc also may call CRT functions.You can read Mingw64 source to see how many hacks they had to do to make it work.
by okanat
1/11/2025 at 4:54:33 AM
That's the "vcruntime" not the "ucrt". There has been a distinction since the ucrt was made an official part of the OS.It's very easy to make a win32 program without the ucrt filesytems APIs so long as you don't mind being platform-specific (or making your own cross-platform wrappers).
by ChrisSD
1/11/2025 at 8:24:44 AM
I have been developing for Microsoft platforms since MS-DOS 3.3, Win16 and Win32 development without any function from standard C library has been a thing for decades, for those doing C development without caring about portability, like demoscene competitions and gaming.Using C++ is another matter.
by pjmlp
1/10/2025 at 3:10:32 PM
It's still an important piece of the app compatibility story.by asveikau
1/10/2025 at 7:41:50 AM
Does that mean that in this UTF-8 mode, GetCommandLineA would, when the full-width double quote occurs in the command line, return the UTF-8 bytes for that double quote, rather than steamrolling it to an ASCII double quote with the WorstFit mapping?by kazinator
1/10/2025 at 7:41:02 PM
Yes, I wanted to suggest the same. I modified some old tools I wrote 15 years ago to do that a while ago. Not because I was aware of any vulnerability, but because a few places still used char* and I figured this would basically make it never fail with any weird filenames regardless of the code page.So now it seems even if you think your app is fully Unicode, still do this just in case? :)
by iforgotpassword
1/15/2025 at 12:17:38 PM
> I figured this would basically make it never fail with any weird filenames regardless of the code page.Windows filenames are not guaranteed to be valid UTF-16 so A functions with UTF-8 code page can still fail to access some files. If you want 100% compatibility you need to realize that Windows is a WTF-16 system and make your own compatibility wrappers for the W functions under that assumption.
by account42
1/10/2025 at 8:57:22 PM
It sounds like something Cygwin ought to do across their ecosystem.by kazinator
1/15/2025 at 12:21:14 PM
UTF-8 ACP might fix these exploits but it doesn't fix the root issue that your application encoding can't represent the whole internal system encoding (WTF-16, NOT UTF-16 despite what it claims).by account42
1/9/2025 at 6:30:06 PM
As mentioned elsewhere in this discussion, 99% of the time the cause is likely the use of standard C functions (or C++ `std::string`…) instead of MS's nonstandard wide versions. Which of course is a ubiquitous practice in portable command-line software like curl.by Sharlin
1/10/2025 at 8:34:29 AM
A lot of details is in linked curl hackerone: https://hackerone.com/reports/2550951by smatija
1/15/2025 at 12:18:58 PM
std::string is not an issue, how you get strings from the environment into it is.You can use W functions and convert the WTF-16 strings you get to WTF-8 and use that in std::string without problems.
by account42
1/9/2025 at 9:10:45 PM
So the culprit is still the software writer. They should have wrapped the C++ library for OS-specific behavior on Windows. Because they are publishing buggy software and calling it cross-platform.by pishpash
1/9/2025 at 9:20:17 PM
curl first released in 1996, shortly after Windows 95 has born and runs on numerous Windows versions even today. So, how many different versions shall be maintained? Are you going to help one of these versions?On top of that, how many new gotchas these “modern” Windows functions hide, and how many fix cycles are required to polish them to the required level?
by bayindirh
1/10/2025 at 1:14:01 AM
If we're talking about curl specifically, I absolutely think they would (NOT "should") fix/workaround it if there are actually common problems caused by it.Yes it would have required numerous fix cycles, but curl in my mind is such a polished product and they would have bit the bullet.
by thrdbndndn
1/10/2025 at 7:18:34 AM
You're right, if the problems created by this are big enough, the team will fix them without any fanfare and whining.However, in neither case this is a shortcoming of curl. They'd be responding to a complicated problem caused by the platform they're running on.
by bayindirh
1/10/2025 at 4:11:55 AM
Why would/should they? I've never paid for curl. Who even develops it? Sounds like a thankless job to fix obscure worstfit bugs.by 8n4vidtmkvmk
1/10/2025 at 7:24:07 AM
> Why would/should they?Because they care. That's it.
> I've never paid for curl.
I'm sure people who develop it doesn't want money and fame, but they're just doing what they like. However, curl has commercial support contracts if you need.
> Who even develops it?
Daniel Stenberg et al. Daniel can be found at https://daniel.haxx.se.
> Sounds like a thankless job to fix obscure worstfit bugs.
It may look thankless, but it's not. curl is critical infrastructure at this point. While https://xkcd.com/2347/ applies squarely to cURL, it's actually nice that the lead developer is making some money out of his endeavor.
by bayindirh
1/10/2025 at 6:54:17 AM
Why would they develop curl at all by your logic?They fix bugs because they simply want their product to be better, if if I were to take a guess? Like, I'm sure curl's contributors worked on OS-specific problems before, and it wouldn't be the last.
> to fix obscure worstfit bugs.
Again my premise is "if there are actually common problems caused by it". This specific bug does sound like that, at least not for now.
by thrdbndndn
1/15/2025 at 12:23:36 PM
I'm sure those software writers will be happy to refund your purchase. It's not their fault that Microsofts standard C implementation is faulty.by account42
1/10/2025 at 5:32:15 AM
>I'm not sure why the non-unicode APIs are still so commonly used.Even argv is affected on Windows. That's part of the C and C++ standard, not really a Windows API. Telling all C/C++ devs they need to stop using argv is kind of a tough ask.
by Thorrez
1/10/2025 at 1:56:39 PM
You also have to use wmain instead of main, with a wchar_t argv, otherwise the compiled-in argparser will be calling the ANSI version. In other words... Anyone using MSVC and the cross-platform standardised and normal C system, are hit by this.Oh, and wmain is a VisualC thing. It isn't found on other platforms. Not standardised.
by shakna
1/10/2025 at 6:50:43 PM
Writing cross platform code which consistently uses UCS-2 wchar_t* on Windows and UTF-8 char* on UNIX-like systems sounds like absolute hellby mort96
1/15/2025 at 12:28:18 PM
It's not that bad really - you just convert at the win32 API call boundary.Also, it's not UCS-2. Also not UTF-16. Windows uses WTF-16 internally and if you want 100% compatibility that's what you need to target.
by account42
1/11/2025 at 3:40:28 AM
A wchar_t "native" libc implementation would be an interesting thing.by lmz
1/9/2025 at 6:06:20 PM
I think the issue is that native OS things like the windows command line, say, don’t always do this. Check the results of their ‘cd’ commands with Japanese Yen characters introduced. You can see that the path descriptor somehow has updated to a directory name with Yen (or a wide backslash) in it, while the file system underneath has munged, and put them into an actual directory. It’s precisely the problem that you can’t control the rest of the API surface to use W that is the source of the difficulties.by vessenes
1/10/2025 at 3:00:02 AM
Using \\?\ has a downside: since it bypasses Win32's path processing, it also prevents relative paths like d:test.txt from working. Kind of annoying on the command line with tools like 7z.exe.by ack_complete
1/15/2025 at 12:29:48 PM
Sounds more like an upside TBH.by account42
1/9/2025 at 7:47:46 PM
> I'm not sure why the non-unicode APIs are still so commonly used. I can't imagine it's out of a desire to support Windows 98 or Window 2000.Nowadays, it's either for historical reasons (code written back when supporting Windows 9x was important, or even code migrated from Windows 3.x), or out of a desire to support non-Windows systems. Most operating systems use a byte-based multi-byte encoding (nowadays usually UTF-8) as their native encoding, instead of UTF-16.
by cesarb
1/10/2025 at 1:33:59 AM
I share your recommendations of always using PWSTR when using windows apis.> I'm not sure why the non-unicode APIs are still so commonly used
I think because the rest of the C world uses char* with utf-8, so that is what people are habituated to. Setting the ACP to CP_UTF8 would have solved a lot of problems, but I believe that's only been supported for a short period of time, bafflingly.
by asveikau
1/15/2025 at 12:31:50 PM
> Setting the ACP to CP_UTF8 would have solved a lot of problems, but I believe that's only been supported for a short period of time, bafflingly.It wouldn't solve all encoding problems though because most Windows APIs can store/return invalid UTF-16 which you can't represent in CP_UTF8 - you'd need a CP_WTF8 for that which doesn't even exist so you have to use the W APIs and do the conversion yourself.
by account42
1/15/2025 at 12:12:20 PM
> I'm not sure why the non-unicode APIs are still so commonly used. I can't imagine it's out of a desire to support Windows 98 or Window 2000.A lot of the uses are indirectly via standard C API functions that are effectively broken on Windows but work just well enough (i.e. work with ASCII) that their use goes unnoticed when someone ports something to Windows.
by account42
1/10/2025 at 6:22:24 PM
> I'm not sure why the non-unicode APIs are still so commonly used.Simple: portable code meant to run on Unix (where UTF-8 is king) and Windows -> want to use UTF-8 codepage on Windows and the "A" APIs.
by cryptonector
1/10/2025 at 4:17:08 PM
The other opt-out might be to opt into UTF-8 support for the "A" functions.by cryptonector
1/9/2025 at 6:50:15 PM
> paths longer than 260 characters (if you add a "\\?\" prefix or set you manifest correctly)A long ago released build of Windows 10 did this automatically so no need for adjustments anymore, 32k is the max....
...except for Office! It can't handle long paths. But Office has always been hacky (the title bar, for example).
by p_ing
1/10/2025 at 7:47:24 AM
Windows has a way of opting out of legacy behavior since Windows XP - manifest files. If you don't include a manifest, even GetWindowsVersion will not return the current version IIRC. It should be not too hard to add an opt-out in there (and at some point make it default in Visual Studio).I think what is also needed is some kind of linting - there is usually no need to call ANSI WinAPI functions in a modern application (unless you set the locale to UTF-8 and only use the 8-bit functions, but I don't know how well that works). I think there are also a couple of settings and headers to include to make everything "just work" - meaning argv, printf and std::cout work with UTF-8, you get no strange conversions, and you just have functions to convert between UTF-8 and UTF-16 to use WinAPI. I'm pretty sure I have a Visual Studio project lying around somewhere where it works. But all those steps necessary need to be documented and put in one place by MS.
by captainmuon
1/15/2025 at 12:37:02 PM
> If you don't include a manifest, even GetWindowsVersion will not return the current version IIRC.Worse than that, even reading relevant registry keys will be faked.
by account42
1/10/2025 at 8:18:37 AM
Using UTF8 internally and converting strings for W API calls is a way to gain some performance.by Arwill
1/10/2025 at 4:23:41 PM
More like it's a way to keep your Windows port code to a minimum so that the rest can run on Unix. I.e., you want to use UTF-8 because that's the standard on Unix, and you don't want to have completely different versions of your code for Windows and Unix because now you have twice the maintenance trouble.by cryptonector
1/15/2025 at 12:38:50 PM
*WTF-8 unless you want to not be able to handle all possible filenames.by account42
1/15/2025 at 12:10:21 PM
> To quote the curl maintainer “curl is a victim” here — but who is the culprit?Security vulnerability or not, it's a bug with curl on windows as it doesn't correctly handle unicode arguments.
by account42
1/10/2025 at 2:37:53 AM
The loosey-goosey mapping of code points to characters has always bothered me about Unicode.by UltraSane
1/10/2025 at 4:24:39 PM
This isn't about Unicode having "loosey-goosey" anything. It's about aa mapping that Microsoft came up with to map Unicode to non-Unicode.by cryptonector
1/10/2025 at 7:44:44 PM
Yeah, they could have mapped code points to their textual descriptions. That'd require reallocations, but converting "to UNICODE_FULLWIDTH_QUOTATION_MARK_U+FF02 would be unambiguous. Ugly, but obvious what happened. Better than � IMO!by SAI_Peregrinus
1/10/2025 at 9:30:52 PM
Since there's two possible antecedents for "they" (the Unicode Consortium, and Microsoft) here you'll have to clarify. Also, my question really was for u/UltraSane.Microsoft should just never have created Best-Fit -- it's a disaster. If you have to lose information, use an ASCII character to denote loss of information and be done. (I hesitate to offer `?` as that character.) Or fail to spawn the process with an error indicating the impossibility of transcoding. Failure is better actually.
by cryptonector
1/11/2025 at 11:05:18 PM
Oh, failure is way better. But a lot of the original APIs didn't have failures, they just returned a value, and MS doesn't like to break backwards compatibility even when it'd be easier for everyone if they did.For "they", I mean MS could have made BestFit work as follows: if an input string contains characters not in the user's code page, then return a new string with characters replaced by with the name of that code point as assigned by the Unicode consortium (and maybe also the textual code point number U+<number>). This requires a new allocation and copies of the parts of the string not needing replacement, but loses no information and creates no security holes.
by SAI_Peregrinus
1/13/2025 at 4:32:32 PM
CreateProcess() and related APIs can fail.by cryptonector
1/15/2025 at 12:41:38 PM
CreateProcess() doesn't know what functions the resulting process will use to access the command-line and environment.by account42