--- title: "Path length limit on Windows" author: "Tomas Kalibera" date: 2023-03-07 categories: ["Internals", "User-visible Behavior", "Windows"] tags: ["MAX_PATH", "PATH_MAX", "long paths"] --- ```{r setup, include=FALSE} knitr::opts_chunk$set(collapse = TRUE) ``` When testing development versions of Rtools for Windows, I've ran into strange failures of several CRAN packages where R could not find, read from or write to some files. The files should have been in temporary directories which get automatically deleted, so it took some effort to find out that actually they existed and were accessible. That didn't make any sense at first, but eventually I got to this output: ``` Warning in gzfile(file, "wb") : cannot open compressed file 'C:\msys64\home\tomas\ucrt3\svn\ucrt3\r_packages\pkgcheck\CRAN\ADAPTS\tmp\RtmpKWYapj/gList.Mast.cells_T.cells.follicular.helper_T.cells.CD4.memory.activated_T.cells.CD4.memory.resting_T.cells.CD4.naive_T.cells.C_Plasma.cells_B.cells.memory_B.cells.naive.RData.RData', probable reason 'No such file or directory' Error in gzfile(file, "wb") : cannot open the connection Calls: remakeLM22p -> save -> gzfile Execution halted ``` Such a long file name. The entire path in the warning message takes 265 bytes. Perhaps it is too long and, for some reason it can be created but not read in a particular way? To confirm the theory, I've created a mapped drive to get rid of the `/msys64/home/tomas/ucrt3/svn/ucrt3/r_packages/pkgcheck` prefix of the path. This package and several other started to pass their checks. Interestingly, a junction didn't work that well, because path normalization followed it in some cases, getting again the long paths. R has been improved since and is more likely to provide a hint (warning or error) that the path is too long, so diagnosing the problem is often easier than this, yet the message may also be too pessimistic (more below). This text provides some background on path-length limits and offers recommendations for what to do about them. It reports on recent improvements in R, which allow R and packages to work with longer paths on recent Windows 10, where and when the system limit can be overriden. Following the changes in R, some of R packages will have to be updated as well to work with long paths. Primarily authors of packages using `PATH_MAX` or even `MAX_PATH` in their code are advised to continue reading. The changes in R make the updating of packages possible (they can be tested), but also more important (they could crash when seeing long paths). It is therefore not advised to enable long paths on production systems, yet - the feature needs to be considered experimental with the R ecosystem. # Background On Windows, there used to be a limit on the length of the entire path name imposed by the operating system. It is derived from constant `MAX_PATH` (260, not likely to change) and limits the number of UCS-2 (16-bit, so only BMP characters) words including the terminator. Depending on the API, it may be in addition applied directly to the number of bytes accepted as path names in ANSI functions, e.g. 259 UTF-8 bytes plus a 1-byte terminator. But, it may also be only applied once converted to UCS-2, and then `259` BMP characters with a 2-byte terminator may correspond up to `3*259` UTF-8 bytes with a 1-byte terminator. However, for quite some time, much longer path names can exist on Windows. The file system normally used (NTFS) allows that. Windows API started supporting so called extended-length path syntax (`\\?\D:\long_path`) in some functions which allowed to overcome the limit, even though anecdotically it is not used much. In addition, where it seemed safe wrt to the applications, Windows API started accepting much longer path names even with the regular syntax, primarily in Unicode variants of the functions. Hence, while some Windows applications are written assuming that no path can be longer than `MAX_PATH`, such paths may and do exist in practice. How come that the old applications making that assumptions still seem to be (mostly) working? The trick is that Windows hides long paths from old applications in APIs where it is believed they could cause trouble, which typically means APIs where the path is being returned to the application. The idea is that long paths are rare, anyway, and users would unlikely try using them especially with old applications. Once an application is updated to work with long paths, it can opt in to see them by declaring it is long-path-aware in its manifest (so in the `.EXE` file, at build time). In addition, this needs to be allowed system-wide. It is supported since somewhat recent Windows 10 and is not enabled by default. The current path length limit imposed by Windows is approximately 32,767 UCS-2 words. An exact single limit does not exist (the documentation says it is approximate and depends on internal expansions), and that is in addition to the mentioned uncertainty due to encoding and ANSI vs Unicode functions described before. R uses MinGW-W64 on Windows, which defines `PATH_MAX` to the same value as `MAX_PATH`, so 260, to help compiling code written originally for POSIX systems. The macros have a similar meaning, but the details are different. Readers interested in the exact wording in POSIX are advised to check the specification. I didn't try to find out whether that was the correct interpretation in the past, but today `PATH_MAX` is not a limit for the entire path length that may exist in the system. When `PATH_MAX` is defined, it is the maximum number of bytes that will be stored to a user buffer by functions that do not allow to specify the buffer size. Such calls are rare today (R uses `realpath` for instance) and `PATH_MAX` is then explicitly mentioned in their documentation. Also, if `PATH_MAX` is defined and the OS limits path lengths, it cannot limit them to a smaller number than `PATH_MAX`. But, the OS may accept much longer paths and much longer paths may exist. In addition, the limit may differ based on the file system. On Unix, all file systems are mounted to the same tree, so essentially the limit may depend on a path. If it does, `PATH_MAX` shall be undefined and instead the user can use `pathconf` (or `fpathconf`) to find the limit for particular path. Again, no limit may be given. Also, a limit too large for allocation may be given. Some applications tend to define `PATH_MAX`, when not defined, to a certain large constant, which may complicate reviewing the code (essentially it then becomes an application-imposed limit). The actual value of `PATH_MAX` is not defined by the standard and differs on different systems, common values are 4096 on Linux and 1024 on macOS. In summary, there is no (exactly, always, at compile time) known limit on the entire path name length, neither on Windows nor on Unix (POSIX). The actual limits imposed definitely differ between main platforms on which R runs (Linux, macOS, Windows) and there may be some variation even on a single machine (on Windows there definitely is, on Unix POSIX allows it). # Declaring long-path awareness For R and packages to work with paths longer than 260 characters on Windows, when it is allowed in the system, R needs to be made long-path aware and declare this to Windows. E.g. Python already does that and the Python installer offers to enable long paths in the system. To declare it, one sets `longPathAware` to `true` in manifests of all R front-ends (`Rterm.exe`, `Rgui.exe`, etc.), so in the same place where R opts in for UTF-8 as the system and native encoding. That this is done at process level means that applications embedding R would have to do it as well to get the support. Once R does it (R-devel already does), the packages and libraries it uses will also receive long paths, so, they should be made long-path aware, but could hardly without testing. To resolve the chicken-and-egg problem, there is the system setting of long paths by Windows. By default, this is still disabled. It can be enabled by enthusiast users, people who really need it for specific applications and choose to take the risk of running into problems in selected packages/libraries, and by developers of those packages, who could hardly make them long-path aware without being able to test and debug. The setting is in the registry, under `[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem]`, field `LongPathsEnabled`. It can also be controlled by Group Policy ("Enable Win32 long paths"). # Ensuring long-path awareness The key part of making an application long-path aware is rewriting the code without an assumption that there is a fixed maximum length of an entire path name. Such an assumption may have lead to static allocation of buffers for paths, to limited checking of return values from system functions, to limited buffer-overflow checking when constructing path names, and to now possibly unhelpful validity checking of paths given by user (printing warnings/errors about paths being too long). In my view, it would make sense to get rid of this assumption in all code, not only in Windows-specific parts. An obvious result of such a rewrite is that the code will never or almost never use `PATH_MAX` nor `MAX_PATH` macros. In addition, on Windows there is number of system functions and components which do not support long paths even though their API would allow it. It is necessary to find them and replace them by modern API. Not always the limitations are documented, so we are stuck with testing. This also may be a natural opportunity to replace calls to deprecated Windows API functions by more recent ones, even when the old ones support long paths, because of the necessity to rewrite the code, anyway. Increased code complexity coming with this change may require local refactoring. ## Figuring out the required buffer size Most Windows API functions returning paths accept a pointer to a user buffer and the buffer size. When the size is sufficient, they fill in the buffer and return the number of bytes used (excluding the terminator). When the buffer size is too small, they return the number of bytes needed (including the terminator). Unicode versions of the functions do the same but the unit is UCS-2 (16-bit) words. So, one can call the function twice, first time to find out the needed buffer size and second time with sufficiently large buffer. Old code like this (excluding error handling): ``` char cwd[MAX_PATH]; GetCurrentDirectory(MAX_PATH, cwd); ``` can be changed into: ``` char *cwd; DWORD res = GetCurrentDirectory(0, NULL); cwd = (char *)malloc(res); GetCurrentDirectory(res, cwd); ``` One could try to optimize the code by using a non-empty buffer already during the first call, so that in "most cases" only one call to `GetCurrentDirectory` would suffice. The downside would be increased code complexity and complicated testing: longer paths would be rare, and hence the code path would rarely be tested. The initial size could indeed even be `MAX_PATH`. While error handling is excluded from the example, calling the function twice comes with a (theoretical, but still) risk that the external conditions would change in between, in this case another thread could change the current working directory of the process to a value requiring a longer buffer, so in theory even the second call could fail due to insufficient buffer size. One needs to be careful when checking the return values of such functions, because there may be slight variations in semantics. Some functions return the required buffer size _without_ the terminator, such as `DragQueryFile`. This matches the behavior of e.g. C `snprintf` function. Some Windows API calls already return a dynamically allocated result, e.g. ``` wchar_t *mydocs; SHGetKnownFolderPath(&FOLDERID_Documents, KF_FLAG_CREATE, NULL, &mydocs); // copy mydocs CoTaskMemFree(mydocs); ``` There finding the result length is easy (e.g. `wcslen()`). One can allocate a buffer for the result in the preferred way, copy it, and free the original using the correct free function following the documentation of the specific API function (allocation is discussed later below). There are Windows API calls which do not return the required buffer size, but only return an error signalling that the provided buffer was not large enough. One then needs to call the function with several times, increasing the buffer size. This example is for `GetModuleFileName`: ``` DWORD size = 1; char *buf = NULL; for(;;) { buf = (char *)malloc(size); if (!buf) return NULL; DWORD res = GetModuleFileName(NULL, buf, size); if (res > 0 && res < size) /* success */ break; free(buf); if (res != size) /* error */ return NULL; size *= 2; /* try again with 2x larger buffer */ } ``` POSIX `getcwd()` functions is another example, where one needs to iterate to find out the required buffer size, even though some extensions allow to return a dynamically allocated result. Iterating is not a suitable solution in all such cases. For instance, `GetOpenFileName` function opens a dialog asking the user to select a file to be opened. The caller provides a buffer for the file name and the size. The function reports an error if the buffer was too small. Right, the application could increase the buffer size, open the dialog again, and ask the user again to make the choice. This would unlikely be practical and using a hard-coded large limit is probably better for most uses. There is probably also a limit to how long path would a user normally be willing to select manually. ## Dynamic allocation While it is natural to use dynamic allocation for paths given there is no useful upper limit on their length, introducing dynamic allocation where it hasn't been before has to be done with care. Using `malloc()` requires checking for a memory allocation failure and deciding what to do when it happens: map it to error codes returned by the function at hand, or throw an R error. Throwing an R error requires additional care: if this introduces a possible R error in a function where it wasn't possible before (so at any call site), it may be introducing also a resource leak (e.g. some open file or another dynamically allocated object not arranged to be released on a long jump). If in between the `malloc()` and `free()` calls there is any call to R API, there is a risk of a long jump there, and the buffer allocated by `malloc()` hence should be arranged to be freed if that happens. There is API to do that, both internally in R and public for packages, but it may be tedious to handle all cases. Another problem of introducing `malloc()` is releasing the memory by the caller. If a function previously returned a pointer to a statically allocated buffer and we change it to return memory allocated by `malloc()`, the callers will have to know to release it, and will have to have access to the correct matching function to free it. This is easily possible only for rarely used or internal functions. An example of a function changed this way on Windows is `getRUser()`. It now returns memory that should be freed using `freeRUser()` function by R front-ends and embedding applications. Older applications would not know to free the memory, because a statically allocated buffer was used before, but this function is normally called just once during R startup, so the leak is not a problem. `malloc()` was the choice in startup code as R heap is not yet available. However, in typical package code as well as often in base R itself, when R is already running, it is easier to use `R_alloc` than `malloc` for temporarily created buffers. Introducing `R_alloc` in these cases usually doesn't require the callers to be modified: the memory is automatically freed at the end of `.Call`/`.External` or can be managed explicitly by `vmaxset/vmaxget` in stack-based manner. Care has to be taken when there is a risk the function modified will be called a large number of times before the cleaning would take place. Also, there must not be an undesirable cleanup using `vmaxset` before the buffer is to be used. `R_alloc` introduces allocation from the R heap, and this means potentially also a garbage collection. Therefore, care must be taken whether this is safe to introduce, whether it would not introduce PROTECT errors. In theory, `R_alloc` also introduces a possible long-jump, because of a potential allocation error. However, memory allocated by `R_alloc` gets cleaned on long jumps (the allocation stack depth is restored at the corresponding contexts), so one does not have to worry about memory leaks. In base R, calls to Windows API have been mostly rewritten to dynamic allocation, using `malloc` in startup code and `R_alloc` elsewhere. Despite the discussion above, deciding on which function to allocate memory to use hasn't always been hard: often `R_alloc` has already been used, so wasn't newly introduced. But some static allocation remains. ## Static allocation In some cases, changing existing code for dynamic allocation of paths may still seem overwhelming or too intrusive. It may be easier, in some cases at least as a temporary solution, to give up on supporting arbitrarily long paths, but instead impose an application-specific limit (much larger than 260 bytes on Windows). It is still necessary to handle things that weren't handled in code that assumed a length limit on any existing path. Compared to dynamic allocation, one does not have to worry about introducing garbage collection (PROTECT errors) and resource leaks (the client not freeing the memory). But, there is still an issue of introducing error paths, and hence potential resource leaks. Unlike dynamic allocation, one needs to carefully protect against buffer overflows and detect when a too-long path would arise e.g. from concatenation. One needs to report that as an error rather than corrupting memory or silently truncating. Also, the code may become complicated by having to deal with multiple path-length limits when the OS API introduces one and the application another. In base R, static allocation was still used for few widely called utility functions (where changing/reviewing the callers would be too difficult), for incorporated external code where the change would complicate maintenance, where one could not find the buffer size, anyway, and in some code used also on Unix, where `PATH_MAX` is usually large enough so that it does not cause trouble. ## Functions to be avoided Some code has to be rewritten to use different API to support long paths. Only several examples are given here to illustrate the problem. An old POSIX function `getwd()` (removed from the standard in 2008) doesn't allow to specify the size of the user buffer. The buffer needs to be at least of size `PATH_MAX` and the function returns an error if the path is longer than `PATH_MAX`. Another example is `realpath`. These functions in their old form are broken by design, because in current POSIX, `PATH_MAX` may not even be defined, or may be a number too large to allocate a buffer of that size, etc. Still, such functions are rare on both Unix and Windows. Unfortunately, even calls which have semantics that would allow supporting long paths sometimes do not support them on Windows. For example, to locate the "Documents" folder, R previously used `SHGetSpecialFolderLocation/SHGetPathFromIDList`, but to support long paths, this was changed to `SHGetKnownFolderPath`, because `SHGetPathFromIDList` does not support long paths. This illustrates that such a limitation sometimes exists even when the API already returns a dynamically allocated result. `GetFullPathNameA` (the ANSI version) does not work with long paths, but `GetFullPathNameW` does. Hence, calls to the ANSI version need to be replaced by a conversion and a call the Unicode version. This doesn't make much sense, because the ANSI version should be doing just that, and because the API would allow supporting long paths, as the buffer size is accepted and real size signalled. Still at least it is documented. Many API functions document the limit for the ANSI version and refer to the Unicode version to overcome it, but that seems surprising (or perhaps outdated) given the new support for UTF-8 and recommendation to use the ANSI functions. Often the ANSI functions happen to work with long paths (when opted in). For example, `GetShortPathName` does, while it is documented to have that limitation in the ANSI version as well. The old dialog for choosing a directory `SHBrowserForFolder` does not support long paths (it is used in Rgui) and had to be replaced by `IFileOpenDialog`, which required more than several lines of code. ## Directory traversal R internally uses POSIX `opendir/readdir/closedir` functions for listing files in a directory. These are not available on Windows directly, but R has been using MinGW-W64 implementations, both the ANSI and the Unicode variants. It should be said here that there is also a limit on the length of an individual file. Luckily, this limit is about the same on all systems where R runs and it hasn't changed (at least not recently). So, it is not a problem that these functions allocate a single file name statically (`d_name`). However, the MinGW-W64 (in version 10) implementation of these functions use `GetFullPathName` on a statically allocated buffer of `PATH_MAX` characters; they use it on the input path used to start the search. So, R now has its own re-implementation of a subset of the functionality of `opendir/readdir/closedir` which does support long paths. The functions for directory traversal also had to be re-factored not to make assumptions about a limit for the full paths that may exist in the system. Such functions internally need to keep appending directory names to build the currently visited path. This previously used a statically allocated buffer, but now uses a dynamically allocated string buffer, which is automatically expanded if needed. ## Checking of return values An example to illustrate the need for reviewing old code which assumed that no path could be longer than `MAX_PATH` is from the implementation of `Sys.which`: ``` int iexts = 0; const char *exts[] = { ".exe" , ".com" , ".cmd" , ".bat" , NULL }; while (exts[iexts]) { strcpy(dest, exts[iexts]); // modifies fl if ((d = SearchPath(NULL, fl, NULL, MAX_PATH, fn, &f))) break; iexts++ ; } ``` The loop tries to find an executable on PATH using different suffixes. A non-zero exit value of `SearchPath` is taken as a success. The function returns zero on error. It returns a value larger than `nBufferLength` (which received the value of MAX_PATH) to indicate that the buffer wasn't large enough, but that wasn't checked in the old code as it was assumed to be impossible. So, when there is a very long path on PATH, say at the beginning of it, `Sys.which()` would fail for files that in fact were on PATH. It doesn't fail in R 4.2.2 and earlier, because Windows hides such long path components from R, `SearchPath` skips it. But it would fail in R-devel on system with enabled long paths. ## Checking of path lengths Given that there is no known limit on the entire path length in the system, it is questionable whether preventive checks make sense, and particularly so with the `MAX_PATH` limit on Windows. It is true that, unless the long paths are enabled in the system, even R-devel would be prone to this limit, but as described earlier, it is only some functions in some cases that are prone to it, some other functions work. So, an error may be premature and a warning may be confusing. Certainly the checks make sense if an application decides to impose its own limit: it is needed to protect static buffers on input from overflow. # Limitations Long path support in Windows is only available in Windows 10 since version 1607 (released in 2016). On older systems, R would still be subject to the `MAX_PATH` limit. Windows applications ("Win32") cannot be started with the current directory being the long path, even when the long path support is enabled. This quite significantly restricts potential use of long paths. In R package development, one would easily run into this when checking or building packages, which in turn often executes external commands. This also means that testing the long path support is difficult. Some Windows components still do not support long paths. Hopefully this will change over time, but it is already over 6 years since the feature has been released. For example it is not possible to print a document to a file with a long path - I've ran into this while testing different functions of Rgui with long paths, and I didn't find alternative API. After all, several Windows applications I tried had the same limit. Inevitably, a number of existing applications would not support long paths, and some may be used together with R, so R supporting them would not help. As noted before, the feature in base R is to be treated as experimental particularly because packages have not yet been updated. While it seems there is no more than 100 CRAN and Bioconductor packages using `PATH_MAX` (or `MAX_PATH`) constants in their code, it is not clear how many would be affected in bad ways. It is not easily possible to "run checks" for all CRAN/Bioconductor packages to test that, because of the limitations in executing from paths with the long name. So, the level on the long path support and testing in packages will be mostly left to manual work. # Recommendations I offer my recommendations based on reading about this problem and implementing long-path support in base R. ## Work-arounds Users who run into the problem of long paths when using an R package on already released versions of R should ideally first check whether the package allows to influence the length of the path: whether it can be told where to create files or how to name them. If not, or if that is already minimal or default, it is worth trying to use a drive mapping (`subst` command) to get rid of any directory prefix. After all, the author of the package probably tested it in some directory, probably without long paths enabled, so this should create a setup that is not more limiting. Finally, if that does not help, try to make sure that 8.3 names are enabled (to confuse matters, they are sometimes also called "short names") for the drive and directories involved (see `dir /x` command, `fsutil file setshortname`). Try to make the package use the 8.3 name variants; it is even possible to set them manually, so influence their length further. How hard would it be to make the package use them would depend on the situation: it might happen automatically, it might work by specifying those to the package functions in the short form, or it might not work at all when the package intentionally normalizes paths or otherwise expands short names. ## Use reasonably short names Path length in practice is a shared resource, different components of the path are named by different entities and software. In my example `msys64\home` comes from Msys2 conventions, `tomas` is my user name, `ucrt3\svn` was my local decision on the system `ucrt3\r_packages` is how a subversion repository is structured, `pkgcheck\CRAN` was a design decision in package checks scripts (`CRAN` is indeed a name of the package repository), `ADAPTS` is the name of the package, `RtmpKWYapj` is named by R (a temporary session directory), and finally ``` gList.Mast.cells_T.cells.follicular.helper_T.cells.CD4.memory.activated_T.cells.CD4.memory.resting_T.cells.CD4.naive_T.cells.C_Plasma.cells_B.cells.memory_B.cells.naive.RData.RData ``` is a name created by the package. Path length being a shared resource, responsible parties would choose reasonably shallow nesting level and particularly reasonably short names of the components, of the files and directories. This example is an extreme case where clearly the file name takes unfairly too much. The file name should be constant wrt to the size of the input. Someone might argue that my prefix was also a bit too deep. Despite the long path support in Windows and efforts like this, it will take "at least" very long before one could reliably rely on paths longer than 260 characters on Windows. Prevention will thus probably remain the key part of the solution for a long time. ## Write code robust to arbitrarily long names According to the current standards and implementations, there is no (known, reasonably small) limit on current systems for the length of the entire path name. At a minimum, code should make it clear when it is imposing its own limit on path name length. It should be robust to paths longer than that: report an error or perhaps skip them, but definitely do not let the code crash or silently truncate. Any self-imposed limit should ideally be at least what `PATH_MAX` is on Linux today (4096). Still, in most cases it seems natural to use dynamic allocation and support path names of arbitrarily long names. It would probably be a natural solution for new code. ## Make packages long-path aware It makes sense to first review all uses of `MAX_PATH` and `PATH_MAX` in the code. This identifies places that need to be rewritten to support long paths. Ideally these constants would only be used with API that explicitly depends on them (e.g. `realpath`, very rare). In cases when the limit is application-imposed, they should be replaced by a different constant to make that clear. I would recommend modifying the code such that the same code path is taken for short and long names. That way, the code would get tested using the currently available tests and by common regular use. Only optimize if ever needed, which would probably be rare in file-system operations, but is not impossible. Ideally there would be a switch to use the long path while testing, e.g. by setting the initial size to a very small value when iterating to find the required buffer size. Testing is essential to find any remaining problems, including limitations in the used libraries and in Windows itself. One cannot rely on the documentation. Also, it is of course easy to overlook problems without testing, even when the code attempts to check path lengths. I've initially seen a lot of crashes of base R when enabled long paths. To check an R package, one may run `R CMD check --output=DIR` to select an output directory, hence avoid running from a long directory. One may start R in a short directory and then change the current working directory to a long one when that helps the testing. One should now be able to install packages into a long directory, both from source and from binary versions. Bash in Msys2 as well as cmd.exe and Powershell can work with long directories. # Summary Updating R to support long paths on Windows took a bit over a month of work, changed about 4300 lines (added or deleted) in 70 files. So, the investment was quite large and this comes with a risk of introducing bugs. Bug reports on suspicious changes in behavior of file-system operations, on Windows as well as on Unix, are particularly welcome, and sooner is better so that they could be fixed before the 4.3 release. Some of the Windows-specific code has been updated on the way to avoid using deprecated functions, so they may be some maintenance benefit even regardless of long paths.