When testing development versions of Rtools for Windows, I’ve ran into strange failures of several CRAN packages where R could not find, read from or write to some files. The files should have been in temporary directories which get automatically deleted, so it took some effort to find out that actually they existed and were accessible. That didn’t make any sense at first, but eventually I got to this output:
Warning in gzfile(file, "wb") :
cannot open compressed file 'C:\msys64\home\tomas\ucrt3\svn\ucrt3\r_packages\pkgcheck\CRAN\ADAPTS\tmp\RtmpKWYapj/gList.Mast.cells_T.cells.follicular.helper_T.cells.CD4.memory.activated_T.cells.CD4.memory.resting_T.cells.CD4.naive_T.cells.C_Plasma.cells_B.cells.memory_B.cells.naive.RData.RData', probable reason 'No such file or directory'
Error in gzfile(file, "wb") : cannot open the connection
Calls: remakeLM22p -> save -> gzfile
Execution halted
Such a long file name. The entire path in the warning message takes 265 bytes. Perhaps it is too long and, for some reason it can be created but not read in a particular way?
To confirm the theory, I’ve created a mapped drive to get rid of the
/msys64/home/tomas/ucrt3/svn/ucrt3/r_packages/pkgcheck
prefix of the path.
This package and several other started to pass their checks. Interestingly,
a junction didn’t work that well, because path normalization followed it in
some cases, getting again the long paths. R has been improved since and is
more likely to provide a hint (warning or error) that the path is too long,
so diagnosing the problem is often easier than this, yet the message may
also be too pessimistic (more below).
This text provides some background on path-length limits and offers recommendations for what to do about them. It reports on recent improvements in R, which allow R and packages to work with longer paths on recent Windows 10, where and when the system limit can be overriden.
Following the changes in R, some of R packages will have to be updated as
well to work with long paths. Primarily authors of packages using
PATH_MAX
or even MAX_PATH
in their code are advised to continue reading.
The changes in R make the updating of packages possible (they can be
tested), but also more important (they could crash when seeing long paths).
It is therefore not advised to enable long paths on production systems, yet
- the feature needs to be considered experimental with the R ecosystem.
Background
On Windows, there used to be a limit on the length of the entire path name
imposed by the operating system. It is derived from constant MAX_PATH
(260, not likely to change) and limits the number of UCS-2 (16-bit, so only
BMP characters) words including the terminator. Depending on the API, it
may be in addition applied directly to the number of bytes accepted as path
names in ANSI functions, e.g. 259 UTF-8 bytes plus a 1-byte terminator.
But, it may also be only applied once converted to UCS-2, and then 259
BMP
characters with a 2-byte terminator may correspond up to 3*259
UTF-8 bytes
with a 1-byte terminator.
However, for quite some time, much longer path names can exist on Windows.
The file system normally used (NTFS) allows that. Windows API started
supporting so called extended-length path syntax (\\?\D:\long_path
) in
some functions which allowed to overcome the limit, even though
anecdotically it is not used much. In addition, where it seemed safe wrt to
the applications, Windows API started accepting much longer path names even
with the regular syntax, primarily in Unicode variants of the functions.
Hence, while some Windows applications are written assuming that no path can
be longer than MAX_PATH
, such paths may and do exist in practice. How
come that the old applications making that assumptions still seem to be
(mostly) working?
The trick is that Windows hides long paths from old applications in APIs where it is believed they could cause trouble, which typically means APIs where the path is being returned to the application. The idea is that long paths are rare, anyway, and users would unlikely try using them especially with old applications.
Once an application is updated to work with long paths, it can opt in to see
them by declaring it is long-path-aware in its manifest (so in the .EXE
file, at build time). In addition, this needs to be allowed system-wide.
It is supported since somewhat recent Windows 10 and is not enabled by
default.
The current path length limit imposed by Windows is approximately 32,767 UCS-2 words. An exact single limit does not exist (the documentation says it is approximate and depends on internal expansions), and that is in addition to the mentioned uncertainty due to encoding and ANSI vs Unicode functions described before.
R uses MinGW-W64 on Windows, which defines PATH_MAX
to the same value as
MAX_PATH
, so 260, to help compiling code written originally for POSIX
systems. The macros have a similar meaning, but the details are different.
Readers interested in the exact wording in POSIX are advised to check the
specification. I didn’t try to find out whether that was the correct
interpretation in the past, but today PATH_MAX
is not a limit for the
entire path length that may exist in the system. When PATH_MAX
is
defined, it is the maximum number of bytes that will be stored to a user
buffer by functions that do not allow to specify the buffer size. Such
calls are rare today (R uses realpath
for instance) and PATH_MAX
is then
explicitly mentioned in their documentation. Also, if PATH_MAX
is defined
and the OS limits path lengths, it cannot limit them to a smaller number
than PATH_MAX
. But, the OS may accept much longer paths and much longer
paths may exist.
In addition, the limit may differ based on the file system. On Unix, all
file systems are mounted to the same tree, so essentially the limit may
depend on a path. If it does, PATH_MAX
shall be undefined and instead the
user can use pathconf
(or fpathconf
) to find the limit for particular
path. Again, no limit may be given. Also, a limit too large for allocation
may be given. Some applications tend to define PATH_MAX
, when not
defined, to a certain large constant, which may complicate reviewing the
code (essentially it then becomes an application-imposed limit).
The actual value of PATH_MAX
is not defined by the standard and differs on
different systems, common values are 4096 on Linux and 1024 on macOS.
In summary, there is no (exactly, always, at compile time) known limit on the entire path name length, neither on Windows nor on Unix (POSIX). The actual limits imposed definitely differ between main platforms on which R runs (Linux, macOS, Windows) and there may be some variation even on a single machine (on Windows there definitely is, on Unix POSIX allows it).
Declaring long-path awareness
For R and packages to work with paths longer than 260 characters on Windows, when it is allowed in the system, R needs to be made long-path aware and declare this to Windows. E.g. Python already does that and the Python installer offers to enable long paths in the system.
To declare it, one sets longPathAware
to true
in manifests of all R
front-ends (Rterm.exe
, Rgui.exe
, etc.), so in the same place where R
opts in for UTF-8 as the system and native encoding. That this is done at
process level means that applications embedding R would have to do it as well
to get the support. Once R does it (R-devel already does), the packages and
libraries it uses will also receive long paths, so, they should be made
long-path aware, but could hardly without testing.
To resolve the chicken-and-egg problem, there is the system setting of long
paths by Windows. By default, this is still disabled. It can be enabled by
enthusiast users, people who really need it for specific applications and
choose to take the risk of running into problems in selected
packages/libraries, and by developers of those packages, who could hardly
make them long-path aware without being able to test and debug. The setting
is in the registry, under
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem]
, field
LongPathsEnabled
. It can also be controlled by Group Policy (“Enable Win32
long paths”).
Ensuring long-path awareness
The key part of making an application long-path aware is rewriting the code without an assumption that there is a fixed maximum length of an entire path name. Such an assumption may have lead to static allocation of buffers for paths, to limited checking of return values from system functions, to limited buffer-overflow checking when constructing path names, and to now possibly unhelpful validity checking of paths given by user (printing warnings/errors about paths being too long).
In my view, it would make sense to get rid of this assumption in all code,
not only in Windows-specific parts. An obvious result of such a rewrite is
that the code will never or almost never use PATH_MAX
nor MAX_PATH
macros.
In addition, on Windows there is number of system functions and components which do not support long paths even though their API would allow it. It is necessary to find them and replace them by modern API. Not always the limitations are documented, so we are stuck with testing.
This also may be a natural opportunity to replace calls to deprecated Windows API functions by more recent ones, even when the old ones support long paths, because of the necessity to rewrite the code, anyway. Increased code complexity coming with this change may require local refactoring.
Figuring out the required buffer size
Most Windows API functions returning paths accept a pointer to a user buffer and the buffer size. When the size is sufficient, they fill in the buffer and return the number of bytes used (excluding the terminator). When the buffer size is too small, they return the number of bytes needed (including the terminator). Unicode versions of the functions do the same but the unit is UCS-2 (16-bit) words. So, one can call the function twice, first time to find out the needed buffer size and second time with sufficiently large buffer.
Old code like this (excluding error handling):
char cwd[MAX_PATH];
GetCurrentDirectory(MAX_PATH, cwd);
can be changed into:
char *cwd;
DWORD res = GetCurrentDirectory(0, NULL);
cwd = (char *)malloc(res);
GetCurrentDirectory(res, cwd);
One could try to optimize the code by using a non-empty buffer already
during the first call, so that in “most cases” only one call to
GetCurrentDirectory
would suffice. The downside would be increased code
complexity and complicated testing: longer paths would be rare, and hence
the code path would rarely be tested. The initial size could indeed even be
MAX_PATH
.
While error handling is excluded from the example, calling the function twice comes with a (theoretical, but still) risk that the external conditions would change in between, in this case another thread could change the current working directory of the process to a value requiring a longer buffer, so in theory even the second call could fail due to insufficient buffer size.
One needs to be careful when checking the return values of such functions,
because there may be slight variations in semantics. Some functions return
the required buffer size without the terminator, such as DragQueryFile
.
This matches the behavior of e.g. C snprintf
function.
Some Windows API calls already return a dynamically allocated result, e.g.
wchar_t *mydocs;
SHGetKnownFolderPath(&FOLDERID_Documents, KF_FLAG_CREATE, NULL, &mydocs);
// copy mydocs
CoTaskMemFree(mydocs);
There finding the result length is easy (e.g. wcslen()
). One can allocate
a buffer for the result in the preferred way, copy it, and free the
original using the correct free function following the documentation of the
specific API function (allocation is discussed later below).
There are Windows API calls which do not return the required buffer size,
but only return an error signalling that the provided buffer was not large
enough. One then needs to call the function with several times, increasing
the buffer size. This example is for GetModuleFileName
:
DWORD size = 1;
char *buf = NULL;
for(;;) {
buf = (char *)malloc(size);
if (!buf)
return NULL;
DWORD res = GetModuleFileName(NULL, buf, size);
if (res > 0 && res < size) /* success */
break;
free(buf);
if (res != size) /* error */
return NULL;
size *= 2; /* try again with 2x larger buffer */
}
POSIX getcwd()
functions is another example, where one needs to iterate to
find out the required buffer size, even though some extensions allow to
return a dynamically allocated result.
Iterating is not a suitable solution in all such cases. For instance,
GetOpenFileName
function opens a dialog asking the user to select a file
to be opened. The caller provides a buffer for the file name and the size.
The function reports an error if the buffer was too small. Right, the
application could increase the buffer size, open the dialog again, and ask
the user again to make the choice. This would unlikely be practical and
using a hard-coded large limit is probably better for most uses. There is
probably also a limit to how long path would a user normally be willing to
select manually.
Dynamic allocation
While it is natural to use dynamic allocation for paths given there is no useful upper limit on their length, introducing dynamic allocation where it hasn’t been before has to be done with care.
Using malloc()
requires checking for a memory allocation failure and
deciding what to do when it happens: map it to error codes returned by the
function at hand, or throw an R error. Throwing an R error requires
additional care: if this introduces a possible R error in a function where
it wasn’t possible before (so at any call site), it may be introducing also
a resource leak (e.g. some open file or another dynamically allocated
object not arranged to be released on a long jump). If in between the
malloc()
and free()
calls there is any call to R API, there is a risk of
a long jump there, and the buffer allocated by malloc()
hence should be
arranged to be freed if that happens. There is API to do that, both
internally in R and public for packages, but it may be tedious to handle all
cases.
Another problem of introducing malloc()
is releasing the memory by the
caller. If a function previously returned a pointer to a statically
allocated buffer and we change it to return memory allocated by malloc()
,
the callers will have to know to release it, and will have to have access to
the correct matching function to free it. This is easily possible only for
rarely used or internal functions.
An example of a function changed this way on Windows is getRUser()
. It
now returns memory that should be freed using freeRUser()
function by R
front-ends and embedding applications. Older applications would not know to
free the memory, because a statically allocated buffer was used before, but
this function is normally called just once during R startup, so the leak is
not a problem. malloc()
was the choice in startup code as R heap is not
yet available.
However, in typical package code as well as often in base R itself, when R
is already running, it is easier to use R_alloc
than malloc
for
temporarily created buffers. Introducing R_alloc
in these cases usually
doesn’t require the callers to be modified: the memory is automatically
freed at the end of .Call
/.External
or can be managed explicitly by
vmaxset/vmaxget
in stack-based manner. Care has to be taken when there is
a risk the function modified will be called a large number of times before
the cleaning would take place. Also, there must not be an undesirable
cleanup using vmaxset
before the buffer is to be used.
R_alloc
introduces allocation from the R heap, and this means potentially
also a garbage collection. Therefore, care must be taken whether this is
safe to introduce, whether it would not introduce PROTECT errors. In
theory, R_alloc
also introduces a possible long-jump, because of a
potential allocation error. However, memory allocated by R_alloc
gets
cleaned on long jumps (the allocation stack depth is restored at the
corresponding contexts), so one does not have to worry about memory leaks.
In base R, calls to Windows API have been mostly rewritten to dynamic
allocation, using malloc
in startup code and R_alloc
elsewhere. Despite
the discussion above, deciding on which function to allocate memory to use
hasn’t always been hard: often R_alloc
has already been used, so wasn’t
newly introduced. But some static allocation remains.
Static allocation
In some cases, changing existing code for dynamic allocation of paths may still seem overwhelming or too intrusive. It may be easier, in some cases at least as a temporary solution, to give up on supporting arbitrarily long paths, but instead impose an application-specific limit (much larger than 260 bytes on Windows). It is still necessary to handle things that weren’t handled in code that assumed a length limit on any existing path.
Compared to dynamic allocation, one does not have to worry about introducing garbage collection (PROTECT errors) and resource leaks (the client not freeing the memory). But, there is still an issue of introducing error paths, and hence potential resource leaks.
Unlike dynamic allocation, one needs to carefully protect against buffer overflows and detect when a too-long path would arise e.g. from concatenation. One needs to report that as an error rather than corrupting memory or silently truncating. Also, the code may become complicated by having to deal with multiple path-length limits when the OS API introduces one and the application another.
In base R, static allocation was still used for few widely called utility
functions (where changing/reviewing the callers would be too difficult), for
incorporated external code where the change would complicate maintenance,
where one could not find the buffer size, anyway, and in some code used also
on Unix, where PATH_MAX
is usually large enough so that it does not cause
trouble.
Functions to be avoided
Some code has to be rewritten to use different API to support long paths. Only several examples are given here to illustrate the problem.
An old POSIX function getwd()
(removed from the standard in 2008) doesn’t
allow to specify the size of the user buffer. The buffer needs to be at
least of size PATH_MAX
and the function returns an error if the path is
longer than PATH_MAX
. Another example is realpath
. These functions in
their old form are broken by design, because in current POSIX, PATH_MAX
may not even be defined, or may be a number too large to allocate a buffer
of that size, etc. Still, such functions are rare on both Unix and Windows.
Unfortunately, even calls which have semantics that would allow supporting long paths sometimes do not support them on Windows.
For example, to locate the “Documents” folder, R previously used
SHGetSpecialFolderLocation/SHGetPathFromIDList
, but to support long paths,
this was changed to SHGetKnownFolderPath
, because SHGetPathFromIDList
does not support long paths. This illustrates that such a limitation
sometimes exists even when the API already returns a dynamically allocated
result.
GetFullPathNameA
(the ANSI version) does not work with long paths, but
GetFullPathNameW
does. Hence, calls to the ANSI version need to be
replaced by a conversion and a call the Unicode version. This doesn’t make
much sense, because the ANSI version should be doing just that, and because
the API would allow supporting long paths, as the buffer size is accepted
and real size signalled. Still at least it is documented.
Many API functions document the limit for the ANSI version and refer to the
Unicode version to overcome it, but that seems surprising (or perhaps
outdated) given the new support for UTF-8 and recommendation to use the ANSI
functions. Often the ANSI functions happen to work with long paths (when
opted in). For example, GetShortPathName
does, while it is documented to
have that limitation in the ANSI version as well.
The old dialog for choosing a directory SHBrowserForFolder
does not
support long paths (it is used in Rgui) and had to be replaced by
IFileOpenDialog
, which required more than several lines of code.
Directory traversal
R internally uses POSIX opendir/readdir/closedir
functions for listing
files in a directory. These are not available on Windows directly, but R has
been using MinGW-W64 implementations, both the ANSI and the Unicode
variants.
It should be said here that there is also a limit on the length of an
individual file. Luckily, this limit is about the same on all systems where
R runs and it hasn’t changed (at least not recently). So, it is not a
problem that these functions allocate a single file name statically
(d_name
).
However, the MinGW-W64 (in version 10) implementation of these functions use
GetFullPathName
on a statically allocated buffer of PATH_MAX
characters;
they use it on the input path used to start the search. So, R now has its
own re-implementation of a subset of the functionality of
opendir/readdir/closedir
which does support long paths.
The functions for directory traversal also had to be re-factored not to make assumptions about a limit for the full paths that may exist in the system. Such functions internally need to keep appending directory names to build the currently visited path. This previously used a statically allocated buffer, but now uses a dynamically allocated string buffer, which is automatically expanded if needed.
Checking of return values
An example to illustrate the need for reviewing old code which assumed that
no path could be longer than MAX_PATH
is from the implementation of
Sys.which
:
int iexts = 0;
const char *exts[] = { ".exe" , ".com" , ".cmd" , ".bat" , NULL };
while (exts[iexts]) {
strcpy(dest, exts[iexts]); // modifies fl
if ((d = SearchPath(NULL, fl, NULL, MAX_PATH, fn, &f))) break;
iexts++ ;
}
The loop tries to find an executable on PATH using different suffixes. A
non-zero exit value of SearchPath
is taken as a success. The
function returns zero on error. It returns a value larger than
nBufferLength
(which received the value of MAX_PATH) to indicate that the
buffer wasn’t large enough, but that wasn’t checked in the old code as it
was assumed to be impossible.
So, when there is a very long path on PATH, say at the beginning of it,
Sys.which()
would fail for files that in fact were on PATH. It doesn’t
fail in R 4.2.2 and earlier, because Windows hides such long path components
from R, SearchPath
skips it. But it would fail in R-devel on system with
enabled long paths.
Checking of path lengths
Given that there is no known limit on the entire path length in the system,
it is questionable whether preventive checks make sense, and particularly so
with the MAX_PATH
limit on Windows. It is true that, unless the long
paths are enabled in the system, even R-devel would be prone to this limit,
but as described earlier, it is only some functions in some cases that are
prone to it, some other functions work. So, an error may be premature and a
warning may be confusing. Certainly the checks make sense if an application
decides to impose its own limit: it is needed to protect static buffers on
input from overflow.
Limitations
Long path support in Windows is only available in Windows 10 since version
1607 (released in 2016). On older systems, R would still be subject to the
MAX_PATH
limit.
Windows applications (“Win32”) cannot be started with the current directory being the long path, even when the long path support is enabled. This quite significantly restricts potential use of long paths. In R package development, one would easily run into this when checking or building packages, which in turn often executes external commands. This also means that testing the long path support is difficult.
Some Windows components still do not support long paths. Hopefully this will change over time, but it is already over 6 years since the feature has been released. For example it is not possible to print a document to a file with a long path - I’ve ran into this while testing different functions of Rgui with long paths, and I didn’t find alternative API. After all, several Windows applications I tried had the same limit.
Inevitably, a number of existing applications would not support long paths, and some may be used together with R, so R supporting them would not help.
As noted before, the feature in base R is to be treated as experimental
particularly because packages have not yet been updated. While it seems
there is no more than 100 CRAN and Bioconductor packages using PATH_MAX
(or MAX_PATH
) constants in their code, it is not clear how many would be
affected in bad ways. It is not easily possible to “run checks” for all
CRAN/Bioconductor packages to test that, because of the limitations in
executing from paths with the long name. So, the level on the long path
support and testing in packages will be mostly left to manual work.
Recommendations
I offer my recommendations based on reading about this problem and implementing long-path support in base R.
Work-arounds
Users who run into the problem of long paths when using an R package on already released versions of R should ideally first check whether the package allows to influence the length of the path: whether it can be told where to create files or how to name them.
If not, or if that is already minimal or default, it is worth trying to use
a drive mapping (subst
command) to get rid of any directory prefix. After
all, the author of the package probably tested it in some directory,
probably without long paths enabled, so this should create a setup that is
not more limiting.
Finally, if that does not help, try to make sure that 8.3 names are enabled
(to confuse matters, they are sometimes also called “short names”) for the
drive and directories involved (see dir /x
command, fsutil file setshortname
). Try to make the package use the 8.3 name variants; it is
even possible to set them manually, so influence their length further. How
hard would it be to make the package use them would depend on the situation:
it might happen automatically, it might work by specifying those to the
package functions in the short form, or it might not work at all when the
package intentionally normalizes paths or otherwise expands short names.
Use reasonably short names
Path length in practice is a shared resource, different components of the
path are named by different entities and software. In my example
msys64\home
comes from Msys2 conventions, tomas
is my user name,
ucrt3\svn
was my local decision on the system ucrt3\r_packages
is how a
subversion repository is structured, pkgcheck\CRAN
was a design decision
in package checks scripts (CRAN
is indeed a name of the package
repository), ADAPTS
is the name of the package, RtmpKWYapj
is named by R
(a temporary session directory), and finally
gList.Mast.cells_T.cells.follicular.helper_T.cells.CD4.memory.activated_T.cells.CD4.memory.resting_T.cells.CD4.naive_T.cells.C_Plasma.cells_B.cells.memory_B.cells.naive.RData.RData
is a name created by the package.
Path length being a shared resource, responsible parties would choose reasonably shallow nesting level and particularly reasonably short names of the components, of the files and directories. This example is an extreme case where clearly the file name takes unfairly too much. The file name should be constant wrt to the size of the input. Someone might argue that my prefix was also a bit too deep.
Despite the long path support in Windows and efforts like this, it will take “at least” very long before one could reliably rely on paths longer than 260 characters on Windows. Prevention will thus probably remain the key part of the solution for a long time.
Write code robust to arbitrarily long names
According to the current standards and implementations, there is no (known, reasonably small) limit on current systems for the length of the entire path name.
At a minimum, code should make it clear when it is imposing its own limit on
path name length. It should be robust to paths longer than that: report an
error or perhaps skip them, but definitely do not let the code crash or
silently truncate. Any self-imposed limit should ideally be at least what
PATH_MAX
is on Linux today (4096).
Still, in most cases it seems natural to use dynamic allocation and support path names of arbitrarily long names. It would probably be a natural solution for new code.
Make packages long-path aware
It makes sense to first review all uses of MAX_PATH
and PATH_MAX
in the
code. This identifies places that need to be rewritten to support long
paths. Ideally these constants would only be used with API that explicitly
depends on them (e.g. realpath
, very rare). In cases when the limit is
application-imposed, they should be replaced by a different constant to make
that clear.
I would recommend modifying the code such that the same code path is taken for short and long names. That way, the code would get tested using the currently available tests and by common regular use. Only optimize if ever needed, which would probably be rare in file-system operations, but is not impossible. Ideally there would be a switch to use the long path while testing, e.g. by setting the initial size to a very small value when iterating to find the required buffer size.
Testing is essential to find any remaining problems, including limitations in the used libraries and in Windows itself. One cannot rely on the documentation. Also, it is of course easy to overlook problems without testing, even when the code attempts to check path lengths. I’ve initially seen a lot of crashes of base R when enabled long paths.
To check an R package, one may run R CMD check --output=DIR
to select an
output directory, hence avoid running from a long directory. One may start
R in a short directory and then change the current working directory to a
long one when that helps the testing. One should now be able to install
packages into a long directory, both from source and from binary versions.
Bash in Msys2 as well as cmd.exe and Powershell can work with long directories.
Summary
Updating R to support long paths on Windows took a bit over a month of work, changed about 4300 lines (added or deleted) in 70 files. So, the investment was quite large and this comes with a risk of introducing bugs. Bug reports on suspicious changes in behavior of file-system operations, on Windows as well as on Unix, are particularly welcome, and sooner is better so that they could be fixed before the 4.3 release.
Some of the Windows-specific code has been updated on the way to avoid using deprecated functions, so they may be some maintenance benefit even regardless of long paths.