Age | Commit message (Collapse) | Author | Files | Lines |
|
syscall.h was chosen as the header to declare it, since its intended
usage is alongside syscalls as a fallback for operations the direct
syscall does not support.
|
|
to support the GNU extension of allocating a buffer for getcwd's
result when a null pointer is passed without incurring a link
dependency on free, we use a PATH_MAX-sized buffer on the stack and
only duplicate it to allocated storage after the operation succeeds.
unfortunately this imposed excessive stack usage on all callers,
including those not making use of the GNU extension.
instead, use a VLA to make stack allocation conditional.
|
|
|
|
the Linux SYS_nice syscall is unusable because it does not return the
newly set priority. always use SYS_setpriority. also avoid overflows
in addition of inc by handling large inc values directly without
examining the old nice value.
|
|
Currently getcwd(3) can succeed without returning an absolute path
because the underlying getcwd syscall, starting with linux commit
v2.6.36-rc1~96^2~2, may succeed without returning an absolute path.
This is a conformance issue because "The getcwd() function shall
place an absolute pathname of the current working directory
in the array pointed to by buf, and return buf".
Fix this by checking the path returned by syscall and failing with
ENOENT if the path is not absolute. The error code is chosen for
consistency with the case when the current directory is unlinked.
Similar issue was fixed in glibc recently, see
https://sourceware.org/bugzilla/show_bug.cgi?id=22679
|
|
commit f9fb20b42da0e755d93de229a5a737d79a0e8f60 switched from using a
pipe for the result to conveying it via the child process exit status.
Alexander Monakov pointed out that the latter could fail if the
application is not expecting faccessat to produce a child and performs
a wait operation with __WCLONE or __WALL, and that it is not clear
whether it's guaranteed to work when SIGCHLD's disposition has been
set to SIG_IGN.
in addition, that commit introduced a bug that caused EACCES to be
produced instead of EBUSY due to an exit path that was overlooked when
the error channel was changed, and introduced a spurious retry loop
around the wait operation.
|
|
The flags argument was missing, causing uninitalized data to be passed
to fchownat(2). The correct value of flags should match the fallback for
chown(3).
|
|
commit 0a950dcf15bb9f7274c804dca490e9e20e475f3e added checking that
the pathname a tty device was opened with actually matches the device,
which can fail to hold when a container inherits a tty from outside
the container. the error code added at the time was ENOENT; however,
discussions between affected applications and glibc developers
resulted in glibc adopting ENODEV as the error for this condition, and
this has now been documented in the man pages project as well. adopt
the same error code for consistency.
patch by Christian Brauner.
|
|
linux containers use separate mount namespace so the /proc
symlink might not point to the right device if the fd was
opened in the parent namespace, in this case return ENOENT.
|
|
despite sh not generally using register-pair alignment for 64-bit
syscall arguments, there are arch-specific versions of the syscall
entry points for pread and pwrite which include a dummy argument for
alignment before the 64-bit offset argument.
|
|
based on patch submitted by Jaydeep Patil, with minor changes.
|
|
patch by Mahesh Bodapati and Jaydeep Patil of Imagination
Technologies.
|
|
nominally the low bits of the trap number on sh are the number of
syscall arguments, but they have never been used by the kernel, and
some code making syscalls does not even know the number of arguments
and needs to pass an arbitrary high number anyway.
sh3/sh4 traditionally used the trap range 16-31 for syscalls, but part
of this range overlapped with hardware exceptions/interrupts on sh2
hardware, so an incompatible range 32-47 was chosen for sh2.
using trap number 31 everywhere, since it's in the existing sh3/sh4
range and does not conflict with sh2 hardware, is a proposed
unification of the kernel syscall convention that will allow binaries
to be shared between sh2 and sh3/sh4. if this is not accepted into the
kernel, we can refit the sh2 target with runtime selection mechanisms
for the trap number, but doing so would be invasive and would entail
non-trivial overhead.
|
|
the equivalent checks for newly opened stdio output streams, used to
determine buffering mode, are also fixed.
on most archs, the TCGETS ioctl command shares a value with
SNDCTL_TMR_TIMEBASE, part of the OSS sound API which was apparently
used with certain MIDI and timer devices. for file descriptors
referring to such a device, TCGETS will not fail with ENOTTY as
expected; it may produce a different error, or may succeed, and if it
succeeds it changes the mode of the device. while it's unlikely that
such devices are in use, this is in principle very harmful behavior
for an operation which is supposed to do nothing but query whether the
fd refers to a tty.
TIOCGWINSZ, used to query logical window size for a terminal, was
chosen as an alternate ioctl to perform the isatty check. it does not
share a value with any other ioctl commands, and it succeeds on any
tty device.
this change also cleans up strace output to be less ugly and
misleading.
|
|
commit 82dc1e2e783815e00a90cd3f681436a80d54a314 addressed the
resolution of Austin Group issue 529, which requires close to leave
the fd open when failing with EINTR, by returning the newly defined
error code EINPROGRESS. this turns out to be a bad idea, though, since
legacy applications not aware of the new specification are likely to
interpret any error from close except EINTR as a hard failure.
|
|
previously, aio operations were not tracked by file descriptor; each
operation was completely independent. this resulted in non-conforming
behavior for non-seekable/append-mode writes (which are required to be
ordered) and made it impossible to implement aio_cancel, which in turn
made closing file descriptors with outstanding aio operations unsafe.
the new implementation is significantly heavier (roughly twice the
size, and seems to be slightly slower) and presently aims mainly at
correctness, not performance.
most of the public interfaces have been moved into a single file,
aio.c, because there is little benefit to be had from splitting them.
whenever any aio functions are used, aio_cancel and the internal
queue lifetime management and fd-to-queue mapping code must be linked,
and these functions make up the bulk of the code size.
the close function's interaction with aio is implemented with weak
alias magic, to avoid pulling in heavy aio cancellation code in
programs that don't use aio, and the expensive cancellation path
(which includes signal blocking) is optimized out when there are no
active aio queues.
|
|
these are mandatory cancellation points per POSIX, so their omission
was a conformance bug.
|
|
in the current version of __synccall, the callback is always run, so
failure to handle this case did not matter. however, the upcoming
overhaul of __synccall will have failure cases, in which case the
callback does not run and errno is already set. the changes being
committed now are in preparation for that.
|
|
the code being removed was introduced to work around "partial failure"
of multi-threaded set*id() operations, where some threads would
succeed in changing their ids but an RLIMIT_NPROC setting would
prevent the rest from succeeding, leaving the process in an
inconsistent and dangerous state. however, the workaround code did not
handle important usage cases like swapping real and effective uids
then restoring their original values, and the wrongful kernel
enforcement of RLIMIT_NPROC at setuid time was removed in Linux 3.1,
making the workaround obsolete.
since the partial failure still is dangerous on old kernels, and could
in principle happen on post-fix kernels as well if set*id() syscalls
fail for another spurious reason such as resource-related failures,
new code is added to detect and forcibly kill the process if/when such
a situation arises. future documentation releases should be updated to
reflect that setting RLIMIT_NPROC to RLIM_INFINITY is necessary to
avoid this forced-kill on old kernels. ideally, at some point the
kernel will get proper multi-threaded set*id() syscalls capable of
performing their actions atomically, and all of the userspace code to
emulate them can be treated as a fallback for outdated kernels.
|
|
opening /dev/tty then using ttyname_r on it does not produce a
canonical terminal name; it simply yields "/dev/tty".
it would be possible to make ctermid determine the actual controlling
terminal device via field 7 of /proc/self/stat, but doing so would
introduce a buffer overflow into applications built with L_ctermid==9,
which glibc defines, adversely affecting the quality of ABI compat.
|
|
such archs are expected to omit definitions of the SYS_* macros for
syscalls their kernels lack from arch/$ARCH/bits/syscall.h. the
preprocessor is then able to select the an appropriate implementation
for affected functions. two basic strategies are used on a
case-by-case basis:
where the old syscalls correspond to deprecated library-level
functions, the deprecated functions have been converted to wrappers
for the modern function, and the modern function has fallback code
(omitted at the preprocessor level on new archs) to make use of the
old syscalls if the new syscall fails with ENOSYS. this also improves
functionality on older kernels and eliminates the incentive to program
with deprecated library-level functions for the sake of compatibility
with older kernels.
in other situations where the old syscalls correspond to library-level
functions which are not deprecated but merely lack some new features,
such as the *at functions, the old syscalls are still used on archs
which support them. this may change at some point in the future if or
when fallback code is added to the new functions to make them usable
(possibly with reduced functionality) on old kernels.
|
|
linux, gcc, etc. all use "sh" as the name for the superh arch. there
was already some inconsistency internally in musl: the dynamic linker
was searching for "ld-musl-sh.path" as its path file despite its own
name being "ld-musl-superh.so.1". there was some sentiment in both
directions as to how to resolve the inconsistency, but overall "sh"
was favored.
|
|
|
|
the workaround/fallback code for supporting O_PATH file descriptors
when the kernel lacks support for performing these operations on them
caused EBADF to get replaced by ENOENT (due to missing entry in
/proc/self/fd). this is unlikely to affect real-world code (calls that
might yield EBADF are generally unsafe, especially in library code)
but it was breaking some test cases.
the fix I've applied is something of a tradeoff: it adds one syscall
to these operations on kernels where the workaround is needed. the
alternative would be to catch ENOENT from the /proc lookup and
translate it to EBADF, but I want to avoid doing that in the interest
of not touching/depending on /proc at all in these functions as long
as the kernel correctly supports the operations. this is following the
general principle of isolating hacks to code paths that are taken on
broken systems, and keeping the code for correct systems completely
hack-free.
|
|
|
|
this is purely a wrapper for close since Linux does not support EINTR
semantics for the close syscall.
|
|
now that we're waiting for the exit status of the child process, the
result can be conveyed in the exit status rather than via a pipe.
since the error value might not fit in 7 bits, a table is used to
translate possible meaningful error values to small integers.
|
|
I mistakenly assumed that clone without a signal produced processes
that would not become zombies; however, waitpid with __WCLONE is
required to release their pids.
|
|
as usual, this is needed to avoid fd leaks. as a better solution, the
use of fds could possibly be replaced with mmap and a futex.
|
|
this fixes an issue reported by Daniel Thau whereby faccessat with the
AT_EACCESS flag did not work in cases where the process is running
suid or sgid but without root privileges. per POSIX, when the process
does not have "appropriate privileges", setuid changes the euid, not
the real uid, and the target uid must be equal to the current real or
saved uid; if this condition is not met, EPERM results. this caused
the faccessat child process to fail.
using the setreuid syscall rather than setuid works. POSIX leaves it
unspecified whether setreuid can set the real user id to the effective
user id on processes without "appropriate privileges", but Linux
allows this; if it's not allowed, there would be no way for this
function to work.
|
|
based on patch by Michael Forney. at the same time, I've changed the
if branch to be more clear, avoiding the comma operator.
the underlying issue is that Linux always returns ERANGE when size is
too short, even when it's zero, rather than returning EINVAL for the
special case of zero as required by POSIX.
|
|
clone will pass the return value of the start function to SYS_exit
anyway; there's no need to call the syscall directly.
|
|
the child process's stack may be insufficient size to support a signal
frame, and there is no reason these signal handlers should run in the
child anyway.
|
|
this is another case of the kernel syscall failing to support flags
where it needs to, leading to horrible workarounds in userspace. this
time the workaround requires changing uid/gid, and that's not safe to
do in the current process. in the worst case, kernel resource limits
might prevent recovering the original values, and then there would be
no way to safely return. so, use the safe but horribly inefficient
alternative: forking. clone is used instead of fork to suppress
signals from the child.
fortunately this worst-case code is only needed when effective and
real ids mismatch, which mainly happens in suid programs.
|
|
on newer kernels, fchdir and fstat work anyway. this same fix should
be applied to any other syscalls that are similarly affected.
with this change, the current definitions of O_SEARCH and O_EXEC as
O_PATH are mostly conforming to POSIX requirements. the main remaining
issue is that O_NOFOLLOW has different semantics.
|
|
I intend to add more Linux workarounds that depend on using these
pathnames, and some of them will be in "syscall" functions that, from
an anti-bloat standpoint, should not depend on the whole snprintf
framework.
|
|
also clean up, optimize, and simplify the code, removing branches by
simply pre-setting the result string to an empty string, which will be
preserved if other operations fail.
|
|
|
|
SYS_pipe is not usable directly in general, since mips has a very
broken calling convention for the pipe syscall. instead, just call the
function, so that the mips-specific ugliness is isolated in
mips/pipe.s and not copied elsewhere.
|
|
also, don't waste code/time on F_GETFL since pipes always have blank
flags initially (at least on old kernels, which are all this fallback
code matters for).
|
|
this bug seems to have caused any failure by pipe2 on such systems to
set errno to 1, rather than the proper error code.
|
|
1. don't open /dev/null just as a basis to copy flags; use shared
__fmodeflags function to get the right file flags for the mode.
2. handle the case (probably invalid, but whatever) case where the
original stream's file descriptor was closed; previously, the logic
re-closed it.
3. accept the "e" mode flag for close-on-exec; update dup3 to fallback
to using dup2 so we can simply call __dup3 instead of putting fallback
logic in freopen itself.
|
|
since we target systems without overcommit, special care should be
taken that system() and popen(), like posix_spawn(), do not fail in
processes whose commit charges are too high to allow ordinary forking.
this in turn requires special precautions to ensure that the parent
process's signal handlers do not end up running in the shared-memory
child, where they could corrupt the state of the parent process.
popen has also been updated to use pipe2, so it does not have a
fd-leak race in multi-threaded programs. since pipe2 is missing on
older kernels, (non-atomic) emulation has been added.
some silly bugs in the old code should be gone too.
|
|
these interfaces have been adopted by the Austin Group for inclusion
in the next version of POSIX.
|
|
|
|
austin group interpretation for defect #529
(http://austingroupbugs.net/view.php?id=529) tightens the
requirements on close such that, if it returns with EINTR, the file
descriptor must not be closed. the linux kernel developers vehemently
disagree with this, and will not change it. we catch and remap EINTR
to EINPROGRESS, which the standard allows close() to return when the
operation was not finished but the file descriptor has been closed.
|
|
|
|
|
|
|
|
note that POSIX does not specify these functions as _Noreturn, because
POSIX is aligned with C99, not the new C11 standard. when POSIX is
eventually updated to C11, it will almost surely give these functions
the _Noreturn attribute. for now, the actual _Noreturn keyword is not
used anyway when compiling with a c99 compiler, which is what POSIX
requires; the GCC __attribute__ is used instead if it's available,
however.
in a few places, I've added infinite for loops at the end of _Noreturn
functions to silence compiler warnings. presumably
__buildin_unreachable could achieve the same thing, but it would only
work on newer GCCs and would not be portable. the loops should have
near-zero code size cost anyway.
like the previous _Noreturn commit, this one is based on patches
contributed by philomath.
|