Age | Commit message (Collapse) | Author | Files | Lines |
|
The received length field in the message may be greater than the
size of the 'answer' buffer in which the message resides. Currently,
ABUF_SIZE is 768. And if we get a larger 'alens[i]', it will result
in an out-of-bounds reading in __dns_parse().
To fix this, limit the length to the size of the received buffer.
|
|
EAI_OVERFLOW should be propagated as ERANGE to inform the caller about
the need to expand the buffer.
|
|
If the buffer passed to getservbyport_r is just enough to store two
pointers after aligning it, getnameinfo is called with buflen == 0
(which means that service name is not needed) and trivially succeeds.
Then, strtol is called on the address just past the buffer end, and
if it doesn't happen to find the port number there, getservbyport_r
spuriously succeeds and returns the same bad address to the caller.
Fix this by ensuring that buflen is at least 1 when passed to
getnameinfo.
|
|
getifaddrs computes &ctx->first->ifa even if ctx->first is NULL. While
this shouldn't be possible on the success path because the loopback
interface is hardcoded into the kernel, this is still possible on the
error path (for example, if __rtnetlink_enumerate couldn't create a
socket due to exceeding the fd limit).
|
|
accept4 emulation via accept ignores unknown flags, so it can spuriously
succeed instead of failing (or succeed without doing the action implied
by an unknown flag if it's added in a future kernel). Worse, unknown
flags trigger the fallback code even on modern kernels if the real
accept4 syscall returns EINVAL, because this is indistinguishable from
socketcall returning EINVAL due to lack of accept4 support.
Fix this by always failing with EINVAL if unknown flags are present and
the syscall is missing or failed with EINVAL.
|
|
This is completely analoguous to commit 633183b5d1c2.
Similar code called from __lookup_name is not affected because it checks
that the line contains the host name surrounded by blanks.
|
|
When IPv6 nameservers are present, __res_msend_rc attempts to disable
IPV6_V6ONLY socket option to ensure that it can communicate with IPv4
nameservers (if they are present too) via IPv4-mapped IPv6 addresses.
However, this option can't be disabled on bound sockets, so setsockopt
always fails.
|
|
A zero returned from recvmsg is currently treated as if some data were
received, so if a DNS server closes its TCP socket before sending the
full answer, __res_msend_rc will spin until the timeout elapses because
POLLIN event will be reported on each poll. Fix this by treating an
early EOF as an error.
|
|
DNS parsing callbacks pass the response buffer end instead of the actual
response end to dn_expand, so a malformed DNS response can use message
compression to make dn_expand jump past the response end and attempt to
parse uninitialized parts of that buffer, which might succeed and return
garbage.
|
|
There are several issues with range checks in this function:
* The question section parsing loop can read up to two out-of-bounds
bytes before doing the range check and bailing out.
* The answer section parsing loop, in addition to the same issue as
above, uses the wrong length in the range check that doesn't prevent
OOB reads when computing len later.
* The len range check before calling the callback is off by 10. Also,
p+len can overflow in a (probably theoretical) case when p is within
2^16 from UINTPTR_MAX.
Because __dns_parse is used only with stack-allocated buffers, such
small overreads can't result in a segfault. The first two also don't
affect the function result, but the last one may result in getaddrinfo
incorrectly succeeding and returning up to 10 bytes past the
response buffer as a part of the IP address, and in (canon) name
returned by getaddrinfo/getnameinfo being affected by memory past the
response buffer (because dn_expand might interpret it as a pointer).
|
|
Before this commit, DNS timeouts always used CLOCK_REALTIME, which
could produce spurious timeouts or delays if wall time changed for
whatever reason.
Now we try CLOCK_MONOTONIC and only fall back to CLOCK_REALTIME when
it is unavailable.
|
|
When a dot is encountered, the loop counter is incremented before
exiting the loop, but the corresponding ip array element is left
uninitialized, so the subsequent memmove (if "::" was seen) and the
loop copying ip to the output buffer will operate on an uninitialized
uint16_t.
The uninitialized data never directly influences the control flow and
is overwritten on successful return by the second half of the parsed
IPv4 address. But it's better to fix this to avoid unexpected
transformations by a sufficiently smart compiler and reports from
UB-detection tools.
|
|
The kernel defines a limit on the number of fds that can be passed
through an SCM_RIGHTS ancillary message as SCM_MAX_FD. The value was
255 before kernel 2.6.38 (after that it is 253), and an SCM_RIGHTS
ancillary message with 255 fds requires 1040 bytes, slightly more than
the current 1024 byte internal buffer in sendmsg. 1024 is an arbitrary
size, so increase it to match the the arbitrary size limit in the
kernel. This fixes tests that are verifying they support up to
SCM_MAX_FD fds.
|
|
commit f081d5336a80b68d3e1bed789cc373c5c3d6699b fixed
gethostbyname[2]_r to treat negative results as a non-error, leaving
gethostbyname[2] wrongly returning a pointer to the unfilled result
buffer rather than a null pointer. since, as documented with commit
fe82bb9b921be34370e6b71a1c6f062c20999ae0, the caller of
gethostby{name[2],addr}_r can always rely on the result pointer being
set, use that consistently rather than trying to duplicate logic about
whether we have a result or not in gethostby{name[2],addr}.
|
|
the only functional change here should be that MAXADDRS is only
checked for RRs that provide address results, so that a CNAME which
appears after an excessive number of address RRs does not get ignored.
I'm not aware of any servers that order the RRs this way, and it may
even be forbidden to do so, but I prefer having the callback logic not
be order dependent.
other than that, the motivation for this change is that the A and AAAA
cases were mostly duplicate code that could be combined as a single
code path.
|
|
returning -1 rather than 0 from the parse function causes __dns_parse
to bail out and return an error. presently, name_from_dns does not
check the return value anyway, so this does not matter, but if it ever
started treating this as an error, lookups with large numbers of
addresses would break. this is a consequence of adding TCP support and
extending the buffer size used in name_from_dns.
|
|
reportedly there is nameserver software with question-rewriting
"functionality" which gives A answers when AAAA is queried. since we
made no effort to validate that the answer RR type actually
corresponds to the question asked, it was possible (depending on
flags, etc.) for these answers to leak through, which the caller might
not be prepared for. indeed, our implementation of gethostbyname2_r
makes an assumption that the resulting addresses are in the family
requested, and will misinterpret the results if they don't.
commit 45ca5d3fcb6f874bf5ba55d0e9651cef68515395 already noted in
fixing CVE-2017-15650 that this could happen, but did nothing to
validate that the RR type of the answer matches the question; it just
enforced the limit on number of results to preclude overflow.
presently, name_from_dns ignores the return value of __dns_parse, so
it doesn't really matter whether we return 0 (ignoring the RR) or -1
(parse-ending error) upon encountering the mismatched RR. if that ever
changes, though, ignoring irrelevant answer RRs sounds like the
semantically correct thing to do, so for now let's return 0 from the
callback when this happens.
|
|
we already attempt to preclude this case by having res_send use a
sufficiently large temporary buffer even if the caller did not provide
one as large as or larger than the udp dns max of 512 bytes. however,
it's possible that the caller passed a custom-crafted query packet
using EDNS0, e.g. to get detailed DNSSEC results, with a larger udp
size allowance.
I have also seen claims that there are some broken nameservers in the
wild that do not honor the dns udp limit of 512 and send large answers
without the TC bit set, when the query was not using EDNS.
we generally don't aim to support broken nameservers, but in this case
both problems, if the latter is even real, have a common solution:
using recvmsg instead of recvfrom so we can examine the MSG_TRUNC
flag.
|
|
the size of 512 is not sufficient to get at least one address in the
worst case where the name is at or near max length and resolves to a
CNAME at or near max length. prior to tcp fallback, there was nothing
we could do about this case anyway, but now it's fixable.
the new limit 768 is chosen so as to admit roughly the number of
addresses with a worst-case CNAME as could fit for a worst-case name
that's not a CNAME in the old 512-byte limit. outside of this
worst-case, the number of addresses that might be obtained is
increased.
MAXADDRS (48) was originally chosen as an upper bound on the combined
number of A and AAAA records that could fit in 512-byte packets (31
and 17, respectively). it is not increased at this time.
so as to prevent a situation where the A records consume almost all of
these slots (at 768 bytes, a "best-case" name can fit almost 47 A
records), the order of parsing is swapped to process AAAA first. this
ensures roughly half of the slots are available to each address
family.
|
|
tcp fallback was originally deemed unwanted and unnecessary, since we
aim to return a bounded-size result from getaddrinfo anyway and
normally plenty of address records fit in the 512-byte udp dns limit.
however, this turned out to have several problems:
- some recursive nameservers truncate by omitting all the answers,
rather than sending as many as can fit.
- a pathological worst-case CNAME for a worst-case name can fill the
entire 512-byte space with just the two names, leaving no room for
any addresses.
- the res_* family of interfaces allow querying of non-address records
such as TLSA (DANE), TXT, etc. which can be very large. for many of
these, it's critical that the caller see the whole RRset. also,
res_send/res_query are specified to return the complete, untruncated
length so that the caller can retry with an appropriately-sized
buffer. determining this is not possible without tcp.
so, it's time to add tcp fallback.
the fallback strategy implemented here uses one tcp socket per
question (1 or 2 questions), initiated via tcp fastopen when possible.
the connection is made to the nameserver that issued the truncated
answer. right now, fallback happens unconditionally when truncation is
seen. this can, and may later be, relaxed for queries made by the
getaddrinfo system, since it will only use a bounded number of results
anyway.
retry is not attempted again after failure over tcp. the logic could
easily be adapted to do that, but it's of questionable value, since
the tcp stack automatically handles retransmission and the successs
answer with TC=1 over udp strongly suggests that the nameserver has
the full answer ready to give. further retry is likely just "take
longer to fail".
|
|
for extremely small buffer sizes, the DNS query core in __res_msend
may malfunction completely, being unable to get even the headers to
determine the response code. but there is also a problem for
reasonable sizes under 512 bytes: __res_msend is unable to determine
if the udp answer was truncated at the recv layer, in which case it
may be incomplete, and res_send is then unable to honor its contract
to return the length of the full, non-truncated answer.
at present, res_send does not honor that contract anyway when the full
answer would exceed 512 bytes, since there is no tcp fallback, but
this change at least makes it consistent in a context where this is
the only "full answer" to be had.
|
|
this is groundwork for TCP fallback support, but does not itself
change behavior in any way.
|
|
this was apparently omitted long ago out of a lack of understanding of
its importance and the fact that POSIX doesn't specify it. despite not
being officially standardized, however, it turns out that at least
AIX, glibc, NetBSD, OpenBSD, QNX, and Solaris document and support it.
in certain usage cases, such as implementing a DNS gateway on top of
the stub resolver interfaces, it's necessary to distinguish the case
where a name does not exit (NxDomain) from one where it exists but has
no addresses (or other records) of the requested type (NODATA). in
fact, even the legacy gethostbyname API had this distinction, which we
were previously unable to support correctly because the backend lacked
it.
apart from fixing an important functionality gap, adding this
distinction helps clarify to users how search domain fallback works
(falling back in cases corresponding to EAI_NONAME, not in ones
corresponding to EAI_NODATA), a topic that has been a source of
ongoing confusion and frustration.
as a result of this change, EAI_NONAME is no longer a valid universal
error code for getaddrinfo in the case where AI_ADDRCONFIG has
suppressed use of all address families. in order to return an accurate
result in this case, getaddrinfo is modified to still perform at least
one lookup. this will almost surely fail (with a network error, since
there is no v4 or v6 network to query DNS over) unless a result comes
from the hosts file or from ip literal parsing, but in case it does
succeed, the result is replaced by EAI_NODATA.
glibc has a related error code, EAI_ADDRFAMILY, that could be used for
the AI_ADDRCONFIG case and certain NODATA cases, but distinguishing
them properly in full generality seems to require additional DNS
queries that are otherwise not useful. on glibc, it is only used for
ip literals with mismatching family, not for DNS or hosts file results
where the name has addresses only in the opposite family. since this
seems misleading and inconsistent, and since EAI_NODATA already covers
the semantic case where the "name" exists but doesn't have any
addresses in the requested family, we do not adopt EAI_ADDRFAMILY at
this time. this could be changed at some point if desired, but the
logic for getting all the corner cases with AI_ADDRCONFIG right is
slightly nontrivial.
|
|
EAI_MEMORY is not possible (but would not provide errno if it were)
and EAI_FAIL does not provide errno. treat the latter as EBADMSG to
match how it's handled in gethostbyname2_r (it indicates erroneous or
failure response from the nameserver).
|
|
EAI_MEMORY is not possible because the resolver backend does not
allocate. if it did, it would be necessary for us to explicitly return
ENOMEM as the error, since errno is not guaranteed to reflect the
error cause except in the case of EAI_SYSTEM, so the existing code was
not correct anyway.
|
|
these functions are horribly underspecified, inconsistent between
historical systems, and should never have been included. however, the
signatures we have match the glibc ones, and the glibc behavior is to
treat NxDomain and NODATA results as a success condition, not an
ENOENT error.
|
|
this distinction only affects search, but allows search to continue
when concatenating one of the search domains onto the requested name
produces a result that's not valid. this can happen when the
concatenation is too long, or one of the search list entries is
itself not valid.
as a consequence of this change, having "." in the search domains list
will now be ignored/skipped rather than making the lookup abort with
no results (due to producing a concatenation ending in ".."). this
behavior could be changed later if needed.
|
|
the main loop already errors out on zero-length labels within the
name, but terminates before having a chance to check for an erroneous
final zero-length label, instead producing a malformed query packet
with a '.' byte instead of the terminating zero.
rather than poke at the look logic, simply detect this condition early
and error out without doing anything.
this also fixes behavior of getaddrinfo when "." appears in the search
domain list, which produces a name ending in ".." after concatenation,
at least in the sense of no longer emitting malformed packets on the
network. however, due to other issues, the lookup will still fail.
|
|
if resolv.conf lists no nameservers at all, the default of 127.0.0.1
is used. however, another "no nameservers" case arises where the
system has ipv6 support disabled/configured-out and resolv.conf only
contains v6 nameservers. this caused the resolver to repeat socket
operations that will necessarily fail (sending to one or more
wrong-family addresses) while waiting for a timeout.
it would be contrary to configured intent to query 127.0.0.1 in this
case, but the current behavior is not conducive to diagnosing the
configuration problem. instead, fail immediately with EAI_SYSTEM and
errno==EAFNOSUPPORT so that the configuration error is reportable.
|
|
apparently this code path was never tested, as it's not usual to have
v6 nameservers listed on a system without v6 networking support. but
it was always intended to work.
when reverting to binding a v4 address, also revert the family in the
sockaddr structure and the socklen for it. otherwise bind will just
fail due to mismatched family/sockaddr size.
fix dns resolver fallback when v6 nameservers are listed by
|
|
this code attempts to use the value of errno from failure of socket or
connect to infer availability of the requested address family (v4 or
v6). however, in the case where connect failed, there is an
intervening call to close between connect and the use of errno. close
is not required to preserve errno on success, and in fact the
__aio_close code, which is called whenever aio is linked and thus
always called in dynamic-linked programs, unconditionally clobbers
errno. as a result, getaddrinfo fails with EAI_SYSTEM and errno=ENOENT
rather than correctly determining that the address family was
unavailable.
this fix is based on report/patch by Jussi Nieminen, but simplified
slightly to avoid breaking the case where socket, not connect, failed.
|
|
assuming a reasonable realtime clock, res_mkquery is highly unlikely
to generate the same query id twice in a row, but it's possible with a
very low-resolution system clock or under extreme delay of forward
progress. when it happens, res_msend fails to wait for both answers,
and instead stops listening after getting two answers to the same
query (A or AAAA).
to avoid this, increment one byte of the second query's id if it
matches the first query's. don't bother checking if the second byte is
also equal, since it doesn't matter; we just need to ensure that at
least one byte is distinct.
|
|
the wrong name works only by accident.
|
|
|
|
prior to commit e68c51ac46a9f273927aef8dcebc89912ab19ece, h_errno was
actually an external data object not a macro. bring back the symbol,
and use it as the storage for the main thread's h_errno.
technically this still doesn't provide full compatibility if the
application was multithreaded, but at the time there were no res_*
functions (and they did not set h_errno anyway), so any use of h_errno
would have been via thread-unsafe functions. thus a solution that just
fixes single-threaded applications seems acceptable.
|
|
while it's not clearly documented anywhere, this is the historical
behavior which some applications expect. applications which need to
see the response packet in these cases, for example to distinguish
between nonexistence in a secure vs insecure zone, must already use
res_mkquery with res_send in order to be portable, since most if not
all other implementations of res_query don't provide it.
|
|
the framework to do this always existed but it was deemed unnecessary
because the only [ex-]standard functions using h_errno were not
thread-safe anyway. however, some of the nonstandard res_* functions
are also supposed to set h_errno to indicate the cause of error, and
were unable to do so because it was not thread-safe. this change is a
prerequisite for fixing them.
|
|
prior to this change, the canonical name came from the first hosts
file line matching the requested family, so the canonical name for a
given hostname could differ depending on whether it was requested with
AF_UNSPEC or a particular family (AF_INET or AF_INET6). now, the
canonical name is deterministically the first one to appear with the
requested name as an alias.
|
|
the existing code clobbered the canonical name already discovered
every time another matching line was found, which will necessarily be
the case when a hostname has both IPv4 and v6 definitions.
patch by Wolf.
|
|
the internal __res_msend returns 0 on timeout without having obtained
any conclusive answer, but in this case has not filled in meaningful
anslen. res_send wrongly treated that as success, but returned a zero
answer length. any reasonable caller would eventually end up treating
that as an error when attempting to parse/validate it, but it should
just be reported as an error.
alternatively we could return the last-received inconclusive answer
(typically servfail), but doing so would require internal changes in
__res_msend. this may be considered later.
|
|
the old logic here likely dates back, at least in inspiration, to
before it was recognized that transient errors must not be allowed to
reflect the contents of successful results and must be reported to the
application.
here, the dns backend for getaddrinfo, when performing a paired query
for v4 and v6 addresses, accepted results for one address family even
if the other timed out. (the __res_msend backend does not propagate
error rcodes back to the caller, but continues to retry until timeout,
so other error conditions were not actually possible.)
this patch moves the checks to take place before answer parsing, and
performs them for each answer rather than only the answer to the first
query. if nxdomain is seen it's assumed to apply to both queries since
that's how dns semantics work.
|
|
the AD (authenticated data) bit in outgoing dns queries is defined by
rfc3655 to request that the nameserver report (via the same bit in the
response) whether the result is authenticated by DNSSEC. while all
results returned by a DNSSEC conforming nameserver will be either
authenticated or cryptographically proven to lack DNSSEC protection,
for some applications it's necessary to be able to distinguish these
two cases. in particular, conforming and compatible handling of DANE
(TLSA) records requires enforcing them only in signed zones.
when the AD bit was first defined for queries, there were reports of
compatibility problems with broken firewalls and nameservers dropping
queries with it set. these problems are probably a thing of the past,
and broken nameservers are already unsupported. however, since there
is no use in the AD bit with the netdb.h interfaces, explicitly clear
it in the queries they make. this ensures that, even with broken
setups, the standard functions will work, and at most the res_*
functions break.
|
|
commit 59324c8b0950ee94db846a50554183c845ede160 added __socketcall
analogous to __syscall, returning the negated error rather than
setting errno. use it to simplify the fallback path of socket(),
avoiding extern calls and access to errno.
Author: Rich Felker <dalias@aerifal.cx>
Date: Tue Jul 30 17:51:16 2019 -0400
make __socketcall analogous to __syscall, error-returning
|
|
always try the time64 syscall first since we can use its success to
conclude that no conversion is needed (any setsockopt for the
timestamp options would have succeeded without need for fallbacks).
otherwise, we have to remember the original controllen for each
msghdr, requiring O(vlen) space, so vlen must be bounded. linux clamps
it to IOV_MAX for sendmmsg only (not recvmmsg), but doing the same for
recvmmsg is not unreasonable, especially since the limitation will
only apply to old kernels.
we could optimize to avoid trying SYS_recvmmsg_time64 first if all
msghdrs have controllen zero, or support unlimited vlen by looping and
emulating the timeout logic, but I'm not inclined to do complex and
error-prone optimizations on a function that has so many underlying
problems it should really never be used.
|
|
the definitions of SO_TIMESTAMP* changed on 32-bit archs in commit
38143339646a4ccce8afe298c34467767c899f51 to the new versions that
provide 64-bit versions of timeval/timespec structure in control
message payload. socket options, being state attached to the socket
rather than function calls, are not trivial to implement as fallbacks
on ENOSYS, and support for them was initially omitted on the
assumption that the ioctl-based polling alternatives (SIOCGSTAMP*)
could be used instead by applications if setsockopt fails.
unfortunately, it turns out that SO_TIMESTAMP is sufficiently old and
widely supported that a number of applications assume it's available
and treat errors as fatal.
this patch introduces emulation of SO_TIMESTAMP[NS] on pre-time64
kernels by falling back to setting the "_OLD" (time32) versions of the
options if the time64 ones are not recognized, and performing
translation of the SCM_TIMESTAMP[NS] control messages in recvmsg.
since recvmsg does not know whether its caller is legacy time32 code
or time64, it performs translation for any SCM_TIMESTAMP[NS]_OLD
control messages it sees, leaving the original time32 timestamp as-is
(it can't be rewritten in-place anyway, and memmove would be mildly
expensive) and appending the converted time64 control message at the
end of the buffer. legacy time32 callers will see the converted one as
a spurious control message of unknown type; time64 callers running on
pre-time64 kernels will see the original one as a spurious control
message of unknown type. a time64 caller running on a kernel with
native time64 support will only see the time64 version of the control
message.
emulation of SO_TIMESTAMPING is not included at this time since (1)
applications which use it seem to be prepared for the possibility that
it's not present or working, and (2) it can also be used in sendmsg
control messages, in a manner that looks complex to emulate
completely, and costly even when running on a time64-supporting
kernel.
corresponding changes in recvmmsg are not made at this time; they will
be done separately.
|
|
somewhat analogous to commit d0b547dfb5f7678cab6bc39dd736ed6454357ca4,
but here the omission of the null timeout check was in the time64
syscall code path. this code is not yet used except on x32.
|
|
without this, the SO_RCVTIMEO and SO_SNDTIMEO socket options would
stop working on pre-5.1 kernels after time_t is switched to 64-bit and
their values are changed to the new time64 versions.
new code is written such that it's statically unreachable on 64-bit
archs, and on existing 32-bit archs until the macro values are changed
to activate 64-bit time_t.
|
|
the time64 syscall is used only if the timeout does not fit in 32
bits. after preprocessing, the code is unchanged on 64-bit archs. for
32-bit archs, the timeout now goes through an intermediate copy,
meaning that the caller does not get back the updated timeout. this is
based on my reading of the documentation, which does not document the
updating as a contract you can rely on, and mentions that the whole
recvmmsg timeout mechanism is buggy and unlikely to be useful. if it
turns out that there's interest in making the remaining time
officially available to callers, such functionality could be added
back later.
|
|
The original logic considered each byte until it either found a 0
value or a value >= 192. This means if a string segment contained any
byte >= 192 it was interepretted as a compressed segment marker even
if it wasn't in a position where it should be interpretted as such.
The fix is to adjust dn_skipname to increment by each segments size
rather than look at each character. This avoids misinterpretting
string segment characters by not considering those bytes.
|
|
addressing &out[k].sa was arguably undefined, despite &out[k] being
defined the slot one past the end of an array, since the member access
.sa is intervening between the [] operator and the & operator.
|