Bug 2291062 - DNF and Firefox take extremely long to start when VPN is active due to problems looking up the local hostname
Summary: DNF and Firefox take extremely long to start when VPN is active due to proble...
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: systemd
Version: rawhide
Hardware: Unspecified
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: systemd-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 2314175 (view as bug list)
Depends On:
Blocks: 2257197
TreeView+ depends on / blocked
 
Reported: 2024-06-09 07:56 UTC by Francesco Ciocchetti
Modified: 2025-05-23 11:04 UTC (History)
26 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2024-10-09 15:05:47 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github containers toolbox issues 1528 0 None open Hostname changing causes unexpected issues. 2024-08-27 16:12:52 UTC
Github systemd systemd issues 33870 0 None open myhostname works in toolbox but resolve plugin does not 2024-07-30 09:34:26 UTC
Github systemd systemd issues 33871 0 None open resolved: when DNSoverTLS is configured on local interface but it does not respond, timeout is too long 2024-07-30 09:34:26 UTC

Description Francesco Ciocchetti 2024-06-09 07:56:37 UTC
I have a brand new install of `Silverblue Fedora 40` on a new laptop  , i created a `fedora:f40` **toolbox** with `toolbox create` but **every dnf** or **yum** command hangs for a few minutes before successfully completing 

I installed `strace` and i can see it is hanging on `/run/systemd/resolve/io.systemd.Resolve` socket

```
futex(0x7f8a79e7d900, FUTEX_WAKE_PRIVATE, 2147483647) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/run/systemd/resolve/io.systemd.Resolve"}, 42) = 0
sendto(3, "{\"method\":\"io.systemd.Resolve.Re"..., 90, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 90
brk(0x557406ed9000)                     = 0x557406ed9000
recvfrom(3, 0x557406e98760, 131080, MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
ppoll([{fd=3, events=POLLIN}], 1, {tv_sec=119, tv_nsec=999960000}, NULL, 8
``` 

and after a minute or 2 , it `times out` and **proceed successfully** to run the command 

```
ppoll([{fd=3, events=POLLIN}], 1, {tv_sec=119, tv_nsec=999960000}, NULL, 8) = 0 (Timeout)
recvfrom(3, 0x557406e98760, 131080, MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)


rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
close(3)                                = 0
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=13539, ...}) = 0
mmap(NULL, 13539, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f8a7cddc000
close(3)                                = 0
openat(AT_FDCWD, "/lib64/libnss_myhostname.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=174416, ...}) = 0
mmap(NULL, 174360, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f8a79e25000
mmap(0x7f8a79e28000, 90112, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x7f8a79e28000
mmap(0x7f8a79e3e000, 49152, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x19000) = 0x7f8a79e3e000
mmap(0x7f8a79e4a000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x24000) = 0x7f8a79e4a000
close(3)                                = 0
mprotect(0x7f8a79e4a000, 20480, PROT_READ) = 0
munmap(0x7f8a7cddc000, 13539)           = 0
rt_sigprocmask(SIG_BLOCK, [HUP USR1 USR2 PIPE ALRM CHLD TSTP URG VTALRM PROF WINCH IO], [], 8) = 0
uname({sysname="Linux", nodename="toolbox", ...}) = 0
```

Note that 

* This **only happens** with toolbox based on `f40`  , i created ( and am currently using ) one based on `f39` and works just fine
* There are no available update  in the `f40` container 
* `dns resolution` and `systemd-resolved` **works just fine** both inside and outside the toolbox container 
* I tried disabling `selinux` but did not help 

```
$ ls -la /run/systemd/resolve/io.systemd.Resolve
srw-rw-rw-. 1 nobody nobody 0 Jun  8 15:59 /run/systemd/resolve/io.systemd.Resolve

⬢[@toolbox ~]$ resolvectl 
Global
           Protocols: LLMNR=resolve -mDNS +DNSOverTLS DNSSEC=yes/supported
    resolv.conf mode: stub
  Current DNS Server: 1.1.1.2#cloudflare-dns.com
         DNS Servers: 1.1.1.2#cloudflare-dns.com 1.0.0.2#cloudflare-dns.com
Fallback DNS Servers: 8.8.8.8#dns.google 8.8.4.4#dns.google

Link 3 (wlp0s20f3)
    Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6
         Protocols: +DefaultRoute LLMNR=resolve -mDNS +DNSOverTLS DNSSEC=yes/supported
Current DNS Server: 192.168.100.1
       DNS Servers: 192.168.100.1
        DNS Domain: lan

Link 4 (enp85s0)
    Current Scopes: none
         Protocols: -DefaultRoute LLMNR=resolve -mDNS +DNSOverTLS DNSSEC=yes/supported

Link 5 (docker0)
    Current Scopes: none
         Protocols: -DefaultRoute LLMNR=resolve -mDNS +DNSOverTLS DNSSEC=yes/supported



$ resolvectl query mirrors.fedoraproject.org
mirrors.fedoraproject.org: 2600:1f14:fad:5c02:7c8a:72d0:1c58:c189 -- link: wlp0s20f3
                           2600:2701:4000:5211:dead:beef:fe:fed3 -- link: wlp0s20f3
                           2604:1580:fe00:0:dead:beef:cafe:fed1 -- link: wlp0s20f3
                           2605:bc80:3010:600:dead:beef:cafe:fed9 -- link: wlp0s20f3
                           2620:52:3:1:dead:beef:cafe:fed6 -- link: wlp0s20f3
                           2620:52:3:1:dead:beef:cafe:fed7 -- link: wlp0s20f3
                           8.43.85.67          -- link: wlp0s20f3
                           8.43.85.73          -- link: wlp0s20f3
                           34.221.3.152        -- link: wlp0s20f3
                           38.145.60.20        -- link: wlp0s20f3
                           38.145.60.21        -- link: wlp0s20f3
                           67.219.144.68       -- link: wlp0s20f3
                           140.211.169.196     -- link: wlp0s20f3
                           152.19.134.142      -- link: wlp0s20f3
                           152.19.134.198      -- link: wlp0s20f3
                           (wildcard.fedoraproject.org)

-- Information acquired via protocol DNS in 144.7ms.
-- Data is authenticated: yes; Data was acquired via local or encrypted transport: yes
-- Data from: network
```


Reproducible: Always

Steps to Reproduce:
1. Install silverblue f40 
2. Create a Toolbox from `fedora f40` 
3. run dnf update 
4. wait


Expected Results:  
dnf commands works

Comment 1 Debarshi Ray 2024-06-20 15:23:08 UTC
Interesting.  Just to be sure.  A container created from fedora-toolbox:40 on a Fedora 40 host shows this problem, but a container from fedora-toolbox:39 on a Fedora 40 host doesn't show this problem.  Right?

Comment 2 Francesco Ciocchetti 2024-06-20 18:19:01 UTC
That is correct. 

I forgot to update that I found this change in nsswitch.conf that seems to address the problem. The linked issue was not about toolbox but it does fix the problem inside toolbox as well

https://discussion.fedoraproject.org/t/dnf-and-firefox-take-extreemly-long-to-start-when-vpn-active-on-f40/114604/4

Comment 3 Debarshi Ray 2024-06-27 23:50:42 UTC
(In reply to Francesco Ciocchetti from comment #2)
> That is correct. 
> 
> I forgot to update that I found this change in nsswitch.conf that seems to
> address the problem. The linked issue was not about toolbox but it does fix
> the problem inside toolbox as well
> 
> https://discussion.fedoraproject.org/t/dnf-and-firefox-take-extreemly-long-
> to-start-when-vpn-active-on-f40/114604/4

Thanks for digging that up!

I can track the change to the hosts database configuration in /etc/nsswitch.conf to this upstream pull request:
https://github.com/authselect/authselect/pull/366

... which was added to Fedora >= 40 through these commits:
https://src.fedoraproject.org/rpms/authselect/c/714bad65d2a09836ba84911ea5f6c6b011f2c480
https://src.fedoraproject.org/rpms/authselect/c/f411c0ecd9eef866fbe13e710867a3fcebaaf87d

... and was discussed in:
https://bugzilla.redhat.com/show_bug.cgi?id=2257197

Comment 4 Debarshi Ray 2024-06-27 23:51:49 UTC
Could you confirm that you are also experiencing this problem when using VPNs, as mentioned in:
https://discussion.fedoraproject.org/t/dnf-and-firefox-take-extreemly-long-to-start-when-vpn-active-on-f40/114604

Comment 5 Debarshi Ray 2024-06-28 00:00:39 UTC
Toolbx doesn't touch /etc/nsswitch.conf.  It's used as is the default on Fedora.  Given that this behaviour is observed both on Fedora 40 hosts and containers made from the fedora-toolbox:40 images, I am confident that this has nothing to do with Toolbx.

Reassigning to authselect.

Comment 6 Petr Menšík 2024-06-28 10:34:50 UTC
These are unfortunate hacks. I think primary problem should be solved by a good caching instead of synthetizing non-existent records by myhostname plugin. That plugin prevents obtaining working name available at the network itself. Because what myhostname plugin delivers is visible only at the host, but probably is not resolvable on the other hosts of this network. It never contains also full hostname including domain.

But programs doing reverse address queries are typically attempting to obtain a name, under which other hosts can see it.

There are roughly 2 different cases:

1) showing network traffic with names, where I want to use hostnames to display it in user-friendly way. myhostname works great for this case.
2) finding under which name other hosts would resolve me. This is what smtp servers try, what hostname -A should provide. Ie. I do not want to see a name synthetized by my host, but seen by network. myhostname plugin prevents this, when used before a real resolution plugins like "dns".

I do not think there is simple fix possible with what we have now. The same API is used for both cases, without any flag possible to make clear indication what is desired.
But the question is why dnf needs reverse resolution in any case. Somehow I do not think that should be necessary for any dnf operation.

I think we may want getnameinfo(3) new flag, which might suppress synthetized names provided by myhostname plugin. To be able to resolve real addresses visible from network, which things like hostname -A or mail server daemons need.

In ideal case local DNS cache should be tried first and myhostname would get consulted only in case local network DNS did not provide any name or even timeouted. To prevent timeout on each query, it would remember with TTL of minutes that the record does not work and follow to myhostname. Unless we have clear use cases, it is not simple to fix. We could also tune timeout to lower value when name provided by myhostname plugin were queried. But nss plugins architecture does not allow it to be resolved a simple way.

Comment 7 Petr Menšík 2024-06-28 11:01:28 UTC
Good way to provide feedback is using getent -s parameter, which allows just selected plugins to be used.

On decent network with good connectivity, it should not take long to query names. If it does, there is some part misconfigured with local DNS.

What will be output of this command on that network?

for plugin in myhostname resolve dns; do echo "# $plugin"; time getent -s $plugin ahosts $HOSTNAME; done

Or better with different databases too:
for plugin in myhostname resolve dns; do for DB in ahosts ahostsv4 ahostsv6; do echo "# $plugin; $DB"; time getent -s $plugin $DB $HOSTNAME; done; done

Sometimes IPv6 queries cause visible timeouts, because local DNS cache is not well configured. Can it be dnsmasq?

But since DNS over TLS is forced on, do all of those servers respond to DoT? Can servers be queried, whether they can respond?

$ dig -d @192.168.100.1 +tls $HOSTNAME

Comment 8 Petr Menšík 2024-06-28 12:12:28 UTC
Is it possible that resolved fallback servers need to kick in, because wifi resolver does not support DNS over TLS and it timeouts, before fallback server gets tried? Also enabled DNSSEC might cause issues if DNSSEC specific queries cause timeouts instead positive or negative queries.

I am not sure how exactly this should be visible in resolvectl output.

I mean, later query to resolvectl gets reasonably fast. -- Information acquired via protocol DNS in 144.7ms. is not slow, not to become noticable.

Comment 9 Michael Catanzaro 2024-06-28 14:09:56 UTC
Thank you Petr. This is very helpful.

(In reply to Petr Menšík from comment #6)
> These are unfortunate hacks. I think primary problem should be solved by a
> good caching instead of synthetizing non-existent records by myhostname
> plugin. That plugin prevents obtaining working name available at the network
> itself. Because what myhostname plugin delivers is visible only at the host,
> but probably is not resolvable on the other hosts of this network. It never
> contains also full hostname including domain.

systemd-resolved will also synthesize a record for the local hostname, so it's probably coming from nss-resolve rather than nss-myhostname. i.e. when systemd-resolved is running, nss-resolve will always return a result; the other NSS modules will not be used and are only there as a fallback for when systemd-resolved is disabled. Therefore, surely nss-myhostname is never used (after the change in bug #2257197) and the synthesized result is not actually coming from nss-myhostname anymore.

A good short term workaround would be to revert the change from bug #2257197 to get dnf and Firefox working properly again, since that is more important than hostname --fqdn, but that is not a good long term fix. In the long term, we should probably keep nss-myhostname at the end and move nss-mdns4_minimal instead. But this is easier said than done, because systemd-resolved doesn't seem to handle mDNS properly, bug #1867830.

> I think we may want getnameinfo(3) new flag, which might suppress
> synthetized names provided by myhostname plugin. To be able to resolve real
> addresses visible from network, which things like hostname -A or mail server
> daemons need.

Hm, there is already NI_NOFQDN though. Could we use that?

> In ideal case local DNS cache should be tried first and myhostname would get
> consulted only in case local network DNS did not provide any name or even
> timeouted. To prevent timeout on each query, it would remember with TTL of
> minutes that the record does not work and follow to myhostname. Unless we
> have clear use cases, it is not simple to fix. We could also tune timeout to
> lower value when name provided by myhostname plugin were queried. But nss
> plugins architecture does not allow it to be resolved a simple way.

I think it's designed on the assumption that the local hostname is more trusted than DNS and should always resolve to the local computer. Changes would probably need to be compatible with that. But maybe something similar could work. We'd need to discuss with upstream.

The NSS plugin architecture shouldn't be a problem because everything is happening inside systemd-resolved.

Comment 10 Petr Menšík 2024-06-28 14:13:43 UTC
Hmm, I am finding something rotten with dig +search +showsearch host, after I have set lan suffix by command:

sudo resolvectl domain enp1s0 lan

When I do dig, it times out instead of giving at least cached response. It sends query to local nameserver over DoT protocol, which at least my resolver does not support. Can it be also case for original reporter?

Confirmed by running tcpdump on background:

sudo tcpdump -n port llmnr or port domain or port domain-s &

I get a lot of lines like:
09:42:51.900967 IP 192.168.122.19.44602 > 192.168.122.1.domain-s: Flags [S], seq 1017153061, win 32120, options [mss 1460,sackOK,TS val 1259108054 ecr 0,nop,wscale 7,tfo  cookiereq,nop,nop], length 0

And indeed, it takes long before it fails. It does not behave well and especially status report in resolvectl does not indicate clearly, which server is actually responding and which is not.
This is caused I think by global DNSoverTLS is enabled on all interfaces. But no status of "Current DNS Server" is provided, whether it responds over specified protocols or not.

I have put into /etc/systemd/resolved.conf:
[Resolve]
# Some examples of DNS servers which may be used for DNS= and FallbackDNS=:
# Cloudflare: 1.1.1.1#cloudflare-dns.com 1.0.0.1#cloudflare-dns.com 2606:4700:4700::1111#cloudflare-dns.com 2606:4700:4700::1001#cloudflare-dns.com
# Google:     8.8.8.8#dns.google 8.8.4.4#dns.google 2001:4860:4860::8888#dns.google 2001:4860:4860::8844#dns.google
# Quad9:      9.9.9.9#dns.quad9.net 149.112.112.112#dns.quad9.net 2620:fe::fe#dns.quad9.net 2620:fe::9#dns.quad9.net
DNS=1.1.1.1#cloudflare-dns.com 1.0.0.1#cloudflare-dns.com
FallbackDNS=8.8.8.8#dns.google 8.8.4.4#dns.google
#Domains=
DNSSEC=yes
DNSOverTLS=yes

$ time getent ahosts host8

real	0m18.467s
user	0m0.001s
sys	0m0.004s

Until I have disabled DNSoverTLS at my local device, resolution of hosts without dots does not work at all. Yes, takes a lot of retries at 853 port. Which in my case is not protected by direwall and fails immediately. If that port would cause timeout, it might be a lot longer failure.

When I did sudo resolvectl dnsovertls enp1s0 off, responses become immediately done. But I did also disable LLMNR.

$ resolvectl 
Global
           Protocols: LLMNR=resolve -mDNS +DNSOverTLS DNSSEC=yes/supported
    resolv.conf mode: stub
  Current DNS Server: 1.1.1.1#cloudflare-dns.com
         DNS Servers: 1.1.1.1#cloudflare-dns.com 1.0.0.1#cloudflare-dns.com
Fallback DNS Servers: 8.8.8.8#dns.google 8.8.4.4#dns.google

Link 2 (enp1s0)
    Current Scopes: DNS
         Protocols: +DefaultRoute -LLMNR -mDNS -DNSOverTLS DNSSEC=yes/supported
Current DNS Server: 192.168.122.1
       DNS Servers: 192.168.122.1
        DNS Domain: lan

Comment 11 Petr Menšík 2024-06-28 14:44:15 UTC
Also I have seen unexpected plaintext query names on Link resolver address. It does not seem to send DNSSEC queries to expected global configuration, but sends them also to 192.168.122.1 server. I have expected only *.lan. names would be sent to it, but does not seem to be case for systemd-resolved-255.7-1.fc40.x86_64.

I know systemd guys do not use DNSSEC and do not recommend it. I am seeing both traffic on TLS addresses 1.1.1.1 and plaintext 192.168.122.1, including DNSSEC records not for *.lan. I am not sure how it is supposed to work with configuration from comment 10, but I would have expected different behaviour. This does not seem a good way to configure global DoT servers in combination with local plaintext server for local only records. Because it leaks queries, which should have stayed protected by TLS (IMHO).

Sadly, global DNS in Network Manager configuration does not allow TLS enabled by default. It does not seem this has better configuration available now.

Comment 12 Petr Menšík 2024-06-28 15:23:40 UTC
If privacy is desired, I would recommend turning off LLMNR resolution. It should be possible to disable it per NM profile or globally. Should speed up resolution of non-existent names without dot, bare names.

$ nmcli c show
NAME                UUID                                  TYPE      DEVICE 
Wired connection 1  5a99be05-107a-3c54-a21b-d275cca70b0c  ethernet  enp1s0 
lo                  6c56b7d0-efa8-4da8-a913-cbfed360efa6  loopback  lo     

$ nmcli c edit 5a99be05-107a-3c54-a21b-d275cca70b0c

set connection.dns-over-tls no
set connection.llmnr no
set ipv4.dns-search lan
save
activate

And have /etc/systemd/resolved.conf:

[Resolve]
# Some examples of DNS servers which may be used for DNS= and FallbackDNS=:
# Cloudflare: 1.1.1.1#cloudflare-dns.com 1.0.0.1#cloudflare-dns.com 2606:4700:4700::1111#cloudflare-dns.com 2606:4700:4700::1001#cloudflare-dns.com
# Google:     8.8.8.8#dns.google 8.8.4.4#dns.google 2001:4860:4860::8888#dns.google 2001:4860:4860::8844#dns.google
# Quad9:      9.9.9.9#dns.quad9.net 149.112.112.112#dns.quad9.net 2620:fe::fe#dns.quad9.net 2620:fe::9#dns.quad9.net
DNS=1.1.1.2#cloudflare-dns.com 1.0.0.2#cloudflare-dns.com
FallbackDNS=8.8.8.8#dns.google 8.8.4.4#dns.google
#Domains=
DNSSEC=yes
DNSOverTLS=yes 
LLMNR=no

It seems with disabled DefaultRoute on link, it will indeed send everything to global TLS servers, only *.lan to local plaintext resolver. I am not sure how it is supposed to be configured permanently in resolved.conf in combination with Network Manager.

Comment 13 Michael Catanzaro 2024-06-28 17:07:07 UTC
Problem is you have global DNS configured *and* the link-specific DNS configured by NetworkManager. That is almost always wrong because it's going to generate two requests. I think systemd-resolved is stupid to interpret the configuration this way and should really change its behavior somehow, but anyway, suffice to say you should stick to link-specific configuration only (unless you're going to disable NetworkManager).

Comment 14 bugzilla.redhat.com.quake198 2024-07-12 06:26:20 UTC
There are name resolving issues even without VPN. In Toolbox, the hostname in /etc/hostname is "toolbox" and environment variable HOSTNAME is inherited (?) from outside the toolbox container - it is set to the real machine's hostname. Is this disparity a problem too?

In Fedora 40 image, the priority change between myhostname and resolve has the consequence that container's hostname "toolbox" will not be resolved in timely manner. This causes name resolving timeouts in any application that attempts to use the hostname in /etc/hostname. Most notable issue is that some X11 applications hang the whole Plasma desktop in Kinoite for a few seconds. In Fedora 39 toolbox images, the myhostname takes precedence and resolves the /etc/hostname's value "toolbox". Applications work correctly there.

Command `resolvectl query toolbox` fails in both Fedora 40 and 39 toolbox images.

Comment 15 Petr Menšík 2024-07-17 13:40:38 UTC
Problem with `resolvectl query toolbox` might happen because LLMNR is enabled in systemd-resolved, but no such host returns negative reply immediately.
non-existent names take a long time to timeout on multicast protocols, at least 3 seconds.

If unicast DNS with applied search domains is tried first, the answer should be fast, if the name with applied search exists in DNS.
Question is whether unicast query can be fast in cases, when DNSoverTLS is desired but not responding. Be it because server is misconfigured, overloaded or broken.
I am not sure whether systemd-resolved caches temporary unavailability of link-local server.

Can original reporter provide also output of command:

- resolvectl show-server-state
- resolvectl statistics

Unfortunately recent response time does not seem to be visible

Comment 16 Petr Menšík 2024-07-17 13:57:34 UTC
(In reply to Michael Catanzaro from comment #13)
> Problem is you have global DNS configured *and* the link-specific DNS
> configured by NetworkManager. That is almost always wrong because it's going
> to generate two requests. I think systemd-resolved is stupid to interpret
> the configuration this way and should really change its behavior somehow,
> but anyway, suffice to say you should stick to link-specific configuration
> only (unless you're going to disable NetworkManager).

I am afraid that is wrong only because it were not considered supported situation.

But especially with DNS over TLS directed to public IP addresses, this would be exactly what I want.
I want to use DNS over TLS for everything, except local-only domains accessible on local link servers.
Because names like example.lan. cannot be resolved on global server, like Cloudflare or Google DNS.
Unfortunately it is not simple to configure global servers used for everything, except local only domains on connections.

You cannot define DNSoverTLS is desired on global connection, but different (default) settings for common interfaces.

For example DNSoverTLS=yes for my global setting (where I can be very sure that is supported), but DNSoverTLS=opportunistic for link provided servers. (Where it might be supported, but often will not be.

It should work, if Network Manager default were set to different value. That should be possible, when NetworkManager.conf contains:

[connection]
connection.dns-over-tls=1  # means opportunistic

That should change default for link settings in NM, meaning it would provide own value. But still, I am not sure how to make it -DefaultRoute.

Comment 17 Michael Catanzaro 2024-07-17 17:44:21 UTC
(In reply to Petr Menšík from comment #15)
> Can original reporter provide also output of command:
> 
> - resolvectl show-server-state
> - resolvectl statistics
> 
> Unfortunately recent response time does not seem to be visible

These questions are still unanswered.

Comment 18 Michael Catanzaro 2024-07-17 17:55:54 UTC
Oh sorry, I see Petr just asked for that a couple hours ago. :D

Comment 19 Michael Catanzaro 2024-07-17 17:58:34 UTC
(In reply to Petr Menšík from comment #16)
> I am afraid that is wrong only because it were not considered supported
> situation.
> 
> But especially with DNS over TLS directed to public IP addresses, this would
> be exactly what I want.

Yeah, your argument is persuasive.

But that's also a tangent, not directly related to this bug report.

Comment 20 Petr Menšík 2024-07-17 19:10:03 UTC
It seems my idea of global DNS over TLS server is described at upstream issue https://github.com/systemd/systemd/issues/33579
That requests different DefaultRoute handling.

Since both systemd-resolved and myhostname nss plugin are part of systemd, should this issue be switched to systemd component? It would notify more maintainers than just Zbygniew. It seems to be issue is not caused by the order itself. Especially if not annoying seconds, but minutes are involved, there might be something else responsible. authselect is very likely not primary fix it needs.

If reproduction steps work, it should be simple to reproduce with correct settings. I think in comment #10 I have made them. Wrong order of dns should add only default timeout and attempts at most, which is maximum 15 seconds for single resolution. Report does not contain specific number, but 1 minute is a lot for DNS.

I think it needs both modified configuration AND better unresponsive TLS servers marking. systemd-resolved as a daemon should be able to wait longer, but should emit resolution failure sooner to client.

Comment 21 Owen Taylor 2024-07-28 18:54:39 UTC
It's worth noting that there are two separate issues here:

 A) Negative lookups being slow to time out (which happens on my system too, no VPN involved - I'm happy to provide details, but it's just a pretty standard wifi connection to a home network)

 B) The hostname value inside toolbox should resolve, not time out as a negative lookup

B) could be solved by going back to preferring nss-myhostname over nss-resolve or by making nss-resolve somehow behave correctly in a UTS namespace with a hostname different than the hostname seen my systemd-resolved.

Comment 22 Petr Menšík 2024-07-30 09:34:27 UTC
Correct, it works quite fine on my system, because I have disabled systemd-resolved. Therefore myhostname is before dns.

hostname resolved from container is different from what systemd-resolved sees itself.
Assumption it caches what myhostname would have to do process does not work in those cases. Because it caches let's say fedora.lan, but toolbox is using toolbox.lan. resolve plugin would either need to current hostname as part of the request

- Created systemd upstream issue: https://github.com/systemd/systemd/issues/33870

But one resolution with broken TLS takes way too long.

resolvectl domain eth0 lan
# I know my resolver does not resolve dnsovertls
resolvectl dnsovertls eth0 yes

- Created systemd upstream issue https://github.com/systemd/systemd/issues/33871

I do not think this should be fixed anywhere else. Yes, we have two parts, both fixable on systemd side better way IMO.

Comment 23 Pavel Březina 2024-07-30 11:12:03 UTC
Should I revert the patch in authselect or should I keep it and we'll wait for systemd-resolved to be fixed?

Comment 24 David Tardon 2024-07-30 11:53:40 UTC
(In reply to Pavel Březina from comment #23)
> Should I revert the patch in authselect or should I keep it and we'll wait
> for systemd-resolved to be fixed?

I don't think that's going to be quick.

Comment 25 Michael Catanzaro 2024-07-30 15:35:26 UTC
Does the long timeout happen when DNS over TLS is not enabled? That is a non-default configuration *specifically* because we know fallback doesn't work well when DoT is not supported by the DNS server. That's why the change proposal https://fedoraproject.org/wiki/Changes/DNS_Over_TLS failed.

I'm asking not because it's OK for this to be broken, but because if it's only broken in a non-default configuration then we probably don't want to make major design changes (authselect moving NSS modules around) as a workaround.

Comment 26 Petr Menšík 2024-08-27 16:12:53 UTC
Systemd upstream does not think the problem is at their side. Filled also issue on toolbox, maybe there is something to adjust there.

https://github.com/containers/toolbox/issues/1528

Comment 27 Pavel Březina 2024-09-23 09:17:17 UTC
*** Bug 2314175 has been marked as a duplicate of this bug. ***

Comment 28 Pavel Březina 2024-09-23 09:22:01 UTC
There does not seem to be any action taken so far. What is your guidance?

I can revert the patch in authselect, breaking hostname --fqdn again. But that seems like a reasonable thing to do instead of experiencing timeouts. I can apply the patch again, once this is fixed.

Comment 29 venanocta 2024-09-23 11:03:30 UTC
Whilst they are cooking up a longtime solution might I suggest introducing the "with-domain" option?
If this option is given the behavior that is currently deployed is engaged where resolved is asked to resolve the fqdn.
Otherwise and by default the option is not activated and all users which do not deploy the package in a domain environment are not sent running.

Because at this point in time trying to get people to transition to distributions using this rpm package is impossible. Because of the tail of bugs that this change causes. Alternatively the deployment of a third party managed plugin, fixing this patch, is required to be deployed by administrators.

Comment 30 Michael Catanzaro 2024-09-23 13:27:38 UTC
Unfortunately the bug is waiting on the bug reporter to provide information, but the reporter has disappeared.

venanocta, since you're hitting the same issues, can you answer Petr's questions in comment #15 and my question in comment #25 please?

(In reply to Pavel Březina from comment #28)
> There does not seem to be any action taken so far. What is your guidance?

If this is happening with DNS over TLS enabled, then I would do nothing, per my reasoning in comment #25.

Otherwise, I suggest my strategy from comment #9:

(In reply to Michael Catanzaro from comment #9)
> A good short term workaround would be to revert the change from bug #2257197
> to get dnf and Firefox working properly again, since that is more important
> than hostname --fqdn, but that is not a good long term fix. In the long
> term, we should probably keep nss-myhostname at the end and move
> nss-mdns4_minimal instead. But this is easier said than done, because
> systemd-resolved doesn't seem to handle mDNS properly, bug #1867830.

Comment 31 venanocta 2024-09-24 22:18:45 UTC
So since you asked me to answer your questions, I have spent a little more time digging into the issue and found what might be the core of the problem but before that I want to explain my setup on which I am experiencing the issue.

-- ANSWERS --
DNS over TLS => default = not configured
VPN          => not enabled, actually doesn't make a difference

$ resolvectl show-server-state
| Server: 10.10.255.254                           
|                               Type: link
|                          Interface: rnet
|                    Interface Index: 13
|             Verified feature level: n/a
|             Possible feature level: TLS+EDNS0+DO
|                        DNSSEC Mode: no
|                   DNSSEC Supported: yes
| Maximum UDP fragment size received: 512
|                Failed UDP attempts: 0
|                Failed TCP attempts: 0
|              Seen truncated packet: no
|           Seen OPT RR getting lost: no
|              Seen RRSIG RR missing: no
|                Seen invalid packet: no
|             Server dropped DO flag: no
|                                                 
| Server: 192.168.100.254                      
|                               Type: link
|                          Interface: enp67s0
|                    Interface Index: 3
|             Verified feature level: UDP+EDNS0
|             Possible feature level: UDP+EDNS0
|                        DNSSEC Mode: no
|                   DNSSEC Supported: yes
| Maximum UDP fragment size received: 512
|                Failed UDP attempts: 0
|                Failed TCP attempts: 0
|              Seen truncated packet: no
|           Seen OPT RR getting lost: no
|              Seen RRSIG RR missing: no
|                Seen invalid packet: no
|             Server dropped DO flag: no

$ resolvectl statistics
| Transactions                                      
|                        Current Transactions:     0
|                          Total Transactions: 11602
|                                                   
| Cache                                             
|                          Current Cache Size:    40
|                                  Cache Hits:  6307
|                                Cache Misses:  6117
|                                                   
| Failure Transactions                              
|                              Total Timeouts:   652
|          Total Timeouts (Stale Data Served):     0
|                     Total Failure Responses:   120
| Total Failure Responses (Stale Data Served):     0
|                                                   
| DNSSEC Verdicts                                   
|                                      Secure:     0
|                                    Insecure:     0
|                                       Bogus:     0
|                               Indeterminate:     0


-- SETUP --
The PC called 'workstation-linux' is a workstation based on the Threadripper 3960X and is connected through a ~3m 2.5Gb RJ45 connection to a router/firewall based on OPNsense (IPv6 is not configured / turned off).
Since the workstation is dual boot capable the workstation has 2 hostnames: in  Fedora = 'workstation-linux' & in Windows = 'workstation-win10'.
Additionally, the OPNsense router has a static DHCP record set for the workstation to 'workstation.home.lan'.

Furthermore, the router provides 2 Networks:
1) LAN:
LAN is provided directly on the native interface.
DNS:
  Unbound is accepting connections on the native interface
DHCP:
  Gateway:           192.168.100.254 ( the router ip )
  DNS Server:        192.168.100.254 ( the router ip )
  DNS Domain:        home.lan
  DNS Search Domain: home.lan, srv.lan (another VLAN with service hosts - not relevant)

2) RNET
RNET is provided on VLAN 10 on the same interface as (1).
DNS:
  Unbound is NOT accepting connections on VLAN 10
DHCP:
  Gateway:           10.10.255.254 ( the router ip )
  DNS Server:        ( not defined => interface ip by default = 10.10.255.254 )
  DNS Domain:        sector1.rnet
  DNS Search Domain: sector1.rnet, rnet

-- TESTING --
For testing I ran 'resolvectl flush-caches' followed by the `time resolvectl query workstation-linux`..
.. a) with only connection (1) in NetworkManager enabled:
  $ time resolvectl query workstation-linux
  | workstation-linux: 192.168.100.1                 -- link: enp67s0
  |                    fe80::5303:3bac:39e5:be66%3   -- link: enp67s0
  | 
  | -- Information acquired via protocol DNS in 2.4ms.
  | -- Data is authenticated: yes; Data was acquired via local or encrypted transport: yes
  | -- Data from: synthetic
  | 
  | real	0m0,008s
  | user	0m0,002s
  | sys	        0m0,004s
.. b) with connections (1) & (2) in NetworkManager enabled:
  # RNET (2) has Ipv6 disabled in NM!
  $ time resolvectl query workstation-linux
  | workstation-linux: resolve call failed: Connection timed out
  | 
  | real	2m0,062s
  | user	0m0,001s
  | sys	        0m0,004s


-- THOUGHTS --
From my tests it seems that the timeout happens whenever a dns name is tried to be resolved on a connection where the DNS servers do not respond.

Furthermore, after these results appeared I enabled the DNS server for the RNET (VLAN 10) and the issue disappeared with following result:
c)
  $ time resolvectl query workstation-linux
  | workstation-linux: 10.10.128.6                       -- link: rnet
  |                    (workstation-linux.sector1.rnet)
  | 
  | -- Information acquired via protocol DNS in 1.9ms.
  | -- Data is authenticated: no; Data was acquired via local or encrypted transport: no
  | -- Data from: network
  | 
  | real	0m0,007s
  | user	0m0,000s
  | sys	        0m0,005s

If I instead of enabling Unbound set the DNS server of the RNET connection in nmcli to 1.1.1.1 following happens:
d-1)
  $ time resolvectl query workstation-linux
  | workstation-linux: resolve call failed: Lookup failed due to system error: No route to host
  | 
  | real	0m16,927s
  | user	0m0,002s
  | sys	        0m0,004s
# after 'resolvectl flush-caches' (& time?):
d-2)
  $ time resolvectl query workstation-linux
  | workstation-linux: 192.168.100.1                 -- link: enp67s0
  |                    10.10.128.1                   -- link: rnet
  |                    10.10.128.6                   -- link: rnet
  |                    fe80::5303:3bac:39e5:be66%3   -- link: enp67s0
  | 
  | -- Information acquired via protocol DNS in 68.8ms.
  | -- Data is authenticated: yes; Data was acquired via local or encrypted transport: yes
  | -- Data from: synthetic
  | 
  | real	0m0,075s
  | user	0m0,001s
  | sys	        0m0,005s


From what I have seen this looks like an error of 'resolve' where it should fail fast when it detects there is no DNS server responding on the NM connection setting the domain entries.
Nonetheless I see the problem where 'resolve' can't really tell the state of the dns server since dns is UDP based.

Comment 32 Michael Catanzaro 2024-09-25 20:39:57 UTC
Well it's probably bad for your DNS servers to be nonresponsive. Surely that's your main problem.

I'll just reiterate my suggestion from comment #9: revert for now, move nss-myhostname back to where it was before, 'hostname --fqdn' to remain broken until somebody can figure out how to fix this properly.

Comment 33 Pavel Březina 2024-09-26 10:52:53 UTC
Ok, I'll revert it for now, I'm fine with it.

But please, keep in mind, that the original order before systemd-resolved was made default was "hosts: files dns myhostname", hostname --fqdn worked and nobody experienced any timeouts. But I am moving myhostname back and forth since resolved was introduced, every time with blessing from systemd developers, and its buggy at either place. It would be really good if somebody could take resolved and start actively working on it or open a change page to remove it from the default configuration.

Comment 34 Fedora Update System 2024-10-09 10:35:22 UTC
FEDORA-2024-02a5688338 (authselect-1.5.0-8.fc42) has been submitted as an update to Fedora 42.
https://bodhi.fedoraproject.org/updates/FEDORA-2024-02a5688338

Comment 35 Pavel Březina 2024-10-09 10:37:53 UTC
This is reverted in
* rawhide: https://bodhi.fedoraproject.org/updates/FEDORA-2024-02a5688338
* F41: https://bodhi.fedoraproject.org/updates/FEDORA-2024-d7f0d7c65b
* F40: https://bodhi.fedoraproject.org/updates/FEDORA-2024-d7caacc700

So the problem should be fixed now by the cost of breaking `hostname --fqdn` again. I would like to see this fixed properly though.

Comment 36 Fedora Update System 2024-10-09 15:05:47 UTC
FEDORA-2024-02a5688338 (authselect-1.5.0-8.fc42) has been pushed to the Fedora 42 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 37 Pavel Březina 2024-10-09 19:00:54 UTC
Reopening since the problem was not solved. Just mitigated in authselect.

Comment 38 Vladislav Grigoryev 2024-10-10 05:05:21 UTC
(In reply to Pavel Březina from comment #35)
> So the problem should be fixed now by the cost of breaking `hostname --fqdn`

It works correctly for me:
```
> rpm -q authselect
authselect-1.5.0-8.fc41.x86_64

> sudo authselect select local -f --nobackup
Profile "local" was selected.

> grep -e ^hosts: /etc/nsswitch.conf
hosts:      files myhostname resolve [!UNAVAIL=return] dns

> sudo hostnamectl hostname fedora.example.org

> hostname -s
fedora

> hostname --short
fedora

> hostname -f
fedora.example.org

> hostname --fqdn 
fedora.example.org
```

Comment 39 Pavel Březina 2024-10-10 10:54:53 UTC
(In reply to Vladislav Grigoryev from comment #38)
> (In reply to Pavel Březina from comment #35)
> > So the problem should be fixed now by the cost of breaking `hostname --fqdn`
> 
> It works correctly for me:
> ```
> > rpm -q authselect
> authselect-1.5.0-8.fc41.x86_64
> 
> > sudo authselect select local -f --nobackup
> Profile "local" was selected.
> 
> > grep -e ^hosts: /etc/nsswitch.conf
> hosts:      files myhostname resolve [!UNAVAIL=return] dns
> 
> > sudo hostnamectl hostname fedora.example.org
> 
> > hostname -s
> fedora
> 
> > hostname --short
> fedora
> 
> > hostname -f
> fedora.example.org
> 
> > hostname --fqdn 
> fedora.example.org
> ```

Won't work if you set hostname to a shortname. In this case, --fqdn should look it up via reverse dns lookup which is however intersected by myhostname. See: https://bugzilla.redhat.com/show_bug.cgi?id=2257197

Comment 40 Aoife Moloney 2025-04-25 10:58:05 UTC
This message is a reminder that Fedora Linux 40 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 40 on 2025-05-13.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '40'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version. Note that the version field may be hidden.
Click the "Show advanced fields" button if you do not see it.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 40 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 41 Daniel Roesen 2025-05-23 09:37:15 UTC
FWIW, this change breaks Postfix startup for me. I have /etc/postfix/main.cf inet_interface = foo.example.com (my hosts FQDN), and with NSS myhostname plugin in charge, getent ahosts $(hostname) returns also the link-local IPv6 address of the host - and Postfix cannot deal with it. I haven't investigated yet wether a) NSS myhostname should return link-local addresses, b) it does so with the relevant interface scope (otherwise meaningless), and c) Postfix is supposed to be able to handle link-local addresses, so unclear what behaviour is wrong here.

Comment 42 Daniel Roesen 2025-05-23 11:04:23 UTC
Actually, getent ahosts does indeed return the scope ID, so b) is answered. And looking at postfix's code, it seems that Postfix completely ignores address scoping, so can't really deal with link-local addresses. And seeing a similar problem Debian deal with in regard to ping6 vs myhostname in 2013, the myhostname author expects the link-local addresses to get reported.

So to me, it looks like Postfix needs to better deal with link-local addresses. Either ignoring them, or properly handling them.


Note You need to log in before you can comment on or make changes to this bug.