Bug 1933433 - systemd-resolved: stub resolver is not working properly
Summary: systemd-resolved: stub resolver is not working properly
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: systemd
Version: 34
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: systemd-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-27 19:27 UTC by Seppo Yli-Olli
Modified: 2021-04-07 23:22 UTC (History)
31 users (show)

Fixed In Version: systemd-248~rc2-3.fc34
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-18 22:00:25 UTC
Type: Bug


Attachments (Terms of Use)
debug log of resolve failure with latest systemd (32.22 KB, text/plain)
2021-03-12 01:59 UTC, Adam Williamson
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github systemd systemd issues 18819 0 None closed systemd-resolved: stub resolver is not following chain of multiple CNAMEs for resolution 2021-03-19 01:16:35 UTC
Github systemd systemd issues 18972 0 None closed Stub resolver cannot resolve www.vox.com or bugzilla.redhat.com, missing A records in response from stub resolver 2021-03-19 01:16:35 UTC
Github systemd systemd pull 18892/ 0 None None None 2021-03-19 01:16:38 UTC

Description Seppo Yli-Olli 2021-02-27 19:27:53 UTC
Description of problem:
Inability to resolve addresses where only CNAME is present. This effectively breaks all direct usage of DNS beyond glibc resolver including flatpak runtimes that do not have systemd247 or higher and various host utilities.

Version-Release number of selected component (if applicable):
systemd-networkd-248~rc2-1.fc34.x86_64

How reproducible:
Always

Steps to Reproduce:
1. nslookup www.netflix.com 127.0.0.53

Actual results:
No IP addresses in output, only CNAME

Expected results:
CNAME and IP addresses it resolves to are in output

Additional info:

Created upstream bug to https://github.com/systemd/systemd/issues/18819

Comment 1 Chris Murphy 2021-02-28 01:31:52 UTC
Fedora 33 Workstation:

$ nslookup www.netflix.com 127.0.0.53
Server:		127.0.0.53
Address:	127.0.0.53#53

Non-authoritative answer:
www.netflix.com	canonical name = www.dradis.netflix.com.
www.dradis.netflix.com	canonical name = www.us-west-2.internal.dradis.netflix.com.
www.us-west-2.internal.dradis.netflix.com	canonical name = dualstack.apiproxy-website-nlb-prod-2-e98cb8cf33ff3581.elb.us-west-2.amazonaws.com.
Name:	dualstack.apiproxy-website-nlb-prod-2-e98cb8cf33ff3581.elb.us-west-2.amazonaws.com
Address: 44.237.234.25
Name:	dualstack.apiproxy-website-nlb-prod-2-e98cb8cf33ff3581.elb.us-west-2.amazonaws.com
Address: 44.234.232.238
Name:	dualstack.apiproxy-website-nlb-prod-2-e98cb8cf33ff3581.elb.us-west-2.amazonaws.com
Address: 44.242.60.85
Name:	dualstack.apiproxy-website-nlb-prod-2-e98cb8cf33ff3581.elb.us-west-2.amazonaws.com
Address: 2600:1f14:62a:de82:822d:a423:9e4c:da8d
Name:	dualstack.apiproxy-website-nlb-prod-2-e98cb8cf33ff3581.elb.us-west-2.amazonaws.com
Address: 2600:1f14:62a:de81:b848:82ee:2416:447e
Name:	dualstack.apiproxy-website-nlb-prod-2-e98cb8cf33ff3581.elb.us-west-2.amazonaws.com
Address: 2600:1f14:62a:de80:69a8:7b12:8e5f:855d

$ rpm -q systemd
systemd-246.10-1.fc33.x86_64


Fedora 34 Workstation:

$ nslookup www.netflix.com 127.0.0.53
Server:		127.0.0.53
Address:	127.0.0.53#53

Non-authoritative answer:
www.netflix.com	canonical name = www.dradis.netflix.com.
www.dradis.netflix.com	canonical name = www.us-west-2.internal.dradis.netflix.com.

$ rpm -q systemd
systemd-248~rc2-1.fc34.x86_64

Comment 2 Fedora Blocker Bugs Application 2021-02-28 01:34:17 UTC
Proposed as a Freeze Exception for 34-beta by Fedora user chrismurphy using the blocker tracking app because:

 This is a regression, it'd be good to fix it. I'm not sure what criterion this would fall under for a blocker though.

Comment 3 Geoffrey Marr 2021-03-01 21:25:56 UTC
Discussed during the 2021-03-01 blocker review meeting: [0]

The decision to classify this bug as an "AcceptedFreezeException (Beta)" was made as it is a noticeable issue that cannot be fixed with an update.

[0] https://meetbot.fedoraproject.org/fedora-blocker-review/2021-03-01/f34-blocker-review.2021-03-01-17.01.txt

Comment 4 Michael Catanzaro 2021-03-01 21:52:34 UTC
This is effectively "the internet is broken," so we need to ensure it gets fixed no matter what, regardless of whether it meets any defined blocker criterion.

Comment 5 Zbigniew Jędrzejewski-Szmek 2021-03-01 22:21:17 UTC
This must be caused by the recent rework to CNAME handling.

Comment 6 Chris Murphy 2021-03-02 18:23:00 UTC
(In reply to Michael Catanzaro from comment #4)
> This is effectively "the internet is broken," so we need to ensure it gets
> fixed no matter what, regardless of whether it meets any defined blocker
> criterion.

I agree but could this be explained in both general and practical terms so we might figure out what criterion applies? Or alternatively get fesco to just say it's a blocker? Because I'm getting dnf and GNOME Software updates and package installs, and web browser is working. And yet flatpaks are all over the map, some have working internet others don't, with no discernible pattern.

Comment 7 Michael Catanzaro 2021-03-02 18:45:31 UTC
CNAME records are... sort of like symlinks, but for DNS A and AAAA records instead of files. For example, say I have alias.example.com and want it to be served by the same server that handles foobar.example.com. You could configure it like this:

A alias.example.com 192.0.2.0
A foobar.example.com 192.0.2.0

Or you could configure it like this:

A foobar.example.com 192.0.2.0
CNAME alias.example.com foobar.example.com

They are equivalent. Sort of. (This explanation is probably not 100% technically correct, but that's more or less how it works.) The latter configuration is currently broken, which means a large chunk of the internet will be not working, with no easily-discernable pattern as to why some websites work and others don't.

In the case of Netflix, if we run 'dig www.netflix.com' on Fedora 33, which is not broken, we see this:

;; ANSWER SECTION:
www.netflix.com.	11	IN	CNAME	www.dradis.netflix.com.
www.dradis.netflix.com.	59	IN	CNAME	www.us-west-2.internal.dradis.netflix.com.
www.us-west-2.internal.dradis.netflix.com. 59 IN CNAME dualstack.apiproxy-website-nlb-prod-2-e98cb8cf33ff3581.elb.us-west-2.amazonaws.com.
dualstack.apiproxy-website-nlb-prod-2-e98cb8cf33ff3581.elb.us-west-2.amazonaws.com. 59 IN A 44.234.232.238
dualstack.apiproxy-website-nlb-prod-2-e98cb8cf33ff3581.elb.us-west-2.amazonaws.com. 59 IN A 44.237.234.25
dualstack.apiproxy-website-nlb-prod-2-e98cb8cf33ff3581.elb.us-west-2.amazonaws.com. 59 IN A 44.242.60.85

Notice that there is a CNAME record for www.netflix.com pointing to www.dradis.netflix.com, and there is no A record for www.netflix.com, so it matches the second example from above. In Fedora 34, systemd-resolved does not handle the CNAMEs properly, and so we wind up with several "broken links" between www.netflix.com and dualstack.apiproxy-website-nlb-prod-2-e98cb8cf33ff3581.elb.us-west-2.amazonaws.com. In short: the internet is broken.

If it's not fixed upstream ASAP, then the bad systemd update should be reverted in the meantime, because an issue like this would effectively block normal usage of Fedora 34. I'm sure not going to upgrade before it's fixed. :P Asking for a special FESCo blocker seems like a good idea to me, because I doubt this fails any of our existing release criteria. The release criteria try to anticipate as far as possible the sort of bugs that are likely to be so bad as to warrant blocking a release, but they can't predict everything. This is the sort of wild issue that's unlikely to ever happen again and is not worth adding to the release criteria unless it can be heavily generalized to something basic like "DNS should work."

Comment 8 Michael Catanzaro 2021-03-03 00:46:31 UTC
(In reply to Chris Murphy from comment #6)
> And yet flatpaks are all
> over the map, some have working internet others don't, with no discernible
> pattern.

flatpak shouldn't have *any* impact on this. It may be that something unrelated is going wrong for flatpaks.

Comment 9 Chris Murphy 2021-03-03 05:58:38 UTC
See the domain list here: https://pagure.io/fesco/issue/2585#comment-718637

That list successfully resolve in Firefox (flathub) and Ungoogled Chrome (fedora) flatpaks on Fedora 33 with resolv.conf mode: stub, and Fedora 34 with resolv.conf mode: foreign, and managed by NetworkManager. Those domains all fail to resolve on Fedora 34 when resolv.conf mode: stub with those same flatpaks, but resolve with Fedora's rpm Firefox.

If flatpaks are excluded, the scope of "the internet is broken" is limited to netflix. That's not likely worth blocking release. If there were a better test to understand the scope of the problem, that would be nice.

Comment 10 Michael Catanzaro 2021-03-03 14:09:40 UTC
So there are two cases here:

 * freedesktop-sdk <= 20.08 flatpaks: these will use nss-dns, read 127.0.0.53 from /etc/resolv.conf, and speak DNS to systemd-resolved on the host without knowing anything about systemd-resolved
 * Fedora 33/34 flatpaks, freedesktop-sdk 21.08 flatpaks: these will attempt to use nss-resolve and speak directly to systemd-resolved via varlink. This requires flatpak 1.10 or it will fail. If it fails, then it should fall back to nss-dns and then work the same as with older flatpaks.

I don't know however else to explain why names would be less-likely to resolve in flatpaks, because regardless of how the flatpak app speaks to systemd-resolved, it should always receive the same results. I guess there is some other, separate bug that we don't understand and which has not been reported yet. E.g. the fallback from nss-resolve to nss-dns was broken in Fedora 33 for some time, eventually fixed by https://src.fedoraproject.org/rpms/systemd/c/779685bf4b1cdb74f6f20a6153299178a565e506?branch=f33. That particular issue could not have reappeared because the affected code no longer exists in Fedora 34, but it's possible that some sort of similar issue has appeared.

Comment 11 Michael Catanzaro 2021-03-03 14:13:03 UTC
(In reply to Michael Catanzaro from comment #10)
> So there are two cases here:

Er, it's actually three cases:

>  * Fedora 33/34 flatpaks, freedesktop-sdk 21.08 flatpaks: these will attempt
> to use nss-resolve and speak directly to systemd-resolved via varlink. This
> requires flatpak 1.10 or it will fail. If it fails, then it should fall back
> to nss-dns and then work the same as with older flatpaks.

Because Fedora 33 flatpaks are different. There, the flatpak will attempt to use the older Fedora 33 version of nss-resolve, which will attempt to speak D-Bus to systemd-resolved. That will be blocked by xdg-dbus-proxy because the app will not have permission. Then it should fall back to nss-dns.

Fedora 34 flatpaks and freedesktop-sdk 21.08 flatpaks have newer nss-resolve that will use varlink, which should hopefully work if you have flatpak 1.10. (But nobody has ever tested it before now, because the runtimes didn't exist yet. And it's not really possible to test if CNAMEs aren't working!)

Comment 12 Michael Catanzaro 2021-03-03 14:55:03 UTC
(In reply to Chris Murphy from comment #9)
> If flatpaks are excluded, the scope of "the internet is broken" is limited
> to netflix. That's not likely worth blocking release. If there were a better
> test to understand the scope of the problem, that would be nice.

FWIW I assumed from the bug description that all CNAMEs were broken. It sounds like that is not the case after all....

Comment 13 Seppo Yli-Olli 2021-03-03 16:03:24 UTC
(In reply to Michael Catanzaro from comment #11)
> (In reply to Michael Catanzaro from comment #10)
> > So there are two cases here:
> 
> Er, it's actually three cases:
> 
> >  * Fedora 33/34 flatpaks, freedesktop-sdk 21.08 flatpaks: these will attempt
> > to use nss-resolve and speak directly to systemd-resolved via varlink. This
> > requires flatpak 1.10 or it will fail. If it fails, then it should fall back
> > to nss-dns and then work the same as with older flatpaks.

> Fedora 34 flatpaks and freedesktop-sdk 21.08 flatpaks have newer nss-resolve
> that will use varlink, which should hopefully work if you have flatpak 1.10.
> (But nobody has ever tested it before now, because the runtimes didn't exist
> yet. And it's not really possible to test if CNAMEs aren't working!)

FWIW I did test that freedesktop-sdk 21.08 successfully uses the varlink interface from flatpak, asssumably Fedora 34 runtime will work as well if resolver is shipped. As to exact criteria which CNAME's fail to resolve, it is unknown. It might have to do with multiple levels of indirection but I don't run a DNS server for a test setup. Some CNAME's are fine eg www.youtube.com. (there is a simple CNAME setup there, not multiple levels of indirection)

Comment 14 Seppo Yli-Olli 2021-03-03 16:05:28 UTC
Just what it's worth, I'm currenly using "ln -sf /run/systemd/resolve/resolv.conf /etc/resolv.conf" as workaround which makes resolv.conf provide real DNS instead of stub resolver as a workaround. That is possibly less disruptive workaround in general than rollbacking systemd. systemd-resolved varlink interface is just fine, stub resolver is just working incorrectly.

Comment 15 Michael Catanzaro 2021-03-03 18:11:15 UTC
OK, that explains the behavior difference from flatpak then. Anything using nss-dns or reading /etc/resolv.conf directly is broken, but anything using nss-resolve -- our default -- works properly.

IMO this should still be a release blocker, but I would downgrade this to a final blocker instead of a beta blocker.

Comment 16 Michael Catanzaro 2021-03-03 18:15:48 UTC
Actually that still doesn't *fully* explain what is going on. Why is www.netflix.com broken in system Firefox? That should be using nss-resolve, not the stub resolver, right? So something must be wrong with more than just the stub resolver?

Comment 17 Chris Murphy 2021-03-03 19:39:04 UTC
This is a beta blocker now.
https://pagure.io/fesco/issue/2585#comment-718986

Comment 18 Chris Murphy 2021-03-03 19:40:46 UTC
>Why is www.netflix.com broken in system Firefox?
And today it's working for me! On the same two laptops in the same configuration it had been failing for the past ~2 days.

Comment 19 Seppo Yli-Olli 2021-03-03 20:23:46 UTC
I found another reproducer www.akamai.com. As said in upstream ticket, this (having CNAME1->CNAME2->A) seems to be a quite common pattern with Akamai sites so there are probably a lot of affected sites.

Comment 20 Allan 2021-03-04 01:03:53 UTC
www.kernel.org have same problem for me (several chained cnames too)
Going back to systemd 247.3-3 fixed it for me

Comment 21 Ben Cotton 2021-03-04 13:26:58 UTC
Declared a F34 Beta blocker by FESCo with the caveat that "if the scope is really small or something we can revisit next week."
https://pagure.io/fesco/issue/2585#comment-718986

Comment 22 Michael Catanzaro 2021-03-05 18:38:48 UTC
Upstream fix: https://github.com/systemd/systemd/pull/18892/

Comment 23 Adam Williamson 2021-03-05 19:52:49 UTC
Blocker status supersedes FE status, no need for both.

Comment 24 Adam Williamson 2021-03-05 19:54:14 UTC
"And today it's working for me! On the same two laptops in the same configuration it had been failing for the past ~2 days."

This is the pattern I'm seeing in the openQA case that I think is caused by this, too: *sometimes* it works, sometimes it doesn't, and it seems to go in patches (no test run will hit it for a while, then *every* test run will hit it for a while...)

Comment 25 Adam Williamson 2021-03-05 21:11:22 UTC
Here's a scratch build with the patch backported:
https://koji.fedoraproject.org/koji/taskinfo?taskID=63160807
please test and see how it goes for you, thanks.

Comment 26 Chris Murphy 2021-03-05 21:33:31 UTC
Are CNAMEs typically ephemeral and this accounts for the variable behavior?

Comment 27 Seppo Yli-Olli 2021-03-05 21:59:03 UTC
Tested with that scratch build that both www.akamai.com and www.netflix.com have their contents in ANSWERS SECTION as expected and Netflix is accessible again with Firefox inside flatpak.

Comment 28 Chris Murphy 2021-03-05 22:02:54 UTC
Works for me as well, and now Firefox and Ungoogled Chromium flatpaks are resolving all the previous sites that were failing.

Comment 29 Adam Williamson 2021-03-06 00:26:42 UTC
"Are CNAMEs typically ephemeral and this accounts for the variable behavior?"

Not typically, no, you don't usually want to change your DNS records too much. I'd think it must be something else, though I've no idea what. Round-robin responses, possibly.

Comment 30 Seppo Yli-Olli 2021-03-06 11:55:37 UTC
First-level CNAME's quite typically have very short TTL in certain special scenarios like CDN. If you want to debug further, we need actual sample hostnames for failures. But it seems the systemd-resolved fix was sufficient. The question is will we get regular build with it as backport or will we wait for next systemd RC which assumably has the fix. Note that for openQA use cases it would be helpful to understand the architecture there: is this failing test using a DNS client that accesses /etc/resolv.conf rather than using glibc resolver? If yes, then it would likely be affected by this same thing.

Comment 31 Andrew Thurman 2021-03-06 15:27:32 UTC
(In reply to Adam Williamson from comment #25)
> Here's a scratch build with the patch backported:
> https://koji.fedoraproject.org/koji/taskinfo?taskID=63160807
> please test and see how it goes for you, thanks.

Netflix, Akamai, Ask Fedora, Kernel.org, and all the other cases seem to be accessible from this build on Silverblue 34.

Comment 32 Adam Williamson 2021-03-06 17:02:54 UTC
I'd rather not take a new RC. We're frozen. We need specific backports of specific fixes for the accepted FE and blocker bugs.

Comment 33 Fedora Update System 2021-03-06 22:15:08 UTC
FEDORA-2021-ead59f24eb has been submitted as an update to Fedora 34. https://bodhi.fedoraproject.org/updates/FEDORA-2021-ead59f24eb

Comment 34 lethalwp 2021-03-07 13:33:59 UTC
i think it is incompletely fixed.

I am running fed34 with systemd-248~rc2-3.fc34.x86_64
My internet is IPV4+IPV6 dualstack.

Under that i have a windows VM that i use with virt-manager. It uses systemd-resolved as DNS.

On the previous buggy version, i had the CNAME bug,  nslookup www.intel.com gave me an empty result (because no ip was given with the CNAME).

This now works.


But, i think there's another bug concerning ipv6.

results from the cmd:
C:\Users\lethalwp>nslookup
Default Server:  little.lethalwp
Address:  192.168.122.1

> www.intel.com
Server:  little.lethalwp
Address:  192.168.122.1

Non-authoritative answer:
Name:    e11.dsca.akamaiedge.net   ----> THIS ONE IS NOW OK.
Address:  23.61.4.6
Aliases:  www.intel.com
          intel11.cn.edgekey.net

> mail.google.com
Server:  little.lethalwp
Address:  192.168.122.1

Non-authoritative answer:
Name:    googlemail.l.google.com
Address:  2a00:1450:400e:802::2005   ---> no V4 given?
Aliases:  mail.google.com




C:\Users\lethalwp>ping mail.google.com
Ping request could not find host mail.google.com. Please check the name and try again.   <--- it tries to reach the IPV6 i suppose? No IPV4 Address available.

This makes mail.google.com unavailable in the windows VM.


I can't compare with how it worked on fed33 previously, i don't have that system available anymore.

Comment 35 lethalwp 2021-03-07 13:40:11 UTC
dig shows AAAA&A being returned, i don't know why windows only shows the V6.
My windows is also in dualstack, but the v6 is only a link-local.


Also i notice insonsistent results?:
[lethalwp@little ~]$ dig @192.168.122.1 mail.google.com

; <<>> DiG 9.16.11-RedHat-9.16.11-5.fc34 <<>> @192.168.122.1 mail.google.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28315
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;mail.google.com.		IN	A

;; ANSWER SECTION:
mail.google.com.	5951	IN	CNAME	googlemail.l.google.com.

;; AUTHORITY SECTION:
googlemail.l.google.com. 227	IN	AAAA	2a00:1450:400e:802::2005
googlemail.l.google.com. 207	IN	A	172.217.17.37

;; Query time: 0 msec
;; SERVER: 192.168.122.1#53(192.168.122.1)
;; WHEN: dim mar 07 14:35:27 CET 2021
;; MSG SIZE  rcvd: 115





10 seconds later:

[lethalwp@little ~]$ dig @192.168.122.1 mail.google.com

; <<>> DiG 9.16.11-RedHat-9.16.11-5.fc34 <<>> @192.168.122.1 mail.google.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13136
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;mail.google.com.		IN	A

;; ANSWER SECTION:
mail.google.com.	5849	IN	CNAME	googlemail.l.google.com.

;; Query time: 0 msec
;; SERVER: 192.168.122.1#53(192.168.122.1)
;; WHEN: dim mar 07 14:37:08 CET 2021
;; MSG SIZE  rcvd: 81

Comment 36 Michael Catanzaro 2021-03-07 14:44:39 UTC
Well it certainly looks like something is wrong... but that doesn't appear to be related to CNAMEs, does it? So please file a new bug. (Well, ideally two: an upstream bug is most important, but we also need a downstream bug for blocker or freeze exception purposes.)

Comment 37 Michael Catanzaro 2021-03-07 14:48:39 UTC
(In reply to lethalwp from comment #35)
> [lethalwp@little ~]$ dig @192.168.122.1 mail.google.com

Wait, you're showing results from your router... this is not coming from systemd-resolved's stub resolver. The stub resolver would be 127.0.0.53.

This bug report is for issues with CNAMEs using the stub resolver. But it looks like whatever is going wrong for you involves neither CNAMEs nor the stub resolver.

Comment 38 Seppo Yli-Olli 2021-03-07 16:22:23 UTC
Well, I can repro the same thing here really. nslookup gives both A and AAAA for mail.google.com for real DNS but only AAAA for stub resolver. While incompleteness, I don't think this is blocker on the same level as the original CNAME issue. I didn't notice such issues since I have a fully functional native IPv6 stack so the AAAA responses actually worked for me. File a separate bug *at least* on systemd side. I think it would be clearer to have a separate bug also in RHBZ for this.

Comment 39 Seppo Yli-Olli 2021-03-07 16:37:50 UTC
Basically compare output of "dig @127.0.0.53 mail.google.com" vs "dig @127.0.0.53 mail.google.com A" vs "dig @127.0.0.53 mail.google.com AAAA" vs "dig @8.8.8.8 mail.google.com" vs "dig @8.8.8.8 mail.google.com AAAA". The responses are wildly different. Also "nslookup mail.google.com 127.0.0.53" vs "nslookup mail.google.com 8.8.8.8". If you query AAAA record, you will get sensible response from stub resolver. But with default or A you get again borken answer section.

Comment 40 Seppo Yli-Olli 2021-03-07 18:47:57 UTC
I think the mail.google.com is something quite different though since the behaviour I was seeing before stopped reproducing later today. This specific issue I reported though which was reproducible is fixed.

Comment 41 Fedora Update System 2021-03-08 21:42:30 UTC
FEDORA-2021-ead59f24eb has been pushed to the Fedora 34 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 42 Michael Catanzaro 2021-03-11 20:36:42 UTC
I upgraded to F34, I have systemd-248~rc2-3, and anything using the stub resolver is almost unusable. The problem is slightly different now in that A records are broken, not missing CNAMEs, but the effect is the same: a large subset of the internet does not work. Reopening.

Comment 43 Seppo Yli-Olli 2021-03-11 20:57:50 UTC
Right. In the scope of generic issues I have been also seeing various sites not resolve for a while only to start resolving just when I'm typing a bug report. There's definitely something wrong with systemd-resolved but I haven't yet been able to produce good reproducers so didn't file another bug report. Something that may affect these test cases is the DNS cache. Ensuring clean cache might help reproducing the issues better. Debug logging from systemd-resolved might be useful, iirc there was some environment variable to toggle it if you run a daemon manually.

Comment 44 Chris Murphy 2021-03-11 22:31:30 UTC
Testers looking to get more info out of resolved without enabling full systemd debug...

$ sudo systemctl edit systemd-resolved

Add the following  two lines in the section for overrides.
         
[Service]
Environment=SYSTEMD_LOG_LEVEL=debug
         
Save it.

$ sudo systemctl restart systemd-resolved

Comment 45 Adam Williamson 2021-03-12 01:27:40 UTC
To support Michael and Seppi, even with systemd-248~rc2-3.fc34, openQA is running into issues resolving mirrors.fedoraproject.org quite a lot on F34 (but not on previous releases). That's from within Fedora infra, where the record should look like this:

;; QUESTION SECTION:
;mirrors.fedoraproject.org.	IN	A

;; ANSWER SECTION:
mirrors.fedoraproject.org. 300	IN	CNAME	wildcard.fedoraproject.org.
wildcard.fedoraproject.org. 60	IN	A	10.3.163.75
wildcard.fedoraproject.org. 60	IN	A	10.3.163.76
wildcard.fedoraproject.org. 60	IN	A	10.3.163.77
wildcard.fedoraproject.org. 60	IN	A	10.3.163.74

it returns the four A records for wildcard in a different order each time, it's a round-robin setup.

Comment 46 Adam Williamson 2021-03-12 01:59:54 UTC
Created attachment 1762882 [details]
debug log of resolve failure with latest systemd

So I got a debug log from a resolve failure in openQA (thanks Chris), here it is.

Comment 47 Adam Williamson 2021-03-12 23:43:19 UTC
https://koji.fedoraproject.org/koji/taskinfo?taskID=63668153 is a scratch build with a workaround mcatanzaro suggested: it includes a config snippet that should disable resolved's cache. If affected folks could test it out that'd be great.

Comment 48 Fedora Update System 2021-03-13 02:28:32 UTC
FEDORA-2021-c2bfa5e4f6 has been submitted as an update to Fedora 34. https://bodhi.fedoraproject.org/updates/FEDORA-2021-c2bfa5e4f6

Comment 49 Chris Murphy 2021-03-13 05:13:49 UTC
systemd-248~rc2-6.fc34 
f34 now matches f33 behavior, however for www.vox.com i get different answer results using 

dig.0.53
;; ANSWER SECTION:
www.vox.com.            54460   IN      CNAME   vox-chorus.map.fastly.net.
vox-chorus.map.fastly.net. 21   IN      A       151.101.69.52

dig.8.8
;; ANSWER SECTION:
www.vox.com.            13853   IN      CNAME   vox-chorus.map.fastly.net.
vox-chorus.map.fastly.net. 10   IN      A       151.101.1.52
vox-chorus.map.fastly.net. 10   IN      A       151.101.65.52
vox-chorus.map.fastly.net. 10   IN      A       151.101.129.52
vox-chorus.map.fastly.net. 10   IN      A       151.101.193.52

Comment 50 Michael Catanzaro 2021-03-13 14:38:24 UTC
Does it work? As long as it's a fastly IP that successfully loads www.vox.com, that's probably fine. The stub resolver is not expected to return the same results as a real DNS server.

(In reply to Adam Williamson from comment #47)
> https://koji.fedoraproject.org/koji/taskinfo?taskID=63668153 is a scratch
> build with a workaround mcatanzaro suggested: it includes a config snippet
> that should disable resolved's cache. If affected folks could test it out
> that'd be great.

Oh good idea. That should at least significantly reduce the impact of this bug.

Comment 51 Adam Williamson 2021-03-13 18:36:44 UTC
The scratch build worked well in some openQA tests I ran, so I sent out an official build and update with the same change. That's https://bodhi.fedoraproject.org/updates/FEDORA-2021-c2bfa5e4f6 . Please test it and see how it works for you. RC2 will include it.

Comment 52 Fedora Update System 2021-03-13 19:27:49 UTC
FEDORA-2021-c2bfa5e4f6 has been pushed to the Fedora 34 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --advisory=FEDORA-2021-c2bfa5e4f6`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2021-c2bfa5e4f6

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 53 Michael Catanzaro 2021-03-15 13:53:47 UTC
(In reply to Adam Williamson from comment #47)
> https://koji.fedoraproject.org/koji/taskinfo?taskID=63668153 is a scratch
> build with a workaround mcatanzaro suggested: it includes a config snippet
> that should disable resolved's cache. If affected folks could test it out
> that'd be great.

This build is good. Of course it would be better to not disable the DNS cache, but now we have downgraded a release blocker to just a regular bug. Excellent.

Comment 54 Fedora Update System 2021-03-16 00:29:12 UTC
FEDORA-2021-c2bfa5e4f6 has been pushed to the Fedora 34 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 55 Adam Williamson 2021-03-16 01:43:00 UTC
-6 went stable and does seem to work around this successfully, so I'm gonna call that good enough to drop blocker status. There's now a proper fix pending upstream, I will do a build+update for that tonight or tomorrow if no-one else gets to it first.

Comment 56 Andrew Thurman 2021-03-17 14:37:52 UTC
I am having various issues on F34 testing (systemd-248~rc2-8.fc34) but not stable (systemd-248~rc2-6.fc34). I am not technical enough to know if this is related, but I'll post it here anyway.

On systemd-248~rc2-8.fc34 I cannot access https://wiki.gnome.org/, many fonts and images are not rendered properly on various sites, I cannot accept my self-signed certificate in GNOME calendar flatpak (but I can in nautilus and Firefox), and I cannot access Flathub on the command line:
```
[andythurman@rockhopper ~]$ flatpak install --verbose --ostree-verbose foo
F: No installations directory in /etc/flatpak/installations.d. Skipping
F: Opening system flatpak installation at path /var/lib/flatpak
F: Opening user flatpak installation at path /var/home/andythurman/.local/share/flatpak
Looking for matches…
F: Calling system helper: GenerateOciSummary
F: Fetching summary index file for remote ‘flathub’
F: Loading https://dl.flathub.org/repo/summary.idx using libsoup
F: Failed to download optional summary index: Could not connect: Network is unreachable
F: An error was encountered searching remote ‘flathub’ for ‘foo’: Unable to load summary from remote flathub: Could not connect: Network is unreachable
F: Fetching summary index file for remote ‘flathub-beta’
F: Loading https://dl.flathub.org/beta-repo/summary.idx using libsoup
F: Failed to download optional summary index: Could not connect: Network is unreachable
F: An error was encountered searching remote ‘flathub-beta’ for ‘foo’: Unable to load summary from remote flathub-beta: Could not connect: Network is unreachable
F: Fetching summary index file for remote ‘gnome-nightly’
F: Loading https://nightly.gnome.org/repo/summary.idx using libsoup
F: Failed to download optional summary index: Could not connect: Network is unreachable
F: An error was encountered searching remote ‘gnome-nightly’ for ‘foo’: Unable to load summary from remote gnome-nightly: Could not connect: Network is unreachable
Found similar ref(s) for ‘foo’ in remote ‘fedora’ (system).
Use this remote? [Y/n]: n
error: No remote chosen to resolve matches for ‘foo’
```

Comment 57 Ondrej Holy 2021-03-17 15:10:06 UTC
Just note that I see perhaps related network issues in F34 toolbox as well: https://bugzilla.redhat.com/show_bug.cgi?id=1934788.

Comment 58 Adam Williamson 2021-03-17 15:25:00 UTC
Thanks a lot for reporting. So either I muffed the patch backport, or we still have issues here. I'll ask Lennart to take a look at it.

Comment 59 Andrew Thurman 2021-03-17 15:42:28 UTC
(In reply to Adam Williamson from comment #58)
> Thanks a lot for reporting. So either I muffed the patch backport, or we
> still have issues here. I'll ask Lennart to take a look at it.

Scratch that. This seems to be unrelated as when overriding the old systemd into 34-testing the issue persists. I'm going to dig a little deeper.

Comment 60 Michael Catanzaro 2021-03-17 16:45:25 UTC
I've tested Adam's systemd-248~rc2-8.fc34 and it fixes this issue for me.

Comment 61 Garrett LeSage 2021-03-18 13:19:53 UTC
I've seen the same issues when using flatpak from the command line on systemd-248~rc2-8.fc34 as Andrew has... and systemd-248~rc2-6.fc34 is working fine here. I've rolled between versions using Silverblue on the same machine and also tested it on my non-Silverblue laptop, which is on F34 beta with systemd-248~rc2-6.fc34.

It's even a problem outside of flatpak, as the issue arises when trying to use other network commands on dl.flathub.org, such as `ping dl.flathub.org` (and mtr / traceroute). 

Once in a while, ping worked, but most of the time, I'd get:

$ ping dl.flathub.org
/usr/bin/ping: connect: Network is unreachable

Whereas it always works in systemd-248~rc2-6.fc34.

Not sure if it matters, but my home network is IPv4 & IPv6 whereas my ISP only provides IPv4. (Wild guess: It could be trying to route to an external IPv6 address and failing?)

Comment 62 Michael Catanzaro 2021-03-18 14:17:31 UTC
Hm... Garrett, could you please post the output of:

$ ping -v -c1 dl.flathub.org
$ resolvectl query dl.flathub.org
$ dig dl.flathub.org
$ dig @1.1.1.1 dl.flathub.org

At least the output when ping is failing, though if you're able to see a difference between good and bad output, that would be good too.

Adam, I think we'd better stick with -6 for F33 beta.

Comment 63 Adam Williamson 2021-03-18 16:37:35 UTC
Yes, Beta RC3 has -6 still, we didn't pull in -8.

Comment 64 Adam Williamson 2021-03-18 20:32:56 UTC
Looks like I'm having trouble with -8 also, with retrace.fedoraproject.org.

[adamw@xps13k ~]$ ping -v retrace.fedoraproject.org
ping: connect: Network is unreachable
[adamw@xps13k ~]$ resolvectl query retrace.fedoraproject.org
retrace.fedoraproject.org: 2620:52:3:1:dead:beef:cafe:c005 -- link: wlp58s0
                           (retrace03.rdu-cc.fedoraproject.org)

-- Information acquired via protocol DNS in 2.3ms.
-- Data is authenticated: no; Data was acquired via local or encrypted transport: no
-- Data from: cache
[adamw@xps13k ~]$ dig @127.0.0.53 retrace.fedoraproject.org

; <<>> DiG 9.16.11-RedHat-9.16.11-5.fc34 <<>> @127.0.0.53 retrace.fedoraproject.org
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47990
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;retrace.fedoraproject.org.	IN	A

;; ANSWER SECTION:
retrace.fedoraproject.org. 53	IN	CNAME	retrace03.rdu-cc.fedoraproject.org.
retrace03.rdu-cc.fedoraproject.org. 6953 IN A	8.43.85.61

;; AUTHORITY SECTION:
retrace03.rdu-cc.fedoraproject.org. 53 IN AAAA	2620:52:3:1:dead:beef:cafe:c005

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Thu Mar 18 13:32:35 PDT 2021
;; MSG SIZE  rcvd: 129

[adamw@xps13k ~]$ dig @1.1.1.1 retrace.fedoraproject.org

; <<>> DiG 9.16.11-RedHat-9.16.11-5.fc34 <<>> @1.1.1.1 retrace.fedoraproject.org
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 5324
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;retrace.fedoraproject.org.	IN	A

;; ANSWER SECTION:
retrace.fedoraproject.org. 300	IN	CNAME	retrace03.rdu-cc.fedoraproject.org.
retrace03.rdu-cc.fedoraproject.org. 86400 IN A	8.43.85.61

;; Query time: 715 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Thu Mar 18 13:32:43 PDT 2021
;; MSG SIZE  rcvd: 101

Comment 65 Adam Williamson 2021-03-18 20:42:33 UTC
the output from 'resolvectl' compared to the result of 'dig' seems revealing. resolvectl is giving the IPv6 address. I do not have IPv6 connectivity.

Comment 66 Michael Catanzaro 2021-03-18 20:53:25 UTC
You also proved it's not an issue with the stub resolver this time, though, because it happens with 'resolvectl query', which does not use the stub resolver, and the stub resolver returns only an A record in the ANSWER section. I suspected this already, since Garrett mentioned the issue is occurring outside flatpaks. So I think it's time to create a new (final blocker) bug report, as we seem to have finally reached the end of the stub resolver CNAME madness, and now have something totally different here.

FWIW it seems like retrace.fedoraproject.org really is broken (or dropping ICMPv6), because it's not pingable via 2620:52:3:1:dead:beef:cafe:c005, even though I do have working IPv6.

Comment 67 David Sebek 2021-03-18 21:02:15 UTC
I am experiencing a similar problem on a fresh Fedora 34 installation after all available updates are installed. My computer has a link-local IPv6 address, but I only have IPv4 access to the internet.

What I did:
1. Installed Fedora 34 (Fedora-Workstation-Live-x86_64-34-20210317.n.0.iso)
2. Installed updates (sudo dnf --refresh upgrade).

After that, I am occasionally getting errors when using dnf or flatpak to check for updates or to download new packages. But the problem happens only occasionally, sometimes it works without any problem.

Sometimes, "sudo dnf --refresh upgrade" cannot update repository metadata and gives me a bunch of Curl errors:
Errors during downloading metadata for repository 'fedora-modular':
  - Curl error (7): Couldn't connect to server for https://mirrors.fedoraproject.org/metalink?repo=fedora-modular-34&arch=x86_64 []

At the same time, "ping mirrors.fedoraproject.org" returns "Network is unreachable", I cannot access https://mirrors.fedoraproject.org from Firefox, and dig outputs:
[david@pc4 ~]$ dig mirrors.fedoraproject.org

; <<>> DiG 9.16.11-RedHat-9.16.11-5.fc34 <<>> mirrors.fedoraproject.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13998
;; flags: qr rd ra; QUERY: 1, ANSWER: 11, AUTHORITY: 6, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;mirrors.fedoraproject.org.	IN	A

;; ANSWER SECTION:
mirrors.fedoraproject.org. 37	IN	CNAME	wildcard.fedoraproject.org.
wildcard.fedoraproject.org. 37	IN	A	209.132.190.2
wildcard.fedoraproject.org. 37	IN	A	152.19.134.142
wildcard.fedoraproject.org. 37	IN	A	38.145.60.21
wildcard.fedoraproject.org. 37	IN	A	8.43.85.67
wildcard.fedoraproject.org. 37	IN	A	38.145.60.20
wildcard.fedoraproject.org. 37	IN	A	8.43.85.73
wildcard.fedoraproject.org. 37	IN	A	140.211.169.206
wildcard.fedoraproject.org. 37	IN	A	67.219.144.68
wildcard.fedoraproject.org. 37	IN	A	152.19.134.198
wildcard.fedoraproject.org. 37	IN	A	140.211.169.196

;; AUTHORITY SECTION:
wildcard.fedoraproject.org. 37	IN	AAAA	2610:28:3090:3001:dead:beef:cafe:fed3
wildcard.fedoraproject.org. 37	IN	AAAA	2620:52:3:1:dead:beef:cafe:fed7
wildcard.fedoraproject.org. 37	IN	AAAA	2605:bc80:3010:600:dead:beef:cafe:feda
wildcard.fedoraproject.org. 37	IN	AAAA	2605:bc80:3010:600:dead:beef:cafe:fed9
wildcard.fedoraproject.org. 37	IN	AAAA	2620:52:3:1:dead:beef:cafe:fed6
wildcard.fedoraproject.org. 37	IN	AAAA	2604:1580:fe00:0:dead:beef:cafe:fed1

;; Query time: 1 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Thu Mar 18 15:58:18 EDT 2021
;; MSG SIZE  rcvd: 405

When I run "curl --verbose https://mirrors.fedoraproject.org/metalink?repo=fedora-modular-34&arch=x86_64", it seems to try those IPv6 addresses and fails with "Network is unreachable" because my internet connection is IPv4.

But sometimes everything works fine. I noticed that in such case, "dig mirrors.fedoraproject.org" does not print the authority section with AAAA records and dnf can successfully download files from the internet, also curl and ping are successful.

Comment 68 Michael Catanzaro 2021-03-18 22:02:26 UTC
The output again confirms the stub resolver is working properly (only 'A' records in the ANSWER section), so it's time for a new bug report please.

Comment 69 Pavel Raiskup 2021-03-19 07:55:48 UTC
(In reply to Michael Catanzaro from comment #66)
> [snip] So I think it's time to create a new (final blocker) bug report, as we
> seem to have finally reached the end of the stub resolver CNAME madness, and
> now have something totally different here.

Has anyone debugged the problems so far, and reported the blocker?  I can't
reproduce myself, but we have bug 1933506 that somewhat transitively links here.

Comment 70 Michael Catanzaro 2021-03-19 13:52:13 UTC
I'm not aware of any downstream bug report yet.

There is an upstream report at https://github.com/systemd/systemd/issues/19049.

Comment 71 Adam Williamson 2021-03-19 16:53:48 UTC
I filed https://bugzilla.redhat.com/show_bug.cgi?id=1940715 .

Comment 72 Adam Williamson 2021-04-07 23:22:20 UTC
Sorry, wrong link. I filed https://bugzilla.redhat.com/show_bug.cgi?id=1947214 .

Comment 73 Adam Williamson 2021-04-07 23:22:52 UTC
goddamnit. wrong bug. i have too many tabs.


Note You need to log in before you can comment on or make changes to this bug.