Bug 1896437 - NetworkManager assigns ~. DNS routing domain to the wrong network interface
Summary: NetworkManager assigns ~. DNS routing domain to the wrong network interface
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: NetworkManager
Version: 33
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Lubomir Rintel
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-10 14:37 UTC by Gergely Gombos
Modified: 2021-03-17 14:24 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-17 14:24:37 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Gergely Gombos 2020-11-10 14:37:55 UTC
Description of problem:
Some company VPN websites stopped working after F33 upgrade.
I use openconnect to connect to it.
It has to be systemd-resolved because disabling it solves the issue.

Version-Release number of selected component (if applicable):
F33

How reproducible:
100%

Steps to Reproduce:
1. Use F33 with systemd-resolved enabled
2. Use openconnect with company VPN.
3. open mycompany-protected-site.com

Actual results:
It treats me as if VPN wasn't running.

Expected results:
VPN working.

Additional info:
Let me know what you need to debug the issue.

Comment 1 Gergely Gombos 2020-11-10 20:21:32 UTC
Some more debugging info. 

When resolvectl is ENABLED and I'm connected to the VPN: (some sites not working)

$ resolvectl domain
Global:
Link 2 (eno1): ~. home
Link 3 (virbr0):
Link 4 (virbr0-nic):
Link 5 (br-d0d9e69bbf7c):
Link 6 (br-04098e67146f):
Link 7 (br-50016b2406a1):
Link 8 (br-814207b17af2):
Link 9 (docker0):
Link 10 (vpn0): somecompanydomain.net

$ resolvectl dns
Global:
Link 2 (eno1): 213.46.246.53 213.46.246.54
Link 3 (virbr0):
Link 4 (virbr0-nic):
Link 5 (br-d0d9e69bbf7c):
Link 6 (br-04098e67146f):
Link 7 (br-50016b2406a1):
Link 8 (br-814207b17af2):
Link 9 (docker0):
Link 10 (vpn0): 10.228.145.105 10.228.226.10 # company DNS?


When resolvectl is DISABLED and I'm connected to the (working) VPN: 

$ cat /etc/resolv.conf
# Generated by NetworkManager
search mycompanydns.net home
nameserver 10.228.145.105 # company DNS?
nameserver 10.228.226.10 # company DNS?
nameserver 213.46.246.53
# NOTE: the libc resolver may not support more than 3 nameservers.
# The nameservers listed below may not be recognized.
nameserver 213.46.246.54

Also, the not working hostnames are resolved to different IP addresses depending on whether systemd-resolved is enabled or disabled.

Comment 2 Gergely Gombos 2020-11-10 20:50:27 UTC
If "Use this connection only for resources on its network" is unchecked, these websites still wouldn't work. Output:

$ resolvectl dns

Global: 10.228.145.105 10.228.226.10 213.46.246.53 213.46.246.54
Link 2 (eno1): 213.46.246.53 213.46.246.54
Link 3 (virbr0):
Link 4 (virbr0-nic):
Link 5 (br-d0d9e69bbf7c):
Link 6 (br-04098e67146f):
Link 7 (br-50016b2406a1):
Link 8 (br-814207b17af2):
Link 9 (docker0):
Link 10 (vpn0): 10.228.145.105 10.228.226.10

$ resolvectl domain

Global: somecompanydomain.net home
Link 2 (eno1): ~. home
Link 3 (virbr0):
Link 4 (virbr0-nic):
Link 5 (br-d0d9e69bbf7c):
Link 6 (br-04098e67146f):
Link 7 (br-50016b2406a1):
Link 8 (br-814207b17af2):
Link 9 (docker0):
Link 10 (vpn0): somecompanydomain.net

Comment 3 Michael Catanzaro 2020-11-10 22:08:05 UTC
(Reassigning to NetworkManager, not because I suspect a NetworkManager bug yet, but because we know any systemd-resolved configuration issue will be NetworkManager level.)

10.228.145.105 10.228.226.10 is surely your company DNS, yes.

The debug in comment #1 looks good to me. At least, you should be able to successfully load resources for somecompanydomain.net, right? Are you having trouble accessing other pages *not* on somecompanydomain.net? That would indicate a server bug (the VPN server failed to tell NetworkManager about the additional domains), and there's not much you can do about it other than to manually add DNS domains for every missing internal domain that you need to access (you can do this using nm-connection-editor).

(In reply to Gergely Gombos from comment #2)
> If "Use this connection only for resources on its network" is unchecked,
> these websites still wouldn't work. Output:

OK, the debug in comment #2 looks weird. With that checkbox unchecked, you should see a ~. DNS domain on vpn0, not on eno1. That's pretty strange. At the risk of this being obvious: are you certain you disconnected and reconnected the VPN? Maybe to be completely paranoid, you could 'systemctl restart NetworkManager'? If you still see the ~. DNS domain on eno1, then NetworkManager is doing something *really* weird.

Comment 4 Gergely Gombos 2020-11-10 23:55:12 UTC
Thanks for the quick reply!

> At least, you should be able to successfully load resources for somecompanydomain.net, right?

I can load some pages that should only be accessible through VPN, but not all.
I can't load some others. 
Examples: 
build.companyprotected.com loads successfully
nexus.companyprotected.com times out
confluence.anothercompanyprotected.net loads a "forbidden" page as if I wasn't connected to VPN.

Note that none of the above domains (companyprotected.com, anothercompanyprotected.net) show up in resolvectl domain output.
(Only somecompanydomain.net does which I never used. :))

> Are you having trouble accessing other pages *not* on somecompanydomain.net?

All non-VPN pages load correctly, and as above, many protected pages are not even on this domain.

> If you still see the ~. DNS domain on eno1, then NetworkManager is doing something *really* weird.

I restarted, reconnected etc. and confirmed the above output, so ~. shows up besides eno1.
(And the output is definitely different than having that "Use this connection only for resources on its network" checkbox checked, since we have those global items set up.)

---

So somehow it looks like that VPN interface/DNS servers are not being used to DNS resolve those hostnames with systemd-resolved enabled. Probably it resolves to a different IP through the non-VPN DNS servers (likely some public IP outside the VPN's 10.xxx.xxx.xxx network) which I'm unable to access through the public interface.

The question ultimately is, why is this resolution scheme different than without using systemd-resolved?

I guess with resolv.conf, those DNS servers were just prepended to the beginning, so always used first for *any* hostname?
(Same on Mac, Windows VPN Cisco AnyConnect VPN clients, as a default behavior?)

Let me know how I can help with dig, traceroute or whatever.

Comment 5 Michael Catanzaro 2020-11-11 17:13:38 UTC
(In reply to Gergely Gombos from comment #4)
> Thanks for the quick reply!
> 
> > At least, you should be able to successfully load resources for somecompanydomain.net, right?
> 
> I can load some pages that should only be accessible through VPN, but not
> all.
> I can't load some others. 
> Examples: 
> build.companyprotected.com loads successfully
> nexus.companyprotected.com times out
> confluence.anothercompanyprotected.net loads a "forbidden" page as if I
> wasn't connected to VPN.
> 
> Note that none of the above domains (companyprotected.com,
> anothercompanyprotected.net) show up in resolvectl domain output.
> (Only somecompanydomain.net does which I never used. :))

Right, so it's expected that all of that is going to fail, because you're using public DNS for everything except somecompanydomain.net. build.companyprotected.com is apparently available via public DNS, and the others are not.

The VPN server is responsible for sending a list of domains that should be resolved by the VPN. Since your VPN server is not doing so, you have to do that manually. You can use nm-connection-editor for that. (Ideally, these settings would exist in gnome-control-center as well, but they don't currently.)

So I don't think there is any bug here, just more local configuration to be done.

> > If you still see the ~. DNS domain on eno1, then NetworkManager is doing something *really* weird.
> 
> I restarted, reconnected etc. and confirmed the above output, so ~. shows up
> besides eno1.
> (And the output is definitely different than having that "Use this
> connection only for resources on its network" checkbox checked, since we
> have those global items set up.)

But this is a totally different situation. My guess is that this is a strange NetworkManager bug. This is really wild.

> I guess with resolv.conf, those DNS servers were just prepended to the
> beginning, so always used first for *any* hostname?

Right. Previously, all your DNS went to the first three servers listed in resolv.conf, in order. Now, if you select "use this connection only for resources on its network," it gets split.

Comment 6 Gergely Gombos 2020-11-12 07:25:39 UTC
> So I don't think there is any bug here, just more local configuration to be done.

I'll try to do that, adding these domains. What I'm saying here is that
- prepending resolv.conf with DNS servers was a default behavior for this and probably many other VPNs
- probably this is the default on Windows and Mac Cisco AnyConnect, too
- this behavior broke without notice in F33 due to systemd-resolved and can only be fixed by entering company domains manually, of which there are an unknown amount, that's why the company DNS servers were the default in the list.

But likely this is how it supposed to be, it's not a bug but a feature, it was wrong before and I'm supposed to uncheck "Use this connection only for resources on its network".

By the way, this bug is very similar to: https://github.com/systemd/systemd/issues/6076
It's a 3-year-old bug report, closed by Lennart Poettering himself a week ago. But it still exists. :)

Apparently people just used manual scripts to place ~. on the VPN interface to keep the same behavior as before.

So 2 questions remain:
- Should this incompatibility from F32 to F33 be documented somewhere/somehow?
- How can I debug this additional buggy behavior of not making the VPN DNS servers default in NetworkManager when I uncheck "Use this connection only for resources on its network"?

Comment 7 Michael Catanzaro 2020-11-12 15:59:52 UTC
(In reply to Gergely Gombos from comment #6)
> > So I don't think there is any bug here, just more local configuration to be done.
> 
> I'll try to do that, adding these domains. What I'm saying here is that
> - prepending resolv.conf with DNS servers was a default behavior for this
> and probably many other VPNs

Well that's wrong behavior. Any third-party VPN software that does this is simply broken. Anything that uses glibc/nsswitch to perform name resolution -- which is most things -- no longer looks at /etc/resolv.conf anymore at all, so trying to edit this file is no longer going to work.

But this isn't your problem, because you are using NetworkManager, and NetworkManager knows to push configuration directly to systemd-resolved and let it manage /etc/resolv.conf.

> - probably this is the default on Windows and Mac Cisco AnyConnect, too
> - this behavior broke without notice in F33 due to systemd-resolved and can
> only be fixed by entering company domains manually, of which there are an
> unknown amount, that's why the company DNS servers were the default in the
> list.
> 
> But likely this is how it supposed to be, it's not a bug but a feature, it
> was wrong before and I'm supposed to uncheck "Use this connection only for
> resources on its network".

Right, if your company accepts traffic for public resources outside its network, then unchecking that box ought to work. (Except, in your case, it doesn't!)

> By the way, this bug is very similar to:
> https://github.com/systemd/systemd/issues/6076
> It's a 3-year-old bug report, closed by Lennart Poettering himself a week
> ago. But it still exists. :)

Thing is, Lennart is right. I've left a comment in that issue now to hopefully reduce confusion. Configuring the ~. domain is the responsibility of NetworkManager or third-party VPN software. It's not systemd-resolved's job. systemd-resolved just sends your DNS where it is told to do so by other software.

Anyway, that's not relevant to your issue here. In your case, we already know that systemd-resolved is doing what it's told, but is being misconfigured by NetworkManager.

> Apparently people just used manual scripts to place ~. on the VPN interface
> to keep the same behavior as before.

This *should* only be required if you have disabled NetworkManager -- if you disable NetworkManager, then you're on your own! -- or if you're using third-party VPN software that doesn't know how to talk to systemd-resolved. (But your case is different. In your case, NetworkManager is just doing the wrong thing, and I don't know why.)

> So 2 questions remain:
> - Should this incompatibility from F32 to F33 be documented
> somewhere/somehow?

Hm, it's documented extensively at https://fedoramagazine.org/systemd-resolved-introduction-to-split-dns/. Of course, more documentation in other places might be good too, but hopefully that blog post should be high in web search results....

> - How can I debug this additional buggy behavior of not making the VPN DNS
> servers default in NetworkManager when I uncheck "Use this connection only
> for resources on its network"?

I don't know. Let's wait for the NetworkManager developers to respond to this bug report. (They receive a lot of bug reports, and will hopefully get to this one soon.) That really seems very likely to be a NetworkManager bug. At least, I can't think of any reason why it might be happening....

Comment 8 Michael Catanzaro 2020-11-12 16:16:19 UTC
(In reply to Michael Catanzaro from comment #7)
> Right, if your company accepts traffic for public resources outside its
> network, then unchecking that box ought to work. (Except, in your case, it
> doesn't!)

I mean: "Except, in your case, it doesn't work!"

I did not mean: your company doesn't accept traffic for public domains. (It might, or it might not. Some employers want you to send all your traffic through them so they can see all your traffic, while others forbid this to reduce load on their VPN servers.)

Comment 9 Manni Heumann 2020-11-17 10:52:10 UTC
Just wanting to say "Hi!" and that I have/had a very similar problem with my corporate vpn since upgrading to 33. For now, I disabled systemd.resolved, as this was the only workaround feasible. The list of domain names that need to be looked up using the corporate nameservers is unknown and potentially endless. Of course, one could argue that the corporate vpn-setup is broken, but the fact remains that everything works without a hitch while NOT using resolved.

I'm happy to help debug this if more information is needed.

Comment 10 Ron Flory 2021-03-09 22:59:32 UTC
Hi-

 This same bug/feature is killing my FC33 setup as well, until  systemd-resolved  is disabled.  Was a surprise since FC32 was working fine until upgrade to FC33.

 Work VPN uses PulseSecure client.  

 please do not assume that NetworkManager is (or should even be) in the picture.  

 I'd be happy to help debug/test.

Comment 11 Michael Catanzaro 2021-03-09 23:37:25 UTC
If your VPN software is not a NetworkManager plugin, then it is itself responsible for configuring systemd-resolved. Report bugs to your VPN software, not here.

(In reply to Ron Flory from comment #10)
> please do not assume that NetworkManager is (or should even be) in the
> picture.

If you're not using NetworkManager-openconnect, then your problem is not this bug. This bug is a NetworkManager bug. From comment #2:

(In reply to Gergely Gombos from comment #2)
> $ resolvectl domain
> 
> Global: somecompanydomain.net home
> Link 2 (eno1): ~. home
> Link 3 (virbr0):
> Link 4 (virbr0-nic):
> Link 5 (br-d0d9e69bbf7c):
> Link 6 (br-04098e67146f):
> Link 7 (br-50016b2406a1):
> Link 8 (br-814207b17af2):
> Link 9 (docker0):
> Link 10 (vpn0): somecompanydomain.net

This output is wrong because "Use this connection only for resources on its network" has been unchecked, so the ~. routing domain belongs on vpn0 rather than eno1. NetworkManager has messed this up for unknown reasons.

Comment 12 Michael Catanzaro 2021-03-09 23:38:46 UTC
I'll add this is really extremely strange. I've triaged a *lot* of VPN-related bug reports in the past several months, but none look like this. My guess is something weird defined on your NetworkManager connection, but I'll wait for NetworkManager developers to respond since they'll know what debug is needed....

Comment 13 Michael Catanzaro 2021-03-09 23:43:30 UTC
Actually, you know what? Since NetworkManager-1.26.6-1.fc33, NetworkManager should no longer assign ~. to anything other than a VPN interface. Instead, it now relies on the DefaultRoute setting of systemd-resolved. So... I don't know whether I would expect your problems to be *fixed*, per se, but they should definitely be *different* than before. Please check and see what's happening now when "use this connection only for resources on its network" is unchecked. (I don't see any evidence that anything is wrong when "use this connection only for resources on its network" is checked. Again, you can use nm-connection-editor to configure your VPN to resolve domains other than somecompanydomain.net.)

Comment 14 Gergely Gombos 2021-03-17 08:21:22 UTC
Hi! Yes, you are right, NM's behavior has indeed changed.
Now, VPN (company) domains work as expected without specifying them in the "additional lookup domains" field in nm-connection-editor.
"use this connection only for resources on its network" is unchecked.

~ $ NetworkManager --version
1.26.6-1.fc33
~ $ resolvectl domain
Global: somecompanydomain.net
Link 2 (eno1):
Link 3 (virbr0):
Link 4 (virbr0-nic):
Link 5 (docker0):
Link 6 (br-3708a908043f):
Link 7 (br-50016b2406a1):
Link 8 (br-814207b17af2):
Link 9 (br-922772d136fe):
Link 10 (vpn0): somecompanydomain.net

(Compare this with #2, no "home" or ~. assignments.)

In this case, it looks like (via Wireshark) that all DNS traffic is now routed through the VPN.

Comment 15 Michael Catanzaro 2021-03-17 14:24:37 UTC
Right, if "use this connection only for resources on its network" is unchecked then everything is expected to be routed through the VPN. All right, we never figured out what was causing the original issue, but clearly your troubles are resolved.

Well, mostly. I see you somehow have configured somecompanydomain.net on the global fake interface. I assume you did that manually somehow? It's probably not what you want as mixing global and link-specific configuration is rarely done intentionally. Regardless, it's not the original issue here, so closing.


Note You need to log in before you can comment on or make changes to this bug.