1374874 – /etc/resolv.conf not consistently set to 127.0.0.1

Bug 1374874 - /etc/resolv.conf not consistently set to 127.0.0.1

Summary: /etc/resolv.conf not consistently set to 127.0.0.1

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	NetworkManager
Sub Component:
Version:	24
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Lubomir Rintel
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-09-09 22:57 UTC by Jason Haar
Modified:	2017-08-08 17:14 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-08-08 17:14:10 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
NetworkManager logs (11.82 KB, text/plain) 2016-10-05 21:12 UTC, Jason Haar	no flags	Details
View All

Description Jason Haar 2016-09-09 22:57:57 UTC

Description of problem:

I am finding /etc/resolv.conf sometimes refers to the actual DNS servers listed by the local DHCP server instead of 127.0.0.1

I need it to remain 127.0.0.1 because I have a vpn tunnel running and need to do split DNS tricks with the local dnsmasq. That works perfectly - when it works - but of course it doesn't if /etc/resolv.conf doesn't refer to 127.0.0.1

Version-Release number of selected component (if applicable):


How reproducible:

Sometimes - 50% of the time when there's a network change?

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

I can almost guarantee it's due to a NetworkManager/dispatcher.d/ script I wrote (which is about when this problem began). So this ticket is really about why a script in there (which doesn't have anything to do with DNS - it disables wifi if it sees ethernet is working) could erratically cause my resolv.conf problem. I'm probably about to make the problem worse by changing that script to compensate for this issue - whereas I'd rather you guys tell me if NetworkManager really should remain responsible for such behaviour :-)

Comment 1 Beniamino Galvani 2016-09-12 08:34:19 UTC

(In reply to Jason Haar from comment #0)
> Description of problem:
> 
> I am finding /etc/resolv.conf sometimes refers to the actual DNS servers
> listed by the local DHCP server instead of 127.0.0.1


Hi,

can you please paste the output of:

NetworkManager --print-config

and the content of /etc/resolv.conf when the problem happens?

And also set:

[logging]
domains=DEFAULT,DNS:TRACE

in /etc/NetworkManager/NetworkManager.conf, do a "systemctl restart NetworkManager" and attach the output of "journalctl -u NetworkManager -b" when resolv.conf is incorrect? Thanks!

Comment 2 Jason Haar 2016-09-19 20:32:02 UTC

I can do that - but I think I'm onto the root cause

I mentioned running a vpn tunnel, but what I didn't mention is that it's not a NetworkManager-related vpn tunnel. We have a complex highly-redundant "always on" vpn solution based around openvpn and it runs under systemd as a service - independent to NetworkManager.

I have just discovered that if I shut down the openvpn service, NetworkManager notices the interface change, re-evaluates DNS, and creates a new   /var/run/NetworkManager/resolv.conf - with 127.0.0.1 as the "nameserver". But once I enable the openvpn service, it seems to get confused and I end up with nameserver entries referring to the real local DNS server (ie the DHCP entries from the local Ethernet/WiFi interfaces)

I would rather NetworkManager totally stayed out of the business of the openvpn service, so right now I just added "unmanaged-devices=interface-name:tun0" in the vain hope that would stop NetworkManager noticing activity involving that interface - but that didn't work (yes I did restart NetworkManager). This openvpn service involves all sorts of routing changes - and DNS fiddling (as it adds geoip-related VPN-only DNS domain names to /etc/NetworkManager/dnsmasq.d/filename, it needed to HUP dnsmasq to learn of the change)

So I'm thinking that NetworkManager is getting confused by this activity occurring outside it's control and that triggers it to get things wrong with the resolv.conf settings? (BTW I have checked and my openvpn script doesn't do anything but read resolv.conf). I just commented out the dnsmasq HUP'ing and that seemed to fix the issue - but that isn't a fix for me because I am wanting to re-evaluate DNS settings whenever plugged into a different network, so need to tell dnsmasq when that occurs. If I could notify NetworkManager to re-evaluate DNS, that would be great. Restarting NetworkManager would also probably work, but that's an infinite loop in the making there...

Comment 3 Beniamino Galvani 2016-09-20 20:55:15 UTC

(In reply to Jason Haar from comment #2)
> I can do that - but I think I'm onto the root cause
> 
> I mentioned running a vpn tunnel, but what I didn't mention is that it's not
> a NetworkManager-related vpn tunnel. We have a complex highly-redundant
> "always on" vpn solution based around openvpn and it runs under systemd as a
> service - independent to NetworkManager.
> 
> I have just discovered that if I shut down the openvpn service,
> NetworkManager notices the interface change, re-evaluates DNS, and creates a
> new   /var/run/NetworkManager/resolv.conf - with 127.0.0.1 as the
> "nameserver". But once I enable the openvpn service, it seems to get
> confused and I end up with nameserver entries referring to the real local
> DNS server (ie the DHCP entries from the local Ethernet/WiFi interfaces)

Are you sure it's not the external openvpn instance that is overwriting resolv.conf? Do you see the comment "# Generated by NetworkManager" in the file whet the content is wrong?

> I would rather NetworkManager totally stayed out of the business of the
> openvpn service, so right now I just added
> "unmanaged-devices=interface-name:tun0" in the vain hope that would stop
> NetworkManager noticing activity involving that interface - but that didn't
> work (yes I did restart NetworkManager). This openvpn service involves all
> sorts of routing changes - and DNS fiddling (as it adds geoip-related
> VPN-only DNS domain names to /etc/NetworkManager/dnsmasq.d/filename, it
> needed to HUP dnsmasq to learn of the change)

It seems very strange to me that NM is writing something different from 127.0.0.1 in resolv.conf if it is configured to used dnsmasq. Logs would be helpful to understand what's happening.

> So I'm thinking that NetworkManager is getting confused by this activity
> occurring outside it's control and that triggers it to get things wrong with
> the resolv.conf settings? (BTW I have checked and my openvpn script doesn't
> do anything but read resolv.conf). I just commented out the dnsmasq HUP'ing
> and that seemed to fix the issue - but that isn't a fix for me because I am
> wanting to re-evaluate DNS settings whenever plugged into a different
> network, so need to tell dnsmasq when that occurs.

Where did you comment out the signaling of dnsmasq?

> If I could notify
> NetworkManager to re-evaluate DNS, that would be great. Restarting
> NetworkManager would also probably work, but that's an infinite loop in the
> making there...

You can send SIGHUP to NM to force it to rewrite resolv.conf (it will also restart dnsmasq).

Comment 4 Jason Haar 2016-10-05 21:11:50 UTC

Well this is getting complicated, now I'm picking up different behaviour. Maybe there's been a NM upgrade (I update daily), but now the problem is much worse.

Each time I change networks, NM does it's fiddle with dnsmasq, but I end up unable to do DNS lookups using 127.0.0.1. What appears to be happening is NM tells dnsmasq to reload it's config, and also tells it what interface the forwarding DNS servers are on ("adding nameserver" log lines). The problem I have is that my home network is on the 10 subnet - and so is my vpn/work network. So dnsmasq insists on pushing out DNS queries to 10.X DNS server IPs via my Ethernet card, whereas the routing table clearly shows the 10/8 network is via my vpn interface - and so dnsmasq does not work.

If I do a full "kill" on dnsmasq, it restarts without these "bindings" and starts working properly - it's NM that makes it act up.

I attach the logfile

To answer your other questions:

1. openvpn calls scripts that I wrote, and no, they don't fiddle with resolv.conf

2. I kill dnsmasq from one of my scripts called when openvpn brings up the vpn interface. Unfortunately, there's a race between my script trying to make dnsmasq work by restarting it and NetworkManager noticing that the vpn interface state changed, leading to unpredictable outcomes. I'd rather hard-wire NetworkManager to ignore all interface and routing changes to do with the openvpn interface - but have had no luck with that

2. SIGHUP'ing NM makes dnsmasq hard-wire itself to the Ethernet card - breaking dnsmasq entirely, so I still need to kill dnsmasq to make it stop sending packets out the wrong interface

Final comment. What is that binding dnsmasq to interfaces thing about? Why aren'
t you just relying on the routing table? I assume there's some reason behind that - but here it is breaking a more classic approach

Comment 5 Jason Haar 2016-10-05 21:12:25 UTC

Created attachment 1207695 [details]
NetworkManager logs

Comment 6 Beniamino Galvani 2016-10-06 08:29:12 UTC

(In reply to Jason Haar from comment #4)
> Well this is getting complicated, now I'm picking up different behaviour.
> Maybe there's been a NM upgrade (I update daily), but now the problem is
> much worse.
>
> Each time I change networks, NM does it's fiddle with dnsmasq, but I end up
> unable to do DNS lookups using 127.0.0.1. What appears to be happening is NM
> tells dnsmasq to reload it's config, and also tells it what interface the
> forwarding DNS servers are on ("adding nameserver" log lines). The problem I
> have is that my home network is on the 10 subnet - and so is my vpn/work
> network. So dnsmasq insists on pushing out DNS queries to 10.X DNS server
> IPs via my Ethernet card, whereas the routing table clearly shows the 10/8
> network is via my vpn interface - and so dnsmasq does not work.

How did you receive those nameserver, from the VPN or from the local
DHCP? If the log contains:

  dnsmasq[0x7fa120016890]: adding nameserver '10.77.30.10@eno1

I would think that the nameserver was received from the ethernet
connection, not from the VPN. If this is true, why would you want
to send queries to the VPN?

> If I do a full "kill" on dnsmasq, it restarts without these "bindings" and
> starts working properly - it's NM that makes it act up.

What do you mean with "without these bindings"? NM should restart
dnsmasq and pass the same parameters as before (in the log I see only
a single invocation of dnsmasq from NM and that has bindings to @eno1).

> I attach the logfile
>
> To answer your other questions:
>
> 1. openvpn calls scripts that I wrote, and no, they don't fiddle with
> resolv.conf

I understand resolv.conf is correctly set to 127.0.0.1 now, right?


> 2. I kill dnsmasq from one of my scripts called when openvpn brings up the
> vpn interface. Unfortunately, there's a race between my script trying to
> make dnsmasq work by restarting it and NetworkManager noticing that the vpn
> interface state changed, leading to unpredictable outcomes. I'd rather
> hard-wire NetworkManager to ignore all interface and routing changes to do
> with the openvpn interface - but have had no luck with that

Could you provide logs for that too? Preferably with:

 [logging]
 domains=TRACE


> 2. SIGHUP'ing NM makes dnsmasq hard-wire itself to the Ethernet card -
> breaking dnsmasq entirely, so I still need to kill dnsmasq to make it stop
> sending packets out the wrong interface
>
> Final comment. What is that binding dnsmasq to interfaces thing about? Why
> aren'
> t you just relying on the routing table? I assume there's some reason behind
> that - but here it is breaking a more classic approach

The reason why we do this is to ensure that a DNS server learned
through an interface isn't accidentally contacted over a different
interface. In many cases routing doesn't help to do the right choice
(e.g. if there are multiple interfaces with a default route). See:

  https://bugzilla.gnome.org/show_bug.cgi?id=765153

for more details.

Comment 7 Thomas Haller 2016-10-06 11:43:55 UTC

(In reply to Beniamino Galvani from comment #6)

> Could you provide logs for that too? Preferably with:
> 
>  [logging]
>  domains=TRACE

should be:

  [logging]
  level=TRACE
  domains=ALL

Comment 8 Jason Haar 2016-10-06 23:40:34 UTC

The explanation is a bit complicated, but you asked...

We have guest networks at work that are configured to use our existing internal 10.X DNS servers. So DHCP gives you a 172.16 address with 10.X DNS servers

Then you run openvpn on the guest network to connect to "work", and openvpn is configured to use the same DNS servers, but to also add a route for 10/8 back over the VPN. This isn't run under NetworkManager and cannot be because it's too complicated: the openvpn solution uses GeoIP lookups to recreated it's config file every time a network change or server outage occurs, and runs ranges of "up/down" scripts - stuff that NetworkManager-openvpn cannot handle

So end result appears to be that NetworkManager hardwires dnsmasq to expect to send and receive DNS packets to our 10.X DNS servers over eno1, but I suspect the kernel (or iptables - that's in play too) is expecting those packets to come back over tun0 - because that's what the routing table says? I'm guessing at this point. All I know is that nslookup doesn't work, even though I can see the DNS packets going out and coming back through eno1. But if I kill dnsmasq and it's auto-restarted by NM, then nslookup starts working again.

When NM formally talks to dnsmasq during a network change, I can see those "adding nameserver '10.77.30.10@eno1" lines, and after I kill dnsmasq, I also see them - but they don't appear to work. After the "kill", tcpdump on eno1 shows ZERO DNS activity - it all flows over the tun0 interface (as expected by the routing table). So it's like NM does something differently after a "crash" than it does normally - even though the logs are consistent. BTW, just to be explicit, there's no reference in the logs about dnsmasq and tun0 - which seems correct as NM doesn't know openvpn is in play

Frankly, dnsmasq could use either eno1 or tun0 for DNS in this case - both should work from my perspective, but unfortunately that isn't the case. When I am at home (192.168 with 192.168 DNS), this weird DNS-not-responding problem doesn't occur (10/8 is exclusively over tun0 - no conflict), but I still get the "nameserver 192.168.." problem. Agh - I've mixed probably unrelated issues in one ticket here. Having the wrong nameserver entry in resolv.conf and having dnsmasq fail to work - the latter only occurs on this "special case" work guest network.

Anyway, I'll go back to getting the debug logs :-)

Comment 9 Beniamino Galvani 2016-10-07 12:38:28 UTC

(In reply to Jason Haar from comment #8)
> The explanation is a bit complicated, but you asked...
>
> We have guest networks at work that are configured to use our existing
> internal 10.X DNS servers. So DHCP gives you a 172.16 address with 10.X DNS
> servers
>
> Then you run openvpn on the guest network to connect to "work", and openvpn
> is configured to use the same DNS servers, but to also add a route for 10/8
> back over the VPN. This isn't run under NetworkManager and cannot be because
> it's too complicated: the openvpn solution uses GeoIP lookups to recreated
> it's config file every time a network change or server outage occurs, and
> runs ranges of "up/down" scripts - stuff that NetworkManager-openvpn cannot
> handle
>
> So end result appears to be that NetworkManager hardwires dnsmasq to expect
> to send and receive DNS packets to our 10.X DNS servers over eno1, but I
> suspect the kernel (or iptables - that's in play too) is expecting those
> packets to come back over tun0 - because that's what the routing table says?

That is possible; but since you could reach 10.X through eno1 before
activating the VPN I think you should have a route to the server on
eno1 (a default or explicit route) and that route should be still be
present (but unused because the VPN probably has a lower metric).

You could enable logging of martian packets:

  echo 1 > /proc/sys/net/ipv4/conf/eno1/log_martians

and see if dmesg output shows anything. If you see martian packets,
probably disabling reverse-path filter would help:

  echo 0 > /proc/sys/net/ipv4/conf/eno1/rp_filter

> I'm guessing at  this point. All I know is that nslookup doesn't work, even
> though I can see the DNS packets going out and coming back through eno1. But
> if I kill dnsmasq and it's auto-restarted by NM, then nslookup starts
> working again.

When debugging such problems I often find useful the logging of
queries in dnsmasq:

 echo log-queries > /etc/NetworkManager/dnsmasq.d/log-queries
 killall -HUP NetworkManager

Then, you should the see in the journal where dnsmasq is sending each
query.

> When NM formally talks to dnsmasq during a network change, I can see those
> "adding nameserver '10.77.30.10@eno1" lines, and after I kill dnsmasq, I
> also see them - but they don't appear to work. After the "kill", tcpdump on
> eno1 shows ZERO DNS activity - it all flows over the tun0 interface (as
> expected by the routing table).

This sounds like a bug in dnsmasq, probably the queries logging should
help to investigate the issue. Note that we recently discovered a
similar bug in dnsmasq (see bug 1367772), which is fixed upstream but
not yet in Fedora.

> So it's like NM does something differently
> after a "crash" than it does normally - even though the logs are consistent.
> BTW, just to be explicit, there's no reference in the logs about dnsmasq and
> tun0 - which seems correct as NM doesn't know openvpn is in play
>
> Frankly, dnsmasq could use either eno1 or tun0 for DNS in this case - both
> should work from my perspective, but unfortunately that isn't the case. When
> I am at home (192.168 with 192.168 DNS), this weird DNS-not-responding
> problem doesn't occur (10/8 is exclusively over tun0 - no conflict), but I
> still get the "nameserver 192.168.." problem. Agh - I've mixed probably
> unrelated issues in one ticket here. Having the wrong nameserver entry in
> resolv.conf  and having dnsmasq fail to work - the latter only occurs on
> this "special case" work guest network.

Yep, this seems a different problem, but I'm pretty sure it's not NM
fault. NM writes 127.0.0.1 when using dnsmasq; however, depending on
configuration or build options, it can write resolv.conf directly or
relying on other tools like resolvconf or netconfig. I guess in your
case one of this tools is adding extra servers to the file. Do you see
the comment "# Generated by NetworkManager" at the top of the file
when the content is wrong?

> Anyway, I'll go back to getting the debug logs :-)

Great, thanks!

Comment 10 Fedora End Of Life 2017-07-25 22:56:58 UTC

This message is a reminder that Fedora 24 is nearing its end of life.
Approximately 2 (two) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 24. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '24'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 24 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Comment 11 Fedora End Of Life 2017-08-08 17:14:10 UTC

Fedora 24 changed to end-of-life (EOL) status on 2017-08-08. Fedora 24 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.