Bug 1728701 - dnsmasq: stops responding to TCP queries with --bind-dynamic when interface index number changes
Summary: dnsmasq: stops responding to TCP queries with --bind-dynamic when interface i...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: dnsmasq
Version: rawhide
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
Assignee: Petr Menšík
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On: 1721668
Blocks: 1728698
TreeView+ depends on / blocked
 
Reported: 2019-07-10 13:40 UTC by Petr Menšík
Modified: 2019-08-21 20:12 UTC (History)
10 users (show)

Fixed In Version: dnsmasq-2.80-7.fc30 dnsmasq-2.79-9.fc29
Clone Of: 1721668
Environment:
Last Closed: 2019-08-03 01:17:07 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github InfrastructureServices/dnsmasq-tests/blob/master/bz1728701.sh 0 None None None 2020-09-24 12:20:04 UTC

Description Petr Menšík 2019-07-10 13:40:51 UTC
+++ This bug was initially created as a clone of Bug #1721668 +++

Description of problem:
In OpenShift environments the dnsmasq daemon is used with --bind-dynamic, because the "shift" process already is bound to 127.0.0.1:53 and therefore the 0.0.0.0 wildcard cannot be used for binding by dnsmasq.

Because the interfaces in OpenShift environments come and go, openswitch restarts, etc., sometimes the network interface changes its index number while dnsmasq is still running and listening on the IP address.

Because the interface index number has been changed, the TCP query will not pass the test for allowed interfaces, the TCP connection will be shutdown and no reponse will be sent. This persists until dnsmasq is restarted.

UDP queries are not affected.

This is highly critical issue that affects many OpenShift installations that still use --bind-dynamic. The customer expects a z-stream fix.

When issue occurs, this is the code flow:

File: dnsmasq.c
1617: 		  for (iface = daemon->interfaces; iface; iface = iface->next)
1618: 		    if (iface->index == if_index)
1619: 		      break;

The above condition for "break;" is not met, the daemon->interfaces will be iterated over all deamon->interfaces until end is reached. "iface" is then a null pointer.

Then, the following confition is met and client_ok = 0:

File: dnsmasq.c
1621: 		  if (!iface && !loopback_exception(listener->tcpfd, tcp_addr.sa.sa_family, &addr, intr_name))
1622: 		    client_ok = 0;
1623: 		}

Subsequently, the connection is shutdown();

File: dnsmasq.c.orig
1644: 	  if (!client_ok)
1645: 	    {
1646: 	      shutdown(confd, SHUT_RDWR);
1647: 	      while (retry_send(close(confd)));
1648: 	    }


Version-Release number of selected component (if applicable):
dnsmasq-2.76-7.el7

How reproducible:
Always

Steps to Reproduce:
1. Use a virtual machine, add a new interface and assign an IP address so that there are at least two interfaces.

2. Run dnmasq with --bind-dynamic.

3. Send a TCP DNS query from another host:
    # nslookup -vc <hostname> <IP address on the newly added interface>

4. Remove the newly added interface from the virtual machine.

5. Add the interface again, and again assign the same IP address. The interface index number is changed.

6. Run the TCP DNS query from another host again:
    # nslookup -vc <hostname> <IP address on the newly added interface>


Actual results:
    # nslookup -vc <hostname> <IP address>
    ;; communications error to <IP address>#53: end of file

Expected results:
   Response to the TCP DNS request.

Additional info:

--- Additional comment from Petr Menšík on 2019-07-01 21:52:12 CEST ---

Cannot reproduce yet, it seems indexes are updated in my testing configuration.

dnsmasq_daemon->interfaces are updated and index matches after I create and drop interface by hand. Old interfaces found before are still kept but not marked with found=1. Tested with dummy0 interface.

listener->iface is NOT updated and is showing wrong index at client acceptance, but that gets changed after client_ok test on line 1622.
But at least now, it seems to work. Client is always accepted and response sent back. Might be wrong because I am testing interface without remote interface on local host.

Testing interface up and down using script:
DEV=dummy0
IP=10.129.0.26/32

--- Additional comment from Petr Menšík on 2019-07-03 13:57:09 CEST ---

Interesting, I checked this issue with ipip tunnel, where I could reproduce the correct behaviour. However, it seems this issue is not entirely in user space in dnsmasq. I were unable to reproduce this issue on my Fedora workstation, but I were able to reproduce it on RHEL 7 virtual machine. It seems it depends on the kernel used and netlink messages emitted.

On Fedora:
dnsmasq: Found interface ipip0(#35) 
dnsmasq: Found interface ipip0(#36) 
dnsmasq: Found interface ipip0(#37) 
dnsmasq: Found interface ipip0(#38) 
dnsmasq: Found interface ipip0(#39) 
dnsmasq: Found interface ipip0(#39) 
dnsmasq: Found interface ipip0(#40) 

$ uname -r
5.1.11-200.fc29.x86_64

On RHEL7:
dnsmasq: Found interface ipip1(#5) 
dnsmasq: Found interface ipip1(#5) 
dnsmasq: Skipping interface ipip1(#5) 
dnsmasq: Skipping interface eth0(#3) 
dnsmasq: Skipping interface eth0(#3) 
dnsmasq: Skipping interface ipip1(#5) 
dnsmasq: Skipping interface eth0(#3) 
dnsmasq: Skipping interface eth0(#3) 
dnsmasq: TCP query from interface ipip1(#7) denied

$ uname -r
3.10.0-693.21.1.el7.x86_64

--- Additional comment from Petr Menšík on 2019-07-03 19:50 CEST ---

Original change compares address of allowed interface. If interface with the same address exist, no new interface is created. However old interface is not also fully updated, leaving stored index not current on destroyed interface.

There is potential problem with this fix however. It might break bind-interfaces on systems with often changing interfaces. New interfaces would be added but old ones never returned, making possible grow of memory usage.

It seems kernel is not responsible. It just behaves slightly different way and the result is different.

There is something wrong with the way how new interfaces are handled in dnsmasq. In case of bind-interfaces, interfaces would be scanned at arrival of (TCP) query, bind-dynamic makes it automatically handled. There is speedup using known addresses at last interface check. If new interface has address we have already stored on any interface, it guesses that is the interface again. However it does not update other parameters like interface index, name, netmask and many other things. So it breaks tcp connections which checks interface number.

I think there might be possible growth of those entries. But they grow already for IPv6 addresses in this case, because IPv6 sockaddr structure contains number of interface (scope_id). That always changes so new entry for IPv6 is created anyway. If that was not issue until now, it should be ok.

--- Additional comment from Petr Menšík on 2019-07-04 18:27:59 CEST ---

Found out difference in behaviour between Fedora and RHEL 7. It depends on IPv6 support difference between those platforms. Because newer version assigns IPv6 address automatically to every single device used, dnsmasq will have record with up-to-date interface index with a new IPv6 address. Because scope_id of address is always different, even when the IP address is still the same, new interface record is created instead reuse of existing one.

It is more coincidence than intention it allows TCP queries again on Fedora. It will find IPv6 interface even for incoming query on IPv4 address, allowing the connection again after recreation! This fix should be applied to Fedora as well.

--- Additional comment from Petr Menšík on 2019-07-09 14:41 CEST ---

Helps debugging of listening on dynamic interfaces. Logs when new listening is started along with used address. Logs also when listening stopped.

--- Additional comment from Petr Menšík on 2019-07-09 14:44 CEST ---

Add new interface when interface index changes. Also accept only TCP queries when interface with correct address is scanned and accepted.

--- Additional comment from Petr Menšík on 2019-07-09 14:47 CEST ---

Free entries of old interfaces no longer found on running system. Do not accept connections on addresses no longer used, release memory for old interfaces.

--- Additional comment from Petr Menšík on 2019-07-09 14:49:13 CEST ---

Upstream patches [1] posted to dnsmasq mailing list. Waiting for any reply.

1. http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2019q3/013114.html

Comment 2 Fedora Update System 2019-07-31 18:46:29 UTC
FEDORA-2019-b0b2b9b380 has been submitted as an update to Fedora 30. https://bodhi.fedoraproject.org/updates/FEDORA-2019-b0b2b9b380

Comment 3 Fedora Update System 2019-07-31 19:36:57 UTC
FEDORA-2019-8ad16085e2 has been submitted as an update to Fedora 29. https://bodhi.fedoraproject.org/updates/FEDORA-2019-8ad16085e2

Comment 4 Fedora Update System 2019-08-01 03:28:49 UTC
dnsmasq-2.80-7.fc30 has been pushed to the Fedora 30 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-b0b2b9b380

Comment 5 Fedora Update System 2019-08-01 05:33:50 UTC
dnsmasq-2.79-9.fc29 has been pushed to the Fedora 29 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-8ad16085e2

Comment 6 Fedora Update System 2019-08-03 01:17:07 UTC
dnsmasq-2.80-7.fc30 has been pushed to the Fedora 30 stable repository. If problems still persist, please make note of it in this bug report.

Comment 7 Fedora Update System 2019-08-15 18:51:37 UTC
dnsmasq-2.79-9.fc29 has been pushed to the Fedora 29 stable repository. If problems still persist, please make note of it in this bug report.

Comment 8 Petr Menšík 2019-08-21 20:12:29 UTC
Reproducer for this bug created at https://github.com/InfrastructureServices/dnsmasq-tests/blob/master/bz1728701.sh


Note You need to log in before you can comment on or make changes to this bug.