Bug 1720703

Summary: Latest bind (9.9) update now causes zone transfer to always fail.
Product: Red Hat Enterprise Linux 7 Reporter: John Horne <john.horne>
Component: bindAssignee: Petr Menšík <pemensik>
Status: CLOSED ERRATA QA Contact: Petr Sklenar <psklenar>
Severity: high Docs Contact:
Priority: urgent    
Version: 7.6CC: benji.wakely, btoonk, cadel2010, cbesson, cobrown, dcortijo, donna.myers, dwoodruf, eric.eisenhart, eskil, fkrska, mooney, nick.hall, psklenar, rcadova, rhbugs, rik.theys, robert.esser, roshan.hendahewa, rwahyudi, scott.frazer, stein, surkum, tcrider, thozza, uwe.knop, villapla, yozone, ysoni
Target Milestone: rcKeywords: Regression, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1724071 (view as bug list) Environment:
Last Closed: 2019-08-06 12:39:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1724071    
Attachments:
Description Flags
Log snippet - failed zone transfer
none
Log snippet - success after rollback none

Description John Horne 2019-06-14 15:14:01 UTC
Description of problem:
After recent yum update of 'bind' from version 9.9.4-73 to version 9.9.4-74 a local zone transfer (using AXFR) from one local server to a second local server always now fails (times out).

The zone file is 2.0MB in size, and contains 58277 lines. (There are no include directives.)

This particular zone transfer has worked fine for the past few years.


Version-Release number of selected component (if applicable):
bind-9.9.4-74.el7_6.1.x86_64
(Uname output): Linux cent-8-508 3.10.0-957.21.2.el7.x86_64 #1 SMP Tue May 28 09:26:43 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

How reproducible:
Occurs all the time here now and has been reported elsewhere.


Steps to Reproduce:
1. Install bind on master server and create zone file.
2. Create secondary bind server and configure for zone. 
3. Transfer the zone from master to secondary.
4. The zone gets partially transferred, stops and then continues a bit. Stops and continues again. It eventually fails with a connection timeout error.

Actual results:
Transfer eventually fails with:

;; connection timed out; no servers could be reached

Expected results:
Zone should transfer in its entirety with no problems.


Additional info:
A 'yum downgrade' of bind to version 9.9.4-73 and the transfer works again.

The difference between versions 9.9.4-74 and 9.9.4-73, as far as I can tell, is the fix for CVE-2018-5743. This relates to TCP clients.
In our config we limit the number of TCP connections to 250 (on both servers), but both servers are lightly used and never reach anywhere near that number of connections.

For some reason by increasing the debug log level to 9, and the transfer then works.

Comment 2 David Cortijo 2019-06-14 18:44:26 UTC
I experienced an identical issue on our production nameservers, which I had to downgrade to 9.9.4-73 prior to finding this bugzilla, so I could not test raising the debug log level.

Comment 3 Petr Menšík 2019-06-17 10:12:01 UTC
Thanks for the report, this seems to be important. Are these issues catched in logs?

Did you check rndc status for connection count at the time of transfer? Is anything about that failed transfer recorded in the log?

Comment 4 John Horne 2019-06-17 10:24:42 UTC
'rndc' during the transfers shows 'tcp clients: 6/250' on both the client and sending servers.

The client server logs show nothing about the transfer.

The sending server logs show:

======
17-Jun-2019 11:17:41.716 xfer-out: info: client 141.163.177.1#35903 (.): transfer of './IN': AXFR started
17-Jun-2019 11:17:53.834 xfer-out: info: client 141.163.177.1#36202 (.): transfer of './IN': AXFR started
17-Jun-2019 11:18:05.959 xfer-out: info: client 141.163.177.1#53706 (.): transfer of './IN': AXFR started
======

After that the connection timed out on the client. Although a part of the zone seems to get sent each time, as can be seen it seems to restart the connection.

Comment 5 John Horne 2019-06-17 10:34:39 UTC
As mentioned above changing the logging debug level to 9 and the transfer worked. I just tried using level 8 and that too worked. Level 7 did not work.

Comment 6 John Horne 2019-06-17 10:40:28 UTC
I changed the debug logging level to 7 (on both servers) and tried the transfer again.

The client logs again showed nothing.

The sending server log showed:

===========
17-Jun-2019 11:31:08.413 xfer-out: debug 6: client 141.163.177.1#41504 (.): AXFR request
17-Jun-2019 11:31:08.413 xfer-out: debug 6: client 141.163.177.1#41504 (.): transfer of './IN': AXFR question section OK
17-Jun-2019 11:31:08.413 xfer-out: debug 6: client 141.163.177.1#41504 (.): transfer of './IN': AXFR authority section OK
17-Jun-2019 11:31:08.414 xfer-out: info: client 141.163.177.1#41504 (.): transfer of './IN': AXFR started
17-Jun-2019 11:31:20.495 xfer-out: debug 6: client 141.163.177.1#47763 (.): AXFR request
17-Jun-2019 11:31:20.495 xfer-out: debug 6: client 141.163.177.1#47763 (.): transfer of './IN': AXFR question section OK
17-Jun-2019 11:31:20.496 xfer-out: debug 6: client 141.163.177.1#47763 (.): transfer of './IN': AXFR authority section OK
17-Jun-2019 11:31:20.496 xfer-out: info: client 141.163.177.1#47763 (.): transfer of './IN': AXFR started
17-Jun-2019 11:31:32.571 xfer-out: debug 6: client 141.163.177.1#55668 (.): AXFR request
17-Jun-2019 11:31:32.571 xfer-out: debug 6: client 141.163.177.1#55668 (.): transfer of './IN': AXFR question section OK
17-Jun-2019 11:31:32.571 xfer-out: debug 6: client 141.163.177.1#55668 (.): transfer of './IN': AXFR authority section OK
17-Jun-2019 11:31:32.571 xfer-out: info: client 141.163.177.1#55668 (.): transfer of './IN': AXFR started
===========

So, not much help really.

Comment 7 Scott Frazer 2019-06-17 14:45:06 UTC
I can add some other data points to this bug, which we started experiencing last week.

We're running bind package:
Name        : bind
Arch        : x86_64
Epoch       : 32
Version     : 9.9.4
Release     : 74.el7_6.1

and bind-chroot package:
Name        : bind-chroot
Arch        : x86_64
Epoch       : 32
Version     : 9.9.4
Release     : 74.el7_6.1

The transfers only fail if they are above a certain byte (not record) threshold. In our case that threshold is 38,300 bytes. I can cause the problem to appear using "dig @{target server} {example.com} AXFR"

The transfers do not fail if they are on the same local network segment (i.e. they only fail when crossing a network boundary. 10.1.1.12 pulling from 10.1.1.11 works, but 10.2.1.12 pulling from 10.1.1.11 does not work)

We only experience this issue on servers in our physical data centers (we run proxmox hypervisors and Centos vms there) but not in AWS (where we run the latest CentOS-provided 1901 AMI (ami-0f2b4fc905b0bd1f1 in the Ohio region, for example)

The failure manifests as a certain number of records being transmitted, then a pause, then same number of records being transmitted again. In pcap this is flagged as a re-transmission due to failed ack.

During this pause, an available TCP connection is used. With the default idle/transfer timeouts in production we were seeing the remote end re-trying the connection often enough to exhaust our connections (new connections were denied with "quota" errors)

Setting our log severity to debug 9 didn't affect the results, and didn't appear to put anything new into the log, perhaps I did that incorrectly.

Downgrading to:
Name        : bind
Arch        : x86_64
Epoch       : 32
Version     : 9.9.4
Release     : 73.el7_6

Also did not fix our problem.

Based on the above data we've been treating it as a network-related issue.

Comment 8 Scott Frazer 2019-06-17 15:15:23 UTC
EDIT. I had an incorrect server in use during testing. Please disregard my previous message and use this comment instead. Sorry for my confusion.

I can add some other data points to this bug, which we started experiencing last week.

We're running bind package:
Name        : bind
Arch        : x86_64
Epoch       : 32
Version     : 9.9.4
Release     : 74.el7_6.1

and bind-chroot package:
Name        : bind-chroot
Arch        : x86_64
Epoch       : 32
Version     : 9.9.4
Release     : 74.el7_6.1

The transfers only fail if they are above a certain byte (not record) threshold. In our case that threshold is 38,300 bytes. I can cause the problem to appear using "dig @{target server} {example.com} AXFR"

The transfers do not fail if they are on the same local network segment (i.e. they only fail when crossing a network boundary. 10.1.1.12 pulling from 10.1.1.11 works, but 10.2.1.12 pulling from 10.1.1.11 does not work)

We only experience this issue on servers in our physical data centers (we run proxmox hypervisors and Centos vms there) but not in AWS (where we run the latest CentOS-provided 1901 AMI (ami-0f2b4fc905b0bd1f1 in the Ohio region, for example)

The failure manifests as a certain number of records being transmitted, then a pause, then same number of records being transmitted again. In pcap this is flagged as a re-transmission due to failed ack.

During this pause, an available TCP connection is used. With the default idle/transfer timeouts in production we were seeing the remote end re-trying the connection often enough to exhaust our connections (new connections were denied with "quota" errors)

Setting log severity to debug 9 on xfer-in and xfer-out doesn't always resolve the problem. It becomes intermittent at that point. 

Downgrading to:
Name        : bind
Arch        : x86_64
Epoch       : 32
Version     : 9.9.4
Release     : 73.el7_6

DID fix our problem (It really helps if you are downgrading the server you are testing against)

When the transfer fails with logging at level 9, the following is posted to the logs:

17-Jun-2019 10:05:57.272 client 10.45.48.222#59552 (example.com): view internal: AXFR request
17-Jun-2019 10:05:57.273 client 10.45.48.222#59552 (example.com): view internal: transfer of 'example.com/IN': AXFR question section OK
17-Jun-2019 10:05:57.273 client 10.45.48.222#59552 (example.com): view internal: transfer of 'example.com/IN': AXFR authority section OK
17-Jun-2019 10:05:57.273 client 10.45.48.222#59552 (example.com): view internal: transfer of 'example.com/IN': AXFR started
17-Jun-2019 10:05:57.273 example.com.	300	IN	SOA	ns2198.dns.dyn.com. hostmaster.example.com. 2019061401 3600 900 1209600 300

... records ...

17-Jun-2019 10:05:57.300 client 10.45.48.222#59552 (example.com): view internal: transfer of 'example.com/IN': sending TCP message of 18368 bytes

... records ...

17-Jun-2019 10:05:57.315 9.example.com. 300	IN	CNAME	secure35.example.com.
17-Jun-2019 10:05:57.315 client 10.45.48.222#59552 (example.com): view internal: transfer of 'example.com/IN': sending TCP message of 17775 bytes

// 9.example.com is the last record that the dig machine saw.

... records ...

// The logfile doesn't indicate a completed send at this point and appears to continue to try to send the rest of the records even though the remote dig is no longer listening. 

17-Jun-2019 10:07:17.574 client 10.45.48.222#59812 (example.com): view internal: transfer of 'example.com/IN': sending TCP message of 22052 bytes



I'll add more as I can find it

Comment 9 Scott Frazer 2019-06-17 15:18:34 UTC
Eventually this was added to the log:
17-Jun-2019 10:15:57.273 client 10.45.48.222#59552 (example.com): view internal: transfer of 'example.com/IN': aborted: timed out
17-Jun-2019 10:16:37.406 client 10.45.48.222#50956 (example.com): view internal: transfer of 'example.com/IN': aborted: timed out
17-Jun-2019 10:17:17.528 client 10.45.48.222#59812 (example.com): view internal: transfer of 'example.com/IN': aborted: timed out

Comment 10 David Cortijo 2019-06-17 15:33:18 UTC
In our case, we were working with the precise same version/release of bind-chroot - 9.9.4.74.el7_6.1. Downgrade to release 73.el7_6 was required to fix the issue. We have some inconsistencies with what Scott is seeing, but all signs point to this being a server-side issue with the bind packages.

Specifically, we had upgraded (and had to roll back) the "master" side of the zone transfer - the slave servers were affected as soon as the master was patched and the zone file required an update.

We were seeing failures on the same network segment, but our zone files are failing after significantly larger transfers - ranging from 875k to 6.4M in a single morning.

One of the two downstream slaves happened to have debug 7 logging on, so I have quite a bit of data from that one slave (which was actually running bind-chroot-9.9.4-61.el7.x86_64 when the upstream master server was upgraded).

Our timeline (I'll be uploading some log snippets to show how these play out):
- Master server was upgraded to bind-chroot-9.9.4.74.el7_6.1 on evening of June 4th
- Slaves were unable to transfer in the overnight of June 4th-5th. There were no explicit record changes at this time, but our DNSSEC signatures update as needed at 2am daily - failures became visible at roughly 3am.
-- This is consistent with a 1 hour to fail transfers, as shown in the example client log snippet (dcortijo-failed_transfers.log)
- Master server was rolled back at 2:13:52 pm on June 14th, issued notifies for zone, transfers immediately were successful (as shown in client log dcortijo-after_rollback.log)

At least in our case, we were seeing 1 hour for the client to register a failure (sitting completely idle as the transfer failed) before it gives up and immediately tried again.

It should be noted that on the other slave that I was analyzing this issue on, upgrading it to 9.9.4.74.el7_6.1 did not help (or otherwise change) the issue. After downgrading the master, that slave is still running the latest release without any problems.

Comment 11 David Cortijo 2019-06-17 15:34:12 UTC
Created attachment 1581483 [details]
Log snippet - failed zone transfer

Comment 12 David Cortijo 2019-06-17 15:34:36 UTC
Created attachment 1581484 [details]
Log snippet - success after rollback

Comment 13 Petr Menšík 2019-06-18 09:32:51 UTC
Tried multiple transfers of root zone on my servers hosted by this version and it worked fine. How many cores do you have? Is it possible it happens only on multicore system? Even multiple transfers together closing to the TCP limit worked well.

Comment 14 Petr Menšík 2019-06-18 09:37:30 UTC
Ok, maybe I am testing wrong side of the issue. I were testing bind serving those transfers. But maybe the issue is in bind receiving the transfers, right? dig -t AXFR seems to work quite right.

Comment 15 Petr Menšík 2019-06-18 09:45:24 UTC
*** Bug 1720689 has been marked as a duplicate of this bug. ***

Comment 16 John Horne 2019-06-18 10:05:11 UTC
Our primary DNS server is quite small - 2 cores, 2GB memory, 20GB disk. The problem is on the sending server (the primary). The clients are currently running the latest BIND RPMs, it is the primary that we had to downgrade.

Comment 17 Pete Fry 2019-06-18 10:57:59 UTC
Our servers are 2 cores, 8GB ram and 24GB HD

we only see the problem when performing zone transfers from this server to a slave or client to simulate this

on one of the slaves we run:

dig @MASTER.IP ZONE axfr

this hangs

Comment 18 Petr Menšík 2019-06-18 15:10:57 UTC
Ok, I can reproduce the issue. Sending part is broken or it does not correctly receive confirmations.

Comment 19 Tim Mooney 2019-06-19 22:15:41 UTC
We experienced this exact same problem in early June and ended up rolling back our primary nameserver to the previous bind* packages we had been using.

I'm very glad to see that Petr has a reproducer.  I look forward to errata for this.

Comment 20 Tim Mooney 2019-06-19 22:40:20 UTC
Our principal network engineer asked me to add a couple additional datapoints, in case it helps the debugging.

During his testing, he discovered that for the zone that always failed to transfer to our secondaries, if he used dig on the master itself, he could reliably transfer the entire zone (e.g. just testing on from the master to itself).  He first noticed that it always succeeded over localhost, which initially made us suspect it might be MTU-related, but further testing showed that it also always succeeded when he did the transfer over the master's network interface.

My testing from a different RHEL 7 host on the same network segment as the master, with dig from bind-utils-9.9.4.74.el7_6.1, showed that I could get more of the zone than our secondaries were getting, but still not a fully successful AXFR.  This was between beefy physical hosts.  So our experiences matched David's in comment #10, but not Scott's experience in comment #8.

The other datapoint our principal network engineer had was this:

> In addition we saw failures with two zones.  One zone always failed, except with the dig test on the same host.
> The other zone failed intermittently when transferring to devices on our campus but failed 100% of the time to a host on the other side of the state.

> Combined with the reports of reproducing the problem above a certain byte size that might indicate that the zone transfer might fail if it does not succeed within a certain amount of time.

Comment 21 Rianto Wahyudi 2019-06-20 03:15:12 UTC
I can also reproduce the issue but with a varying degree of success. 

Our master is a small VM (VMWare ESXi 6.7) with 2 CPU & 2GB of RAM. 
Slaves are mixed of physical and Virtual with a minimum of 2 CPU & 6GB of RAM. 

- Host ( Virtual or Physical ) on the same subnet as the master does not have issues zone transfer issue. 
- Physical hosts on the same subnet and different subnet as the master does not have issues. 
- VM host on different network & different geographic is having zone transfer time out issues

At first, I thought this is network issues so I have tcpdump when the problem happens - if you need this please let me know. 
Downgrading the master fixed the problem.

Comment 22 Robert Esser 2019-06-20 23:19:33 UTC
We've been able to consistently reproduce this issue on a number of hosts supporting larger zones.

From our testing, it appears we can workaround the issue if we explicitly set the "keep-response-order" option to "any;" in the global options:

=======
    keep-response-order { any; };
=======

I'm not certain if this will work in all cases, but so far it seems to be working for a variety of different cases and may be a viable solution for others as well.

Comment 23 John Horne 2019-06-21 09:33:06 UTC
(In reply to Robert Esser from comment #22)
> We've been able to consistently reproduce this issue on a number of hosts
> supporting larger zones.
> 
> From our testing, it appears we can workaround the issue if we explicitly
> set the "keep-response-order" option to "any;" in the global options:
> 
> =======
>     keep-response-order { any; };
> =======
> 
> I'm not certain if this will work in all cases, but so far it seems to be
> working for a variety of different cases and may be a viable solution for
> others as well.

It works for me :-)
Thanks.

Comment 24 Petr Menšík 2019-06-21 11:11:01 UTC
Sure, that works reliable with any trace level used. Strange. Noticed just today as well. Should have been default value anyway.

Comment 27 Tom Crider 2019-06-21 17:57:20 UTC
*** Bug 1722911 has been marked as a duplicate of this bug. ***

Comment 28 David Cortijo 2019-06-21 19:00:55 UTC
(In reply to Petr Menšík from comment #24)
> Sure, that works reliable with any trace level used. Strange. Noticed just
> today as well. Should have been default value anyway.

I grabbed the source RPM (from here: https://access.redhat.com/downloads/content/rhel---7/x86_64/2456/bind/9.9.4-74.el7_6.1/x86_64/fd431d51/package) to look at differences between the versions, and in bind-9.11-rt37821.patch, line 286 of bind-9.11-rt37821.patch jumped out at me:

+#      keep-response-order {any;};\n\


That section is the modifications to bin/named/config.c (so, the default values, if I'm not mistaken), but it looks like this default "any" line is actually commented out, which doesn't really match the expected defaults surrounding it. Could this be causing the issue?

I did some further tests with rpmbuild -bp, and found that line 286 of the path file is inserting the default, but commented out. Removing the comment from the patch and rerunning the rpmbuild inserts the line without a comment. I have not gotten a chance to test the full rpmbuild, but I also don't really have a server I could test a built RPM on at the moment anyway.

Comment 39 Roshan 2019-08-01 02:17:43 UTC
We were affected by this this week. We are running BIND 9.9.4-RedHat-9.9.4-74.el7_6.1

Only AXFR transfers failed. 

Adding keep-response-order {any;}; to options fixed the problem.

Comment 40 Tomáš Hozza 2019-08-01 07:40:33 UTC
(In reply to Roshan from comment #39)
> We were affected by this this week. We are running BIND
> 9.9.4-RedHat-9.9.4-74.el7_6.1
> 
> Only AXFR transfers failed. 
> 
> Adding keep-response-order {any;}; to options fixed the problem.

Fixed package bind-9.9.4-74.el7_6.2 has been released on Monday, please use that instead of the workaround.

Comment 42 errata-xmlrpc 2019-08-06 12:39:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:2057