Bug 1328143 - Long ironic timeouts because of unresolvable reverse DNS entry (ServFail)
Summary: Long ironic timeouts because of unresolvable reverse DNS entry (ServFail)
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic
Version: 8.0 (Liberty)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: async
: ---
Assignee: Lucas Alvares Gomes
QA Contact: Raviv Bar-Tal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-18 14:55 UTC by Filip Hubík
Modified: 2016-08-17 09:38 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-17 09:38:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1572201 0 None None None 2016-04-19 15:23:31 UTC

Description Filip Hubík 2016-04-18 14:55:37 UTC
Description of problem:
When ironic is not able to get reverse DNS entry for IP assigned to br-ctlplane interface (doesn't even receive NXDomain error message), all ironic commands take very long to execute (they time out, but still succeed).

[undercloud]: $ time ironic-node list
+--------------------------------------+------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+------+---------------+-------------+--------------------+-------------+
...
real    0m55.383s
user    0m0.248s
sys     0m0.043s

Version-Release number of selected component (if applicable):
Tested with OSPd8

How reproducible (example with IP 10.100.100.1):
OSP director 8 deployment:
[undercloud]: $ ip a
...
7: br-ctlplane: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether <macaddr> brd ff:ff:ff:ff:ff:ff
    inet 10.100.100.1/24 brd 10.100.100.255 scope global br-ctlplane
       valid_lft forever preferred_lft forever
...

Configure your DNS server to not respond (even with NXDOMAIN) for 10.100.100.1:

[undercloud]: $ time host 10.100.100.1
;; connection timed out; no servers could be reached
real    0m14.005s
user    0m0.003s
sys     0m0.003s

[undercloud]: $ time dig -x 10.100.100.1
...
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 20304
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
...
;; connection timed out; no servers could be reached
real    0m21.007s
user    0m0.003s
sys     0m0.004s

[undercloud]: $ time nslookup 10.100.100.1                                
;; connection timed out; trying next origin
;; connection timed out; trying next origin
;; Got SERVFAIL reply from XYZ, trying next server
;; connection timed out; trying next origin
;; connection timed out; trying next origin
;; connection timed out; no servers could be reached
real    0m50.008s
user    0m0.002s
sys     0m0.009

Actual results:
Ironic commands can take 20-60 seconds per one in this case

Expected results:
Ironic should have mechanism to deal with this, commands shouldn't take tens of seconds rather than milliseconds:
[undercloud]: $ time ironic-node list
+--------------------------------------+------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+------+---------------+-------------+--------------------+-------------+
...
real    0m0.393s
user    0m0.244s
sys     0m0.041s

Comment 2 Dmitry Tantsur 2016-04-18 15:23:26 UTC
There are 2 components in the fix for this issue:
1. DNS resolving must no hang on any hosts;
2. Ironic should not hang if DNF resolving hangs.

The latter seems fixed in OSPd9, the former should be fixed in the environment.
I'm not sure if we can backport the Ironic fix, as it's pretty risky.

Note that a potential workaround is to set host_ip=0.0.0.0 in ironic.conf. I'm not sure why it works, though.

Comment 3 Dmitry Tantsur 2016-08-17 09:38:32 UTC
Hello,

The backport was rejected upstream, and it seems like we do not have problems in the later versions, so unfortunately we can't fix this bug.


Note You need to log in before you can comment on or make changes to this bug.