Description of problem: When ironic is not able to get reverse DNS entry for IP assigned to br-ctlplane interface (doesn't even receive NXDomain error message), all ironic commands take very long to execute (they time out, but still succeed). [undercloud]: $ time ironic-node list +--------------------------------------+------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------+---------------+-------------+--------------------+-------------+ ... real 0m55.383s user 0m0.248s sys 0m0.043s Version-Release number of selected component (if applicable): Tested with OSPd8 How reproducible (example with IP 10.100.100.1): OSP director 8 deployment: [undercloud]: $ ip a ... 7: br-ctlplane: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN link/ether <macaddr> brd ff:ff:ff:ff:ff:ff inet 10.100.100.1/24 brd 10.100.100.255 scope global br-ctlplane valid_lft forever preferred_lft forever ... Configure your DNS server to not respond (even with NXDOMAIN) for 10.100.100.1: [undercloud]: $ time host 10.100.100.1 ;; connection timed out; no servers could be reached real 0m14.005s user 0m0.003s sys 0m0.003s [undercloud]: $ time dig -x 10.100.100.1 ... ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 20304 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ... ;; connection timed out; no servers could be reached real 0m21.007s user 0m0.003s sys 0m0.004s [undercloud]: $ time nslookup 10.100.100.1 ;; connection timed out; trying next origin ;; connection timed out; trying next origin ;; Got SERVFAIL reply from XYZ, trying next server ;; connection timed out; trying next origin ;; connection timed out; trying next origin ;; connection timed out; no servers could be reached real 0m50.008s user 0m0.002s sys 0m0.009 Actual results: Ironic commands can take 20-60 seconds per one in this case Expected results: Ironic should have mechanism to deal with this, commands shouldn't take tens of seconds rather than milliseconds: [undercloud]: $ time ironic-node list +--------------------------------------+------+---------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+------+---------------+-------------+--------------------+-------------+ ... real 0m0.393s user 0m0.244s sys 0m0.041s
There are 2 components in the fix for this issue: 1. DNS resolving must no hang on any hosts; 2. Ironic should not hang if DNF resolving hangs. The latter seems fixed in OSPd9, the former should be fixed in the environment. I'm not sure if we can backport the Ironic fix, as it's pretty risky. Note that a potential workaround is to set host_ip=0.0.0.0 in ironic.conf. I'm not sure why it works, though.
Hello, The backport was rejected upstream, and it seems like we do not have problems in the later versions, so unfortunately we can't fix this bug.