Description of problem: OSPd7's bulk introspection times out during PXE boot phase when dnsmasq hash function produces collision in IP addresses offered to nodes initially (explained http://permalink.gmane.org/gmane.network.dns.dnsmasq.general/3846). This might happen when IP pool configured in dnsmasq is narrow enough and MAC addresses are similar enough, therefore dnsmasq sends DHCPOFFER's with same proposed IP to nodes with different MAC addresses initially. One of the nodes ACK's the offered IP, but other one fails to ACK, sending NAK 4 times as defined in DHCP workflow. Unfortunately, this delay is not usually expected from the PXE firmware (10-15 sec. wait time for successful handshake) and PXE boot times out resulting in a failed introspection of affected node. Version-Release number of selected component (if applicable): RHEL 7.2 puddle 2016-01-22.1 dnsmasq-2.66-14.el7_1.x86_64 dnsmasq-utils-2.66-14.el7_1.x86_64 instack-undercloud-2.1.2-37.el7ost.noarch python-rdomanager-oscplugin-0.0.10-26.el7ost.noarch openstack-ironic-discoverd-1.1.0-8.el7ost.noarch openstack-ironic-common-2015.1.2-2.el7ost.noarch openstack-ironic-conductor-2015.1.2-2.el7ost.noarch openstack-ironic-api-2015.1.2-2.el7ost.noarch openstack-tripleo-puppet-elements-0.0.1-5.el7ost.noarch openstack-tripleo-image-elements-0.9.6-10.el7ost.noarch openstack-tripleo-heat-templates-0.8.6-112.el7ost.noarch openstack-tripleo-common-0.0.1.dev6-5.git49b57eb.el7ost.noarch openstack-tripleo-0.0.7-0.1.1664e566.el7ost.noarch A) Steps to reproduce in baremetal environment: 1. Deploy OSPd7 with instackenv.json with e.g. following MAC's { "nodes": [ { "pm_type": "pxe_ipmitool", "mac": [ "34:17:eb:e6:45:33" ], ... }, { "pm_type": "pxe_ipmitool", "mac": [ "34:17:eb:e6:45:f0" ], ... }, ... ] } 2. $ openstack baremetal import --json instackenv.json $ openstack baremetal configure boot 3. If is not configured already, check configuration of dnsmasq (it is important for reproducing that pool/range has exactly 20 addresses available in "dhcp-range" field, the IP's doesn't matter): $ cat /etc/ironic-discoverd/dnsmasq.conf port=0 interface=br-ctlplane bind-interfaces dhcp-range=10.200.200.100,10.200.200.120,29 enable-tftp tftp-root=/tftpboot dhcp-match=ipxe,175 dhcp-boot=tag:!ipxe,undionly.kpxe,localhost.localdomain,10.200.200.1 dhcp-boot=tag:ipxe,http://10.200.200.1:8088/discoverd.ipxe 4. $ openstack baremetal introspection bulk start 5. Discovery recognizes following MAC's broadcasting DHCPDISCOVER's on isolated network 34:17:eb:e6:45:32 34:17:eb:e6:45:ef 6. Whole communication looks like $ sudo journalctl -f -u openstack-ironic-discoverd-dnsmasq Jan 22 23:50:55 DHCPOFFER(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:32 Jan 22 23:50:55 DHCPDISCOVER(br-ctlplane) 34:17:eb:e6:45:ef Jan 22 23:50:55 DHCPOFFER(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:ef Jan 22 23:50:56 DHCPDISCOVER(br-ctlplane) 34:17:eb:e6:45:32 Jan 22 23:50:56 DHCPOFFER(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:32 Jan 22 23:50:56 DHCPDISCOVER(br-ctlplane) 34:17:eb:e6:45:ef Jan 22 23:50:56 DHCPOFFER(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:ef Jan 22 23:51:04 DHCPREQUEST(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:32 Jan 22 23:51:04 DHCPACK(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:32 Jan 22 23:51:04 DHCPREQUEST(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:ef Jan 22 23:51:04 DHCPNAK(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:ef address in use Jan 22 23:51:04 DHCPDISCOVER(br-ctlplane) 34:17:eb:e6:45:32 Jan 22 23:51:04 DHCPOFFER(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:32 Jan 22 23:51:04 DHCPREQUEST(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:ef Jan 22 23:51:04 DHCPNAK(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:ef address in use Jan 22 23:51:05 DHCPREQUEST(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:ef Jan 22 23:51:05 DHCPNAK(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:ef address in use Jan 22 23:51:07 DHCPREQUEST(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:ef Jan 22 23:51:07 DHCPNAK(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:ef address in use Jan 22 23:51:08 DHCPDISCOVER(br-ctlplane) 34:17:eb:e6:45:32 Jan 22 23:51:08 DHCPOFFER(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:32 Jan 22 23:51:16 DHCPREQUEST(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:32 Jan 22 23:51:16 DHCPACK(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:32 Jan 22 23:50:55 DHCPOFFER(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:32 Jan 22 23:50:55 DHCPDISCOVER(br-ctlplane) 34:17:eb:e6:45:ef Jan 22 23:50:55 DHCPOFFER(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:ef Jan 22 23:50:56 DHCPDISCOVER(br-ctlplane) 34:17:eb:e6:45:32 Jan 22 23:50:56 DHCPOFFER(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:32 Jan 22 23:50:56 DHCPDISCOVER(br-ctlplane) 34:17:eb:e6:45:ef Jan 22 23:50:56 DHCPOFFER(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:ef Jan 22 23:51:04 DHCPREQUEST(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:32 Jan 22 23:51:04 DHCPACK(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:32 Jan 22 23:51:04 DHCPREQUEST(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:ef Jan 22 23:51:04 DHCPNAK(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:ef address in use Jan 22 23:51:04 DHCPDISCOVER(br-ctlplane) 34:17:eb:e6:45:32 Jan 22 23:51:04 DHCPOFFER(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:32 Jan 22 23:51:04 DHCPREQUEST(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:ef Jan 22 23:51:04 DHCPNAK(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:ef address in use Jan 22 23:51:05 DHCPREQUEST(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:ef Jan 22 23:51:05 DHCPNAK(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:ef address in use Jan 22 23:51:07 DHCPREQUEST(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:ef Jan 22 23:51:07 DHCPNAK(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:ef address in use Jan 22 23:51:08 DHCPDISCOVER(br-ctlplane) 34:17:eb:e6:45:32 Jan 22 23:51:08 DHCPOFFER(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:32 Jan 22 23:51:16 DHCPREQUEST(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:32 Jan 22 23:51:16 DHCPACK(br-ctlplane) 10.200.200.115 34:17:eb:e6:45:32 7) Introspection of node 34:17:eb:e6:45:ef failed because PXE boot couldn't get IP address in time B) Steps to reproduce in emulated environment (PXE using QEMU): As this happens on baremetals in my case where MAC's can't be "spoofed" before PXE boot (BIOS doesn't support that and without proper MAC setup this can't be reproduced), therefore we can use emulated QEMU PXE boot to reproduce. 1. Deploy OSPd7 manually up to step "Introspect Nodes" on baremetal, cancel introspection during progress ("waiting for nodes ..." stage) and from this point work on only instack node 2. Check if iptables discovery chain doesn't block spoofed MAC's (listed below, iptables -L discovery), otherwise delete DROP rules (iptables -D discovery XY) and disable discoverd service (service openstack-ironic-discoverd stop) so it can't create DROP rules again 3. Add two tap devices as connection point for emulated PXE consoles and add them as ports to br-ctlplane, make sure created all interfaces are UP ip tuntap add tap0 mode tap && ip link set tap0 up ip tuntap add tap1 mode tap && ip link set tap1 up ovs-vsctl add-port br-ctlplane tap0 ovs-vsctl add-port br-ctlplane tap1 4. Boot PXE virtually with MAC's causing collisions in our IP range qemu-system-x86_64 -boot n -net nic,macaddr=34:17:eb:e6:45:32 -net tap,ifname=tap0,script=no qemu-system-x86_64 -boot n -net nic,macaddr=34:17:eb:e6:45:ef -net tap,ifname=tap1,script=no 5. Even in emulated environment PXE boot fails for one of nodes (who DHCPACK's first wins, other one times out). Actual results: Introspection of baremetal node with MAC 34:17:eb:e6:45:ef fails with timeout. $ baremetal introspection bulk status | Node UUID | Finished | Error | +--------------------------------------+----------+-----------------------+ | 3e926bb2-4f30-4eef-9021-269752ae42f4 | True | Introspection timeout | | 3217ecf2-b251-41f8-a23a-0d4d1a43b7aa | True | None | ... Expected results: Introspection will pass. Additional info This issue might be possibly resolved in https://review.openstack.org/#/c/203040/ introducing introspection delay referenced in https://bugs.launchpad.net/ironic-inspector/+bug/1473024, backport might be direct solution. Workarounds: Manual introspection (so one can add delay between introspection of each node), change IP pool size or spoof MAC addresses so dnsmasq can not cause collisions.
*** This bug has been marked as a duplicate of bug 1301659 ***