Created attachment 1142241 [details] Ironic Inspector logs. Issue described occurred on 2015-03-30 Description of problem: I've deployed a TripleO all day today, and after rebuilding undercloud a couple of times, I always got to the same issue when introspecting. Here is an example with just one node: [stack@undercloud ~]$ openstack baremetal introspection bulk start Setting nodes for introspection to manageable... Starting introspection of node: 304e594a-5585-4c25-a754-01864d872346 Waiting for introspection to finish... Introspection for UUID 304e594a-5585-4c25-a754-01864d872346 finished with error: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'eth1': {'ip': None, 'mac': u'10:60:4b:a9:b8:7c'}, u'eth0': {'ip': None, 'mac': u'10:60:4b:a9:b8:78'}} Setting manageable nodes to available... Introspection completed with errors: 304e594a-5585-4c25-a754-01864d872346: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'eth1': {'ip': None, 'mac': u'10:60:4b:a9:b8:7c'}, u'eth0': {'ip': None, 'mac': u'10:60:4b:a9:b8:78'}} Physical configuration The physical configuration was working before in Liberty, but just in case, I'm describing here for your input. The undercloud (virtual) server connects with NIC 1 to the internet and NIC 2 to an non internet connected switch. The overcloud servers connect with NIC 1 to the same switch, and NIC 2 to the internet, or are unplugged. Console When looking at the console on the overcloud node, I can see that the machine obtains a valid DHCP on NIC 1 and sends a DHCPDISCOVER on NIC 2 that, gets nowhere. A warning message appears as: ironic-python-agent[895]: date 895 WARNING ironic_python_agent.ironic_api_client [-] POST failed: HTTPConnectionPool(host='127.0.0.1', port=6385): Max retries exceeded with url: /v1/drivers/agent_ipmitool/vendor_passthru/lookup (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x2617f10>: Failed to establish a new connection: [Errno 111] Connection refused',)) If I try to run the command again, the server will not PXE boot again. Mar 30 22:29:23 undercloud.hq.ltg dnsmasq-dhcp[7741]: DHCPDISCOVER(tap3d3ea1c3-d0) 10:60:4b:a9:b8:78 no address available Mar 30 22:29:25 undercloud.hq.ltg dnsmasq-dhcp[7741]: DHCPDISCOVER(tap3d3ea1c3-d0) 10:60:4b:a9:b8:78 no address available Mar 30 22:29:30 undercloud.hq.ltg dnsmasq-dhcp[7741]: DHCPDISCOVER(tap3d3ea1c3-d0) 10:60:4b:a9:b8:78 no address available Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I realized we don't store ramdisk logs in this case. could you please set always_store_ramdisk_logs to true in /etc/ironic-inspector/inspector.conf, restart openstack-ironic-inspector and rerun introspection? this should give you the ramdisk logs in /var/log/ironic-inspector/ramdisk.
Created attachment 1142296 [details] ramdisk logs Including the ramdisk log after modifying /etc/ironic-inspector/inspector.conf
Hmm, that's an error which I though we've fixed. Where does you introspection image (ironic-python-agent) come from? Did you build it yourself or downloaded from somewhere?
I tried both ways. I first built it myself, and had the errors, then downloaded from https://ci.centos.org/artifacts/rdo/images/mitaka/delorean/stable/ and after removing from glance, reimporting and recreating the ironic nodes, I got the same behavior.
I've unpacked the image, and it has the fix in question. Still, in logs I see: Mar 31 16:01:20 localhost ironic-python-agent[823]: 2016-03-31 16:01:17.448 823 INFO ironic_python_agent.inspector [-] network interfaces: {'eth1': {'ip': None, 'mac': '10:60:4b:a9:b8:7c'}, 'eth0': {'ip': None, 'mac': '10:60:4b:a9:b8:78'}} ... Mar 31 16:01:20 localhost dhcp-all-interfaces.sh[826]: Inspecting interface: eth0...Configured eth0 Mar 31 16:01:20 localhost ifup[883]: Determining IP information for eth0... done. cc dprince
Can traffic on NIC1 see the traffic on NIC2 in any way. If so I'm wondering if this might be related to ARP flux (essentially the DHCP queries are causing mac address confusion on the DHCP server or something). If this is the cause then some sysctl related arp_filter, arp_ignore, arp_announce tuning might help. Either that or you can disconnect your NIC1, NIC2 switches so the traffic isn't getting across. Another idea would be to try using the stable-interface-names element too when building your IPA discovery ramdisk. This should help ensure NIC ordering upon reboots. Just some ideas.
In the previous (rdo-specific) ramdisk we have a lot of sleeps waiting for DHCP. I think we might introduce something like that.
(In reply to Dan Prince from comment #6) > Can traffic on NIC1 see the traffic on NIC2 in any way. No, both NICs belong to separated networks, configured on the blade enclosures. The second NIC is either unconnected or either connected to the Internet. The NIC 1 is connected to an internal network with just the OpenStack servers, so that there is no DHCP conflicts in any way.
Fixed in the newton release. Unfortunately, it's not something we can backport...
I am seeing this too on RDO tripleo Mitaka and Red Hat OSP 8 RC.
After more upstream discussion, we decided that this bug fix is backportable.
(In reply to Dmitry Tantsur from comment #11) > After more upstream discussion, we decided that this bug fix is backportable. Thanks, please let me know if you need more logs, testing etc.
I'm seeing this in OSP8 GA. The upstream fix [1] makes inspection wait 60 seconds for *all* NIC's to get their IP addresses. Looks like a custom image with this change should be a workaround. A crude workaround is to delete and re-introspect on a loop until all of the nodes return no errors [2]. As it's a race condition this crude approach worked for me with on the third iteration with six nodes introspecting with ironic-python-agent-8.0-20160415.1.el7ost.tar. [1] https://git.openstack.org/cgit/openstack/ironic-python-agent/commit/?id=3fba1ee8db0aa0b1519ef2135e602268488570f4 [2] while true; do for uuid in $(ironic node-list | awk {'print $2'} | egrep -v "UUID|^$"); do ironic node-set-power-state $uuid off ironic node-delete $uuid done openstack baremetal import --json ~/instackenv.json openstack baremetal configure boot openstack baremetal introspection bulk start intro_done=$(openstack baremetal introspection bulk status | awk {'print $6'} | egrep -v "^$|\||None") if [[ -z "$intro_done" ]]; then break fi done
John, we have a tracking bug for OSPd8, please move your comment there: https://bugzilla.redhat.com/show_bug.cgi?id=1327255
We've made one more patch to stable/mitaka. Hopefully, it will fix this problem as soon as it propagates to RDO (which may take substantial time).
Hi, I am hitting this issue with RDO Mitaka, when are we expecting a backport to make it to RDO? Regards, Graeme
It depends on when the images will get updated to contain the fix.
(In reply to Dmitry Tantsur from comment #17) > It depends on when the images will get updated to contain the fix. Which image and isn't there a procedure for users to rebuild the image on their own? When moving to MODIFIED please provide NVR in Fixed In Version field. For fixes which landed on stable/mitaka, you can use packages from RDO Mitaka trunk repo: https://trunk.rdoproject.org/centos7-mitaka/current-passed-ci/delorean.repo Images build by tripleo-quickstart CI gate job for that repo are saved at http://buildlogs.centos.org/centos/7/cloud/x86_64/tripleo_images/mitaka/delorean/
I'd really suggest this warrants a release to Stable-Mitaka rather than asking users to rebuild from mitaka trunk.
> Images build by tripleo-quickstart CI gate job for that repo are saved at > http://buildlogs.centos.org/centos/7/cloud/x86_64/tripleo_images/mitaka/ > delorean/ Hi, I just installed TripleO Mitaka Stable and had the same issue (with 15 HP Blades). I was using my own CentOS images. I replaced the images with the ones you suggested above and now the issue has gone away. Thank you Charles
(In reply to Charles Short from comment #20) > > Images build by tripleo-quickstart CI gate job for that repo are saved at > > http://buildlogs.centos.org/centos/7/cloud/x86_64/tripleo_images/mitaka/ > > delorean/ > > Hi, > > I just installed TripleO Mitaka Stable and had the same issue (with 15 HP > Blades). > I was using my own CentOS images. I replaced the images with the ones you > suggested above and now the issue has gone away. > Thank you > > Charles To add, the images do not have python-redis installed and so in an HA Tripleo deployment Ceilometer fails to start in pacemaker. This can be resolved by simply installing python-redis. Charles
I hit the same issue in osp9 virt test environment. =================================================== 10-08 06:33:22.695 25401 ERROR ironic_inspector.utils [-] [node: MAC 52:54:00:5a:6b:12] The following failures happened during running rocessing hook validate_interfaces: No suitable interfaces found in {u'ens4': {'ip': None, 'mac': u'52:54:00:5a:6b:12'}, u'ens3': {'ip': u'172.16.0.150', 'mac': u'52:54:00:5a:6b:11'}} Look up error: Could not find a node for attributes {'bmc_address': u'', 'mac': None}
After change undercloud and overcloud-node network nic model from e1000 to virtio and following steps in Bug 1234601 Comment 19. introspection was succeed. https://bugzilla.redhat.com/show_bug.cgi?id=1234601#c19
Ok, so since fixed in updated images, closing.