Description of problem: After undercloud installation and before starting the introspection I've changed the ifcfg-eth2 (not the locat_interfcae - this is eth0) and restarted the network service: systemctl restart network. Tried to run the introspection but it failed. After debugging it seems that dhcp requests are reaching the undercloud but without response. I've restarted the openstack-ironic* services and introspection started to work. Version-Release number of selected component (if applicable): openstack-ironic-api-10.1.3-5.el7ost.noarch openstack-ironic-inspector-7.2.1-2.el7ost.noarch python-ironic-inspector-client-3.1.1-1.el7ost.noarch puppet-ironic-12.4.0-2.el7ost.noarch python2-ironicclient-2.2.1-1.el7ost.noarch openstack-ironic-common-10.1.3-5.el7ost.noarch openstack-ironic-staging-drivers-0.9.0-4.el7ost.noarch python-ironic-lib-2.12.1-1.el7ost.noarch python2-ironic-neutron-agent-1.0.0-1.el7ost.noarch openstack-ironic-conductor-10.1.3-5.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Install undercloud 2. restart the network service 3. run introspection Actual results: introspection failed due to oc nodes not getting dhcp offers Expected results: introspection passed Additional info:
Can you indicate when the sosreport was taken? Was it after the the ironic service was restarted or before? After the network service was restarted? It looks like you changed eth2 to be a static ip instead of using dhcp. It looks like the local_ip (192.168.24.1) at least is configured on eth0/br-ctlplane. 1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever 1: lo inet6 ::1/128 scope host \ valid_lft forever preferred_lft forever 2: eth0 inet6 fe80::5054:ff:fef9:1393/64 scope link \ valid_lft forever preferred_lft forever 3: eth1 inet 172.16.0.94/24 brd 172.16.0.255 scope global noprefixroute dynamic eth1\ valid_lft 2573sec preferred_lft 2573sec 3: eth1 inet6 fe80::5054:ff:fe5f:6af2/64 scope link noprefixroute \ valid_lft forever preferred_lft forever 4: eth2 inet 10.46.23.2/26 brd 10.46.23.63 scope global noprefixroute eth2\ valid_lft forever preferred_lft forever 4: eth2 inet6 fe80::5054:ff:febf:901b/64 scope link \ valid_lft forever preferred_lft forever 7: docker0 inet 172.17.0.1/16 scope global docker0\ valid_lft forever preferred_lft forever 11: br-ctlplane inet 192.168.24.1/24 brd 192.168.24.255 scope global br-ctlplane\ valid_lft forever preferred_lft forever 11: br-ctlplane inet6 fe80::5054:ff:fef9:1393/64 scope link \ valid_lft forever preferred_lft forever cat etc/sysconfig/network-scripts/ifcfg-eth0 # This file is autogenerated by os-net-config DEVICE=eth0 ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no PEERDNS=no DEVICETYPE=ovs TYPE=OVSPort OVS_BRIDGE=br-ctlplane BOOTPROTO=none MTU=1500 Can you confirm if you ran tcpdump on br-ctlplane and saw DHCP requests but no offer?
Can you also check if dnsmasq is running after the restart of network services? Note that we may want to adjust the severity of this bug - manually editing the network-scripts and restarting network services isn't really a recommended procedure, and in addition there is a recovery which is to restart ironic.
I see lots of neutron errors in the nova-compute.log, not sure if this was after the neutron restart:2018-08-27 04:10:08.111 27131 DEBUG neutronclient.v2_0.client [req-7c926a05-fb0d-468a-a5fc-9e16db63fe92 86935d11be3947649440aeed0c33297d cca8c770aa0f4cbcbaf533bb55c5766a - default default] Error message: {"NeutronError": {"message": "The resource could not be found.", "type": "HTTPNotFound", "detail": ""}} _handle_fault_response /usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py:259 2018-08-27 04:10:08.210 27131 DEBUG nova.network.base_api [req-7c926a05-fb0d-468a-a5fc-9e16db63fe92 86935d11be3947649440aeed0c33297d cca8c770aa0f4cbcbaf533bb55c5766a - default default] [instance: be7fbe90-6c6b-470d-8834-003c8dd8a2f6] Updating instance_info_cache with network_info: [{"profile": {}, "ovs_interfaceid": null, "preserve_on_delete": true, "network": {"bridge": null, "subnets": [{"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "192.168.24.6"}], "version": 4, "meta": {"dhcp_server": "192.168.24.5"}, "dns": [], "routes": [{"interface": null, "cidr": "169.254.169.254/32", "meta": {}, "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "192.168.24.1"}}], "cidr": "192.168.24.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "192.168.24.1"}}], "meta": {"injected": false, "tenant_id": "e6d303ee3a8145818f18052a07148bd5", "mtu": 1500}, "id": "c126cb72-159c-4afe-8e23-ea3f7e93db9b", "label": "ctlplane"}, "devname": "tap300993b5-5d", "vnic_type": "baremetal", "qbh_params": null, "meta": {}, "details": {}, "address": "52:54:00:4c:4f:fb", "active": true, "type": "other", "id": "300993b5-5d26-4a70-bc2c-0ba7490479b7", "qbg_params": null}] update_instance_cache_with_nw_info /usr/lib/python2.7/site-packages/nova/network/base_api.py:48
(In reply to Bob Fournier from comment #2) > Can you indicate when the sosreport was taken? Was it after the the ironic > service was restarted or before? After the network service was restarted? > > It looks like you changed eth2 to be a static ip instead of using dhcp. It > looks like the local_ip (192.168.24.1) at least is configured on > eth0/br-ctlplane. > 1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever > preferred_lft forever > 1: lo inet6 ::1/128 scope host \ valid_lft forever preferred_lft > forever > 2: eth0 inet6 fe80::5054:ff:fef9:1393/64 scope link \ valid_lft > forever preferred_lft forever > 3: eth1 inet 172.16.0.94/24 brd 172.16.0.255 scope global noprefixroute > dynamic eth1\ valid_lft 2573sec preferred_lft 2573sec > 3: eth1 inet6 fe80::5054:ff:fe5f:6af2/64 scope link noprefixroute \ > valid_lft forever preferred_lft forever > 4: eth2 inet 10.46.23.2/26 brd 10.46.23.63 scope global noprefixroute > eth2\ valid_lft forever preferred_lft forever > 4: eth2 inet6 fe80::5054:ff:febf:901b/64 scope link \ valid_lft > forever preferred_lft forever > 7: docker0 inet 172.17.0.1/16 scope global docker0\ valid_lft > forever preferred_lft forever > 11: br-ctlplane inet 192.168.24.1/24 brd 192.168.24.255 scope global > br-ctlplane\ valid_lft forever preferred_lft forever > 11: br-ctlplane inet6 fe80::5054:ff:fef9:1393/64 scope link \ > valid_lft forever preferred_lft forever > > cat etc/sysconfig/network-scripts/ifcfg-eth0 > # This file is autogenerated by os-net-config > DEVICE=eth0 > ONBOOT=yes > HOTPLUG=no > NM_CONTROLLED=no > PEERDNS=no > DEVICETYPE=ovs > TYPE=OVSPort > OVS_BRIDGE=br-ctlplane > BOOTPROTO=none > MTU=1500 > > Can you confirm if you ran tcpdump on br-ctlplane and saw DHCP requests but > no offer? Hi Bob, The sosreport was taken after the network and ironic services restart. Yes, the tcpdump was taken on the undercloud on eth0. The service is running: ● openstack-ironic-inspector-dnsmasq.service - PXE boot dnsmasq service for Ironic Inspector Loaded: loaded (/usr/lib/systemd/system/openstack-ironic-inspector-dnsmasq.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2018-08-22 11:21:07 EDT; 5 days ago Process: 24549 ExecStart=/sbin/dnsmasq --conf-file=/etc/ironic-inspector/dnsmasq.conf (code=exited, status=0/SUCCESS) Main PID: 24551 (dnsmasq) Tasks: 1 CGroup: /system.slice/openstack-ironic-inspector-dnsmasq.service └─24551 /sbin/dnsmasq --conf-file=/etc/ironic-inspector/dnsmasq.conf As reported in the bug I did change the network configuration. Since it is not the OpenStack network (eth0) there is no recommended procedure, a customer can add/edit/remove nics as he, please. I want to bring the customer perspective here. As a sysadmin changing the network configuration is a basic/common practice. When a basic action is leading to a failure in another service I see it as bad UX and a damage to the product reputation. Let's keep the severity and you can change the priority as you/PM see it.
To test I confirmed introspection was working and then restarted the network service: sudo systemctl restart network When starting introspection there were no dhcp responses from dnsmasq and introspection failed: (undercloud) [stack@host01 ~]$ sudo tcpdump -i br-ctlplane port 67 or port 68 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on br-ctlplane, link-type EN10MB (Ethernet), capture size 262144 bytes 07:47:30.534958 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b0:83:fe:c6:63:86 (oui Unknown), length 548 07:47:35.916196 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b0:83:fe:c6:53:21 (oui Unknown), length 548 07:47:38.574080 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b0:83:fe:c6:63:86 (oui Unknown), length 548 07:47:43.938163 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b0:83:fe:c6:53:21 (oui Unknown), length 548 Then I restarted dnsmasq: systemctl status openstack-ironic-inspector-dnsmasq.service And dhcp responses were returned and introspection worked OK: 08:25:49.337396 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b0:83:fe:c6:53:21 (oui Unknown), length 548 08:25:52.093743 IP local_ip.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 302 Introspection of node 4de51a48-abdc-4da7-85d1-4393b2bf9a90 completed. Status:SUCCESS. Errors:None Introspection of node 4de51a48-abdc-4da7-85d1-4393b2bf9a90 completed. Status:SUCCESS. Errors:None Introspection of node ec6694c6-8c0c-437f-a7f6-82e7c2ab3793 completed. Status:SUCCESS. Errors:None
The "bind-interfaces" option that we set in /etc/ironic-inspector/dnsmasq.conf seems to have an affect here. From the man pages: --bind-interfaces On systems which support it, dnsmasq binds the wildcard address, even when it is listening on only some interfaces. It then discards requests that it shouldn't reply to. This has the advantage of working even when interfaces come and go and change address. This option forces dnsmasq to really bind only the interfaces it is listening on. About the only time when this is useful is when running another nameserver (or another instance of dnsmasq) on the same machine. Setting this option also enables multiple instances of dnsmasq which provide DHCP service to run in the same machine I removed the setting of bind-interfaces and restarted the network service. Introspection worked fine then:(undercloud) [stack@host01 ~]$ openstack overcloud node introspect --all-manageable Waiting for introspection to finish... Started Mistral Workflow tripleo.baremetal.v1.introspect_manageable_nodes. Execution ID: ab406330-563c-4584-ae87-12d96462a06e Waiting for messages on queue 'tripleo' with no timeout. Introspection of node 4de51a48-abdc-4da7-85d1-4393b2bf9a90 completed. Status:SUCCESS. Errors:None Introspection of node ec6694c6-8c0c-437f-a7f6-82e7c2ab3793 completed. Status:SUCCESS. Errors:None Successfully introspected 2 node(s). I also tried using "bind-dynamic" instead of "bind-interfaces" but this did not work. Will investigate more the removal of bind-interfaces from /etc/ironic-inspector/dnsmasq.conf.
There is some related discussion here -https://bugs.launchpad.net/ubuntu/+source/dnsmasq/+bug/876458 Man page is http://www.thekelleys.org.uk/dnsmasq/docs/dnsmasq-man.html
Verified: Environment: puppet-ironic-12.4.0-3.el7ost.noarch (undercloud) [stack@undercloud-0 ~]$ sudo systemctl restart network (undercloud) [stack@undercloud-0 ~]$ openstack overcloud node introspect --all-manageable Waiting for introspection to finish... Started Mistral Workflow tripleo.baremetal.v1.introspect_manageable_nodes. Execution ID: 7306d948-6d56-4903-8b82-ee48ee790ba7 Waiting for messages on queue 'tripleo' with no timeout. Introspection of node 3ecb3aad-a548-4d9d-bac4-401b8271f27b completed. Status:SUCCESS. Errors:None Introspection of node d7afdc79-03c1-4849-9bcf-d438327d956e completed. Status:SUCCESS. Errors:None Successfully introspected 2 node(s). Introspection completed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3587