1622719 – Introspection failed after restart of network service on the undercloud

Bug 1622719 - Introspection failed after restart of network service on the undercloud

Summary: Introspection failed after restart of network service on the undercloud

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-ironic
Sub Component:
Version:	13.0 (Queens)
Hardware:	All
OS:	All
Priority:	medium
Severity:	urgent
Target Milestone:	z3
Target Release:	13.0 (Queens)
Assignee:	Bob Fournier
QA Contact:	Alexander Chuzhoy
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1627596 1628775
TreeView+	depends on / blocked

Reported:	2018-08-27 20:25 UTC by Udi Shkalim
Modified:	2018-11-13 22:29 UTC (History)
CC List:	11 users (show)
Fixed In Version:	puppet-ironic-12.4.0-3.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1627596 1628775 (view as bug list)
Environment:
Last Closed:	2018-11-13 22:28:47 UTC
Target Upstream Version:
Embargoed:
Flags:	lmarsh: needinfo- lmarsh: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	600068	'None'	MERGED	Remove ironic inspector dnsmasq bind-interfaces setting	2020-11-03 17:57:45 UTC
OpenStack gerrit	608060	'None'	MERGED	Remove ironic inspector dnsmasq bind-interfaces setting	2020-11-03 17:57:44 UTC
Red Hat Product Errata	RHBA-2018:3587	None	None	None	2018-11-13 22:29:30 UTC

Description Udi Shkalim 2018-08-27 20:25:39 UTC

Description of problem:
After undercloud installation and before starting the introspection I've changed the ifcfg-eth2 (not the locat_interfcae - this is eth0) and restarted the network service: systemctl restart network.
Tried to run the introspection but it failed. After debugging it seems that dhcp requests are reaching the undercloud but without response.
I've restarted the openstack-ironic* services and introspection started to work.

Version-Release number of selected component (if applicable):
openstack-ironic-api-10.1.3-5.el7ost.noarch
openstack-ironic-inspector-7.2.1-2.el7ost.noarch
python-ironic-inspector-client-3.1.1-1.el7ost.noarch
puppet-ironic-12.4.0-2.el7ost.noarch
python2-ironicclient-2.2.1-1.el7ost.noarch
openstack-ironic-common-10.1.3-5.el7ost.noarch
openstack-ironic-staging-drivers-0.9.0-4.el7ost.noarch
python-ironic-lib-2.12.1-1.el7ost.noarch
python2-ironic-neutron-agent-1.0.0-1.el7ost.noarch
openstack-ironic-conductor-10.1.3-5.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Install undercloud
2. restart the network service
3. run introspection

Actual results:
introspection failed due to oc nodes not getting dhcp offers

Expected results:
introspection passed

Additional info:

Comment 2 Bob Fournier 2018-08-27 21:21:46 UTC

Can you indicate when the sosreport was taken?  Was it after the the ironic service was restarted or before?  After the network service was restarted?

It looks like you changed eth2 to be a static ip instead of using dhcp.  It looks like the local_ip (192.168.24.1) at least is configured on eth0/br-ctlplane.
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
1: lo    inet6 ::1/128 scope host \       valid_lft forever preferred_lft forever
2: eth0    inet6 fe80::5054:ff:fef9:1393/64 scope link \       valid_lft forever preferred_lft forever
3: eth1    inet 172.16.0.94/24 brd 172.16.0.255 scope global noprefixroute dynamic eth1\       valid_lft 2573sec preferred_lft 2573sec
3: eth1    inet6 fe80::5054:ff:fe5f:6af2/64 scope link noprefixroute \       valid_lft forever preferred_lft forever
4: eth2    inet 10.46.23.2/26 brd 10.46.23.63 scope global noprefixroute eth2\       valid_lft forever preferred_lft forever
4: eth2    inet6 fe80::5054:ff:febf:901b/64 scope link \       valid_lft forever preferred_lft forever
7: docker0    inet 172.17.0.1/16 scope global docker0\       valid_lft forever preferred_lft forever
11: br-ctlplane    inet 192.168.24.1/24 brd 192.168.24.255 scope global br-ctlplane\       valid_lft forever preferred_lft forever
11: br-ctlplane    inet6 fe80::5054:ff:fef9:1393/64 scope link \       valid_lft forever preferred_lft forever

cat etc/sysconfig/network-scripts/ifcfg-eth0 
# This file is autogenerated by os-net-config
DEVICE=eth0
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
DEVICETYPE=ovs
TYPE=OVSPort
OVS_BRIDGE=br-ctlplane
BOOTPROTO=none
MTU=1500

Can you confirm if you ran tcpdump on br-ctlplane and saw DHCP requests but no offer?

Comment 3 Bob Fournier 2018-08-27 21:28:32 UTC

Can you also check if dnsmasq is running after the restart of network services?

Note that we may want to adjust the severity of this bug - manually editing the network-scripts and restarting network services isn't really a recommended procedure, and in addition there is a recovery which is to restart ironic.

Comment 4 Bob Fournier 2018-08-27 21:37:49 UTC

I see lots of neutron errors in the nova-compute.log, not sure if this was after the neutron restart:2018-08-27 04:10:08.111 27131 DEBUG neutronclient.v2_0.client [req-7c926a05-fb0d-468a-a5fc-9e16db63fe92 86935d11be3947649440aeed0c33297d cca8c770aa0f4cbcbaf533bb55c5766a - default default] Error message: {"NeutronError": {"message": "The resource could not be found.", "type": "HTTPNotFound", "detail": ""}} _handle_fault_response /usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py:259
2018-08-27 04:10:08.210 27131 DEBUG nova.network.base_api [req-7c926a05-fb0d-468a-a5fc-9e16db63fe92 86935d11be3947649440aeed0c33297d cca8c770aa0f4cbcbaf533bb55c5766a - default default] [instance: be7fbe90-6c6b-470d-8834-003c8dd8a2f6] Updating instance_info_cache with network_info: [{"profile": {}, "ovs_interfaceid": null, "preserve_on_delete": true, "network": {"bridge": null, "subnets": [{"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "192.168.24.6"}], "version": 4, "meta": {"dhcp_server": "192.168.24.5"}, "dns": [], "routes": [{"interface": null, "cidr": "169.254.169.254/32", "meta": {}, "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "192.168.24.1"}}], "cidr": "192.168.24.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "192.168.24.1"}}], "meta": {"injected": false, "tenant_id": "e6d303ee3a8145818f18052a07148bd5", "mtu": 1500}, "id": "c126cb72-159c-4afe-8e23-ea3f7e93db9b", "label": "ctlplane"}, "devname": "tap300993b5-5d", "vnic_type": "baremetal", "qbh_params": null, "meta": {}, "details": {}, "address": "52:54:00:4c:4f:fb", "active": true, "type": "other", "id": "300993b5-5d26-4a70-bc2c-0ba7490479b7", "qbg_params": null}] update_instance_cache_with_nw_info /usr/lib/python2.7/site-packages/nova/network/base_api.py:48

Comment 5 Udi Shkalim 2018-08-28 11:35:23 UTC

(In reply to Bob Fournier from comment #2)
> Can you indicate when the sosreport was taken?  Was it after the the ironic
> service was restarted or before?  After the network service was restarted?
> 
> It looks like you changed eth2 to be a static ip instead of using dhcp.  It
> looks like the local_ip (192.168.24.1) at least is configured on
> eth0/br-ctlplane.
> 1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever
> preferred_lft forever
> 1: lo    inet6 ::1/128 scope host \       valid_lft forever preferred_lft
> forever
> 2: eth0    inet6 fe80::5054:ff:fef9:1393/64 scope link \       valid_lft
> forever preferred_lft forever
> 3: eth1    inet 172.16.0.94/24 brd 172.16.0.255 scope global noprefixroute
> dynamic eth1\       valid_lft 2573sec preferred_lft 2573sec
> 3: eth1    inet6 fe80::5054:ff:fe5f:6af2/64 scope link noprefixroute \      
> valid_lft forever preferred_lft forever
> 4: eth2    inet 10.46.23.2/26 brd 10.46.23.63 scope global noprefixroute
> eth2\       valid_lft forever preferred_lft forever
> 4: eth2    inet6 fe80::5054:ff:febf:901b/64 scope link \       valid_lft
> forever preferred_lft forever
> 7: docker0    inet 172.17.0.1/16 scope global docker0\       valid_lft
> forever preferred_lft forever
> 11: br-ctlplane    inet 192.168.24.1/24 brd 192.168.24.255 scope global
> br-ctlplane\       valid_lft forever preferred_lft forever
> 11: br-ctlplane    inet6 fe80::5054:ff:fef9:1393/64 scope link \      
> valid_lft forever preferred_lft forever
> 
> cat etc/sysconfig/network-scripts/ifcfg-eth0 
> # This file is autogenerated by os-net-config
> DEVICE=eth0
> ONBOOT=yes
> HOTPLUG=no
> NM_CONTROLLED=no
> PEERDNS=no
> DEVICETYPE=ovs
> TYPE=OVSPort
> OVS_BRIDGE=br-ctlplane
> BOOTPROTO=none
> MTU=1500
> 
> Can you confirm if you ran tcpdump on br-ctlplane and saw DHCP requests but
> no offer?


Hi Bob,
The sosreport was taken after the network and ironic services restart.
Yes, the tcpdump was taken on the undercloud on eth0.

The service is running:
● openstack-ironic-inspector-dnsmasq.service - PXE boot dnsmasq service for Ironic Inspector
   Loaded: loaded (/usr/lib/systemd/system/openstack-ironic-inspector-dnsmasq.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2018-08-22 11:21:07 EDT; 5 days ago
  Process: 24549 ExecStart=/sbin/dnsmasq --conf-file=/etc/ironic-inspector/dnsmasq.conf (code=exited, status=0/SUCCESS)
 Main PID: 24551 (dnsmasq)
    Tasks: 1
   CGroup: /system.slice/openstack-ironic-inspector-dnsmasq.service
           └─24551 /sbin/dnsmasq --conf-file=/etc/ironic-inspector/dnsmasq.conf


As reported in the bug I did change the network configuration. Since it is not the OpenStack network (eth0) there is no recommended procedure, a customer can add/edit/remove nics as he, please.

I want to bring the customer perspective here.
As a sysadmin changing the network configuration is a basic/common practice. When a basic action is leading to a failure in another service I see it as bad UX and a damage to the product reputation. 
Let's keep the severity and you can change the priority as you/PM see it.

Comment 6 Bob Fournier 2018-09-03 12:50:56 UTC

To test I confirmed introspection was working and then restarted the network service:
sudo systemctl restart network

When starting introspection there were no dhcp responses from dnsmasq and introspection failed:

(undercloud) [stack@host01 ~]$ sudo tcpdump -i br-ctlplane port 67 or port 68
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br-ctlplane, link-type EN10MB (Ethernet), capture size 262144 bytes
07:47:30.534958 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b0:83:fe:c6:63:86 (oui Unknown), length 548
07:47:35.916196 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b0:83:fe:c6:53:21 (oui Unknown), length 548
07:47:38.574080 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b0:83:fe:c6:63:86 (oui Unknown), length 548
07:47:43.938163 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b0:83:fe:c6:53:21 (oui Unknown), length 548

Then I restarted dnsmasq:
systemctl status openstack-ironic-inspector-dnsmasq.service

And dhcp responses were returned and introspection worked OK:
08:25:49.337396 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b0:83:fe:c6:53:21 (oui Unknown), length 548
08:25:52.093743 IP local_ip.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 302

Introspection of node 4de51a48-abdc-4da7-85d1-4393b2bf9a90 completed. Status:SUCCESS. Errors:None
Introspection of node 4de51a48-abdc-4da7-85d1-4393b2bf9a90 completed. Status:SUCCESS. Errors:None
Introspection of node ec6694c6-8c0c-437f-a7f6-82e7c2ab3793 completed. Status:SUCCESS. Errors:None

Comment 7 Bob Fournier 2018-09-04 15:11:22 UTC

The "bind-interfaces" option that we set in /etc/ironic-inspector/dnsmasq.conf seems to have an affect here. From the man pages:
--bind-interfaces
On systems which support it, dnsmasq binds the wildcard address, even when it is listening on only some interfaces. It then discards requests that it shouldn't reply to. This has the advantage of working even when interfaces come and go and change address. This option forces dnsmasq to really bind only the interfaces it is listening on. About the only time when this is useful is when running another nameserver (or another instance of dnsmasq) on the same machine. Setting this option also enables multiple instances of dnsmasq which provide DHCP service to run in the same machine

I removed the setting of bind-interfaces and restarted the network service. Introspection worked fine then:(undercloud) [stack@host01 ~]$ openstack overcloud node introspect --all-manageable
Waiting for introspection to finish...
Started Mistral Workflow tripleo.baremetal.v1.introspect_manageable_nodes. Execution ID: ab406330-563c-4584-ae87-12d96462a06e
Waiting for messages on queue 'tripleo' with no timeout.
Introspection of node 4de51a48-abdc-4da7-85d1-4393b2bf9a90 completed. Status:SUCCESS. Errors:None
Introspection of node ec6694c6-8c0c-437f-a7f6-82e7c2ab3793 completed. Status:SUCCESS. Errors:None
Successfully introspected 2 node(s).

I also tried using "bind-dynamic" instead of "bind-interfaces" but this did not work.

Will investigate more the removal of bind-interfaces from /etc/ironic-inspector/dnsmasq.conf.

Comment 8 Bob Fournier 2018-09-04 15:16:30 UTC

There is some related discussion here -https://bugs.launchpad.net/ubuntu/+source/dnsmasq/+bug/876458

Man page is http://www.thekelleys.org.uk/dnsmasq/docs/dnsmasq-man.html

Comment 14 Alexander Chuzhoy 2018-10-30 18:00:34 UTC

Verified:
Environment: puppet-ironic-12.4.0-3.el7ost.noarch

(undercloud) [stack@undercloud-0 ~]$ sudo systemctl restart network 

(undercloud) [stack@undercloud-0 ~]$ openstack overcloud node introspect --all-manageable                                                                                                
Waiting for introspection to finish...                                                                                                                                                    
Started Mistral Workflow tripleo.baremetal.v1.introspect_manageable_nodes. Execution ID: 7306d948-6d56-4903-8b82-ee48ee790ba7                                                             
Waiting for messages on queue 'tripleo' with no timeout.                                                                                                                                  
Introspection of node 3ecb3aad-a548-4d9d-bac4-401b8271f27b completed. Status:SUCCESS. Errors:None                                                                                         
Introspection of node d7afdc79-03c1-4849-9bcf-d438327d956e completed. Status:SUCCESS. Errors:None                                                                                         
Successfully introspected 2 node(s).                                                                                                                                                      
Introspection completed.

Comment 18 errata-xmlrpc 2018-11-13 22:28:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3587

Note You need to log in before you can comment on or make changes to this bug.