Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1633287

Summary: node cleaning in undercloud occasionally fails for a subset of nodes.
Product: Red Hat OpenStack Reporter: Alexander Chuzhoy <sasha>
Component: openstack-ironicAssignee: RHOS Maint <rhos-maint>
Status: CLOSED WONTFIX QA Contact: mlammon
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 14.0 (Rocky)CC: bfournie, derekh, dtantsur, hjensas, mburns
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-16 22:57:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sosreports none

Description Alexander Chuzhoy 2018-09-26 15:23:45 UTC
node cleaning in undercloud occasionally fails for a subset of nodes.

Environment:
python2-ironicclient-2.5.0-0.20180810135843.fb94fb8.el7ost.noarch
python2-ironic-inspector-client-3.3.0-0.20180810080932.53bf4e8.el7ost.noarch
instack-undercloud-9.3.1-0.20180831000258.e464799.el7ost.noarch
puppet-ironic-13.3.1-0.20180831191239.61387eb.el7ost.noarch


Steps to reproduce:
1. Introspect all nodes.
2. Try to clean all nodes with:
    openstack overcloud node clean --all-manageable


Result:
openstack overcloud node clean --all-manageable
Waiting for messages on queue 'tripleo' with no timeout.
Cleaned 7 node(s)

openstack baremetal node provide ceph1-leaf0
The requested action "provide" can not be performed on node "1c5f5638-851d-4a6f-8017-620a15c46c7c" while it is in state "clean failed". (HTTP 400)



(undercloud) [stack@undercloud ~]$ openstack baremetal node list
+--------------------------------------+--------------------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name               | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------------+---------------+-------------+--------------------+-------------+
| 09bc03a0-e9e4-4b4d-b200-3270883c4812 | ceph0-leaf0        | None          | power off   | available          | False       |
| 1c5f5638-851d-4a6f-8017-620a15c46c7c | ceph1-leaf0        | None          | power off   | clean failed       | True        |
| 924f4d83-f199-47e1-aa06-9f9755c68b5b | compute-0-leaf0    | None          | power off   | clean failed       | True        |
| 4b9c0106-9c79-48bb-972a-b55814632914 | ceph2-leaf1        | None          | power off   | manageable         | False       |
| 660756af-1035-4556-9d3a-db72393b3692 | controller-0-leaf1 | None          | power off   | manageable         | False       |
| 6ca2f3d9-b08d-4abe-8189-4ef82f93717a | controller-1-leaf1 | None          | power off   | manageable         | False       |
| 56863c26-5791-48f6-a807-80795587b32d | controller-2-leaf1 | None          | power off   | manageable         | False       |
+--------------------------------------+--------------------+---------------+-------------+--------------------+-------------+

The issue isn't persistent and sometimes all nodes are cleaned and sometimes only one is in failed state.

Comment 1 Alexander Chuzhoy 2018-09-26 15:31:24 UTC
Created attachment 1487305 [details]
sosreports

Comment 2 Bob Fournier 2018-10-05 18:29:07 UTC
It looks like node 1c5f5638-851d-4a6f-8017-620a15c46c7c was successfully introspected but then when cleaning was attempted we get these errors:

ronic-conductor.log.1:2018-09-25 23:26:46.057 1 DEBUG ironic.conductor.manager [req-64a12ece-1b82-4ec8-8441-a845170a93ab a066e8e0c0044a37917bb2b7605a67ad 0d268b757aff4effbbeff6d5182a882e - default default] Starting manual cleaning for node 1c5f5638-851d-4a6f-8017-620a15c46c7c _do_node_clean /usr/lib/python2.7/site-packages/ironic/conductor/manager.py:1201
ironic-conductor.log.1:2018-09-25 23:26:50.764 1 DEBUG ironic.common.neutron [req-64a12ece-1b82-4ec8-8441-a845170a93ab a066e8e0c0044a37917bb2b7605a67ad 0d268b757aff4effbbeff6d5182a882e - default default] Removing ports on network b68ab7d3-2518-470b-a0b8-7f541e8dbaa2 on node 1c5f5638-851d-4a6f-8017-620a15c46c7c. remove_ports_from_network /usr/lib/python2.7/site-packages/ironic/common/neutron.py:305
ironic-conductor.log.1:2018-09-25 23:26:52.502 1 DEBUG ironic.common.neutron [req-64a12ece-1b82-4ec8-8441-a845170a93ab a066e8e0c0044a37917bb2b7605a67ad 0d268b757aff4effbbeff6d5182a882e - default default] No ports to remove for node 1c5f5638-851d-4a6f-8017-620a15c46c7c remove_neutron_ports /usr/lib/python2.7/site-packages/ironic/common/neutron.py:331
ironic-conductor.log.1:2018-09-25 23:26:52.502 1 INFO ironic.drivers.modules.network.flat [req-64a12ece-1b82-4ec8-8441-a845170a93ab a066e8e0c0044a37917bb2b7605a67ad 0d268b757aff4effbbeff6d5182a882e - default default] Adding cleaning network to node 1c5f5638-851d-4a6f-8017-620a15c46c7c
ironic-conductor.log.1:2018-09-25 23:26:56.737 1 DEBUG ironic.common.neutron [req-64a12ece-1b82-4ec8-8441-a845170a93ab a066e8e0c0044a37917bb2b7605a67ad 0d268b757aff4effbbeff6d5182a882e - default default] For node 1c5f5638-851d-4a6f-8017-620a15c46c7c, creating neutron ports on network b68ab7d3-2518-470b-a0b8-7f541e8dbaa2 using FlatNetwork network interface. add_ports_to_network /usr/lib/python2.7/site-packages/ironic/common/neutron.py:217
ironic-conductor.log.1:2018-09-25 23:26:58.454 1 WARNING ironic.common.neutron [req-64a12ece-1b82-4ec8-8441-a845170a93ab a066e8e0c0044a37917bb2b7605a67ad 0d268b757aff4effbbeff6d5182a882e - default default] Could not create neutron port for node's 1c5f5638-851d-4a6f-8017-620a15c46c7c port 6840f692-1803-4930-a4cc-f3fa40075206 on the neutron network b68ab7d3-2518-470b-a0b8-7f541e8dbaa2. Host 1c5f5638-851d-4a6f-8017-620a15c46c7c is not connected to any segments on routed provider network 'b68ab7d3-2518-470b-a0b8-7f541e8dbaa2'.  It should be connected to one.
ironic-conductor.log.1:Neutron server returns request_ids: ['req-5e859a3f-8b88-47a2-9ea6-576cefc618bd']: Conflict: Host 1c5f5638-851d-4a6f-8017-620a15c46c7c is not connected to any segments on routed provider network 'b68ab7d3-2518-470b-a0b8-7f541e8dbaa2'.  It should be connected to one.
ironic-conductor.log.

Followed by:
2018-09-25 23:27:00.127 1 DEBUG ironic.common.neutron [req-64a12ece-1b82-4ec8-8441-a845170a93ab a066e8e0c0044a37917bb2b7605a67ad 0d268b757aff4effbbeff6d5182a882e - default default] No ports to remove for node 1c5f5638-851d-4a6f-8017-620a15c46c7c remove_neutron_ports /usr/lib/python2.7/site-packages/ironic/common/neutron.py:331
2018-09-25 23:27:00.127 1 ERROR ironic.conductor.manager [req-64a12ece-1b82-4ec8-8441-a845170a93ab a066e8e0c0044a37917bb2b7605a67ad 0d268b757aff4effbbeff6d5182a882e - default default] Failed to prepare node 1c5f5638-851d-4a6f-8017-620a15c46c7c for cleaning: Failed to create neutron ports for any PXE enabled port on node 1c5f5638-851d-4a6f-8017-620a15c46c7c.: NetworkError: Failed to create neutron ports for any PXE enabled port on node 1c5f5638-851d-4a6f-8017-620a15c46c7c.
2018-09-25 23:27:00.127 1 ERROR ironic.conductor.manager Traceback (most recent call last):
2018-09-25 23:27:00.127 1 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/conductor/manager.py", line 1244, in _do_node_clean
2018-09-25 23:27:00.127 1 ERROR ironic.conductor.manager     prepare_result = task.driver.deploy.prepare_cleaning(task)
2018-09-25 23:27:00.127 1 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic_lib/metrics.py", line 60, in wrapped
2018-09-25 23:27:00.127 1 ERROR ironic.conductor.manager     result = f(*args, **kwargs)
2018-09-25 23:27:00.127 1 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/iscsi_deploy.py", line 614, in prepare_cleaning
2018-09-25 23:27:00.127 1 ERROR ironic.conductor.manager     task, manage_boot=True)
2018-09-25 23:27:00.127 1 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/deploy_utils.py", line 887, in prepare_inband_cleaning
2018-09-25 23:27:00.127 1 ERROR ironic.conductor.manager     task.driver.network.add_cleaning_network(task)
2018-09-25 23:27:00.127 1 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/network/flat.py", line 119, in add_cleaning_network
2018-09-25 23:27:00.127 1 ERROR ironic.conductor.manager     task, self.get_cleaning_network_uuid(task))
2018-09-25 23:27:00.127 1 ERROR ironic.conductor.manager   File "/usr/lib/python2.7/site-packages/ironic/common/neutron.py", line 277, in add_ports_to_network
2018-09-25 23:27:00.127 1 ERROR ironic.conductor.manager     "on node %s.") % node.uuid)
2018-09-25 23:27:00.127 1 ERROR ironic.conductor.manager NetworkError: Failed to create neutron ports for any PXE enabled port on node 1c5f5638-851d-4a6f-8017-620a15c46c7c.
2018-09-25 23:27:00.127 1 ERROR ironic.conductor.manager

Note that there are no neutron logs in the sosreport:
ovpn-125-245:containers bfournie$ pwd
/Users/bfournie/bug1633287/sosreport-undercloud.localdomain-20180926112915/var/log/containers
ovpn-125-245:containers bfournie$ ls
heat			ironic			mistral
httpd			ironic-inspector	zaqar

Comment 5 Bob Fournier 2018-10-06 20:02:20 UTC
Potentially will be helped by adding delay after setting port's physical network.
https://code.engineering.redhat.com/gerrit/#/c/151962/

Comment 6 Alexander Chuzhoy 2018-10-16 21:24:40 UTC
The issue doesn't reproduce with a 30 sec delay added.
Not sure if we want to document that or it's expected to have delays on real setups (non-automated deployment).

Comment 7 Alexander Chuzhoy 2018-10-16 21:25:03 UTC
WDYT re comment #6?

Comment 8 Alexander Chuzhoy 2018-10-16 21:25:38 UTC
WDYT re comment #6?

Comment 9 Alexander Chuzhoy 2018-10-16 22:56:03 UTC
Filed a doc bug to wait 30sec after setting ports:
https://bugzilla.redhat.com/show_bug.cgi?id=1639924

Comment 10 Alexander Chuzhoy 2018-10-16 22:56:26 UTC
No need for more info.

Comment 11 Alexander Chuzhoy 2018-10-16 22:57:06 UTC
Closing per comment #9. A 30 sec wait was needed.