1829086 – Introspection fails after scale up

Bug 1829086 - Introspection fails after scale up

Summary: Introspection fails after scale up

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-ironic
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	beta
Target Release:	16.1 (Train on RHEL 8.2)
Assignee:	RHOS Maint
QA Contact:	Alistair Tonner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-28 20:42 UTC by David Rosenfeld
Modified:	2020-07-29 07:52 UTC (History)
CC List:	9 users (show)
Fixed In Version:	openstack-ironic-13.0.4-0.20200529150915.911bc51.el8ost
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-29 07:52:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:3148	0	None	None	None	2020-07-29 07:52:43 UTC

Description David Rosenfeld 2020-04-28 20:42:59 UTC

Description of problem: Introspection is failing after a scale up is performed. This is seen in logs when an introspection is performed:

STDOUT:

Waiting for introspection to finish...
Introspection of node completed:82c00b53-c17b-467e-864d-47d57c4b38eb. Status:FAILED. Errors:Failed to set boot device to PXE: Timed out waiting for a reply to message ID e38a3a305549458e9e588f4a0d7a5d17 (HTTP 500)
Retrying 1 nodes that failed introspection. Attempt 1 of 3
Introspection of node completed:82c00b53-c17b-467e-864d-47d57c4b38eb. Status:FAILED. Errors:Failed to set boot device to PXE: Node 82c00b53-c17b-467e-864d-47d57c4b38eb is locked by host undercloud-0.redhat.local, please retry after the current operation is completed. (HTTP 409)
Retrying 1 nodes that failed introspection. Attempt 2 of 3
Introspection of node completed:82c00b53-c17b-467e-864d-47d57c4b38eb. Status:FAILED. Errors:Failed to set boot device to PXE: Node 82c00b53-c17b-467e-864d-47d57c4b38eb is locked by host undercloud-0.redhat.local, please retry after the current operation is completed. (HTTP 409)
Retrying 1 nodes that failed introspection. Attempt 3 of 3
Introspection of node completed:82c00b53-c17b-467e-864d-47d57c4b38eb. Status:FAILED. Errors:Failed to set boot device to PXE: Timed out waiting for a reply to message ID 020dbfdd32484ef9a91b3456998a856c (HTTP 500)
Retry limit reached with 1 nodes still failing introspection

STDERR:

Waiting for messages on queue 'tripleo' with no timeout.
Introspection completed with errors: Retry limit reached with 1 nodes still failing introspection

Version-Release number of selected component (if applicable): RHOS-16.1-RHEL-8-20200424.n.0

How reproducible: Each time one of the df rfe scale up jobs is performed. e.g.: DFG-df-rfe-16.1-virsh-3cont_3comp_3ceph-blacklist-1compute-1control-scaleup

Steps to Reproduce:
1. Execute jenkins job: DFG-df-rfe-16.1-virsh-3cont_3comp_3ceph-blacklist-1compute-1control-scaleup
2.
3.

Actual results: Introspection fails with error seen above.

Expected results: Introspection is successful.

Additional info:

Comment 1 David Rosenfeld 2020-04-28 20:43:47 UTC

Logs to a failing test:

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/df/view/rfe/job/DFG-df-rfe-16.1-virsh-3cont_3comp_3ceph-blacklist-1compute-1control-scaleup/3/

Comment 2 Bob Fournier 2020-05-01 15:35:06 UTC

I'm seeing the same issue.  I imported a new node using ipmitool after I've already deployed a stack.  When I attempted to introspect I immediately get:
$ openstack overcloud node introspect hp-dl360-g9-02 --provide
Waiting for introspection to finish...
Waiting for messages on queue 'tripleo' with no timeout.
Introspection of node completed:a574065d-febc-47fd-8990-8a5aeaeddbe2. Status:FAILED. Errors:Failed to set boot device to PXE: Gateway Timeout (HTTP 504)
Retrying 1 nodes that failed introspection. Attempt 1 of 3 
Introspection of node completed:a574065d-febc-47fd-8990-8a5aeaeddbe2. Status:FAILED. Errors:Failed to set boot device to PXE: Node a574065d-febc-47fd-8990-8a5aeaeddbe2 is locked by host hardprov-dl360-g9-01.snedlab.lab.eng.rdu2.redhat.com, please retry after the current operation is completed. (HTTP 409)
Retrying 1 nodes that failed introspection. Attempt 2 of 3 
Introspection of node completed:a574065d-febc-47fd-8990-8a5aeaeddbe2. Status:FAILED. Errors:Failed to set boot device to PXE: Node a574065d-febc-47fd-8990-8a5aeaeddbe2 is locked by host hardprov-dl360-g9-01.snedlab.lab.eng.rdu2.redhat.com, please retry after the current operation is completed. (HTTP 409)
Retrying 1 nodes that failed introspection. Attempt 3 of 3 
Introspection of node completed:a574065d-febc-47fd-8990-8a5aeaeddbe2. Status:FAILED. Errors:Failed to set boot device to PXE: Gateway Timeout (HTTP 504)
Retry limit reached with 1 nodes still failing introspection
Introspection completed with errors: Retry limit reached with 1 nodes still failing introspection

ironic_inspector.log
containers/ironic-inspector/ironic-inspector.log:2020-05-01 10:30:16.732 7 INFO ironic_inspector.introspect [-] [node: a574065d-febc-47fd-8990-8a5aeaeddbe2 state starting] The following attributes will be used for look up: {'bmc_address': ['10.9.103.29'], 'mac': ['48:df:37:12:0e:14']}
containers/ironic-inspector/ironic-inspector.log:2020-05-01 10:30:26.724 7 INFO eventlet.wsgi.server [req-ea534936-545a-4e52-ac3c-191466d69ba4 8b55a2650e584ea5811c51ec24e33a0d bec9c7aab90547aeb255ab9ef7564ecd - default default] fe32:dead:beef::2,fe32:dead:beef::4 "GET /v1/introspection/a574065d-febc-47fd-8990-8a5aeaeddbe2 HTTP/1.1" status: 200  len: 492 time: 0.0066347
containers/ironic-inspector/ironic-inspector.log:2020-05-01 10:31:40.945 7 ERROR ironic_inspector.utils [-] [node: a574065d-febc-47fd-8990-8a5aeaeddbe2 state starting] Failed to set boot device to PXE: Timed out waiting for a reply to message ID f665caeb4bc34df2bd5be342f7a22703 (HTTP 500): ironicclient.common.apiclient.exceptions.InternalServerError: Timed out waiting for a reply to message ID f665caeb4bc34df2bd5be342f7a22703 (HTTP 500)
containers/ironic-inspector/ironic-inspector.log:2020-05-01 10:31:40.946 7 ERROR ironic_inspector.node_cache [-] [node: a574065d-febc-47fd-8990-8a5aeaeddbe2 state starting] Processing the error event because of an exception <class 'ironic_inspector.utils.Error'>: Failed to set boot device to PXE: Timed out waiting for a reply to message ID f665caeb4bc34df2bd5be342f7a22703 (HTTP 500) raised by ironic_inspector.introspect._background_introspect: ironic_inspector.utils.Error: Failed to set boot device to PXE: Timed out waiting for a reply to message ID f665caeb4bc34df2bd5be342f7a22703 (HTTP 500)

I do see calls logged in ironic, but not at same timestamp at above:
containers/ironic/ironic-conductor.log:2020-05-01 10:30:40.943 7 DEBUG ironic.conductor.manager [req-dd4537be-0acf-4557-bea4-2bd4e4ccd79b 7eac58e2c3ac44169b007f2d339589e0 9de13d993d53445aaff2a73a8b49e832 - default default] RPC set_boot_device called for node a574065d-febc-47fd-8990-8a5aeaeddbe2 with device pxe set_boot_device /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:3049


Not sure if this is related to the openstacksdk change (https://review.opendev.org/#/c/672179/) but that's the last change in inspector to use set_node_boot_device - https://github.com/openstack/ironic-inspector/blob/master/ironic_inspector/introspect.py#L118.

Comment 3 Bob Fournier 2020-05-01 17:29:43 UTC

I've also see the same issue with a 500 error
2020-05-01 13:16:42.169 7 ERROR ironic_inspector.utils [-] [node: c87251c5-7804-473e-bea9-0f0772d1d870 state starting] Failed to set boot device to PXE: Timed out waiting for a reply to message ID 7366b9d106254a26bb929aeb4943fa21 (HTTP 500): 

So we're either getting 500 or 504 errors when attempting the set_boot_device command.

Comment 4 Bob Fournier 2020-05-01 18:33:14 UTC

Seeing this in ironic/app.log for the particular message that is timing out 

2020-05-01 13:16:42.167 27 ERROR wsme.api [req-2905f9ce-be9e-491e-89fc-d5e2aee09a41 7eac58e2c3ac44169b007f2d339589e0 9de13d993d53445aaff2a73a8b49e832 - default default] Server-side error: "Timed out waiting for a reply to message ID 7366b9d106254a26bb929aeb4943fa21". Detail:
Traceback (most recent call last):

  File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 397, in get
    return self._queues[msg_id].get(block=True, timeout=timeout)

  File "/usr/lib64/python3.6/queue.py", line 172, in get
    raise Empty

queue.Empty


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "/usr/lib/python3.6/site-packages/wsmeext/pecan.py", line 85, in callfunction
    result = f(self, *args, **kwargs)

  File "/usr/lib/python3.6/site-packages/ironic/api/controllers/v1/node.py", line 233, in put
    topic=topic)

  File "/usr/lib/python3.6/site-packages/ironic/conductor/rpcapi.py", line 645, in set_boot_device
    device=device, persistent=persistent)

  File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call
    transport_options=self.transport_options)

  File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send
    transport_options=transport_options)

  File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 646, in send
    transport_options=transport_options)

  File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 634, in _send
    call_monitor_timeout)

  File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 523, in wait
    message = self.waiters.get(msg_id, timeout=timeout)

  File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 401, in get
    'to message ID %s' % msg_id)

oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 7366b9d106254a26bb929aeb4943fa21
: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 7366b9d106254a26bb929aeb4943fa21
@

Comment 5 Bob Fournier 2020-05-04 15:30:00 UTC

So it looks like this is a similar issue as https://bugzilla.redhat.com/show_bug.cgi?id=1813889, ipmitool commands are taking up to 2 minutes which causes the inspector timeout. 

Seeing lots of:
2020-05-04 11:16:53.205 7 DEBUG ironic.common.utils [req-eb49faaa-94bd-4f0e-badd-064272ba1ebc - - - - -] Command stderr is: "Unable to Get Channel Cipher Suites
" execute /usr/lib/python3.6/site-packages/ironic/common/utils.py:79

Its not clear why this only occurs on scale up.  The first time a set of nodes is introspected and the stack deployed it works fine.  Its only when bringing in another node afterwards that we get these ipmitool timeouts.  I am seeing this in a baremetal setup, I believe the initial report of this bug is a virtual setup.

Comment 6 Bob Fournier 2020-05-04 15:35:46 UTC

Versions used by ironic_conductor

(undercloud) [stack@hardprov-dl360-g9-01 log]$ sudo podman exec -it ironic_conductor /bin/bash
()[ironic@hardprov-dl360-g9-01 /]$ rpm -qa | grep ipmi
ipmitool-1.8.18-14.el8.x86_64
()[ironic@hardprov-dl360-g9-01 /]$ rpm -qa | grep pyg
python3-pyghmi-1.0.22-3.el8ost.noarch

Comment 9 David Rosenfeld 2020-05-05 13:43:14 UTC

Was seeing the introspection error in multiple scale up jobs. However, with latest build(RHOS-16.1-RHEL-8-20200428.n.0) those same jobs pass without seeing the introspection error. With that build still see the introspection error in the replace controller stage of this job:

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/df/view/controller_replacement/job/DFG-df-controller_replacement-16.1-virthost-3cont_3comp_3ceph-yes_UC_SSL-yes_OC_SSL-ceph-ipv4-geneve-replace_controller-RHELOSP-31864/4/

Searched all my logs and don't see pyhgmi being installed.

Comment 10 Bob Fournier 2020-05-05 14:07:23 UTC

Thanks David - I verified in the logs that you are seeing the same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1813889, which is due to the ipmitool commands taking 2 minutes to complete due to "Unable to Get Channel Cipher Suites"

2020-04-30 15:56:21.645 7 DEBUG ironic.common.utils [req-d964b773-c1df-474e-b9d0-fffdcb912b08 0272b8c63c7c41b5ac5b3b613a268498 2b656bb559154893ba9c3fe97122a1ab - default default] Command stderr is: "Unable to Get Channel Cipher Suites
" execute /usr/lib/python3.6/site-packages/ironic/common/utils.py:79

This was resolved with the latest pyghmi in https://bugzilla.redhat.com/show_bug.cgi?id=181388, from the Trello card associated with this bug [1] - 
"update: the issue with slow ipmitool is gone from OSP CI as we use the always the newest pyghmi with vbmc (from pip), as of this moment it's pyghmi-1.5.13"

Is it possible that in the replace controller stage it is using an older version of pyghmi which needs to be updated via pip?

[1] https://trello.com/c/WVFOGaUG/1393-cixbz1813889ospphase1osp161python-virtualbmcosp161-rhel82-ipmitool-commands-via-vbmc-virtualbmc-take-too-long-and-cause-overclou

Comment 11 David Rosenfeld 2020-05-05 21:01:59 UTC

The Infrared task: [vbmc : Install package from pypi]  in the replace controller stage does create a virtual environment using python 2.7.5:

created virtual environment CPython2.7.5.final.0-64 in 358ms
  creator CPython2Posix(dest=/root/.virtualenvs/vbmc, clear=False, global=True)
  seeder FromAppData(download=False, pip=latest, setuptools=latest, wheel=latest, via=copy, app_data_dir=/root/.local/share/virtualenv/seed-app-data/v1.0.1)
  activators PythonActivator,CShellActivator,FishActivator,PowerShellActivator,BashActivator

This is the pyghmi version installed in the virtual environment:

Collecting pyghmi===1.4.1
  Using cached pyghmi-1.4.1-py2.py3-none-any.whl (196 kB)

which is older than referenced above.

Comment 13 Bob Fournier 2020-06-01 16:54:24 UTC

David - we have a fix in Ironic that will get past the issue introduced in ipmitool. Please retest with latest compose.  Marking as MODIFIED.

Comment 17 David Rosenfeld 2020-06-08 13:55:53 UTC

Yes, scale up jobs now pass with no introspection error. Passing logs using build: RHOS-16.1-RHEL-8-20200604.n.1

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/df/view/rfe/job/DFG-df-rfe-16.1-virsh-3cont_3comp_3ceph-blacklist-1compute-1control-scaleup/15/

Comment 20 Alex McLeod 2020-06-16 12:31:03 UTC

If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to '-'.

Comment 22 errata-xmlrpc 2020-07-29 07:52:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3148

Note You need to log in before you can comment on or make changes to this bug.