Bug 1831893

Summary: Baremetal nodes with HP BMCs fail introspection due to ipmitool timeout
Product: Red Hat OpenStack Reporter: Bob Fournier <bfournie>
Component: openstack-ironicAssignee: RHOS Maint <rhos-maint>
Status: CLOSED ERRATA QA Contact: Alistair Tonner <atonner>
Severity: high Docs Contact:
Priority: high    
Version: 16.1 (Train)CC: amcleod, bfournie, cpaquin, dtantsur, gcheresh, imelofer, itbrown, jschluet, mburns, racedoro, rhos-maint, rpittau, sclewis, slinaber
Target Milestone: betaKeywords: TestBlockerForLayeredProduct, Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-ironic-13.0.4-0.20200529150915.911bc51.el8ost Doc Type: Bug Fix
Doc Text:
A regression was introduced in ipmitool-1.8.18-11 that caused IPMI access to take over two minutes for certain BMCs that did not support the "Get Cipher Suites". As a result, introspection could fail and deployments could take much longer than previously. + With this update, ipmitool retries are handled differently, introspection passes, and deployments succeed. + [NOTE] This issue with ipmitool is resolved in ipmitool-1.8.18-17.
Story Points: ---
Clone Of:
: 1849038 (view as bug list) Environment:
Last Closed: 2020-07-29 07:52:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1831158    
Bug Blocks: 1849038    

Description Bob Fournier 2020-05-05 20:29:05 UTC
Description of problem:

This is an OSP bug to track the ipmitool bug - https://bugzilla.redhat.com/show_bug.cgi?id=1831158

With the version of ipmitool that is used in RHEL 8.2 we are getting introspection failures when using HP iLo BMCs.

This was seen on an HP ProLiant DL360 Gen9.  Introspection fails with:

openstack overcloud node introspect hp-dl360-g9-02 --provide
Waiting for introspection to finish...
Waiting for messages on queue 'tripleo' with no timeout.
Introspection of node completed:ced799b5-6619-44db-90cd-71c3955e3043. Status:FAILED. Errors:Failed to set boot device to PXE: Timed out waiting for a reply to message ID a3c7ab7325004808b4ae6411dce0f2db (HTTP 500)
Retrying 1 nodes that failed introspection. Attempt 1 of 3 
Introspection of node completed:ced799b5-6619-44db-90cd-71c3955e3043. Status:FAILED. Errors:Failed to set boot device to PXE: Timed out waiting for a reply to message ID ff20e76a05d444eabecf80031e9a518d (HTTP 500)
Retrying 1 nodes that failed introspection. Attempt 2 of 3 
Introspection of node completed:ced799b5-6619-44db-90cd-71c3955e3043. Status:FAILED. Errors:Failed to set boot device to PXE: Timed out waiting for a reply to message ID adcc70c4f43d4d06818e31718f1882e2 (HTTP 500)
Retrying 1 nodes that failed introspection. Attempt 3 of 3 
Introspection of node completed:ced799b5-6619-44db-90cd-71c3955e3043. Status:FAILED. Errors:Failed to set boot device to PXE: Timed out waiting for a reply to message ID 5c641cd7bf5847fb8d643ef4ad120243 (HTTP 500)
Retry limit reached with 1 nodes still failing introspection


In the logs we see:
containers/ironic/ironic-conductor.log.1:2020-05-04 23:55:41.385 7 DEBUG ironic.common.utils [req-eb49faaa-94bd-4f0e-badd-064272ba1ebc - - - - -] Command stderr is: "Unable to Get Channel Cipher Suites
containers/ironic/ironic-conductor.log.1:2020-05-04 23:57:52.657 7 DEBUG ironic.common.utils [req-eb49faaa-94bd-4f0e-badd-064272ba1ebc - - - - -] Command stderr is: "Unable to Get Channel Cipher Suites
containers/ironic/ironic-conductor.log.1:2020-05-05 00:00:03.935 7 DEBUG ironic.common.utils [req-eb49faaa-94bd-4f0e-badd-064272ba1ebc - - - - -] Command stderr is: "Unable to Get Channel Cipher Suites

Running the ipmitool command manually takes 2 minutes to complete:

()[ironic@hardprov-dl360-g9-01 /]$ time ipmitool -I lanplus -H 10.9.103.29 -U DMINISTRATOR -P XXX -v -R 12 -N 5 chassis status
...
real	2m6.271s
user	0m0.002s
sys	0m0.004s


This issue was also seen with vbmc but it was resolved with a new version of pyghmi in https://bugzilla.redhat.com/show_bug.cgi?id=1813889, pyghmi is not used with baremetal BMC access.

Version-Release number of selected component (if applicable):

HP ProLiant DL360 Gen9 - iLO versions 2.54 (Jun 15 2017) and 2.60 (latest available, May 23 2018)

ipmitool-1.8.18-14.el8.x86_64


How reproducible:

Happens every time with this BMC.  It works fine with Dell systems that have been tested.

Comment 4 Bob Fournier 2020-06-08 14:37:58 UTC
Package is in compose RHOS-16.1-RHEL-8-20200604.n.1.

Comment 5 Bob Fournier 2020-06-09 17:41:01 UTC
Verified that we no longer get a 2 minute response from ipmitool due to the Cipher Suites issue.  Ipmitool commands are now being issued with "-R 1 -N 1" and retries are done by ironic.
Running cmd (subprocess): ipmitool -I lanplus -H 172.16.0.28 -L ADMINISTRATOR -p 6230 -U admin -R 1 -N 1 -f /tmp/tmpebyzf379

ipmi.use_ipmitool_retries      = False

Comment 7 Alex McLeod 2020-06-16 12:30:57 UTC
If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to '-'.

Comment 10 errata-xmlrpc 2020-07-29 07:52:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3148