Bug 1802307 - Introspection fails to transition from waiting to error
Summary: Introspection fails to transition from waiting to error
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-hardware
Version: 16.0 (Train)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z3
: 16.1 (Train on RHEL 8.2)
Assignee: Steve Baker
QA Contact: Paras Babbar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-12 20:52 UTC by Kevin Jones
Modified: 2020-12-15 18:36 UTC (History)
10 users (show)

Fixed In Version: python-hardware-0.23.0-0.20200310212453.59211cc.el8ost
Doc Type: No Doc Update
Doc Text:
-
Clone Of:
: 1888446 (view as bug list)
Environment:
Last Closed: 2020-12-15 18:35:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
inspection stdout logs from undercloud (531.42 KB, application/gzip)
2020-02-12 20:52 UTC, Kevin Jones
no flags Details
Message in /var/log/messages on overcloud node during introspection (46.47 KB, image/png)
2020-02-12 22:00 UTC, Kevin Jones
no flags Details
Full messages file from node (158.05 KB, text/plain)
2020-02-12 22:19 UTC, Kevin Jones
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github redhat-cip hardware pull 133 0 None closed Handle check_output exception for rtc 2020-12-05 22:41:20 UTC
Red Hat Product Errata RHEA-2020:5413 0 None None None 2020-12-15 18:36:45 UTC

Description Kevin Jones 2020-02-12 20:52:56 UTC
Created attachment 1662770 [details]
inspection stdout logs from undercloud

Description of problem:
Trying to introspect 4 ASRock Rack boards using RHOSP 16.0. Systems boot into inspector ramdisk. However, they error out on the hardware detect utility. This throws an error on the undercloud that says:

2020-02-12T15:28:56.248664667-05:00 stderr F 2020-02-12 15:28:56.245 7 ERROR ironic_inspector.utils [-] [node: 868edfff-3526-4d29-a329-18d94215bf77 state waiting] Can not transition from state 'waiting' on event 'error' (no defined transition): automaton.exceptions.NotFound: Can not transition from state 'waiting' on event 'error' (no defined transition)

Version-Release number of selected component (if applicable):
16.0

How reproducible:
100% in my environment

Steps to Reproduce:
1. Import node into ironic on undercloud
2. Start introspection on node
3. Watch stdout logs for introspection

Actual results:
Node stays in waiting state and just continues to loop over introspection process

Expected results:
Introspection should fail and transition to error state


Additional info:

Comment 1 Bob Fournier 2020-02-12 21:42:25 UTC
Hi Kevin,

When you say "they error out on the hardware detect utility" - was that seen on the node's console during introspection?  Are you able to capture the console and provide it here?

Comment 2 Kevin Jones 2020-02-12 22:00:10 UTC
Created attachment 1662800 [details]
Message in /var/log/messages on overcloud node during introspection

Here is an image. I'll try and pull the messages log off the node and attach.

Comment 3 Kevin Jones 2020-02-12 22:19:43 UTC
Created attachment 1662812 [details]
Full messages file from node

I was able to pull the entire messages log that captures from start of introspection to the failure. Let me know if you need anything more.

Comment 4 Bob Fournier 2020-02-12 22:34:09 UTC
Thanks Kevin.  Not quite sure what is going on but it may help to remove the extra-hardware collector so that hardware-detect doesn't run.
To do that edit /var/lib/ironic/httpboot/inspector.ipxe and remove the string "extra-hardware", and rerun introspection.

It looks like this is the error:
Feb 12 17:14:42 localhost ironic-python-agent[1446]: 2020-02-12 17:14:42.622 1446 ERROR ironic_python_agent.utils [-] failed to run hardware-detect utility: Unexpected error while running command.#012Command: hardware-detect#012Exit code: 1#012Stdout: ''#012Stderr: '/bin/sh: cli64: command not found\nconnect: Connection refused\nFailed to connect to lldpad - clif_open: Invalid argument\nconnect: Connection refused\nFailed to connect to lldpad - clif_open: Invalid argument\nconnect: Connection refused\nFailed to connect to lldpad - clif_open: Invalid argument\nconnect: Connection refused\nFailed to connect to lldpad - clif_open: Invalid argument\n/bin/sh: lsb_release: command not found\n/bin/sh: lsb_release: command not found\nmodprobe: FATAL: Module ipmi_smb not found in directory /lib/modules/4.18.0-147.5.1.el8_1.x86_64\n/bin/sh: ddr-timings-x86_64: command not found\nFailed to query server: The name org.freedesktop.timedate1 was not provided by any .service files\nInfo: detect_areca: No controller found\nCannot find megacli on the system\nReading SMART for nvme\nread_smart_nvme: Reading S.M.A.R.T information on /dev/nvme0n1\nread_smart: Reading S.M.A.R.T information on /dev/sdd\nread_smart_ata: Found S.M.A.R.T information on /dev/sdd\nread_smart: Reading S.M.A.R.T information on /dev/sdb\nread_smart_ata: Found S.M.A.R.T information on /dev/sdb\nread_smart: Reading S.M.A.R.T information on /dev/sdc\nread_smart_ata: Found S.M.A.R.T information on /dev/sdc\nread_smart: Reading S.M.A.R.T information on /dev/sda\nread_smart_ata: Found S.M.A.R.T information on /dev/sda\nInfo: Probing ipmi_si failed\nInfo: Probing ipmi_devintf failed\nInfo: No Infiniband device found\nReporting DDR Timings\nTraceback (most recent call last):\n  File "/usr/bin/hardware-detect", line 10, in <module>\n    sys.exit(main())\n  File "/usr/lib/python3.6/site-packages/hardware/detect.py", line 972, in main\n    detect_rtc_clock(hrdw)\n  File "/usr/lib/python3.6/site-packages/hardware/detect.py", line 867, in detect_rtc_clock\n    hw_lst.append((\'system\', \'rtc\', \'utc\', rtc.get_rtc()))\n  File "/usr/lib/python3.6/site-packages/hardware/rtc.py", line 30, in get_rtc\n    stdout = check_output(cmd).decode("utf-8")\n  File "/usr/lib64/python3.6/subprocess.py", line 356, in check_output\n    **kwargs).stdout\n  File "/usr/lib64/python3.6/subprocess.py", line 438, in run\n    output=stdout, stderr=stderr)\nsubprocess.CalledProcessError: Command \'[

Comment 5 Kevin Jones 2020-02-13 07:19:51 UTC
Yep, that works. I remove extra-hardware from inspector.ipxe and this allows introspection to complete. Thanks so much for finding that.

It looks like inspection_extras is enabled by default in 16 undercloud.conf sample.

# Whether to enable extra hardware collection during the inspection
# process. Requires python-hardware or python-hardware-detect package
# on the introspection image. (boolean value)
#inspection_extras = true

Comment 6 Bob Fournier 2020-02-13 13:07:52 UTC
Thanks Kevin, from that output looks like a problem reading the real time clock - https://github.com/redhat-cip/hardware/blob/master/hardware/rtc.py#L30, which causes a CalledProcessError which is not being handled properly.

Traceback (most recent call last):\n
File "/usr/bin/hardware-detect", line 10, in <module>\n
sys.exit(main())\n
File "/usr/lib/python3.6/site-packages/hardware/detect.py", line 972, in main\n
detect_rtc_clock(hrdw)\n
File "/usr/lib/python3.6/site-packages/hardware/detect.py", line 867, in detect_rtc_clock\n 
hw_lst.append((\'system\', \'rtc\', \'utc\', rtc.get_rtc()))\n
File "/usr/lib/python3.6/site-packages/hardware/rtc.py", line 30, in get_rtc\n 
stdout = check_output(cmd).decode("utf-8")\n 
File "/usr/lib64/python3.6/subprocess.py", line 356, in check_output\n  
**kwargs).stdout\n  File "/usr/lib64/python3.6/subprocess.py", line 438, in run\n
output=stdout, stderr=stderr)\nsubprocess.CalledProcessError: Command \'[

I think get_rtc() needs to be changed to handle CalledProcessError the same as https://github.com/redhat-cip/hardware/blob/master/hardware/detect_utils.py#L29.

Comment 7 Bob Fournier 2020-02-19 15:16:02 UTC
Fix merged upstream.

Comment 8 Bob Fournier 2020-06-16 21:27:36 UTC
Fix is in python-hardware-0.25.1-0.20200615141931.b36b733.el8ost, which is not currently being picked up for OSP-16.1.

Comment 11 Bob Fournier 2020-09-22 18:16:02 UTC
Note that the build in FIV is actually tagged for 17, not for 16. Moving this back to POST.

Comment 12 Steve Baker 2020-10-14 00:52:16 UTC
I think the following needs to happen for this change to get into 16.x from upstream:

https://github.com/redhat-openstack/rdoinfo/blob/master/tags/train.yml#L190
  increment train hardware version to 0.25

https://github.com/rdo-packages/hardware-distgit
  backport the following packaging changes to train-rdo:
  - https://review.rdoproject.org/r/#/c/25617/
  - https://review.rdoproject.org/r/#/c/24578/

Comment 13 Steve Baker 2020-10-14 00:55:28 UTC
Actually the hardware-distgit changes are needed for bug #1774254, not this one

Comment 27 errata-xmlrpc 2020-12-15 18:35:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.3 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:5413


Note You need to log in before you can comment on or make changes to this bug.