Created attachment 1662770 [details] inspection stdout logs from undercloud Description of problem: Trying to introspect 4 ASRock Rack boards using RHOSP 16.0. Systems boot into inspector ramdisk. However, they error out on the hardware detect utility. This throws an error on the undercloud that says: 2020-02-12T15:28:56.248664667-05:00 stderr F 2020-02-12 15:28:56.245 7 ERROR ironic_inspector.utils [-] [node: 868edfff-3526-4d29-a329-18d94215bf77 state waiting] Can not transition from state 'waiting' on event 'error' (no defined transition): automaton.exceptions.NotFound: Can not transition from state 'waiting' on event 'error' (no defined transition) Version-Release number of selected component (if applicable): 16.0 How reproducible: 100% in my environment Steps to Reproduce: 1. Import node into ironic on undercloud 2. Start introspection on node 3. Watch stdout logs for introspection Actual results: Node stays in waiting state and just continues to loop over introspection process Expected results: Introspection should fail and transition to error state Additional info:
Hi Kevin, When you say "they error out on the hardware detect utility" - was that seen on the node's console during introspection? Are you able to capture the console and provide it here?
Created attachment 1662800 [details] Message in /var/log/messages on overcloud node during introspection Here is an image. I'll try and pull the messages log off the node and attach.
Created attachment 1662812 [details] Full messages file from node I was able to pull the entire messages log that captures from start of introspection to the failure. Let me know if you need anything more.
Thanks Kevin. Not quite sure what is going on but it may help to remove the extra-hardware collector so that hardware-detect doesn't run. To do that edit /var/lib/ironic/httpboot/inspector.ipxe and remove the string "extra-hardware", and rerun introspection. It looks like this is the error: Feb 12 17:14:42 localhost ironic-python-agent[1446]: 2020-02-12 17:14:42.622 1446 ERROR ironic_python_agent.utils [-] failed to run hardware-detect utility: Unexpected error while running command.#012Command: hardware-detect#012Exit code: 1#012Stdout: ''#012Stderr: '/bin/sh: cli64: command not found\nconnect: Connection refused\nFailed to connect to lldpad - clif_open: Invalid argument\nconnect: Connection refused\nFailed to connect to lldpad - clif_open: Invalid argument\nconnect: Connection refused\nFailed to connect to lldpad - clif_open: Invalid argument\nconnect: Connection refused\nFailed to connect to lldpad - clif_open: Invalid argument\n/bin/sh: lsb_release: command not found\n/bin/sh: lsb_release: command not found\nmodprobe: FATAL: Module ipmi_smb not found in directory /lib/modules/4.18.0-147.5.1.el8_1.x86_64\n/bin/sh: ddr-timings-x86_64: command not found\nFailed to query server: The name org.freedesktop.timedate1 was not provided by any .service files\nInfo: detect_areca: No controller found\nCannot find megacli on the system\nReading SMART for nvme\nread_smart_nvme: Reading S.M.A.R.T information on /dev/nvme0n1\nread_smart: Reading S.M.A.R.T information on /dev/sdd\nread_smart_ata: Found S.M.A.R.T information on /dev/sdd\nread_smart: Reading S.M.A.R.T information on /dev/sdb\nread_smart_ata: Found S.M.A.R.T information on /dev/sdb\nread_smart: Reading S.M.A.R.T information on /dev/sdc\nread_smart_ata: Found S.M.A.R.T information on /dev/sdc\nread_smart: Reading S.M.A.R.T information on /dev/sda\nread_smart_ata: Found S.M.A.R.T information on /dev/sda\nInfo: Probing ipmi_si failed\nInfo: Probing ipmi_devintf failed\nInfo: No Infiniband device found\nReporting DDR Timings\nTraceback (most recent call last):\n File "/usr/bin/hardware-detect", line 10, in <module>\n sys.exit(main())\n File "/usr/lib/python3.6/site-packages/hardware/detect.py", line 972, in main\n detect_rtc_clock(hrdw)\n File "/usr/lib/python3.6/site-packages/hardware/detect.py", line 867, in detect_rtc_clock\n hw_lst.append((\'system\', \'rtc\', \'utc\', rtc.get_rtc()))\n File "/usr/lib/python3.6/site-packages/hardware/rtc.py", line 30, in get_rtc\n stdout = check_output(cmd).decode("utf-8")\n File "/usr/lib64/python3.6/subprocess.py", line 356, in check_output\n **kwargs).stdout\n File "/usr/lib64/python3.6/subprocess.py", line 438, in run\n output=stdout, stderr=stderr)\nsubprocess.CalledProcessError: Command \'[
Yep, that works. I remove extra-hardware from inspector.ipxe and this allows introspection to complete. Thanks so much for finding that. It looks like inspection_extras is enabled by default in 16 undercloud.conf sample. # Whether to enable extra hardware collection during the inspection # process. Requires python-hardware or python-hardware-detect package # on the introspection image. (boolean value) #inspection_extras = true
Thanks Kevin, from that output looks like a problem reading the real time clock - https://github.com/redhat-cip/hardware/blob/master/hardware/rtc.py#L30, which causes a CalledProcessError which is not being handled properly. Traceback (most recent call last):\n File "/usr/bin/hardware-detect", line 10, in <module>\n sys.exit(main())\n File "/usr/lib/python3.6/site-packages/hardware/detect.py", line 972, in main\n detect_rtc_clock(hrdw)\n File "/usr/lib/python3.6/site-packages/hardware/detect.py", line 867, in detect_rtc_clock\n hw_lst.append((\'system\', \'rtc\', \'utc\', rtc.get_rtc()))\n File "/usr/lib/python3.6/site-packages/hardware/rtc.py", line 30, in get_rtc\n stdout = check_output(cmd).decode("utf-8")\n File "/usr/lib64/python3.6/subprocess.py", line 356, in check_output\n **kwargs).stdout\n File "/usr/lib64/python3.6/subprocess.py", line 438, in run\n output=stdout, stderr=stderr)\nsubprocess.CalledProcessError: Command \'[ I think get_rtc() needs to be changed to handle CalledProcessError the same as https://github.com/redhat-cip/hardware/blob/master/hardware/detect_utils.py#L29.
Fix merged upstream.
Fix is in python-hardware-0.25.1-0.20200615141931.b36b733.el8ost, which is not currently being picked up for OSP-16.1.
Note that the build in FIV is actually tagged for 17, not for 16. Moving this back to POST.
I think the following needs to happen for this change to get into 16.x from upstream: https://github.com/redhat-openstack/rdoinfo/blob/master/tags/train.yml#L190 increment train hardware version to 0.25 https://github.com/rdo-packages/hardware-distgit backport the following packaging changes to train-rdo: - https://review.rdoproject.org/r/#/c/25617/ - https://review.rdoproject.org/r/#/c/24578/
Actually the hardware-distgit changes are needed for bug #1774254, not this one
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.3 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:5413