Bug 1813889
Summary: | osp16.1, rhel8.2: ipmitool commands via vbmc (virtualbmc) take too long and cause overcloud introspection to fail | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Waldemar Znoinski <wznoinsk> | ||||||
Component: | ipmitool | Assignee: | Vaclav Dolezal <vdolezal> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Kernel-QE - Hardware <kernel-qe-hw> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 8.2 | CC: | aschultz, bfournie, dciabrin, dtantsur, hjensas, jpretori, lmiccini, michele, tonyb | ||||||
Target Milestone: | rc | ||||||||
Target Release: | 8.0 | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2020-03-19 10:23:53 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1813468, 1814398 | ||||||||
Attachments: |
|
Description
Waldemar Znoinski
2020-03-16 12:21:26 UTC
Hi, can you provide access to the undercloud so we can look at logs? Any time I see unexpected delays, I suspect DNS. Could you check that the reverse DNS lookup / DNS lookup for unknown hosts doesn't hang for long? Additionally, I could not find vbmc logs in the undercloud tarball, could you provide them? Looks like same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1813468 Additionally, may be pasting the result of `ipmitool -v -v -v ...` would be helpful as well. Created attachment 1670542 [details]
messages (have some vbmc logs)
(In reply to Bob Fournier from comment #4) > Looks like same issue as https://bugzilla.redhat.com/show_bug.cgi?id=1813468 It's two different BMC's. RHBZ#1813468 ^^ uses https://opendev.org/openstack/openstack-virtual-baremetal/src/branch/master/openstack_virtual_baremetal , while this bug is using https://opendev.org/openstack/virtualbmc. Created attachment 1670543 [details]
ipmitool vvverbose
@Bob - I'm working with Dmitry on the affected undercloud now @Ilia - I've attached output of ipmitool -v -v -v... More specifically, ipmitool seems to loop on repeated instances of
>> Sending IPMI command payload
>> netfn : 0x06
>> command : 0x54
>> data : 0x0e 0x00 0x80
BUILDING A v2 COMMAND
Local RqAddr 0x20 transit 0:0 target 0x20:0 bridgePossible 0
Nothing interesting happens in virtualbmc logs at the same time.
A curious fact: if I remove the retries, it returns MUCH faster: $ time ipmitool -I lanplus -H 172.16.0.95 -L ADMINISTRATOR -p 6232 -U admin -P password chassis status Unable to Get Channel Cipher Suites System Power : on Power Overload : false Power Interlock : inactive Main Power Fault : false Power Control Fault : false Power Restore Policy : always-off Last Power Event : Chassis Intrusion : inactive Front-Panel Lockout : inactive Drive Fault : false Cooling/Fan Fault : false real 0m10.506s user 0m0.003s sys 0m0.004s Thanks Waldemar, seeing lots of these log messages in the attached file. Mar 16 08:33:38 undercloud-0 podman[44739]: 2020-03-16 08:33:38.158446199 +0000 UTC m=+0.080653190 container create b4d6dceaa17242e222a571ef752a90b195592f95c6fd6e7de8aadee414ce2522 (image=undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-heat-api:16.1_20200311.2, name=heat_api) Mar 16 08:33:38 undercloud-0 systemd[1]: Reloading. Mar 16 08:33:38 undercloud-0 systemd[1]: machine.slice: Failed to set cpuset.cpus: Device or resource busy Mar 16 08:33:38 undercloud-0 systemd[1]: machine.slice: Failed to set cpuset.mems: Device or resource busy Mar 16 08:33:38 undercloud-0 systemd[1]: libpod-00e589f7d1e6aa3e2061c1daed2901b1bf05edb21f6907285e2f75f2f19cd911.scope: Failed to set cpuset.cpus: No space left on device Mar 16 08:33:38 undercloud-0 systemd[1]: libpod-00e589f7d1e6aa3e2061c1daed2901b1bf05edb21f6907285e2f75f2f19cd911.scope: Failed to set cpuset.mems: No space left on device Mar 16 08:33:38 undercloud-0 systemd[1]: libpod-3671ea66cdbe673e84cf2cf7214541ef0406936aacbad7aeed15885de271e43a.scope: Failed to set cpuset.cpus: No space left on device Mar 16 08:33:38 undercloud-0 systemd[1]: libpod-3671ea66cdbe673e84cf2cf7214541ef0406936aacbad7aeed15885de271e43a.scope: Failed to set cpuset.mems: No space left on device Mar 16 08:33:38 undercloud-0 systemd[1]: libpod-119a6ecd7cd5f97d6c1490a4d1f92045482821f9322ca062476a0e5dc00f3e1b.scope: Failed to set cpuset.cpus: No space left on device Mar 16 08:33:38 undercloud-0 systemd[1]: libpod-119a6ecd7cd5f97d6c1490a4d1f92045482821f9322ca062476a0e5dc00f3e1b.scope: Failed to set cpuset.mems: No space left on device Mar 16 08:33:38 undercloud-0 systemd[1]: libpod-0bba30f62117d2265b790eae3119f2ddff4921ebe18607525c82d1180f52b426.scope: Failed to set cpuset.cpus: No space left on device Mar 16 08:33:38 undercloud-0 systemd[1]: libpod-0bba30f62117d2265b790eae3119f2ddff4921ebe18607525c82d1180f52b426.scope: Failed to set cpuset.mems: No space left on device Mar 16 08:33:38 undercloud-0 systemd[1]: libpod-253ed5fc2e548f6395e88471f98b0374a250f7906e85edfabe88558a4b1373a4.scope: Failed to set cpuset.cpus: No space left on device Mar 16 08:33:38 undercloud-0 systemd[1]: libpod-253ed5fc2e548f6395e88471f98b0374a250f7906e85edfabe88558a4b1373a4.scope: Failed to set cpuset.mems: No space left on device Mar 16 08:33:38 undercloud-0 systemd[1]: libpod-479032d577dc634024288771b4365394f90da501a3cee983b15081e2b5bd477d.scope: Failed to set cpuset.cpus: No space left on device Mar 16 08:33:38 undercloud-0 systemd[1]: libpod-479032d577dc634024288771b4365394f90da501a3cee983b15081e2b5bd477d.scope: Failed to set cpuset.mems: No space left on device Mar 16 08:33:38 undercloud-0 systemd[1]: libpod-dc9a8618e1d18cedbb522f53e0f03fa402a5043b7ec32e0767a8d780b1601ad2.scope: Failed to set cpuset.cpus: No space left on device Mar 16 08:33:38 undercloud-0 systemd[1]: libpod-dc9a8618e1d18cedbb522f53e0f03fa402a5043b7ec32e0767a8d780b1601ad2.scope: Failed to set cpuset.mems: No space left on device This is a regression starting with 1.8.18-12. It appears to now take about <the value of -R> * <the value of -N> to complete. with 1.8.18-12, it scales such that -R 1 = ~5s, -R 2 = ~11s, -R 3 = ~18s for -N 5. It's worse with 1.8.18-14. ()[root@undercloud /]# time ipmitool -I lanplus -H 192.168.1.15 -L ADMINISTRATOR -U admin -v -R 3 -N 5 -P password chassis bootdev pxe Running Get PICMG Properties my_addr 0x20, transit 0, target 0x20 Error response 0xc1 from Get PICMG Properities Running Get VSO Capabilities my_addr 0x20, transit 0, target 0x20 Invalid completion code received: Invalid command Discovered IPMB address 0x0 Set Boot Device to pxe real 0m0.519s user 0m0.000s sys 0m0.005s ()[root@undercloud /]# rpm -qa | grep ipmitool ipmitool-1.8.18-10.el8.x86_64 ()[root@undercloud /]# time ipmitool -I lanplus -H 192.168.1.15 -L ADMINISTRATOR -U admin -v -R 3 -N 5 -P password chassis bootdev pxe Unable to Get Channel Cipher Suites Running Get PICMG Properties my_addr 0x20, transit 0, target 0x20 Error response 0xc1 from Get PICMG Properities Running Get VSO Capabilities my_addr 0x20, transit 0, target 0x20 Invalid completion code received: Invalid command Discovered IPMB address 0x0 Set Boot Device to pxe real 0m18.889s user 0m0.001s sys 0m0.006s ()[root@undercloud /]# rpm -qa | grep ipmitool ipmitool-1.8.18-12.el8.x86_64 ()[root@undercloud /]# time ipmitool -I lanplus -H 192.168.1.15 -L ADMINISTRATOR -U admin -v -R 3 -N 5 -P password chassis bootdev pxe Unable to Get Channel Cipher Suites Running Get PICMG Properties my_addr 0x20, transit 0, target 0x20 Error response 0xc1 from Get PICMG Properities Running Get VSO Capabilities my_addr 0x20, transit 0, target 0x20 Invalid completion code received: Invalid command Discovered IPMB address 0x0 Set Boot Device to pxe real 0m22.575s user 0m0.001s sys 0m0.006s ()[root@undercloud /]# rpm -qa | grep ipmitool ipmitool-1.8.18-14.el8.x86_64 Note that "Unable to Get Channel Cipher Suites" is also new in the affected versions. I wonder if it's somehow related. > Note that "Unable to Get Channel Cipher Suites" is also new in the affected versions. I wonder if it's somehow related. This was added in the version that is failing as 1.8.18-10 works fine. * Tue Oct 15 2019 Václav Doležal <vdolezal> - 1.8.18-11 - Choose the best cipher suite available when connecting over LAN (#1749360) See https://bugzilla.redhat.com/show_bug.cgi?id=1749360 I've tried this with baremetal nodes and IPMI works fine using ipmitool-1.8.18-14.el8.x86_64. $ sudo podman exec -it ironic_conductor /bin/bash ()[ironic@hardprov-dl360-g9-01 /]$ rpm -qa | grep ipmi ipmitool-1.8.18-14.el8.x86_64 Harald has suggested this may be a pyghmi issue as pyghmi is the common python IPMI implementation used by both OVB and VBMC Even if pyghmi doing something fishy, it's still a regression. In any case, we need some guidance from people familiar with ipmitool. I was thinking it might be python3-pyghmi-1.0.22-2.el8ost.noarch, being quite dated, some fix could be present in later version. But, for OVB (atleast in my case) pyghmi-1.5.12.tar.gz is installed from pip. Assuming other people's OVB environments also get this version of pyghmi it unlikely that there is a fix. Yeah, it looks like it stalls on the "Get Channel Cipher Suites" command. I don't thinks this is bug in ipmitool. I see "Get Channel Cipher Suites" command listed as "Mandatory if IPMI v2.0/RMCP+ session-based channels are implemented.". And I suppose BMC should return some error code (e.g. 0xC1 Invalid command) instead of ignoring the request. I think you can work around this timeout by forcing cipher suite, e.g. "-C 3" It could be an explanation if ipmitool outright failed or just always waited. But what we observe is that ipmitool receives a correct response and then retries it until timeout. The lower number of retries (-R and -N) - the faster ipmitool (successfully!) returns. This behavior looks illogical to me. I don't see in the log where "Get Channel Cipher Suites" command (NetFn, cmd = 0x06, 0x54) receives a correct response. The -R and -N parameters apply to *each* IPMI command. So ipmitool, in order to use best cipher suite, sends "Get Channel Cipher Suites" command, times out, retries, finally gives up, uses fallback value and continue with the rest of the commands. So, it looks like pyghmi sends code 0xC1 on not implemented features: https://opendev.org/x/pyghmi/src/branch/master/pyghmi/ipmi/bmc.py#L182. Is it the right thing to do (assuming we cannot just simply implement "Get Channel Cipher Suites")? Yes, it is right thing to do. Note that ipmitool sends "Get Channel Cipher Suites" before authentication so it may be handled in different code path. Okay, then shouldn't ipmitool give up and fallback to the default on receiving that? The practical implication of my question is that we may be able to fix pyghmi, but we may not be able to fix BMCs out there that will get broken with RHEL 8.2. > Okay, then shouldn't ipmitool give up and fallback to the default on receiving that? Well, I don't see ipmitool receiving any reply in the log. On quick look into pyghmi sources, it looks that it is handled (or ignored in this case) in pyghmi/ipmi/private/serversession.py:IpmiServer:sessionless_data() (line 301). > The practical implication of my question is that we may be able to fix > pyghmi, but we may not be able to fix BMCs out there that will get broken > with RHEL 8.2. Well, I hope that most BMCs will behave correctly. Without this patch, for some newer BMCs user will need to manually specify "-C 17" because they reject cipher suite 3 as obsolete. And older ones don't support cipher suite 17… I have confirmed in IPMI spec that "Get Channel Cipher Suites" "works at any privilege level, can be sent prior to a session being established". > it looks that it is handled (or ignored in this case) in pyghmi/ipmi/private/serversession.py:IpmiServer:sessionless_data() Thanks for the pointer! I'll see if I can just implement it. > Well, I hope that most BMCs will behave correctly. I no longer hope that.. But I guess you're right, at least some response is expected (although I've seen BMCs that just ignore you when they don't like something..). I've reached out to the pyghmi maintainer and he promised to help. I'm personally going to go insane from trying to understand the IPMI spec (I haven't been seriously exposed to it before). pyghmi 1.5.13 has been released with a fix. Since virtualbmc is installed from pip, I'm closing this bug as fixed. *** Bug 1813468 has been marked as a duplicate of this bug. *** |