Description of problem:
The timeouts inside of ipmitool are too short.
The timeout for lanplus is 4 seconds, for lan is 2 seconds. The general thinking for both of these plugins with respect to errors is the notion of dropped packets, not slow BMCs. What happens is for various reasons, certain Romley platform BMCs bog down or hang, usually near power on/off events. More common is a BMC is running very slowly, and it can take a long time to send an IPMI message reply. Once ipmitool times out waiting for a reply to a message, it does a retry.
Once the retry is sent, two important bad things can happen:
* The BMC is now getting ANOTHER session, further bogging it down
* The eventual reply from the first message does come back but out of order.
Depending on timing, this out-of-order reply can confuse ipmitool.
-> ipmitool may produce a nonsensical error message
-> ipmitool may seg fault
In all experimenting over thousands of boards, the timeout setting is key to waiting for slow BMCs. And note that the BMC slowness is not perpetual, it is brief and recovers. We have rarely seen it fail with a timeout past 6 seconds, and never past 8 seconds. 8 seconds is not an inappropriate timeout value and makes sense when viewing the timeout/retry from the perspective of a BMC slowdown (or a BMC restarting itself) issue instead of a dropped packet type of problem.
Version-Release number of selected component (if applicable):
It happens intermittently using ipmitool lan or lanplus with certain Romley platform BMCs, usually near power on/off events.
ipmitool may segfault, or it may produce a nonsensical error message.
No segfault or nonsensical message
Hello, was this tested with certain comands that do support setting the retransmit period with -N ?
Do you want these adjusted timeout values with "-o sgi" or "-o romley" or some different oem string?
The response from the engineer is:
Yes, this timeout setting change was tested using the -N option. The hindrance to simply resolving the issues with the -N option is that the software stack above ipmitool has no way to determine if it is using an older or newer version of ipmitool that has the -N option.
No, having sgi specific options or romley specific has the same basic problem as the -N option, which is the aforementioned software stack has no way of determining if the underlying ipmitool is the version that supports the new options.
Would making ipmitool aware of environment variable options be enough?
Parse and merge enviroment variable with command line options. Or does the software stack reset environment?
To check for the environment reset:
1. set some variable before the software stack start.
2. replace ipmitool binary with wrapper script to capture env output.
2. if ipmitool is running long enough in single instance on that system
cat /proc/`ps -C ipmitool -o pid | tail -n +2 | head -n 1 `/environ | tr '\0' '\n'
The SGI engineer replies:
Environment variables would be an excellent solution. In our problem space, ipmitool invocations are very short lived but frequent. We can quickly deploy environment variables for ipmitool options.
Status check: Is this still on target for 6.6?
rawhide 1.8.13-4 ( 9512c1786385bf0cd66269bfa1474695d45024e2 )
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.