Bug 1028163

Summary: ipmitool default timeout values are too short
Product: Red Hat Enterprise Linux 6 Reporter: Tony Ernst <tee>
Component: ipmitoolAssignee: Ales Ledvinka <aledvink>
Status: CLOSED ERRATA QA Contact: Frantisek Sumsal <fsumsal>
Severity: medium Docs Contact:
Priority: high    
Version: 6.6CC: aledvink, azelinka, ctatman, erikj, fsumsal, gbeshers, gbeshers, jdonohue, jkurik, lmiksik, loriann, mnavrati, peterm, psklenar, rja, salmy, tee, tsmetana
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ipmitool-1.8.11-21.el6 Doc Type: Bug Fix
Doc Text:
Previously, the ipmitool default timeout values set an insufficient time period. As a consequence, during retries, ipmitool could terminate unexpectedly with a segmentation fault, or produce a nonsensical error message. With this update, the ipmitool options passed from the environment variable are parsed correctly from the IPMITOOL_OPTS and IPMI_OPTS variables, IPMITOOL_* variables take precedence over IPMI_* variables. As a result, ipmitool no longer crashes in the described situation.
Story Points: ---
Clone Of:
: 1147593 1210339 (view as bug list) Environment:
Last Closed: 2015-07-22 06:59:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 996235, 1080044, 1147593, 1210339    

Description Tony Ernst 2013-11-07 19:32:04 UTC
Description of problem:
The timeouts inside of ipmitool are too short.

The timeout for lanplus is 4 seconds, for lan is 2 seconds. The general thinking for both of these plugins with respect to errors is the notion of dropped packets, not slow BMCs. What happens is for various reasons, certain Romley platform BMCs bog down or hang, usually near power on/off events. More common is a BMC is running very slowly, and it can take a long time to send an IPMI message reply. Once ipmitool times out waiting for a reply to a message, it does a retry. 

Once the retry is sent, two important bad things can happen:
* The BMC is now getting ANOTHER session, further bogging it down
* The eventual reply from the first message does come back but out of order.
  Depending on timing, this out-of-order reply can confuse ipmitool.
     -> ipmitool may produce a nonsensical error message
     -> ipmitool may seg fault

In all experimenting over thousands of boards, the timeout setting is key to waiting for slow BMCs. And note that the BMC slowness is not perpetual, it is brief and recovers.  We have rarely seen it fail with a timeout past 6 seconds, and never past 8 seconds. 8 seconds is not an inappropriate timeout value and makes sense when viewing the timeout/retry from the perspective of a BMC slowdown (or a BMC restarting itself) issue instead of a dropped packet type of problem.

Version-Release number of selected component (if applicable):
ipmitool-1.8.11-16.el6.x86_64

How reproducible:
It happens intermittently using ipmitool lan or lanplus with certain Romley platform BMCs, usually near power on/off events.

Actual results:
ipmitool may segfault, or it may produce a nonsensical error message.

Expected results:
No segfault or nonsensical message

Additional info:

Comment 2 Ales Ledvinka 2013-11-14 11:42:34 UTC
Hello, was this tested with certain comands that do support setting the retransmit period with -N ?

Do you want these adjusted timeout values with "-o sgi" or "-o romley" or some different oem string?

Comment 3 Russ Anderson 2013-11-14 14:53:19 UTC
The response from the engineer is:

Yes, this timeout setting change was tested using the -N option. The hindrance to simply resolving the issues with the -N option is that the software stack above ipmitool has no way to determine if it is using an older or newer version of ipmitool that has the -N option.

No, having sgi specific options or romley specific has the same basic problem as the -N option, which is the aforementioned software stack has no way of determining if the underlying ipmitool is the version that supports the new options.

Comment 4 Ales Ledvinka 2013-11-14 15:34:09 UTC
Would making ipmitool aware of environment variable options be enough?

Parse and merge enviroment variable with command line options. Or does the software stack reset environment?
To check for the environment reset:
1. set some variable before the software stack start.
2. replace ipmitool binary with wrapper script to capture env output.
or
2. if ipmitool is running long enough in single instance on that system
cat /proc/`ps -C ipmitool -o pid | tail -n +2 | head -n 1 `/environ | tr '\0' '\n'

Comment 5 Russ Anderson 2013-11-14 15:54:08 UTC
The SGI engineer replies:

Environment variables would be an excellent solution. In our problem space, ipmitool invocations are very short lived but frequent. We can quickly deploy environment variables for ipmitool options.

Comment 6 Russ Anderson 2014-01-27 19:18:22 UTC
Status check: Is this still on target for 6.6?
Thanks.

Comment 9 Ales Ledvinka 2014-04-08 11:12:45 UTC
rawhide 1.8.13-4 ( 9512c1786385bf0cd66269bfa1474695d45024e2 )

Comment 22 errata-xmlrpc 2015-07-22 06:59:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1351.html