Bug 1464131 - ironic sometimes send wrong idrac wsman request for power cycling a server
ironic sometimes send wrong idrac wsman request for power cycling a server
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic (Show other bugs)
10.0 (Newton)
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: RHOS Maint
Depends On:
  Show dependency treegraph
Reported: 2017-06-22 09:36 EDT by Eduard Barrera
Modified: 2017-08-23 08:23 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2017-08-23 08:23:45 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
power cycle fail (5.46 KB, text/plain)
2017-06-22 09:36 EDT, Eduard Barrera
no flags Details
power cycle success (4.58 KB, text/plain)
2017-06-22 09:37 EDT, Eduard Barrera
no flags Details

  None (edit)
Description Eduard Barrera 2017-06-22 09:36:09 EDT
Created attachment 1290711 [details]
power cycle fail

Description of problem:

I have been troubleshooting a weird issue where sometimes the introspection success and sometimes it fails due the server is not being powered on. I have been sending power on and power off until one of the two nodes failed, one it failed it kept failing. The I got the output of the failed power cycle and to correct power cycle, attached the outputs. 

It seems that for a power cycle

==> /var/log/ironic/ironic-api.log <==
2017-06-22 14:34:49.665 4863 INFO eventlet.wsgi.server [req-6dfccd04-ff90-4f96-951e-23f1ba7b8ae3 admin admin - - -] "PUT /v1/nodes/e7215ec4-8806-4886-b7bd-837daf137c6a/states/power HTTP/1.1" status: 202  len: 384 time: 0.1393850

we are sending a query for lifcycle jobs:

2017-06-22 14:34:50.336 1769 DEBUG dracclient.wsman [req-6dfccd04-ff90-4f96-951e-23f1ba7b8ae3 - - - - -] Sending request to <s:Envelope xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing" xmlns:s="http://www.w3.org/2003/05/soap-envelope" xmlns:wsman="http://schemas.dmtf.org/wbem/wsman/1/wsman.xsd"><s:Header><wsa:Tos:mustUnderstand="true"></wsa:To><wsman:ResourceURI s:mustUnderstand="true">http://schemas.dell.com/wbem/wscim/1/cim-schema/2/DCIM_LifecycleJob</wsman:ResourceURI><wsa:MessageID s:mustUnderstand="true">uuid:c3efb16f-7f9d-4f21-afbf-5c4f19918cfd</wsa:MessageID><wsa:ReplyTo><wsa:Address>http://schemas.xmlsoap.org/ws/2004/08/addressing/role/anonymous</wsa:Address></wsa:ReplyTo><wsa:Action s:mustUnderstand="true">http://schemas.xmlsoap.org/ws/2004/09/enumeration/Enumerate</wsa:Action></s:Header><s:Body><wsen:Enumerate xmlns:wsen="http://schemas.xmlsoap.org/ws/2004/09/enumeration"><wsman:Filter Dialect="http://schemas.dmtf.org/wbem/cql/1/dsp0202.pdf">select * from DCIM_LifecycleJob where Name != "CLEARALL" and JobStatus != "Reboot Completed" and JobStatus != "Completed" and JobStatus != "Completed with Errors" and JobStatus != "Failed"</wsman:Filter><wsman:OptimizeEnumeration/><wsman:MaxElements>100</wsman:MaxElements></wsen:Enumerate></s:Body></s:Envelope> _do_request /usr/lib/python2.7/site-packages/dracclient/wsman.py:60

instead of a power cycle command as it is done in the success case:

Sending request to <s:Envelope xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing" xmlns:s="http://www.w3.org/2003/05/soap-envelope" xmlns:wsman="http://schemas.dmtf.org/wbem/wsman/1/wsman.xsd"><s:Header><wsa:Tos:mustUnderstand="true"></wsa:To><wsman:ResourceURI s:mustUnderstand="true">http://schemas.dell.com/wbem/wscim/1/cim-schema/2/DCIM_ComputerSystem</wsman:ResourceURI><wsa:MessageID s:mustUnderstand="true">uuid:2f798edd-27a6-4760-923e-752e7aaee83f</wsa:MessageID><wsa:ReplyTo><wsa:Address>http://schemas.xmlsoap.org/ws/2004/08/addressing/role/anonymous</wsa:Address></wsa:ReplyTo><wsa:Action s:mustUnderstand="true">http://schemas.dell.com/wbem/wscim/1/cim-schema/2/DCIM_ComputerSystem/RequestStateChange</wsa:Action><wsman:SelectorSet><wsman:Selector Name="CreationClassName">DCIM_ComputerSystem</wsman:Selector><wsman:Selector Name="Name">srv:system</wsman:Selector></wsman:SelectorSet></s:Header><s:Body><ns0:RequestStateChange_INPUT xmlns:ns0="http://schemas.dell.com/wbem/wscim/1/cim-schema/2/DCIM_ComputerSystem"><ns0:RequestedState>2</ns0:RequestedState></ns0:RequestStateChange_INPUT></s:Body></s:Envelope> _do_request /usr/lib/python2.7/site-packages/dracclient/wsman.py:60

Version-Release number of selected component (if applicable):
OSP 10

How reproducible:
some times

Steps to Reproduce:
1. # ironic node-set-power-state <uuid> on 

Actual results:
server not power cycled some times

Expected results:
server powered cycled always

Additional info:
Comment 1 Eduard Barrera 2017-06-22 09:37 EDT
Created attachment 1290712 [details]
power cycle success
Comment 2 Dmitry Tantsur 2017-06-26 11:18:11 EDT
Could you please attach the complete ironic-conductor logs? What's your python-dracclient version? Also could you please try fetching and temporary installing a python-dracclient RPM from the latest release?

I have a feeling that we do query for lifecycle jobs from time to time..
Comment 3 Bob Fournier 2017-08-22 18:07:48 EDT
Hi Eduard - can we get the requested data?  Thanks.
Comment 4 Eduard Barrera 2017-08-23 03:49:55 EDT
Hi Bob,

At the end the issue was the following:

regarding to boot issue, the DELL server itself has a job queue that receive remote management commands then executes it. this job queue may filled up ignoring other commands sent recently, it just needed to delete job queue to prevent it ignoring director commands, that was all of it.

Do you still need more data or we shall close the bugzilla ?
Comment 5 Bob Fournier 2017-08-23 08:23:45 EDT
Thanks a lot Eduard.

Yes I think we can close this.

Note that there are a couple configurable IPMI settings in /etc/ironic/ironic.conf
that help with the frequency and time between IPMI commands.

# Maximum time in seconds to retry IPMI operations. There is a
# tradeoff when setting this value. Setting this too low may
# cause older BMCs to crash and require a hard reset. However,
# setting too high can cause the sync power state periodic
# task to hang when there are slow or unresponsive BMCs.
# (integer value)
#retry_timeout = 60
retry_timeout = 15

# Minimum time, in seconds, between IPMI operations sent to a
# server. There is a risk with some hardware that setting this
# too low may cause the BMC to crash. Recommended setting is 5
# seconds. (integer value)
#min_command_interval = 5

Note You need to log in before you can comment on or make changes to this bug.