Red Hat Bugzilla – Bug 1464131
ironic sometimes send wrong idrac wsman request for power cycling a server
Last modified: 2017-08-23 08:23:45 EDT
Created attachment 1290711 [details]
power cycle fail
Description of problem:
I have been troubleshooting a weird issue where sometimes the introspection success and sometimes it fails due the server is not being powered on. I have been sending power on and power off until one of the two nodes failed, one it failed it kept failing. The I got the output of the failed power cycle and to correct power cycle, attached the outputs.
It seems that for a power cycle
==> /var/log/ironic/ironic-api.log <==
2017-06-22 14:34:49.665 4863 INFO eventlet.wsgi.server [req-6dfccd04-ff90-4f96-951e-23f1ba7b8ae3 admin admin - - -] 192.0.2.1 "PUT /v1/nodes/e7215ec4-8806-4886-b7bd-837daf137c6a/states/power HTTP/1.1" status: 202 len: 384 time: 0.1393850
we are sending a query for lifcycle jobs:
2017-06-22 14:34:50.336 1769 DEBUG dracclient.wsman [req-6dfccd04-ff90-4f96-951e-23f1ba7b8ae3 - - - - -] Sending request to https://172.16.40.3:443/wsman: <s:Envelope xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing" xmlns:s="http://www.w3.org/2003/05/soap-envelope" xmlns:wsman="http://schemas.dmtf.org/wbem/wsman/1/wsman.xsd"><s:Header><wsa:Tos:mustUnderstand="true">https://172.16.40.3:443/wsman</wsa:To><wsman:ResourceURI s:mustUnderstand="true">http://schemas.dell.com/wbem/wscim/1/cim-schema/2/DCIM_LifecycleJob</wsman:ResourceURI><wsa:MessageID s:mustUnderstand="true">uuid:c3efb16f-7f9d-4f21-afbf-5c4f19918cfd</wsa:MessageID><wsa:ReplyTo><wsa:Address>http://schemas.xmlsoap.org/ws/2004/08/addressing/role/anonymous</wsa:Address></wsa:ReplyTo><wsa:Action s:mustUnderstand="true">http://schemas.xmlsoap.org/ws/2004/09/enumeration/Enumerate</wsa:Action></s:Header><s:Body><wsen:Enumerate xmlns:wsen="http://schemas.xmlsoap.org/ws/2004/09/enumeration"><wsman:Filter Dialect="http://schemas.dmtf.org/wbem/cql/1/dsp0202.pdf">select * from DCIM_LifecycleJob where Name != "CLEARALL" and JobStatus != "Reboot Completed" and JobStatus != "Completed" and JobStatus != "Completed with Errors" and JobStatus != "Failed"</wsman:Filter><wsman:OptimizeEnumeration/><wsman:MaxElements>100</wsman:MaxElements></wsen:Enumerate></s:Body></s:Envelope> _do_request /usr/lib/python2.7/site-packages/dracclient/wsman.py:60
instead of a power cycle command as it is done in the success case:
Sending request to https://172.16.40.8:443/wsman: <s:Envelope xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing" xmlns:s="http://www.w3.org/2003/05/soap-envelope" xmlns:wsman="http://schemas.dmtf.org/wbem/wsman/1/wsman.xsd"><s:Header><wsa:Tos:mustUnderstand="true">https://172.16.40.8:443/wsman</wsa:To><wsman:ResourceURI s:mustUnderstand="true">http://schemas.dell.com/wbem/wscim/1/cim-schema/2/DCIM_ComputerSystem</wsman:ResourceURI><wsa:MessageID s:mustUnderstand="true">uuid:2f798edd-27a6-4760-923e-752e7aaee83f</wsa:MessageID><wsa:ReplyTo><wsa:Address>http://schemas.xmlsoap.org/ws/2004/08/addressing/role/anonymous</wsa:Address></wsa:ReplyTo><wsa:Action s:mustUnderstand="true">http://schemas.dell.com/wbem/wscim/1/cim-schema/2/DCIM_ComputerSystem/RequestStateChange</wsa:Action><wsman:SelectorSet><wsman:Selector Name="CreationClassName">DCIM_ComputerSystem</wsman:Selector><wsman:Selector Name="Name">srv:system</wsman:Selector></wsman:SelectorSet></s:Header><s:Body><ns0:RequestStateChange_INPUT xmlns:ns0="http://schemas.dell.com/wbem/wscim/1/cim-schema/2/DCIM_ComputerSystem"><ns0:RequestedState>2</ns0:RequestedState></ns0:RequestStateChange_INPUT></s:Body></s:Envelope> _do_request /usr/lib/python2.7/site-packages/dracclient/wsman.py:60
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. # ironic node-set-power-state <uuid> on
server not power cycled some times
server powered cycled always
Created attachment 1290712 [details]
power cycle success
Could you please attach the complete ironic-conductor logs? What's your python-dracclient version? Also could you please try fetching and temporary installing a python-dracclient RPM from the latest release?
I have a feeling that we do query for lifecycle jobs from time to time..
Hi Eduard - can we get the requested data? Thanks.
At the end the issue was the following:
regarding to boot issue, the DELL server itself has a job queue that receive remote management commands then executes it. this job queue may filled up ignoring other commands sent recently, it just needed to delete job queue to prevent it ignoring director commands, that was all of it.
Do you still need more data or we shall close the bugzilla ?
Thanks a lot Eduard.
Yes I think we can close this.
Note that there are a couple configurable IPMI settings in /etc/ironic/ironic.conf
that help with the frequency and time between IPMI commands.
# Maximum time in seconds to retry IPMI operations. There is a
# tradeoff when setting this value. Setting this too low may
# cause older BMCs to crash and require a hard reset. However,
# setting too high can cause the sync power state periodic
# task to hang when there are slow or unresponsive BMCs.
# (integer value)
#retry_timeout = 60
retry_timeout = 15
# Minimum time, in seconds, between IPMI operations sent to a
# server. There is a risk with some hardware that setting this
# too low may cause the BMC to crash. Recommended setting is 5
# seconds. (integer value)
#min_command_interval = 5