Bug 1464131

Summary: ironic sometimes send wrong idrac wsman request for power cycling a server
Product: Red Hat OpenStack Reporter: Eduard Barrera <ebarrera>
Component: openstack-ironicAssignee: RHOS Maint <rhos-maint>
Status: CLOSED NOTABUG QA Contact: mlammon
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 10.0 (Newton)CC: bfournie, ebarrera, mburns, rhel-osp-director-maint, slinaber, srevivo
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-23 12:23:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
power cycle fail
none
power cycle success none

Description Eduard Barrera 2017-06-22 13:36:09 UTC
Created attachment 1290711 [details]
power cycle fail

Description of problem:

I have been troubleshooting a weird issue where sometimes the introspection success and sometimes it fails due the server is not being powered on. I have been sending power on and power off until one of the two nodes failed, one it failed it kept failing. The I got the output of the failed power cycle and to correct power cycle, attached the outputs. 


It seems that for a power cycle

==> /var/log/ironic/ironic-api.log <==
2017-06-22 14:34:49.665 4863 INFO eventlet.wsgi.server [req-6dfccd04-ff90-4f96-951e-23f1ba7b8ae3 admin admin - - -] 192.0.2.1 "PUT /v1/nodes/e7215ec4-8806-4886-b7bd-837daf137c6a/states/power HTTP/1.1" status: 202  len: 384 time: 0.1393850

we are sending a query for lifcycle jobs:

2017-06-22 14:34:50.336 1769 DEBUG dracclient.wsman [req-6dfccd04-ff90-4f96-951e-23f1ba7b8ae3 - - - - -] Sending request to https://172.16.40.3:443/wsman: <s:Envelope xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing" xmlns:s="http://www.w3.org/2003/05/soap-envelope" xmlns:wsman="http://schemas.dmtf.org/wbem/wsman/1/wsman.xsd"><s:Header><wsa:Tos:mustUnderstand="true">https://172.16.40.3:443/wsman</wsa:To><wsman:ResourceURI s:mustUnderstand="true">http://schemas.dell.com/wbem/wscim/1/cim-schema/2/DCIM_LifecycleJob</wsman:ResourceURI><wsa:MessageID s:mustUnderstand="true">uuid:c3efb16f-7f9d-4f21-afbf-5c4f19918cfd</wsa:MessageID><wsa:ReplyTo><wsa:Address>http://schemas.xmlsoap.org/ws/2004/08/addressing/role/anonymous</wsa:Address></wsa:ReplyTo><wsa:Action s:mustUnderstand="true">http://schemas.xmlsoap.org/ws/2004/09/enumeration/Enumerate</wsa:Action></s:Header><s:Body><wsen:Enumerate xmlns:wsen="http://schemas.xmlsoap.org/ws/2004/09/enumeration"><wsman:Filter Dialect="http://schemas.dmtf.org/wbem/cql/1/dsp0202.pdf">select * from DCIM_LifecycleJob where Name != "CLEARALL" and JobStatus != "Reboot Completed" and JobStatus != "Completed" and JobStatus != "Completed with Errors" and JobStatus != "Failed"</wsman:Filter><wsman:OptimizeEnumeration/><wsman:MaxElements>100</wsman:MaxElements></wsen:Enumerate></s:Body></s:Envelope> _do_request /usr/lib/python2.7/site-packages/dracclient/wsman.py:60


instead of a power cycle command as it is done in the success case:

Sending request to https://172.16.40.8:443/wsman: <s:Envelope xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing" xmlns:s="http://www.w3.org/2003/05/soap-envelope" xmlns:wsman="http://schemas.dmtf.org/wbem/wsman/1/wsman.xsd"><s:Header><wsa:Tos:mustUnderstand="true">https://172.16.40.8:443/wsman</wsa:To><wsman:ResourceURI s:mustUnderstand="true">http://schemas.dell.com/wbem/wscim/1/cim-schema/2/DCIM_ComputerSystem</wsman:ResourceURI><wsa:MessageID s:mustUnderstand="true">uuid:2f798edd-27a6-4760-923e-752e7aaee83f</wsa:MessageID><wsa:ReplyTo><wsa:Address>http://schemas.xmlsoap.org/ws/2004/08/addressing/role/anonymous</wsa:Address></wsa:ReplyTo><wsa:Action s:mustUnderstand="true">http://schemas.dell.com/wbem/wscim/1/cim-schema/2/DCIM_ComputerSystem/RequestStateChange</wsa:Action><wsman:SelectorSet><wsman:Selector Name="CreationClassName">DCIM_ComputerSystem</wsman:Selector><wsman:Selector Name="Name">srv:system</wsman:Selector></wsman:SelectorSet></s:Header><s:Body><ns0:RequestStateChange_INPUT xmlns:ns0="http://schemas.dell.com/wbem/wscim/1/cim-schema/2/DCIM_ComputerSystem"><ns0:RequestedState>2</ns0:RequestedState></ns0:RequestStateChange_INPUT></s:Body></s:Envelope> _do_request /usr/lib/python2.7/site-packages/dracclient/wsman.py:60


Version-Release number of selected component (if applicable):
OSP 10

How reproducible:
some times

Steps to Reproduce:
1. # ironic node-set-power-state <uuid> on 
2.
3.

Actual results:
server not power cycled some times

Expected results:
server powered cycled always

Additional info:

Comment 1 Eduard Barrera 2017-06-22 13:37:11 UTC
Created attachment 1290712 [details]
power cycle success

Comment 2 Dmitry Tantsur 2017-06-26 15:18:11 UTC
Could you please attach the complete ironic-conductor logs? What's your python-dracclient version? Also could you please try fetching and temporary installing a python-dracclient RPM from the latest release?

I have a feeling that we do query for lifecycle jobs from time to time..

Comment 3 Bob Fournier 2017-08-22 22:07:48 UTC
Hi Eduard - can we get the requested data?  Thanks.

Comment 4 Eduard Barrera 2017-08-23 07:49:55 UTC
Hi Bob,

At the end the issue was the following:

"""
regarding to boot issue, the DELL server itself has a job queue that receive remote management commands then executes it. this job queue may filled up ignoring other commands sent recently, it just needed to delete job queue to prevent it ignoring director commands, that was all of it.
"""

Do you still need more data or we shall close the bugzilla ?

Comment 5 Bob Fournier 2017-08-23 12:23:45 UTC
Thanks a lot Eduard.

Yes I think we can close this.

Note that there are a couple configurable IPMI settings in /etc/ironic/ironic.conf
that help with the frequency and time between IPMI commands.

# Maximum time in seconds to retry IPMI operations. There is a
# tradeoff when setting this value. Setting this too low may
# cause older BMCs to crash and require a hard reset. However,
# setting too high can cause the sync power state periodic
# task to hang when there are slow or unresponsive BMCs.
# (integer value)
#retry_timeout = 60
retry_timeout = 15

# Minimum time, in seconds, between IPMI operations sent to a
# server. There is a risk with some hardware that setting this
# too low may cause the BMC to crash. Recommended setting is 5
# seconds. (integer value)
#min_command_interval = 5