Bug 1464131 - ironic sometimes send wrong idrac wsman request for power cycling a server
Summary: ironic sometimes send wrong idrac wsman request for power cycling a server
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: mlammon
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-22 13:36 UTC by Eduard Barrera
Modified: 2020-07-16 09:52 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-23 12:23:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
power cycle fail (5.46 KB, text/plain)
2017-06-22 13:36 UTC, Eduard Barrera
no flags Details
power cycle success (4.58 KB, text/plain)
2017-06-22 13:37 UTC, Eduard Barrera
no flags Details

Description Eduard Barrera 2017-06-22 13:36:09 UTC
Created attachment 1290711 [details]
power cycle fail

Description of problem:

I have been troubleshooting a weird issue where sometimes the introspection success and sometimes it fails due the server is not being powered on. I have been sending power on and power off until one of the two nodes failed, one it failed it kept failing. The I got the output of the failed power cycle and to correct power cycle, attached the outputs. 


It seems that for a power cycle

==> /var/log/ironic/ironic-api.log <==
2017-06-22 14:34:49.665 4863 INFO eventlet.wsgi.server [req-6dfccd04-ff90-4f96-951e-23f1ba7b8ae3 admin admin - - -] 192.0.2.1 "PUT /v1/nodes/e7215ec4-8806-4886-b7bd-837daf137c6a/states/power HTTP/1.1" status: 202  len: 384 time: 0.1393850

we are sending a query for lifcycle jobs:

2017-06-22 14:34:50.336 1769 DEBUG dracclient.wsman [req-6dfccd04-ff90-4f96-951e-23f1ba7b8ae3 - - - - -] Sending request to https://172.16.40.3:443/wsman: <s:Envelope xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing" xmlns:s="http://www.w3.org/2003/05/soap-envelope" xmlns:wsman="http://schemas.dmtf.org/wbem/wsman/1/wsman.xsd"><s:Header><wsa:Tos:mustUnderstand="true">https://172.16.40.3:443/wsman</wsa:To><wsman:ResourceURI s:mustUnderstand="true">http://schemas.dell.com/wbem/wscim/1/cim-schema/2/DCIM_LifecycleJob</wsman:ResourceURI><wsa:MessageID s:mustUnderstand="true">uuid:c3efb16f-7f9d-4f21-afbf-5c4f19918cfd</wsa:MessageID><wsa:ReplyTo><wsa:Address>http://schemas.xmlsoap.org/ws/2004/08/addressing/role/anonymous</wsa:Address></wsa:ReplyTo><wsa:Action s:mustUnderstand="true">http://schemas.xmlsoap.org/ws/2004/09/enumeration/Enumerate</wsa:Action></s:Header><s:Body><wsen:Enumerate xmlns:wsen="http://schemas.xmlsoap.org/ws/2004/09/enumeration"><wsman:Filter Dialect="http://schemas.dmtf.org/wbem/cql/1/dsp0202.pdf">select * from DCIM_LifecycleJob where Name != "CLEARALL" and JobStatus != "Reboot Completed" and JobStatus != "Completed" and JobStatus != "Completed with Errors" and JobStatus != "Failed"</wsman:Filter><wsman:OptimizeEnumeration/><wsman:MaxElements>100</wsman:MaxElements></wsen:Enumerate></s:Body></s:Envelope> _do_request /usr/lib/python2.7/site-packages/dracclient/wsman.py:60


instead of a power cycle command as it is done in the success case:

Sending request to https://172.16.40.8:443/wsman: <s:Envelope xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing" xmlns:s="http://www.w3.org/2003/05/soap-envelope" xmlns:wsman="http://schemas.dmtf.org/wbem/wsman/1/wsman.xsd"><s:Header><wsa:Tos:mustUnderstand="true">https://172.16.40.8:443/wsman</wsa:To><wsman:ResourceURI s:mustUnderstand="true">http://schemas.dell.com/wbem/wscim/1/cim-schema/2/DCIM_ComputerSystem</wsman:ResourceURI><wsa:MessageID s:mustUnderstand="true">uuid:2f798edd-27a6-4760-923e-752e7aaee83f</wsa:MessageID><wsa:ReplyTo><wsa:Address>http://schemas.xmlsoap.org/ws/2004/08/addressing/role/anonymous</wsa:Address></wsa:ReplyTo><wsa:Action s:mustUnderstand="true">http://schemas.dell.com/wbem/wscim/1/cim-schema/2/DCIM_ComputerSystem/RequestStateChange</wsa:Action><wsman:SelectorSet><wsman:Selector Name="CreationClassName">DCIM_ComputerSystem</wsman:Selector><wsman:Selector Name="Name">srv:system</wsman:Selector></wsman:SelectorSet></s:Header><s:Body><ns0:RequestStateChange_INPUT xmlns:ns0="http://schemas.dell.com/wbem/wscim/1/cim-schema/2/DCIM_ComputerSystem"><ns0:RequestedState>2</ns0:RequestedState></ns0:RequestStateChange_INPUT></s:Body></s:Envelope> _do_request /usr/lib/python2.7/site-packages/dracclient/wsman.py:60


Version-Release number of selected component (if applicable):
OSP 10

How reproducible:
some times

Steps to Reproduce:
1. # ironic node-set-power-state <uuid> on 
2.
3.

Actual results:
server not power cycled some times

Expected results:
server powered cycled always

Additional info:

Comment 1 Eduard Barrera 2017-06-22 13:37:11 UTC
Created attachment 1290712 [details]
power cycle success

Comment 2 Dmitry Tantsur 2017-06-26 15:18:11 UTC
Could you please attach the complete ironic-conductor logs? What's your python-dracclient version? Also could you please try fetching and temporary installing a python-dracclient RPM from the latest release?

I have a feeling that we do query for lifecycle jobs from time to time..

Comment 3 Bob Fournier 2017-08-22 22:07:48 UTC
Hi Eduard - can we get the requested data?  Thanks.

Comment 4 Eduard Barrera 2017-08-23 07:49:55 UTC
Hi Bob,

At the end the issue was the following:

"""
regarding to boot issue, the DELL server itself has a job queue that receive remote management commands then executes it. this job queue may filled up ignoring other commands sent recently, it just needed to delete job queue to prevent it ignoring director commands, that was all of it.
"""

Do you still need more data or we shall close the bugzilla ?

Comment 5 Bob Fournier 2017-08-23 12:23:45 UTC
Thanks a lot Eduard.

Yes I think we can close this.

Note that there are a couple configurable IPMI settings in /etc/ironic/ironic.conf
that help with the frequency and time between IPMI commands.

# Maximum time in seconds to retry IPMI operations. There is a
# tradeoff when setting this value. Setting this too low may
# cause older BMCs to crash and require a hard reset. However,
# setting too high can cause the sync power state periodic
# task to hang when there are slow or unresponsive BMCs.
# (integer value)
#retry_timeout = 60
retry_timeout = 15

# Minimum time, in seconds, between IPMI operations sent to a
# server. There is a risk with some hardware that setting this
# too low may cause the BMC to crash. Recommended setting is 5
# seconds. (integer value)
#min_command_interval = 5


Note You need to log in before you can comment on or make changes to this bug.