Bug 1381009 - Cisco UCS fencing using "Power fencing policy" in RHEVM is not shutting down the fence device.
Summary: Cisco UCS fencing using "Power fencing policy" in RHEVM is not shutting down ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.6.8
Hardware: Unspecified
OS: Linux
high
medium
Target Milestone: ovirt-4.1.1
: ---
Assignee: Martin Perina
QA Contact: Petr Matyáš
URL:
Whiteboard:
Depends On: 1412722
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-02 09:08 UTC by Ulhas Surse
Modified: 2020-01-17 15:59 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1410881 (view as bug list)
Environment:
Last Closed: 2017-04-25 00:57:13 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:0998 0 normal SHIPPED_LIVE VDSM bug fix and enhancement update 4.1 GA 2017-04-18 20:11:39 UTC

Description Ulhas Surse 2016-10-02 09:08:57 UTC
Description of problem:
fencing policy cannot successfully power down the Host and it cannot successfully power up the Host either. The reason why this is broken is because with method 1 when RHEV shuts down a Host it connects to the Host (I assume via SSH) and it issues a shutdown command from within the Host OS. The Host then shuts down and powers off but this is broken because nothing tells Cisco UCS that the "desired" power state is down. When the policy has shut down a Host, Cisco UCS still has a "desired" power state of "UP", so when RHEV tries to power up this Host again in the future, UCS ignores the request because UCS thinks this is still UP.

Version-Release number of selected component (if applicable):
RHEV 3.6

How reproducible:
Always in customer environment

Steps to Reproduce:
1. Set the power fencing policy [Disable policy control of power management is unchecked] in power management window of RHEVM GUI
2. When rhevm tryied to shutdown the host, it never turns the "desired" power state" in down. 
3. And waits to start it up with error: 
ssl=yes out [] err ['Failed: Timed out waiting to power ON', '', '']

Actual results:
ssl=yes out [] err ['Failed: Timed out waiting to power ON', '', '']

Expected results:
Should be down and ON again for the policy.

Additional info:
1. When the host is power down from dropdown menu in the GUI, start/stop/restart it works fine and issue is only when the "Disable policy control of power management" is unchecked in the configuration.

Comment 1 Ulhas Surse 2016-10-06 02:12:48 UTC
Hello,

I am able to re-produce the issue in lab environment with following configurations. 

This is an issue with 3.6.9 as well. What I have tested yet:

Host1:
Red Hat Enterprise Linux Server release 7.2 (Maipo)
vdsm-4.17.35-1.el7ev.noarch
3.10.0 - 327.36.1.el7.x86_64

Host2: 
Red Hat Enterprise Linux Server release 7.2 (Maipo)
vdsm-4.17.35-1.el7ev.noarch
3.10.0 - 327.36.1.el7.x86_64

RHEVM:
rhevm-3.6.9.2-0.1.el6.noarch

Cluster Scheduling Policy (custom configuration): 

 - CpuOverCommitDurationMinutes: 2
 - MaxFreeMemoryForOverUtilized: 26000
 - HighUtilization: 50
 - LowUtilization: 4
 - MinFreeMemoryForUnderUtilized: 500000
 - EnableAutomaticHostPowerManagement: true
 - HostsInReserve: 1

checked with following options: 

Optimized for Utilization
Enable HA Reservation

---
In edit host,

Check: 
Enable Power Management

leave uncheck: "Disable policy control of power management" 

- I tried to keep the hosts empty (no load), then one of the host was stopped by system:
     Host Hypervisor1 was stopped by SYSTEM.
     Power management stop of Host Hypervisor1 succeeded.

After, when you check CISCO UCS console properties, it says "desired" power state" UP, but Ideally it should be down.   

- Then I increased the load on othe Hypervisor which results in: 

     Host Hypervisor1 was started by SYSTEM.
     Power management start of Host Hypervisor1 succeeded.

The Hypervisor1 does not start and stuck at following because the "desired" power state" was never down:

     ssl=yes out [] err ['Failed: Timed out waiting to power ON', '', '']

Not sure why when same commands are sent by manual as well as policy, there is no change in the status of UCS host.
Need to debug why the policy script is not shutting down the host "desired" power state"  to down. 

Hope this helps.
If I get time, I will try this with RHV4.

Comment 2 Eli Mesika 2016-10-09 08:20:07 UTC
Hi

Seems as the shutdown takes more time and the wait timeout for the status should be set explicitly.
Please try to set the following option as well in the "options" field :

"power_wait=30"

Then try gain the scenario

Comment 3 Ulhas Surse 2016-10-10 14:46:04 UTC
Nope, that did not helped. Though the RHEVM says that the host is not needed and shuts it down but when required, it stucks at the same point:

~~~
 Thread-16498::DEBUG::2016-10-10 20:05:10,449::bindingxmlrpc::1281::vds::(wrapper) return fenceNode with {'status': {'message': ['Failed: Timed out waiting to power ON', '', ''], 'code': 1}, 'operationStatus': 'initiated', 'power': 'unknown'}
~~~

I was waiting more than 90 sec to set the correct power status, but it is not changing. The "desired power state" remains UP.

Comment 7 Eli Mesika 2016-10-13 14:25:36 UTC
Well , I checked and the reason why this is not working is as follows :

1) When you stop/start the host from the power management menu , it is actually using the fencing device to do so and therefor , the fencing device reports 'off' status after the fencing operation completes 

2) When you shut down all host's VMs while cluster is in 'power saving' mode , the host is powered off via SSH and  not by the fencing agent 

But I wonder why after the host is shutdown by SSH session the cisco_ucs device still reports a 'ON' status 

Marek, in case that a host that have power management configured is shutdown not via its fencing agent, should the fencing device report a 'ON' status ?

To me it seems as a bug of the fencing agent implementation that is saving or caching its state and not checking for the real status

Comment 8 Marek Grac 2016-10-16 16:54:48 UTC
@Eli:

Fence agents do not store any information, so caching state is not an option. But we had a bug with Cisco UCS (https://bugzilla.redhat.com/show_bug.cgi?id=1298430) that was fixed relatively lately. According to Cisco our latest package (z-stream) should be fine. 

Bug #1298430 concerns issues when UCS is returning ON even for machines that are OFF and vice versa. So it might be also your case.

Comment 9 Martin Perina 2016-10-17 10:54:35 UTC
@Ulhas, could you please test with latest fence-agents-4.0.11-27.el7_2.9?

Comment 10 Ulhas Surse 2016-10-17 11:02:50 UTC
Hi Martin,

I see all the packages are at the same version: 


~~~
# rpm -qa | grep fence-agents
fence-agents-rsb-4.0.11-27.el7_2.9.x86_64
fence-agents-drac5-4.0.11-27.el7_2.9.x86_64
fence-agents-cisco-ucs-4.0.11-27.el7_2.9.x86_64
fence-agents-apc-4.0.11-27.el7_2.9.x86_64
fence-agents-ibmblade-4.0.11-27.el7_2.9.x86_64
fence-agents-apc-snmp-4.0.11-27.el7_2.9.x86_64
fence-agents-vmware-soap-4.0.11-27.el7_2.9.x86_64
fence-agents-common-4.0.11-27.el7_2.9.x86_64
fence-agents-rsa-4.0.11-27.el7_2.9.x86_64
fence-agents-emerson-4.0.11-27.el7_2.9.x86_64
fence-agents-ilo-moonshot-4.0.11-27.el7_2.9.x86_64
fence-agents-ilo-mp-4.0.11-27.el7_2.9.x86_64
fence-agents-hpblade-4.0.11-27.el7_2.9.x86_64
fence-agents-kdump-4.0.11-27.el7_2.9.x86_64
fence-agents-wti-4.0.11-27.el7_2.9.x86_64
fence-agents-bladecenter-4.0.11-27.el7_2.9.x86_64
fence-agents-cisco-mds-4.0.11-27.el7_2.9.x86_64
fence-agents-intelmodular-4.0.11-27.el7_2.9.x86_64
fence-agents-eaton-snmp-4.0.11-27.el7_2.9.x86_64
fence-agents-mpath-4.0.11-27.el7_2.9.x86_64
fence-agents-compute-4.0.11-27.el7_2.9.x86_64
fence-agents-all-4.0.11-27.el7_2.9.x86_64
fence-agents-eps-4.0.11-27.el7_2.9.x86_64
fence-agents-rhevm-4.0.11-27.el7_2.9.x86_64
fence-agents-brocade-4.0.11-27.el7_2.9.x86_64
fence-agents-ilo-ssh-4.0.11-27.el7_2.9.x86_64
fence-agents-ilo2-4.0.11-27.el7_2.9.x86_64
fence-agents-ifmib-4.0.11-27.el7_2.9.x86_64
fence-agents-ipdu-4.0.11-27.el7_2.9.x86_64
fence-agents-ipmilan-4.0.11-27.el7_2.9.x86_64
fence-agents-scsi-4.0.11-27.el7_2.9.x86_64
~~~

Comment 11 Eli Mesika 2016-10-25 09:09:59 UTC
(In reply to Ulhas Surse from comment #10)
> Hi Martin,
> 
> I see all the packages are at the same version: 

Is that done on the host that was selected as a proxy for the fencing operation ?

Comment 12 Ulhas Surse 2016-10-25 11:11:06 UTC
Yes, there are two hosts and the packages are same on both the hosts.

Comment 13 Martin Perina 2016-10-25 11:49:37 UTC
Ulhas, could you please verify that following flows work in your environment (let's assume you want to fence host1 and commands are executed from host2)?

A. Turn off the host1 using PM and check for PM status (assuming host is turned on)

  1. Turn off the host

       fence_cisco_ucs -a <HOST1_PM_ADDRESS> -l <USERNAME> -p <PASSWORD> -o off ; echo $?

     Expected output:

       Success: Powered OFF
       0

  2. Check PM status of the host (this step should return OFF immediately after successful action from step 1., but for safety we execute this action several times with a few seconds delay)

       fence_ciso_ucs -a <HOST1_PM_ADDRESS> -l <USERNAME> -p <PASSWORD> -o status ; echo $?

     Expected output:

       Status: OFF
       2

B. Turn off the host using shutdwon command and check for PM status (assuming host is turned on)

  1. Turn off the host using shutdown

     ssh root@host1
     shutdown -h now

  2. Check console of host1 and wait until host1 is really turned off

  3. Check PM status of the host1 (when you verified on console that host is turned off, this should return OFF status)

       fence_ciso_ucs -a <HOST1_PM_ADDRESS> -l <USERNAME> -p <PASSWORD> -o status ; echo $?

     Expected output:

       Status: OFF
       2

C. Turn on the host using PM and check for PM status

  1. Turn on the host

       fence_cisco_ucs -a <HOST1_PM_ADDRESS> -l <USERNAME> -p <PASSWORD> -o on ; echo $?

     Expected output:

       Success: Powered ON
       0
  
  2. Check PM status of the host (this step should return OFF immediately after successful action from step 1., but for safety we execute this action several times with a few seconds delay)

       fence_cisco_ucs -a <HOST1_PM_ADDRESS> -l <USERNAME> -p <PASSWORD> -o status ; echo $?

     Expected output:

       Status: ON
       0

If any of fence_cisco_ucs execution doesn't work as expected, please consult man page and try to set additional parameters until success. Those additional parameters needs to be specified as stdin parameters in Options field of Fence Agent dialog inside webadmin.

Comment 14 Ulhas Surse 2016-10-25 12:19:46 UTC
Hello Martin,
I think this will work fine as I have already mention in my description of the bugzilla under additional info, This is only the issue when power saving cluster 
policy is applied. 

I have re-installed my servers, I am doing test with RHV 4.0 now to see if this reflects there. 

Still I will try to check your commends but those will be in RHV 4 now.

Comment 18 Marek Grac 2016-10-31 08:22:23 UTC
according to logs, status action is working as expected:

operPower="off" ... operState="power-off" ... presence="equipped"

it looks like the machine is there but is powered off. 

The question is why our command to start the machine is working only in some cases. This looks like question for Cisco.

Comment 19 Ulhas Surse 2016-11-08 02:40:21 UTC
Hello,

Thanks for the update. Is it possible for you to share the complete analysis which can be submitted to Cisco for discussion at which point they have to check the status.

Comment 20 Marek Grac 2016-11-08 15:31:56 UTC
@Ulhas:

It would be best to give them access to all messages in this thread. Main issue is that we can get into state when using their API we are not able to turn on machine and the machine which should be OFF is still turned ON. 

We have solved similar type of issue before (https://bugzilla.redhat.com/show_bug.cgi?id=1298430). So maybe, if we can send it to same people, they should know what we are doing wrong.

Comment 24 Ulhas Surse 2016-11-10 02:51:45 UTC
Here are the test results from the environment: 


I have tested the steps you have provided. 
I an getting the error while following after step 2.. at step 3:


A. Turn off the host1 using PM and check for PM status (assuming host is turned on)

1] 

# fence_cisco_ucs -a <host_name> -n <plug> -l <username> -p <password> -o off ; echo $?
Success: Powered OFF
0

2] 

# fence_cisco_ucs -a <host_name> -n <plug> -l <username> -p <password>  -o status ; echo $?
Status: OFF
2

After sometime wait, the host got fenced:

# fence_cisco_ucs -a -a <host_name> -n <plug> -l <username> -p <password>  -o status ; echo $?
Status: ON
0

=================================================================================================


B. Turn off the host using shutdwon command and check for PM status (assuming host is turned on)

1]

# shutdown -h now

2]

Host really off: NO  <<====== Here, from console it's still shows UP. 
                              Overall Status:       Powered Off
                              Desired Power State:  UP 

3]

# fence_cisco_ucs -a <host_name> -n <plug> -l <username> -p <password>  -o status ; echo $?
Status: OFF
2


=================================================================================================


C. Turn on the host using PM and check for PM status

1] 

# fence_cisco_ucs -a <host_name> -n <plug> -l <username> -p <password>  -o on ;
(28, 'Operation timed out after 3001 milliseconds with 0 out of -1 bytes received')

Connection timed out <<===== FAILED


# fence_cisco_ucs -a <host_name> -n <plug> -l <username> -p <password>  -o on ;
Failed: Timed out waiting to power ON <<======= TIMEOUT


=================================================================================================


I have tried following option: 

lanplus=1,power_wait=20,ssl_insecure=1

This still fails.

Comment 29 Tony Pearce 2016-11-15 02:24:48 UTC
Hi all, 
I am the customer that initially reported this as a problem to RH Support. 

I've followed the comments / updates here and feel that I need to point out the problem is not with the power up. The problem lies with how the script is shutting down the blade. You can check this yourself, allow the script to shut down the blade and then go to UCSM java application > Servers > Service Profiles > associated service profile name (that the script just shut down) Then check the "overall status" and expand the drop down field to see "Desired power state". The issue is in effect when "overall status" = off/down and "desired power state" = up. This is the condition which the script leaves the blade in. Therefore when the script attempts power up, because there is no checking of the existing condition; the script sends the power up command and the desired state is already up so nothing happens. 

In addition, if you compare the responses from UCSM for the following two conditions you can see the difference:
1. first condition is a working scenario. To do this, shut down the blade gracefully from UCSM (ie do not allow the script to shut down). Check the desired power state has been changed to reflect the down status. THEN use the script to power up the host. This will work. Capture the response from UCSM to RHEVH and you will see an extra value there to state that the power up is successful. 

2. The second condition is the non-working one. Allow the fence script to shut down the Host. Notice desired power state remains up. Now try and power up the Host from the script and capture the response from UCSM. You will see that there is a field missing and that the response is different. The response is as such "command OK" but this is not an indication of a successful power up of the host. 

Kind regards

Comment 30 Marek Grac 2016-11-15 16:29:45 UTC
Hi Tony,

Thanks for a description. I would just add a minor justification for 

Therefore when the script attempts power up, because there is no checking of the existing condition; the script sends the power up command and the desired state is already up so nothing happens.

That is not completely true because in fence agent, we check the state of the machine before and after running power on/off. So the final timeout is there not because of failing command to power ON but because of waiting machine to return that it's status is ON.

Comment 31 Bradley Frank 2016-12-05 17:46:29 UTC
Hi, has there been any progress on this case? I believe the customer has opened a new case 01750696 for this, and is asking about what the status of the issue is? Apparently this problem is still occurring. I am linking that case with this bugzilla.


Note You need to log in before you can comment on or make changes to this bug.