Description of problem: fencing policy cannot successfully power down the Host and it cannot successfully power up the Host either. The reason why this is broken is because with method 1 when RHEV shuts down a Host it connects to the Host (I assume via SSH) and it issues a shutdown command from within the Host OS. The Host then shuts down and powers off but this is broken because nothing tells Cisco UCS that the "desired" power state is down. When the policy has shut down a Host, Cisco UCS still has a "desired" power state of "UP", so when RHEV tries to power up this Host again in the future, UCS ignores the request because UCS thinks this is still UP. Version-Release number of selected component (if applicable): RHEV 3.6 How reproducible: Always in customer environment Steps to Reproduce: 1. Set the power fencing policy [Disable policy control of power management is unchecked] in power management window of RHEVM GUI 2. When rhevm tryied to shutdown the host, it never turns the "desired" power state" in down. 3. And waits to start it up with error: ssl=yes out [] err ['Failed: Timed out waiting to power ON', '', ''] Actual results: ssl=yes out [] err ['Failed: Timed out waiting to power ON', '', ''] Expected results: Should be down and ON again for the policy. Additional info: 1. When the host is power down from dropdown menu in the GUI, start/stop/restart it works fine and issue is only when the "Disable policy control of power management" is unchecked in the configuration.
Hello, I am able to re-produce the issue in lab environment with following configurations. This is an issue with 3.6.9 as well. What I have tested yet: Host1: Red Hat Enterprise Linux Server release 7.2 (Maipo) vdsm-4.17.35-1.el7ev.noarch 3.10.0 - 327.36.1.el7.x86_64 Host2: Red Hat Enterprise Linux Server release 7.2 (Maipo) vdsm-4.17.35-1.el7ev.noarch 3.10.0 - 327.36.1.el7.x86_64 RHEVM: rhevm-3.6.9.2-0.1.el6.noarch Cluster Scheduling Policy (custom configuration): - CpuOverCommitDurationMinutes: 2 - MaxFreeMemoryForOverUtilized: 26000 - HighUtilization: 50 - LowUtilization: 4 - MinFreeMemoryForUnderUtilized: 500000 - EnableAutomaticHostPowerManagement: true - HostsInReserve: 1 checked with following options: Optimized for Utilization Enable HA Reservation --- In edit host, Check: Enable Power Management leave uncheck: "Disable policy control of power management" - I tried to keep the hosts empty (no load), then one of the host was stopped by system: Host Hypervisor1 was stopped by SYSTEM. Power management stop of Host Hypervisor1 succeeded. After, when you check CISCO UCS console properties, it says "desired" power state" UP, but Ideally it should be down. - Then I increased the load on othe Hypervisor which results in: Host Hypervisor1 was started by SYSTEM. Power management start of Host Hypervisor1 succeeded. The Hypervisor1 does not start and stuck at following because the "desired" power state" was never down: ssl=yes out [] err ['Failed: Timed out waiting to power ON', '', ''] Not sure why when same commands are sent by manual as well as policy, there is no change in the status of UCS host. Need to debug why the policy script is not shutting down the host "desired" power state" to down. Hope this helps. If I get time, I will try this with RHV4.
Hi Seems as the shutdown takes more time and the wait timeout for the status should be set explicitly. Please try to set the following option as well in the "options" field : "power_wait=30" Then try gain the scenario
Nope, that did not helped. Though the RHEVM says that the host is not needed and shuts it down but when required, it stucks at the same point: ~~~ Thread-16498::DEBUG::2016-10-10 20:05:10,449::bindingxmlrpc::1281::vds::(wrapper) return fenceNode with {'status': {'message': ['Failed: Timed out waiting to power ON', '', ''], 'code': 1}, 'operationStatus': 'initiated', 'power': 'unknown'} ~~~ I was waiting more than 90 sec to set the correct power status, but it is not changing. The "desired power state" remains UP.
Well , I checked and the reason why this is not working is as follows : 1) When you stop/start the host from the power management menu , it is actually using the fencing device to do so and therefor , the fencing device reports 'off' status after the fencing operation completes 2) When you shut down all host's VMs while cluster is in 'power saving' mode , the host is powered off via SSH and not by the fencing agent But I wonder why after the host is shutdown by SSH session the cisco_ucs device still reports a 'ON' status Marek, in case that a host that have power management configured is shutdown not via its fencing agent, should the fencing device report a 'ON' status ? To me it seems as a bug of the fencing agent implementation that is saving or caching its state and not checking for the real status
@Eli: Fence agents do not store any information, so caching state is not an option. But we had a bug with Cisco UCS (https://bugzilla.redhat.com/show_bug.cgi?id=1298430) that was fixed relatively lately. According to Cisco our latest package (z-stream) should be fine. Bug #1298430 concerns issues when UCS is returning ON even for machines that are OFF and vice versa. So it might be also your case.
@Ulhas, could you please test with latest fence-agents-4.0.11-27.el7_2.9?
Hi Martin, I see all the packages are at the same version: ~~~ # rpm -qa | grep fence-agents fence-agents-rsb-4.0.11-27.el7_2.9.x86_64 fence-agents-drac5-4.0.11-27.el7_2.9.x86_64 fence-agents-cisco-ucs-4.0.11-27.el7_2.9.x86_64 fence-agents-apc-4.0.11-27.el7_2.9.x86_64 fence-agents-ibmblade-4.0.11-27.el7_2.9.x86_64 fence-agents-apc-snmp-4.0.11-27.el7_2.9.x86_64 fence-agents-vmware-soap-4.0.11-27.el7_2.9.x86_64 fence-agents-common-4.0.11-27.el7_2.9.x86_64 fence-agents-rsa-4.0.11-27.el7_2.9.x86_64 fence-agents-emerson-4.0.11-27.el7_2.9.x86_64 fence-agents-ilo-moonshot-4.0.11-27.el7_2.9.x86_64 fence-agents-ilo-mp-4.0.11-27.el7_2.9.x86_64 fence-agents-hpblade-4.0.11-27.el7_2.9.x86_64 fence-agents-kdump-4.0.11-27.el7_2.9.x86_64 fence-agents-wti-4.0.11-27.el7_2.9.x86_64 fence-agents-bladecenter-4.0.11-27.el7_2.9.x86_64 fence-agents-cisco-mds-4.0.11-27.el7_2.9.x86_64 fence-agents-intelmodular-4.0.11-27.el7_2.9.x86_64 fence-agents-eaton-snmp-4.0.11-27.el7_2.9.x86_64 fence-agents-mpath-4.0.11-27.el7_2.9.x86_64 fence-agents-compute-4.0.11-27.el7_2.9.x86_64 fence-agents-all-4.0.11-27.el7_2.9.x86_64 fence-agents-eps-4.0.11-27.el7_2.9.x86_64 fence-agents-rhevm-4.0.11-27.el7_2.9.x86_64 fence-agents-brocade-4.0.11-27.el7_2.9.x86_64 fence-agents-ilo-ssh-4.0.11-27.el7_2.9.x86_64 fence-agents-ilo2-4.0.11-27.el7_2.9.x86_64 fence-agents-ifmib-4.0.11-27.el7_2.9.x86_64 fence-agents-ipdu-4.0.11-27.el7_2.9.x86_64 fence-agents-ipmilan-4.0.11-27.el7_2.9.x86_64 fence-agents-scsi-4.0.11-27.el7_2.9.x86_64 ~~~
(In reply to Ulhas Surse from comment #10) > Hi Martin, > > I see all the packages are at the same version: Is that done on the host that was selected as a proxy for the fencing operation ?
Yes, there are two hosts and the packages are same on both the hosts.
Ulhas, could you please verify that following flows work in your environment (let's assume you want to fence host1 and commands are executed from host2)? A. Turn off the host1 using PM and check for PM status (assuming host is turned on) 1. Turn off the host fence_cisco_ucs -a <HOST1_PM_ADDRESS> -l <USERNAME> -p <PASSWORD> -o off ; echo $? Expected output: Success: Powered OFF 0 2. Check PM status of the host (this step should return OFF immediately after successful action from step 1., but for safety we execute this action several times with a few seconds delay) fence_ciso_ucs -a <HOST1_PM_ADDRESS> -l <USERNAME> -p <PASSWORD> -o status ; echo $? Expected output: Status: OFF 2 B. Turn off the host using shutdwon command and check for PM status (assuming host is turned on) 1. Turn off the host using shutdown ssh root@host1 shutdown -h now 2. Check console of host1 and wait until host1 is really turned off 3. Check PM status of the host1 (when you verified on console that host is turned off, this should return OFF status) fence_ciso_ucs -a <HOST1_PM_ADDRESS> -l <USERNAME> -p <PASSWORD> -o status ; echo $? Expected output: Status: OFF 2 C. Turn on the host using PM and check for PM status 1. Turn on the host fence_cisco_ucs -a <HOST1_PM_ADDRESS> -l <USERNAME> -p <PASSWORD> -o on ; echo $? Expected output: Success: Powered ON 0 2. Check PM status of the host (this step should return OFF immediately after successful action from step 1., but for safety we execute this action several times with a few seconds delay) fence_cisco_ucs -a <HOST1_PM_ADDRESS> -l <USERNAME> -p <PASSWORD> -o status ; echo $? Expected output: Status: ON 0 If any of fence_cisco_ucs execution doesn't work as expected, please consult man page and try to set additional parameters until success. Those additional parameters needs to be specified as stdin parameters in Options field of Fence Agent dialog inside webadmin.
Hello Martin, I think this will work fine as I have already mention in my description of the bugzilla under additional info, This is only the issue when power saving cluster policy is applied. I have re-installed my servers, I am doing test with RHV 4.0 now to see if this reflects there. Still I will try to check your commends but those will be in RHV 4 now.
according to logs, status action is working as expected: operPower="off" ... operState="power-off" ... presence="equipped" it looks like the machine is there but is powered off. The question is why our command to start the machine is working only in some cases. This looks like question for Cisco.
Hello, Thanks for the update. Is it possible for you to share the complete analysis which can be submitted to Cisco for discussion at which point they have to check the status.
@Ulhas: It would be best to give them access to all messages in this thread. Main issue is that we can get into state when using their API we are not able to turn on machine and the machine which should be OFF is still turned ON. We have solved similar type of issue before (https://bugzilla.redhat.com/show_bug.cgi?id=1298430). So maybe, if we can send it to same people, they should know what we are doing wrong.
Here are the test results from the environment: I have tested the steps you have provided. I an getting the error while following after step 2.. at step 3: A. Turn off the host1 using PM and check for PM status (assuming host is turned on) 1] # fence_cisco_ucs -a <host_name> -n <plug> -l <username> -p <password> -o off ; echo $? Success: Powered OFF 0 2] # fence_cisco_ucs -a <host_name> -n <plug> -l <username> -p <password> -o status ; echo $? Status: OFF 2 After sometime wait, the host got fenced: # fence_cisco_ucs -a -a <host_name> -n <plug> -l <username> -p <password> -o status ; echo $? Status: ON 0 ================================================================================================= B. Turn off the host using shutdwon command and check for PM status (assuming host is turned on) 1] # shutdown -h now 2] Host really off: NO <<====== Here, from console it's still shows UP. Overall Status: Powered Off Desired Power State: UP 3] # fence_cisco_ucs -a <host_name> -n <plug> -l <username> -p <password> -o status ; echo $? Status: OFF 2 ================================================================================================= C. Turn on the host using PM and check for PM status 1] # fence_cisco_ucs -a <host_name> -n <plug> -l <username> -p <password> -o on ; (28, 'Operation timed out after 3001 milliseconds with 0 out of -1 bytes received') Connection timed out <<===== FAILED # fence_cisco_ucs -a <host_name> -n <plug> -l <username> -p <password> -o on ; Failed: Timed out waiting to power ON <<======= TIMEOUT ================================================================================================= I have tried following option: lanplus=1,power_wait=20,ssl_insecure=1 This still fails.
Hi all, I am the customer that initially reported this as a problem to RH Support. I've followed the comments / updates here and feel that I need to point out the problem is not with the power up. The problem lies with how the script is shutting down the blade. You can check this yourself, allow the script to shut down the blade and then go to UCSM java application > Servers > Service Profiles > associated service profile name (that the script just shut down) Then check the "overall status" and expand the drop down field to see "Desired power state". The issue is in effect when "overall status" = off/down and "desired power state" = up. This is the condition which the script leaves the blade in. Therefore when the script attempts power up, because there is no checking of the existing condition; the script sends the power up command and the desired state is already up so nothing happens. In addition, if you compare the responses from UCSM for the following two conditions you can see the difference: 1. first condition is a working scenario. To do this, shut down the blade gracefully from UCSM (ie do not allow the script to shut down). Check the desired power state has been changed to reflect the down status. THEN use the script to power up the host. This will work. Capture the response from UCSM to RHEVH and you will see an extra value there to state that the power up is successful. 2. The second condition is the non-working one. Allow the fence script to shut down the Host. Notice desired power state remains up. Now try and power up the Host from the script and capture the response from UCSM. You will see that there is a field missing and that the response is different. The response is as such "command OK" but this is not an indication of a successful power up of the host. Kind regards
Hi Tony, Thanks for a description. I would just add a minor justification for Therefore when the script attempts power up, because there is no checking of the existing condition; the script sends the power up command and the desired state is already up so nothing happens. That is not completely true because in fence agent, we check the state of the machine before and after running power on/off. So the final timeout is there not because of failing command to power ON but because of waiting machine to return that it's status is ON.
Hi, has there been any progress on this case? I believe the customer has opened a new case 01750696 for this, and is asking about what the status of the issue is? Apparently this problem is still occurring. I am linking that case with this bugzilla.