Bug 1254012 - RHEV power management not shutting down/restarting hosts properly
RHEV power management not shutting down/restarting hosts properly
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.5.3
x86_64 Linux
high Severity high
: ---
: ---
Assigned To: Eli Mesika
Petr Matyáš
infra
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-08-16 13:22 EDT by Robert McSwain
Modified: 2016-02-10 14:38 EST (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-11-29 03:30:55 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Robert McSwain 2015-08-16 13:22:43 EDT
Description of problem:
An upgraded RHEV host "azariah" was determined to be non-responsive but when it was fenced it still got shut down rather than restarted, leaving the VM's hanging in the meantime.

Version-Release number of selected component (if applicable):
RHEV 3.5.3

How reproducible:
Intermittent

Steps to Reproduce:
- Dropped the kernel version on the hypervisor down to an older version.  This part was an attempt to test something else.
- Upon bootup now, the RHEV-M says it fails to verify host power management.
- Interestingly enough, that does not prevent the server from being placed back into RHEV.  It still makes they hypervisor active and I can migrate servers to there.
- If I have RHEV-M then try to restart the server, it does the power down instead of power cycle thing.  
- For at least a little while (the server was shut down a few minutes ago), the VM's which were running on this still show as being "up" and connected.  They have not gone to "?" status yet.
- Still, I cannot do anything with the VM's.  Power cycle, shutdown, migrate, suspend...nothing will work.  

Actual results:
Host is sometimes powered off rather than power cycled

Expected results:
Host is power cycled as expected 

Additional info:
Data coming in a future update
Comment 4 Robert McSwain 2015-08-25 13:52:31 EDT
Eli,

The customer already is using lanplus=1, and all of their hosts are RHEL 6 based rather than RHEL 7 based. Is there any resolution for RHEL 6 hosts given they're already using lanplus=1?

Regards,
Robert McSwain
Comment 5 Eli Mesika 2015-08-26 03:40:16 EDT
Please attach here the result of running the operation directly from /usr/sbin/fence_ipmilan on the host used as a proxy for this operation
Comment 6 Robert McSwain 2015-08-26 10:04:07 EDT
I've requested this info and will update again when I get a reply. Thanks!
Comment 7 Robert McSwain 2015-08-26 13:39:28 EDT
From the customer after testing and waiting a bit longer for the process to complete:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
# fence_ipmilan -a 192.168.26.138 -l fence -p picket -P -M cycle -v
Rebooting machine @ IPMI:192.168.26.138...Spawning: '/usr/bin/ipmitool -I lanplus -H '192.168.26.138' -U 'fence' -P '[set]' -v chassis power status'...
Spawning: '/usr/bin/ipmitool -I lanplus -H '192.168.26.138' -U 'fence' -P '[set]' -v chassis power cycle'...
Done

Actually *does* finally cycle the power.  I looked at the server status immediately after the command was issued and the server showed as being down.  But after about a minute or so, the server actually did restart.  Perhaps, like the other command, there is actually a stop and start portion of the command and I only looked at the status after the stop portion.

At this point the question I have is: What is RHEVM/VDSM actually using to do the fencing?
Comment 8 Eli Mesika 2015-08-27 03:38:01 EDT
(In reply to Robert McSwain from comment #7)
> RHEVM/VDSM does not use the cycle option, since it has to pool for the actual status and verify that the host was stopped before migrating HA VMs to another host. If cycle is used pooling the host can not tell if the host was down and started again or the stop operation failed when getting a ON status from the fencing agent and that can lead to corruption when VMs will try to run on two hosts.

So, what I need is the output of fence_ipmilan with -m off followed by output of fence_ipmilan with -m on
Comment 9 Robert McSwain 2015-10-02 09:29:04 EDT
Eli,

Here's the information you've requested:


When testing yesterday, I was testing with a server fencing itself when what I should have done is to have another server fence the one which needed fencing. It turns out there is a difference here because the -m onoff option actually *DOES* restart the server, just in a different way than the -m cycle option.  The -m onoff option first turns the other server off, then monitors for a while, then turns it back on, like this:
------------------
[root@peter.library.nd.edu no_ora ~]# fence_ipmilan --ip=192.168.26.138 --username=fence --password=picket -m onoff --lanplus -v
Delay 0 second(s) before logging in to the fence device
Executing: /usr/bin/ipmitool -I lanplus -H 192.168.26.138 -U fence -P picket -p 623 -L ADMINISTRATOR chassis power status

0 Chassis Power is on


Executing: /usr/bin/ipmitool -I lanplus -H 192.168.26.138 -U fence -P picket -p 623 -L ADMINISTRATOR chassis power off

0 Chassis Power Control: Down/Off


Executing: /usr/bin/ipmitool -I lanplus -H 192.168.26.138 -U fence -P picket -p 623 -L ADMINISTRATOR chassis power status

0 Chassis Power is on


Executing: /usr/bin/ipmitool -I lanplus -H 192.168.26.138 -U fence -P picket -p 623 -L ADMINISTRATOR chassis power status

0 Chassis Power is on


Executing: /usr/bin/ipmitool -I lanplus -H 192.168.26.138 -U fence -P picket -p 623 -L ADMINISTRATOR chassis power status

0 Chassis Power is on


Executing: /usr/bin/ipmitool -I lanplus -H 192.168.26.138 -U fence -P picket -p 623 -L ADMINISTRATOR chassis power status

1  Error in open session response message : insufficient resources for session

Error: Unable to establish IPMI v2 / RMCP+ session


Executing: /usr/bin/ipmitool -I lanplus -H 192.168.26.138 -U fence -P picket -p 623 -L ADMINISTRATOR chassis power status

1  Error in open session response message : insufficient resources for session

Error: Unable to establish IPMI v2 / RMCP+ session


Executing: /usr/bin/ipmitool -I lanplus -H 192.168.26.138 -U fence -P picket -p 623 -L ADMINISTRATOR chassis power status

1  Error in open session response message : insufficient resources for session

Error: Unable to establish IPMI v2 / RMCP+ session


Executing: /usr/bin/ipmitool -I lanplus -H 192.168.26.138 -U fence -P picket -p 623 -L ADMINISTRATOR chassis power status

0 Chassis Power is off


Executing: /usr/bin/ipmitool -I lanplus -H 192.168.26.138 -U fence -P picket -p 623 -L ADMINISTRATOR chassis power on

0 Chassis Power Control: Up/On


Executing: /usr/bin/ipmitool -I lanplus -H 192.168.26.138 -U fence -P picket -p 623 -L ADMINISTRATOR chassis power status

0 Chassis Power is off


Executing: /usr/bin/ipmitool -I lanplus -H 192.168.26.138 -U fence -P picket -p 623 -L ADMINISTRATOR chassis power status

0 Chassis Power is off


Executing: /usr/bin/ipmitool -I lanplus -H 192.168.26.138 -U fence -P picket -p 623 -L ADMINISTRATOR chassis power status

0 Chassis Power is off


Executing: /usr/bin/ipmitool -I lanplus -H 192.168.26.138 -U fence -P picket -p 623 -L ADMINISTRATOR chassis power status

0 Chassis Power is off


Executing: /usr/bin/ipmitool -I lanplus -H 192.168.26.138 -U fence -P picket -p 623 -L ADMINISTRATOR chassis power status

0 Chassis Power is off


Executing: /usr/bin/ipmitool -I lanplus -H 192.168.26.138 -U fence -P picket -p 623 -L ADMINISTRATOR chassis power status

0 Chassis Power is off


Executing: /usr/bin/ipmitool -I lanplus -H 192.168.26.138 -U fence -P picket -p 623 -L ADMINISTRATOR chassis power status

0 Chassis Power is on


Success: Rebooted

------------------
This is much different than the -m cycle option, which simply does this:

# fence_ipmilan --ip=192.168.26.138 --username=fence --password=picket -m cycle --lanplus -v
Delay 0 second(s) before logging in to the fence device
Executing: /usr/bin/ipmitool -I lanplus -H 192.168.26.138 -U fence -P picket -p 623 -L ADMINISTRATOR chassis power status

0 Chassis Power is on


Executing: /usr/bin/ipmitool -I lanplus -H 192.168.26.138 -U fence -P picket -p 623 -L ADMINISTRATOR chassis power cycle

0 Chassis Power Control: Cycle


Success: Rebooted
-------------------

When I ran this yesterday, as I said I fenced the server itself because I wanted to make sure I used the newest version of the fence_ipmilan command. But what I did not know is that the server monitors for down and then toggles back up.  Obviously if the server turns itself off, it cannot monitor for the server to be down and then turn itself back on, so the server remained down.
Comment 12 Eli Mesika 2015-10-07 05:30:26 EDT
(In reply to Robert McSwain from comment #9)

This does not match the way engine does a reboot sequence 

Please attach the output of the following sequence which exactly match the sequence of commands sent by the engine :

1)
# fence_ipmilan --ip=192.168.26.138 --username=fence --password=picket -m off --lanplus -v
2)
# fence_ipmilan --ip=192.168.26.138 --username=fence --password=picket -m status --lanplus -v

repeat 2) until you got a 'off' status 

3) # fence_ipmilan --ip=192.168.26.138 --username=fence --password=picket -m on --lanplus -v

4)
# fence_ipmilan --ip=192.168.26.138 --username=fence --password=picket -m status --lanplus -v

repeat 4) until you got a 'on' status

Note You need to log in before you can comment on or make changes to this bug.