Bug 906389

Summary: engine: we are fencing a host when putting it in maintenance after failed reinstall.
Product: Red Hat Enterprise Virtualization Manager Reporter: Dafna Ron <dron>
Component: ovirt-engineAssignee: Eli Mesika <emesika>
Status: CLOSED CURRENTRELEASE QA Contact: Tareq Alayan <talayan>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.1.2CC: acathrow, bazulay, dyasny, iheim, lpeer, masayag, ofrenkel, pstehlik, Rhev-m-bugs, talayan, yeylon, ykaul, yzaslavs
Target Milestone: ---Keywords: Regression, Reopened
Target Release: 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: sf14 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-18 13:50:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log
none
engine.log.sf11 none

Description Dafna Ron 2013-01-31 14:55:41 UTC
Created attachment 690983 [details]
log

Description of problem:

one of the hosts on rhevm-3 got kernel panic and we decided to reinstall the OS. 
after I reinstalled the OS I tried reinstalling the host in rhevm and the re-install failed. 
when I tried putting the host in maintenance it wend down for reboot. 
looking at the log, since the host changes state to prepare for maintenance we are sending a query to the host and getting network exception, since rhevm network has not yet been installed the host is fenced. 

Version-Release number of selected component (if applicable):

si26.1

How reproducible:

100%

Steps to Reproduce:
1. install a clean rhel on a host which is already installed in rhevm  (host needs to be configured with power management) 
2. do not register the host and try to reinstall it in rhevm
3. after install fails -> put the host in maintenance 
  
Actual results:

host is fenced when we put it in maintenance 

Expected results:

host should not be fenced. 

Additional info:logs

2013-01-31 15:52:09,421 INFO  [org.ovirt.engine.core.bll.MaintananceNumberOfVdssCommand] (pool-3-thread-5) [27e46562] Running command: MaintananceNumberOfVdssCommand internal: false. Entities affected :  ID: 50077e50-ffcf-11e0-9807-0014
5e832c40 Type: VDS
2013-01-31 15:52:09,438 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-3-thread-5) [27e46562] START, SetVdsStatusVDSCommand(HostName = master-vds13, HostId = 50077e50-ffcf-11e0-9807-00145e832c40, status=PreparingFo
rMaintenance, nonOperationalReason=NONE), log id: 6dc3d1c7
2013-01-31 15:52:09,462 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (pool-3-thread-5) [27e46562] FINISH, SetVdsStatusVDSCommand, log id: 6dc3d1c7
2013-01-31 15:52:09,485 ERROR [org.ovirt.engine.core.vdsbroker.VdsManager] (QuartzScheduler_Worker-95) VDS::handleNetworkException Server failed to respond,  vds_id = 50077e50-ffcf-11e0-9807-00145e832c40, vds_name = master-vds13, error 
= VDSNetworkException: 

013-01-31 15:52:09,563 ERROR [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] (QuartzScheduler_Worker-95) vds::refreshVdsStats Failed getVdsStats,  vds = 50077e50-ffcf-11e0-9807-00145e832c40 : master-vds13, error = VDSNetworkExce
ption: VDSNetworkException: 
2013-01-31 15:52:09,580 INFO  [org.ovirt.engine.core.bll.MaintananceVdsCommand] (pool-3-thread-5) [27e46562] Running command: MaintananceVdsCommand internal: true. E


2013-01-31 15:52:09,682 INFO  [org.ovirt.engine.core.bll.FencingExecutor] (pool-3-thread-3) Executing <Status> Power Management command, Proxy Host:master-vds8, Agent:bladecenter, Target Host:master-vds13, Management IP:qabc3-mgmt.qa.la
b.tlv.redhat.com, User:USERID, Options:port,secure=False,slot=6

Comment 2 Barak 2013-02-03 11:00:45 UTC
 (In reply to comment #0)
> Created attachment 690983 [details]
> log
> 
> Description of problem:
> 

> How reproducible:
> 
> 100%
> 
> Steps to Reproduce:
> 1. install a clean rhel on a host which is already installed in rhevm  (host
> needs to be configured with power management)

what was the status of the Host before running the yum update ?

 
> 2. do not register the host and try to reinstall it in rhevm

I assume you refer to RHN registration.

> 3. after install fails -> put the host in maintenance 
>

Comment 3 Dafna Ron 2013-02-03 11:57:53 UTC
> > 1. install a clean rhel on a host which is already installed in rhevm  (host
> > needs to be configured with power management)
> 
> what was the status of the Host before running the yum update ?

there was no yum update - it was a complete installation of a clean OS. 
but, the host has to be in maintenance state so you can re-install it in rhevm. 
> 
>  
> > 2. do not register the host and try to reinstall it in rhevm
> 
> I assume you refer to RHN registration.
> 

since host which is not registered with RHN will fail the install, it will be an easy way to fail install in the early stages (so yes).

Comment 5 Barak 2013-02-10 10:37:18 UTC
This BZ looks like a duplicate of Bug 894231,

Comment 21 Dan Yasny 2013-02-21 12:24:08 UTC
Dafna, can you please open a bz for 
> one is that we can use rest to reinstall a host in non-operational state

For this BZ,  block re install for non-operational hosts, and block maintenance for "install failed" hosts will be the solution, as per Comment 19

I'm acking this report with the reqs above

Comment 23 Eli Mesika 2013-02-21 13:04:45 UTC
(In reply to comment #21)
> Dafna, can you please open a bz for 
> > one is that we can use rest to reinstall a host in non-operational state
> 
> For this BZ,  block re install for non-operational hosts, 
>and block maintenance for "install failed"

Had checked the code, AFAIK, this is currently working like that ...

> hosts will be the solution, as per Comment
> 19
> 
> I'm acking this report with the reqs above

Comment 24 Eli Mesika 2013-02-25 18:29:57 UTC
fixing in commit: 5242f13

Comment 26 Tareq Alayan 2013-03-22 10:54:52 UTC
failed QA (sf12)
================

steps:
Reinstalled host
Failed installation 
put hot in maintenance

from event tab:
2013-Mar-22, 12:50 Host aqua6 is rebooting.
	
2013-Mar-22, 12:50 Host aqua6 was started by Engine.
	
2013-Mar-22, 12:49 Manual fence for host aqua6 was started.
		
2013-Mar-22, 12:49 Host aqua6 was stopped by Engine.
	
2013-Mar-22, 12:49 Host aqua6 is non-responsive.
	
2013-Mar-22, 12:49 Host aqua6 was switched to Maintenance mode by admin@internal.

attaching engine log.

Comment 27 Tareq Alayan 2013-03-22 10:56:53 UTC
Created attachment 714481 [details]
engine.log.sf11

Comment 28 Barak 2013-03-24 10:41:17 UTC
I think the discussion above missed the major point of this scenario.

We must make sure:
- "Install Failed" host should not be allowed to move to maintenance, because there may be various situations under "install failed" that can not be handled.
- Such a host can be removed and than reinstalled. (this should be the official 
way of handling this scenario)

Comment 29 Eli Mesika 2013-03-24 20:40:05 UTC
(In reply to comment #28)
> I think the discussion above missed the major point of this scenario.
> 
> We must make sure:
> - "Install Failed" host should not be allowed to move to maintenance,
> because there may be various situations under "install failed" that can not
> be handled.

This is already implemented in MaintenanceVdsCommand::canDoAction

> - Such a host can be removed and than reinstalled. (this should be the
> official 
> way of handling this scenario)

This is already implemented in RemoveVdsCommand::canDoAction

Comment 30 Eli Mesika 2013-04-02 12:50:44 UTC
(In reply to comment #29)
> (In reply to comment #28)
> > I think the discussion above missed the major point of this scenario.
> > 
> > We must make sure:
> > - "Install Failed" host should not be allowed to move to maintenance,
> > because there may be various situations under "install failed" that can not
> > be handled.
> 
> This is already implemented in MaintenanceVdsCommand::canDoAction
> 
> > - Such a host can be removed and than reinstalled. (this should be the
> > official 
> > way of handling this scenario)
> 
> This is already implemented in RemoveVdsCommand::canDoAction

The above is valid for 3.1 as well...

Comment 31 Eli Mesika 2013-04-06 19:43:02 UTC
Barak, following our talk on this BZ , please advice how to proceed ...

Comment 34 Eli Mesika 2013-04-15 06:57:35 UTC
fixed in commit: f813bb9

Comment 36 Eli Mesika 2013-04-22 12:41:20 UTC
Should fix : enabling moving non-operational host to maintenance

Comment 37 Eli Mesika 2013-04-22 13:51:37 UTC
(In reply to comment #36)
> Should fix : enabling moving non-operational host to maintenance

fixed in commit: 95d2c0e

Comment 38 Tareq Alayan 2013-05-19 13:37:33 UTC
verified. 

- Host failed to install 
- Host can be removed or re-installed
- Host cannot go to Maintenance state

Comment 39 Itamar Heim 2013-06-11 08:22:26 UTC
3.2 has been released

Comment 40 Itamar Heim 2013-06-11 08:24:44 UTC
3.2 has been released