Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 927151

Summary:

engine: after spm election of a host if it suddenly becomes non-responsive we try to contend new host as spm (contending will be blocked by sanlock)

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Dafna Ron <dron>

Component:

ovirt-engine

Assignee:

Ayal Baron <abaron>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Dafna Ron <dron>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

3.1.3

CC:

acathrow, dyasny, hateya, iheim, lpeer, Rhev-m-bugs, scohen, yeylon, ykaul

Target Milestone:

---

Target Release:

3.2.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

storage

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-04-07 07:21:25 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
logs	none

Description Dafna Ron 2013-03-25 08:39:28 UTC

Created attachment 715912 [details]
logs

Description of problem:

after I rebooted my spm and selected the "confirm host has been rebooted',  as soon as the old spm started I blocked the new spm from engine with REJECT and we automatically conetended the new spm. 
it seems to be coming from this command: 

2013-03-25 10:21:10,428 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-42) [795ce3ba] Irs placed on server null failed. Proceed Failover


Version-Release number of selected component (if applicable):

3.1.3

How reproducible:

100%

Steps to Reproduce:
1. in two hosts cluster with NFS storage, reboot the spm
2. when the host becomes non-responsive -> confirm host has been rebooted
3. when the old spm host changes state to up -> block connectivity between the engine and the host using iptables with REJECT
  
Actual results:

engine tries to contend spm again although the host is non-responsive and we should not be contending spm in such caese. 

Expected results:

we should not conetnd spm if we have a network connectivity issue. 


Additional info: logs

Comment 1 Dafna Ron 2013-03-25 08:40:10 UTC

[root@gold-vdsc ~]# vdsClient -s 0 getSpmStatus f92efae9-b68b-4e10-be1b-39ab7ab4e026
	spmId = 1
	spmStatus = SPM
	spmLver = 0




[root@gold-vdsd ~]# vdsClient -s 0 getSpmStatus f92efae9-b68b-4e10-be1b-39ab7ab4e026
	spmId = 1
	spmStatus = Contend
	spmLver = 0

[root@gold-vdsd ~]# vdsClient -s 0 getSpmStatus f92efae9-b68b-4e10-be1b-39ab7ab4e026
	spmId = 1
	spmStatus = Free
	spmLver = 0

Comment 2 Ayal Baron 2013-03-27 09:47:43 UTC

Dafna,  I don't understand the use case.
What's the end result? what's the impact on the pool?
iiuc the old spm host is now up and running, does it become spm? do you need now to manually fence the new one and then the old one will become spm?

Comment 3 Dafna Ron 2013-03-27 09:56:45 UTC

1. there could be two SPM's if we have more than one domain - I was using only 1 domain in my tests this is why the sanlock was locking the domain. but if we have more than one domain we can combine reconstruct with this action and since the Sanlock will not block the new master domain we can have two SPMs. 
2. as far as the user is considered we move the domains to non-operational.

Comment 4 Ayal Baron 2013-03-27 10:44:48 UTC

(In reply to comment #3)
> 1. there could be two SPM's if we have more than one domain - I was using
> only 1 domain in my tests this is why the sanlock was locking the domain.
> but if we have more than one domain we can combine reconstruct with this
> action and since the Sanlock will not block the new master domain we can
> have two SPMs. 
> 2. as far as the user is considered we move the domains to non-operational.

? can you run this with 2 domains and reach this state?

Comment 5 Ayal Baron 2013-03-28 09:02:48 UTC

(In reply to comment #3)
> 1. there could be two SPM's if we have more than one domain - I was using
> only 1 domain in my tests this is why the sanlock was locking the domain.
> but if we have more than one domain we can combine reconstruct with this
> action and since the Sanlock will not block the new master domain we can
> have two SPMs. 
> 2. as far as the user is considered we move the domains to non-operational.

To be clear, the moment engine tried to run spm on the new host then it should not try to run reconstruct before fencing the host on which we tried so I don't think we'll reach this state.