Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1068865

Summary:

SPM semi-randomly switches between hosts

Product:

[Retired] oVirt

Reporter:

Gerasimos Melissaratos <gmelis>

Component:

vdsm

Assignee:

Nir Soffer <nsoffer>

Status:

CLOSED WONTFIX

QA Contact:

Aharon Canan <acanan>

Severity:

low

Docs Contact:

Priority:

low

Version:

3.3

CC:

acathrow, amureini, bazulay, bugs, danken, fsimonce, gklein, gmelis, iheim, mgoldboi, yeylon

Target Milestone:

---

Keywords:

Triaged

Target Release:

3.5.0

Hardware:

x86_64

OS:

Linux

Whiteboard:

storage

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-05-27 15:59:55 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
partial vdsm.log describing an spmId change to -1	none
Ovirt GUI console output showing the spm flapping	none
Log files from all 4 hosts (vdsm.log and sanlock.log) from 13:00 to 14:30	none
spm-lock.log from all hosts	none

Description Gerasimos Melissaratos 2014-02-22 16:14:15 UTC

Description of problem:

In an ovirt cluster with 4 hosts and 3 NFS storages, spm keeps flapping between hosts semi-randomly with no apparent reason and the logs, far as I can see, are clear. Below I have attached the vdsm.log of a host for a period of time during which the spmid changed from 3 to -1. This seems to happen mostly when one of the storages is experiencing higher usage, but is still reachable and responsive enough. The VMs are usable during the whole period, and the spm can change at a very hihg rate, like once per minute. 

Version-Release number of selected component (if applicable):
vdsm-cli-4.13.3-3.el6.noarch
vdsm-xmlrpc-4.13.3-3.el6.noarch
vdsm-4.13.3-3.el6.x86_64
vdsm-bootstrap-4.13.3-3.el6.noarch
vdsm-python-4.13.3-3.el6.x86_64


How reproducible:
Haven't got any clue.

Steps to Reproduce:
1.
2.
3.

Actual results:
SPM flapping between hosts

Expected results:
SPM should stay put

Additional info:

Comment 1 Gerasimos Melissaratos 2014-02-22 16:16:30 UTC

Created attachment 866428 [details]
partial vdsm.log describing an spmId change to -1

Comment 2 Itamar Heim 2014-02-23 08:28:12 UTC

Setting target release to current version for consideration and review. please
do not push non-RFE bugs to an undefined target release to make sure bugs are
reviewed for relevancy, fix, closure, etc.

Comment 3 Gerasimos Melissaratos 2014-02-23 08:28:50 UTC

Created attachment 866570 [details]
Ovirt GUI console output showing the spm flapping

Comment 4 Nir Soffer 2014-02-23 11:58:21 UTC

Gerasimos, can you attach vdsm.log and sanlock.log from all hosts?

Since spm is switch every minute or so, I guess that 1 hour of log when the system is experiencing this will be enough.

Comment 5 Nir Soffer 2014-02-23 12:25:47 UTC

Gerasimos, what did you change in the system before this started?

Comment 6 Gerasimos Melissaratos 2014-02-23 14:19:44 UTC

Created attachment 866679 [details]
Log files from all 4 hosts (vdsm.log and sanlock.log) from 13:00 to 14:30

Comment 7 Gerasimos Melissaratos 2014-02-23 14:34:33 UTC

This is a system that started off as ovirt 3.1 and ended up being, through upgrades, a 3.3. The problem became noticeable a few months ago at 3.2, but not much attention was paid to it. It became a real pain a fewer months ago, but this being a live system I was not very willing to tinker with it. I attempted the upgrade to 3.3, hoping the problem would automagically go away, a week ago (and very afraid of something going wrong) but everything went smoothly and I ended up with a 3.3 system, only the problem persisted. And far as upgrades are concerned, when I went from 3.1 to 3.2 the database upgrade failed and took me about a week to finish the upgrade with no errors. Thankfully the hosts kept going smoothly during this week, and when the engine came back up and could see everything, I removed all hosts and added them again one by one, to make sure the engine would be in sync with the hosts.

Comment 8 Nir Soffer 2014-02-23 15:26:59 UTC

It looks like you are still working with old data center and cluster compatibility version (< 3.3). In particualr, you are using storage domain version 0, which uses older SafeLease cluster lock.  Although the old version is still supported, the chance to get fixes for this code are lower.

We recommend to upgrade the cluster and data center comptiblity version to 3.3. This version uses sanlock for the cluster lock.

Comment 9 Nir Soffer 2014-02-23 15:47:59 UTC

Strange event seen in the logs, that may explain some of the spm switches:

$ grep 'signal 10' logs_*/vdsm.log
logs_ekeini/vdsm.log:MainThread::DEBUG::2014-02-23 13:58:57,734::vdsm::50::vds::(sigusr1Handler) Received signal 10
logs_ekeinos/vdsm.log:MainThread::DEBUG::2014-02-23 12:42:28,922::vdsm::50::vds::(sigusr1Handler) Received signal 10
logs_ekeinos/vdsm.log:MainThread::DEBUG::2014-02-23 13:02:30,558::vdsm::50::vds::(sigusr1Handler) Received signal 10
logs_ekeinos/vdsm.log:MainThread::DEBUG::2014-02-23 14:11:53,657::vdsm::50::vds::(sigusr1Handler) Received signal 10
logs_ekeinos/vdsm.log:MainThread::DEBUG::2014-02-23 14:12:52,824::vdsm::50::vds::(sigusr1Handler) Received signal 10

When receiving SIGUSR1, vdsm stop the spm if spm if this host is the spm.

Federico, do we expect to receive this signal from someone?

Comment 10 Gerasimos Melissaratos 2014-02-23 19:34:00 UTC

Just updated data center compatibility version to 3.3. Moving some gigs from one storage to another to see if anything breaks.

Comment 11 Gerasimos Melissaratos 2014-02-23 20:44:39 UTC

The SPM flapping is not flapping any more, or so it seems. The upgrade to compatibility versoin 3.3 seems to have solved the problem. I'll post a follow up later in the week to confirm it's still working (hopefully). Thanks for the prompt responses.

Comment 12 Dan Kenigsberg 2014-02-23 23:12:46 UTC

Nir, spmprotect.sh sends USR1 when it fails to pet its lease. Hints on why this has happened may be found in /var/log/vdsm/spm-lock.log. Gerasimos, could you attach the latter?

Comment 13 Gerasimos Melissaratos 2014-02-24 08:10:42 UTC

Created attachment 866890 [details]
spm-lock.log from all hosts

Comment 14 Gerasimos Melissaratos 2014-02-27 07:03:33 UTC

SPM has been stable since the configuration upgrade to version 3.3, so my vote goes for closure of this problem report. Not one SPM change for 4 days.

Comment 15 Nir Soffer 2014-02-27 07:19:03 UTC

Thanks for updating Gerasimos!

But I think we should investigate the spm-lock.log and understand why this error happens with old domain version.

Moving it to next version.

Comment 16 Allon Mureinik 2014-05-27 15:59:55 UTC

(In reply to Nir Soffer from comment #15)
> Thanks for updating Gerasimos!
> 
> But I think we should investigate the spm-lock.log and understand why this
> error happens with old domain version.
> 
> Moving it to next version.
Since upgrading solves this, and since we aren't going to put the effort into fixing safelease, closing.