Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1068865

Summary: SPM semi-randomly switches between hosts
Product: [Retired] oVirt Reporter: Gerasimos Melissaratos <gmelis>
Component: vdsmAssignee: Nir Soffer <nsoffer>
Status: CLOSED WONTFIX QA Contact: Aharon Canan <acanan>
Severity: low Docs Contact:
Priority: low    
Version: 3.3CC: acathrow, amureini, bazulay, bugs, danken, fsimonce, gklein, gmelis, iheim, mgoldboi, yeylon
Target Milestone: ---Keywords: Triaged
Target Release: 3.5.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-05-27 15:59:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
partial vdsm.log describing an spmId change to -1
none
Ovirt GUI console output showing the spm flapping
none
Log files from all 4 hosts (vdsm.log and sanlock.log) from 13:00 to 14:30
none
spm-lock.log from all hosts none

Description Gerasimos Melissaratos 2014-02-22 16:14:15 UTC
Description of problem:

In an ovirt cluster with 4 hosts and 3 NFS storages, spm keeps flapping between hosts semi-randomly with no apparent reason and the logs, far as I can see, are clear. Below I have attached the vdsm.log of a host for a period of time during which the spmid changed from 3 to -1. This seems to happen mostly when one of the storages is experiencing higher usage, but is still reachable and responsive enough. The VMs are usable during the whole period, and the spm can change at a very hihg rate, like once per minute. 

Version-Release number of selected component (if applicable):
vdsm-cli-4.13.3-3.el6.noarch
vdsm-xmlrpc-4.13.3-3.el6.noarch
vdsm-4.13.3-3.el6.x86_64
vdsm-bootstrap-4.13.3-3.el6.noarch
vdsm-python-4.13.3-3.el6.x86_64


How reproducible:
Haven't got any clue.

Steps to Reproduce:
1.
2.
3.

Actual results:
SPM flapping between hosts

Expected results:
SPM should stay put

Additional info:

Comment 1 Gerasimos Melissaratos 2014-02-22 16:16:30 UTC
Created attachment 866428 [details]
partial vdsm.log describing an spmId change to -1

Comment 2 Itamar Heim 2014-02-23 08:28:12 UTC
Setting target release to current version for consideration and review. please
do not push non-RFE bugs to an undefined target release to make sure bugs are
reviewed for relevancy, fix, closure, etc.

Comment 3 Gerasimos Melissaratos 2014-02-23 08:28:50 UTC
Created attachment 866570 [details]
Ovirt GUI console output showing the spm flapping

Comment 4 Nir Soffer 2014-02-23 11:58:21 UTC
Gerasimos, can you attach vdsm.log and sanlock.log from all hosts?

Since spm is switch every minute or so, I guess that 1 hour of log when the system is experiencing this will be enough.

Comment 5 Nir Soffer 2014-02-23 12:25:47 UTC
Gerasimos, what did you change in the system before this started?

Comment 6 Gerasimos Melissaratos 2014-02-23 14:19:44 UTC
Created attachment 866679 [details]
Log files from all 4 hosts (vdsm.log and sanlock.log) from 13:00 to 14:30

Comment 7 Gerasimos Melissaratos 2014-02-23 14:34:33 UTC
This is a system that started off as ovirt 3.1 and ended up being, through upgrades, a 3.3. The problem became noticeable a few months ago at 3.2, but not much attention was paid to it. It became a real pain a fewer months ago, but this being a live system I was not very willing to tinker with it. I attempted the upgrade to 3.3, hoping the problem would automagically go away, a week ago (and very afraid of something going wrong) but everything went smoothly and I ended up with a 3.3 system, only the problem persisted. And far as upgrades are concerned, when I went from 3.1 to 3.2 the database upgrade failed and took me about a week to finish the upgrade with no errors. Thankfully the hosts kept going smoothly during this week, and when the engine came back up and could see everything, I removed all hosts and added them again one by one, to make sure the engine would be in sync with the hosts.

Comment 8 Nir Soffer 2014-02-23 15:26:59 UTC
It looks like you are still working with old data center and cluster compatibility version (< 3.3). In particualr, you are using storage domain version 0, which uses older SafeLease cluster lock.  Although the old version is still supported, the chance to get fixes for this code are lower.

We recommend to upgrade the cluster and data center comptiblity version to 3.3. This version uses sanlock for the cluster lock.

Comment 9 Nir Soffer 2014-02-23 15:47:59 UTC
Strange event seen in the logs, that may explain some of the spm switches:

$ grep 'signal 10' logs_*/vdsm.log
logs_ekeini/vdsm.log:MainThread::DEBUG::2014-02-23 13:58:57,734::vdsm::50::vds::(sigusr1Handler) Received signal 10
logs_ekeinos/vdsm.log:MainThread::DEBUG::2014-02-23 12:42:28,922::vdsm::50::vds::(sigusr1Handler) Received signal 10
logs_ekeinos/vdsm.log:MainThread::DEBUG::2014-02-23 13:02:30,558::vdsm::50::vds::(sigusr1Handler) Received signal 10
logs_ekeinos/vdsm.log:MainThread::DEBUG::2014-02-23 14:11:53,657::vdsm::50::vds::(sigusr1Handler) Received signal 10
logs_ekeinos/vdsm.log:MainThread::DEBUG::2014-02-23 14:12:52,824::vdsm::50::vds::(sigusr1Handler) Received signal 10

When receiving SIGUSR1, vdsm stop the spm if spm if this host is the spm.

Federico, do we expect to receive this signal from someone?

Comment 10 Gerasimos Melissaratos 2014-02-23 19:34:00 UTC
Just updated data center compatibility version to 3.3. Moving some gigs from one storage to another to see if anything breaks.

Comment 11 Gerasimos Melissaratos 2014-02-23 20:44:39 UTC
The SPM flapping is not flapping any more, or so it seems. The upgrade to compatibility versoin 3.3 seems to have solved the problem. I'll post a follow up later in the week to confirm it's still working (hopefully). Thanks for the prompt responses.

Comment 12 Dan Kenigsberg 2014-02-23 23:12:46 UTC
Nir, spmprotect.sh sends USR1 when it fails to pet its lease. Hints on why this has happened may be found in /var/log/vdsm/spm-lock.log. Gerasimos, could you attach the latter?

Comment 13 Gerasimos Melissaratos 2014-02-24 08:10:42 UTC
Created attachment 866890 [details]
spm-lock.log from all hosts

Comment 14 Gerasimos Melissaratos 2014-02-27 07:03:33 UTC
SPM has been stable since the configuration upgrade to version 3.3, so my vote goes for closure of this problem report. Not one SPM change for 4 days.

Comment 15 Nir Soffer 2014-02-27 07:19:03 UTC
Thanks for updating Gerasimos!

But I think we should investigate the spm-lock.log and understand why this error happens with old domain version.

Moving it to next version.

Comment 16 Allon Mureinik 2014-05-27 15:59:55 UTC
(In reply to Nir Soffer from comment #15)
> Thanks for updating Gerasimos!
> 
> But I think we should investigate the spm-lock.log and understand why this
> error happens with old domain version.
> 
> Moving it to next version.
Since upgrading solves this, and since we aren't going to put the effort into fixing safelease, closing.