Bug 1068865
| Summary: | SPM semi-randomly switches between hosts | ||
|---|---|---|---|
| Product: | [Retired] oVirt | Reporter: | Gerasimos Melissaratos <gmelis> |
| Component: | vdsm | Assignee: | Nir Soffer <nsoffer> |
| Status: | CLOSED WONTFIX | QA Contact: | Aharon Canan <acanan> |
| Severity: | low | Docs Contact: | |
| Priority: | low | ||
| Version: | 3.3 | CC: | acathrow, amureini, bazulay, bugs, danken, fsimonce, gklein, gmelis, iheim, mgoldboi, yeylon |
| Target Milestone: | --- | Keywords: | Triaged |
| Target Release: | 3.5.0 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | storage | ||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2014-05-27 15:59:55 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Attachments: | |||
|
Description
Gerasimos Melissaratos
2014-02-22 16:14:15 UTC
Created attachment 866428 [details]
partial vdsm.log describing an spmId change to -1
Setting target release to current version for consideration and review. please do not push non-RFE bugs to an undefined target release to make sure bugs are reviewed for relevancy, fix, closure, etc. Created attachment 866570 [details]
Ovirt GUI console output showing the spm flapping
Gerasimos, can you attach vdsm.log and sanlock.log from all hosts? Since spm is switch every minute or so, I guess that 1 hour of log when the system is experiencing this will be enough. Gerasimos, what did you change in the system before this started? Created attachment 866679 [details]
Log files from all 4 hosts (vdsm.log and sanlock.log) from 13:00 to 14:30
This is a system that started off as ovirt 3.1 and ended up being, through upgrades, a 3.3. The problem became noticeable a few months ago at 3.2, but not much attention was paid to it. It became a real pain a fewer months ago, but this being a live system I was not very willing to tinker with it. I attempted the upgrade to 3.3, hoping the problem would automagically go away, a week ago (and very afraid of something going wrong) but everything went smoothly and I ended up with a 3.3 system, only the problem persisted. And far as upgrades are concerned, when I went from 3.1 to 3.2 the database upgrade failed and took me about a week to finish the upgrade with no errors. Thankfully the hosts kept going smoothly during this week, and when the engine came back up and could see everything, I removed all hosts and added them again one by one, to make sure the engine would be in sync with the hosts. It looks like you are still working with old data center and cluster compatibility version (< 3.3). In particualr, you are using storage domain version 0, which uses older SafeLease cluster lock. Although the old version is still supported, the chance to get fixes for this code are lower. We recommend to upgrade the cluster and data center comptiblity version to 3.3. This version uses sanlock for the cluster lock. Strange event seen in the logs, that may explain some of the spm switches: $ grep 'signal 10' logs_*/vdsm.log logs_ekeini/vdsm.log:MainThread::DEBUG::2014-02-23 13:58:57,734::vdsm::50::vds::(sigusr1Handler) Received signal 10 logs_ekeinos/vdsm.log:MainThread::DEBUG::2014-02-23 12:42:28,922::vdsm::50::vds::(sigusr1Handler) Received signal 10 logs_ekeinos/vdsm.log:MainThread::DEBUG::2014-02-23 13:02:30,558::vdsm::50::vds::(sigusr1Handler) Received signal 10 logs_ekeinos/vdsm.log:MainThread::DEBUG::2014-02-23 14:11:53,657::vdsm::50::vds::(sigusr1Handler) Received signal 10 logs_ekeinos/vdsm.log:MainThread::DEBUG::2014-02-23 14:12:52,824::vdsm::50::vds::(sigusr1Handler) Received signal 10 When receiving SIGUSR1, vdsm stop the spm if spm if this host is the spm. Federico, do we expect to receive this signal from someone? Just updated data center compatibility version to 3.3. Moving some gigs from one storage to another to see if anything breaks. The SPM flapping is not flapping any more, or so it seems. The upgrade to compatibility versoin 3.3 seems to have solved the problem. I'll post a follow up later in the week to confirm it's still working (hopefully). Thanks for the prompt responses. Nir, spmprotect.sh sends USR1 when it fails to pet its lease. Hints on why this has happened may be found in /var/log/vdsm/spm-lock.log. Gerasimos, could you attach the latter? Created attachment 866890 [details]
spm-lock.log from all hosts
SPM has been stable since the configuration upgrade to version 3.3, so my vote goes for closure of this problem report. Not one SPM change for 4 days. Thanks for updating Gerasimos! But I think we should investigate the spm-lock.log and understand why this error happens with old domain version. Moving it to next version. (In reply to Nir Soffer from comment #15) > Thanks for updating Gerasimos! > > But I think we should investigate the spm-lock.log and understand why this > error happens with old domain version. > > Moving it to next version. Since upgrading solves this, and since we aren't going to put the effort into fixing safelease, closing. |