Bug 1367813
Summary: | Shutting down N-1 nodes at once causes cluster with lms qdevice to lose quorum | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Roman Bednář <rbednar> | ||||||||||
Component: | corosync | Assignee: | Jan Friesse <jfriesse> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||||||||
Severity: | unspecified | Docs Contact: | |||||||||||
Priority: | unspecified | ||||||||||||
Version: | 7.3 | CC: | ccaulfie, cluster-maint, jfriesse, jkortus, mjuricek, rbednar | ||||||||||
Target Milestone: | rc | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | corosync-2.4.0-4.el7 | Doc Type: | If docs needed, set a value | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2016-11-04 06:50:09 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 614122 | ||||||||||||
Attachments: |
|
Description
Roman Bednář
2016-08-17 14:28:46 UTC
Reassigning to Chrissie because LMS is her field. @Martin: Is the same problem happening also with ffsplit? QE guys, debug logs would be helpful. Qnetd logs to syslog but it has to be configured so open /etc/sysconfig/corosync-qnetd and change line: COROSYNC_QNETD_OPTIONS="" to COROSYNC_QNETD_OPTIONS="-dd" Qdevice logs are depending on corosync.conf configuration but generally, syslog is enabled so please use: logging { ... logger_subsys { subsys: QDEVICE debug: on } ... } configuration in /etc/corosync/corosync.conf before reporting problems. Also comment to both "reports". I'm unable to reproduce any one of them (trying killall -9 corosync or sysrg trigger). Would you mind to share corosync.conf (together with debug logs)? As discussed with Martin, I was able to reproduce the issue. It is really needed to crash node rather than just stop corosync and/or qdevice. Basically what happens: - Node 1 dies but disconnect cannot be send - Node 2 finds out node 1 is dead and starts forming new membership sending membership change to qnetd - Qnetd LMS algo sees Node 1 as alive and Node 2 as split but not leader -> sends NACK to Node 2 - Eventually Qnetd finds out Node 2 died Solution used in ffsplit is that qnetd_algo_lms_client_disconnect is handled and current status is revalued. This is probably not a good choice for lms, because lms keeps vote (if has one) till change (to overcome problem with accidental disconnect from qnetd). Created attachment 1194051 [details]
Proposed patch
Solves situation when in 2 node cluster tie-breaker node dies. Because
code contains two bugs, other node got NACK instead of ACK.
- Algo timer is not stack, so calling abort and schedule in timer
callback without setting reschedule is noop.
- It's needed to check not only what current node thinks about
membership, but also what other nodes thinks. If views diverge -> wait.
Just a note, I'm still unable to reproduce first bug reported by Roman. Roman, can you please paste a logs as Martin did? Martine, thanks for logs. For the next time, please make sure to set logging { ... logger_subsys { subsys: QDEVICE debug: on } ... } (please note subsys: QDEVICE not subsys: VOTEQ). Anyway, I kind of believe that proposed patch solves also this problem. You mind to test scratch build? Sounds great, thanks for testing! Created attachment 1195934 [details]
Patch with slightly better English comments
ACK to the patch, thanks for spotting that.
I've fixed the English in the comments somewhat but the logic seems fine to me.
Chrissie, thanks for review. Path is now in upstream as b0c850f308d44ddcdf1a1f881c1e1142ad489385 Created attachment 1196271 [details]
Man: Fix corosync-qdevice-net-certutil link
Created attachment 1196272 [details]
man: mention qdevice incompatibilites in votequorum.5
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2463.html |