Bug 1654630
Summary: | node will not rejoin cluster after reboot | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | michal novacek <mnovacek> |
Component: | corosync | Assignee: | Christine Caulfield <ccaulfield> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | cluster-qe <cluster-qe> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | 8.0 | CC: | abeekhof, ccaulfie, cluster-maint, jfriesse, mnovacek |
Target Milestone: | rc | Keywords: | TestBlocker |
Target Release: | 8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | corosync-2.99.5-2.el8 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-06-14 01:25:06 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
michal novacek
2018-11-29 09:55:40 UTC
Small correction to description: packages are corosync 2.99.4-1.el8 and pacemaker 2.0.0-11.el8. Looking at incident of interest: Nov 29 10:14:51 virt-004.cluster-qe.lab.eng.brq.redhat.com pacemakerd [18629] (main) notice: Starting Pacemaker 2.0.0-11.el8 | build=efbf81b659 features: generated-manpages agent-manpages ncurses libqb-logging libqb-ipc systemd nagios corosync-native atomic-attrd acls Nov 29 10:22:22 virt-004.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [18639] (tengine_stonith_notify) notice: Peer virt-005 was terminated (reboot) by virt-004 on behalf of stonith-api.25789: OK | initiator=virt-004 ref=2708d2b0-15c8-48bd-9317-2484014ea877 Nov 29 10:26:05 virt-005.cluster-qe.lab.eng.brq.redhat.com pacemakerd [995] (main) notice: Starting Pacemaker 2.0.0-11.el8 | build=efbf81b659 features: generated-manpages agent-manpages ncurses libqb-logging libqb-ipc systemd nagios corosync-native atomic-attrd acls Nov 29 10:26:13 virt-005.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [1187] (crm_update_peer_proc) info: cluster_connect_cpg: Node virt-005[2] - corosync-cpg is now online Nov 29 10:26:13 virt-005.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [1187] (peer_update_callback) info: Node virt-005 is now a peer | DC=<none> old=0x0000000 new=0x4000000 Nov 29 10:26:13 virt-005.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [1187] (init_cs_connection_once) info: Connection to 'corosync': established Nov 29 10:26:13 virt-005.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [1187] (cluster_connect_quorum) warning: Quorum lost Nov 29 10:26:15 virt-005.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [1187] (pcmk_quorum_notification) info: Quorum still lost | membership=196 members=1 Nov 29 10:26:16 virt-005.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [1187] (pcmk_cpg_membership) info: Group event crmd.0: node 2 joined Nov 29 10:26:16 virt-005.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [1187] (pcmk_cpg_membership) info: Group event crmd.0: node 2 (virt-005) is member A cluster is not being formed at the corosync level. Oddly, corosync doesn't log anything during this time: Nov 29 09:25:59 [778] virt-005.cluster-qe.lab.eng.brq.redhat.com corosync notice [MAIN ] Completed service synchronization, ready to provide service. Nov 29 10:28:46 [778] virt-005.cluster-qe.lab.eng.brq.redhat.com corosync notice [MAIN ] Node was shut down by a signal Reassigning to corosync for further investigation @Mnovacek: 1. It's very hard to read logs because after reboot time returns back by one hour: Nov 29 10:29:06 [1673] virt-005.cluster-qe.lab.eng.brq.redhat.com corosync notice [MAIN ] Corosync Cluster Engine exiting nor mally Nov 29 09:32:55 [768] virt-005.cluster-qe.lab.eng.brq.redhat.com corosync notice [MAIN ] Corosync Cluster Engine ('2.99.4'): started and ready to provide service. could you please sent report without this "feature" enabled? 2. Would you mind to test using udpu transport? 3. Could you please try to turn debug in corosync? Also time of the incident (= when node was rebooted) may be helpful. After trying really hard, I'm really unable to reproduce the problem. It however may be because of DNS resolution and/or network not activated. So in addition to previous comment please check if your hosts are in the /etc/hosts? If not, does adding them to /etc/hosts helps? Adding fqdn to /etc/hosts did not changed the behaviour: > $ cat /etc/hosts > 10.37.166.131 virt-004.ipv4.cluster-qe.lab.eng.brq.redhat.com > 2620:52:0:25a4:1800:ff:fe00:4 virt-004.ipv6.cluster-qe.lab.eng.brq.redhat.com virt-004.cluster-qe.lab.eng.brq.redhat.com > > 10.37.166.132 virt-005.ipv4.cluster-qe.lab.eng.brq.redhat.com > 2620:52:0:25a4:1800:ff:fe00:5 virt-005.ipv6.cluster-qe.lab.eng.brq.redhat.com virt-005.cluster-qe.lab.eng.brq.redhat.com --- Adding short form virt-00{4,5} to ipv6 entries solves the problem: > $ cat /etc/hosts > 10.37.166.131 virt-004.ipv4.cluster-qe.lab.eng.brq.redhat.com > 2620:52:0:25a4:1800:ff:fe00:4 virt-004.ipv6.cluster-qe.lab.eng.brq.redhat.com virt-004.cluster-qe.lab.eng.brq.redhat.com virt-004 > > 10.37.166.132 virt-005.ipv4.cluster-qe.lab.eng.brq.redhat.com virt-005 > 2620:52:0:25a4:1800:ff:fe00:5 virt-005.ipv6.cluster-qe.lab.eng.brq.redhat.com virt-005.cluster-qe.lab.eng.brq.redhat.com virt-005 --- We might have a little strange dns setup in our lab where virt-004 will by default resolve to ipv6 address: [root@virt-004 ~]# host virt-004 virt-004.cluster-qe.lab.eng.brq.redhat.com has address 10.37.166.131 virt-004.cluster-qe.lab.eng.brq.redhat.com has IPv6 address 2620:52:0:25a4:1800:ff:fe00:4 [root@virt-004 ~]# ping virt-004 PING virt-004(virt-004.cluster-qe.lab.eng.brq.redhat.com (2620:52:0:25a4:1800:ff:fe00:4)) 56 data bytes ... [root@virt-004 ~]# ping virt-004.ipv4 PING virt-004.ipv4.cluster-qe.lab.eng.brq.redhat.com (10.37.166.131) 56(84) bytes of data. ... In the same environment with no added /etc/hosts lines rhel7.6 pacemaker clusters worked well. What should we do differently to not need to add entries to /etc/hosts? (corosync.conf ring0_addr?) @mnovacek: Chrissie is now working on some DNS issues which I've found (dns resolution errors are not checked) so maybe it will fix also the problem you've reported. It's not necessarily to do anything differently than before. It should work (or fail with error or at least show warning) and if it doesn't then it's bug and must be fixed (customers setups may be even weirder and current behavior when no error is reported and corosync doesn't work is simply wrong). All the questions which I've asked were just to allow us to reproduce the issue. It's now evident that problem is ipv4 vs ipv6. Needle (2.x, RHEL7) default was ipv4, camelback (3.x, RHEL8) has no strict default so it looks like in lab setup it resolves to ipv6. @mnovacek: Would you mind to test https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=19353248 without /etc/hosts modified (same configuration as you've had in time of reporting the bug)? It will probably not work, but logs should contain at least some clues now. Hopefully helpful repolink: http://brew-task-repos.usersys.redhat.com/repos/scratch/jfriesse/corosync/2.99.4/1.el8.jf1/corosync-2.99.4-1.el8.jf1-scratch.repo Created attachment 1512426 [details]
pcs cluster report with test version of corosync-2.99.4
After brief look to logs, I've found at least following problems: - Time is still moving one hour back and forward. This is NOT a problem for corosync, but it's very hard to investigate such logs. Please try to fix your VM (https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/virtualization_administration_guide/sect-virtualization-tips_and_tricks-libvirt_managed_timers) - looks like virt15 was never running corosync 2.99.4 (cluster-log.txt, corosync-blackbox-live.txt and journal.log never logs 2.99.4) so it was not using patched version - log collector probably doesn't work as expected, because virt14 cluster-log.txt contains version 2.99.3, but corosync-blackbox-live.txt (correctly) contains 2.99.4 (In reply to Jan Friesse from comment #10) > After brief look to logs, I've found at least following problems: > > - Time is still moving one hour back and forward. This is NOT a problem for > corosync, but it's very hard to investigate such logs. Please try to fix > your VM > (https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/ > html/virtualization_administration_guide/sect-virtualization-tips_and_tricks- > libvirt_managed_timers) > I put <clock offset='localtime' /> to virts definition. Hope it helps > - looks like virt15 was never running corosync 2.99.4 (cluster-log.txt, > corosync-blackbox-live.txt and journal.log never logs 2.99.4) so it was not > using patched version I retried with reboot and cluster re-creation where only installed version on both nodes is 2.99.4 > > - log collector probably doesn't work as expected, because virt14 > cluster-log.txt contains version 2.99.3, but corosync-blackbox-live.txt > (correctly) contains 2.99.4 I can give you access to these machines so you can check whatever you need in case this is really the case. Created attachment 1512598 [details]
pcs cluster report, #2
@mnovacek: Thank you for the update. Time seems to still be moving back/forwards, but at least corosync.quorum is complete and problem is evident. For some reason, node with id 2 is bound to ipv4 and node with id 1 is bound to ipv6. Chrissie sent a patch which I've NACKed because I believe it's not needed, but it's at least worth a try, so let me build new test package. Another test build with https://patch-diff.githubusercontent.com/raw/corosync/corosync/pull/409.patch included - http://brew-task-repos.usersys.redhat.com/repos/scratch/jfriesse/corosync/2.99.5/1.el8.jf1/ The patch from comment #14 seems to correct the issue. Perfect. I've merged patch as a https://github.com/corosync/corosync/commit/3d7f136f86a56dd9d9caa9060f7a01e8b681eb7f Created attachment 1513348 [details]
man: Add some information about address resolution
man: Add some information about address resolution
to corosync.conf(5)
Signed-off-by: Christine Caulfield <ccaulfie>
Reviewed-by: Jan Friesse <jfriesse>
Created attachment 1513349 [details]
=?UTF-8?q?config:=C2=A0Look=20up=20hostnames=20in=20a=20defined=20order?=
=?UTF-8?q?config:=C2=A0Look=20up=20hostnames=20in=20a=20defined=20order?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Current practice is to let getaddrinfo() decide which address we get
but this is not necessarily deterministic as DNS servers won't
always return addresses in the same order if a node has
several. While this doesn't deal with node names that have
multiple IP addresses of the same family (that's an installation issue
IMHO) we can, at least, force a definite order for IPv6/IPv4 name
resolution.
I've chosen IPv6 then IPv4 as that's what happens on my test system (
using /etc/hosts) and it also seems more 'future proof'.
Signed-off-by: Christine Caulfield <ccaulfie>
Reviewed-by: Jan Friesse <jfriesse>
|