Bug 1654630

Summary:

node will not rejoin cluster after reboot

Product:

Red Hat Enterprise Linux 8

Reporter:

michal novacek <mnovacek>

Component:

corosync

Assignee:

Christine Caulfield <ccaulfield>

Status:

CLOSED CURRENTRELEASE

QA Contact:

cluster-qe <cluster-qe>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

8.0

CC:

abeekhof, ccaulfie, cluster-maint, jfriesse, mnovacek

Target Milestone:

Keywords:

TestBlocker

Target Release:

8.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

corosync-2.99.5-2.el8

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-06-14 01:25:06 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
'pcs cluster report' after node node joining the cluster	none
pcs cluster report with test version of corosync-2.99.4	none
pcs cluster report, #2	none
man: Add some information about address resolution	none
=?UTF-8?q?config:=C2=A0Look=20up=20hostnames=20in=20a=20defined=20order?=	none

Description michal novacek 2018-11-29 09:55:40 UTC

Created attachment 1509771 [details]
'pcs cluster report' after node node joining the cluster

Description of problem:

I created two node pacemaker cluster with fencing and two resources on rhel8
(1), (2). After rebooting or fencing one of the nodes (node2) it will not join back
the cluster. Corosync and pacemaker system services will start but it will
start its own inquorate cluster with the other node seen (node1) as UNCLEAN.
Fencing of node1 will not occur.  After issuing 'pcs cluster stop' and 'pcs
cluster start' on node2 it will correctly.

Version-Release number of selected component (if applicable):
rhel8 nightly compose
corosync-2.4.3-4.el7.x86_64
pacemaker-1.1.19-8.el7.x86_64

How reproducible: always

Steps to Reproduce:
1. Create cluster (1), (2)
2. Reboot one of the nodes.
3.

Actual results: Rebooted node will not join back to the cluster.

Expected results: Rebooted node joining back to the cluster after reboot.

Additional info:

No firewall. Selinux on permissive.

Using restart of corosync or pacemaker using 'systemctl restart' the node will
rejoin cluster correctly. I don't see how restarting the service is any
different from rebooting the node but it does not behave the same way.


---

> (1) pcs status
Cluster name: el-cluster
Stack: corosync
Current DC: virt-004 (version 2.0.0-11.el8-efbf81b659) - partition with quorum
Last updated: Thu Nov 29 10:52:31 2018
Last change: Thu Nov 29 10:20:54 2018 by root via cibadmin on virt-005

2 nodes configured
6 resources configured

Online: [ virt-004 virt-005 ]

Full list of resources:

 fence-virt-004	(stonith:fence_xvm):	Started virt-005
 fence-virt-005	(stonith:fence_xvm):	Started virt-004
 Clone Set: dlm-clone [dlm]
     Started: [ virt-004 virt-005 ]
 Clone Set: lvmlockd-clone [lvmlockd]
     Started: [ virt-004 virt-005 ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: inactive/enabled


> (2) pcs config
Cluster Name: el-cluster
Corosync Nodes:
 virt-004 virt-005
Pacemaker Nodes:
 virt-004 virt-005

Resources:
 Clone: dlm-clone
  Meta Attrs: interleave=true ordered=true
  Resource: dlm (class=ocf provider=pacemaker type=controld)
   Operations: monitor interval=10s start-delay=0s timeout=20s (dlm-monitor-interval-10s)
               start interval=0s timeout=90s (dlm-start-interval-0s)
               stop interval=0s timeout=100s (dlm-stop-interval-0s)
 Clone: lvmlockd-clone
  Meta Attrs: interleave=true ordered=true
  Resource: lvmlockd (class=ocf provider=heartbeat type=lvmlockd)
   Operations: monitor interval=30s timeout=90s (lvmlockd-monitor-interval-30s)
               start interval=0s timeout=90s (lvmlockd-start-interval-0s)
               stop interval=0s timeout=90s (lvmlockd-stop-interval-0s)

Stonith Devices:
 Resource: fence-virt-004 (class=stonith type=fence_xvm)
  Attributes: delay=5 pcmk_host_check=static-list pcmk_host_list=virt-004 pcmk_host_map=virt-004:virt-004.cluster-qe.lab.eng.brq.redhat.com
  Operations: monitor interval=60s (fence-virt-004-monitor-interval-60s)
 Resource: fence-virt-005 (class=stonith type=fence_xvm)
  Attributes: pcmk_host_check=static-list pcmk_host_list=virt-005 pcmk_host_map=virt-005:virt-005.cluster-qe.lab.eng.brq.redhat.com
  Operations: monitor interval=60s (fence-virt-005-monitor-interval-60s)
Fencing Levels:

Location Constraints:
  Resource: fence-virt-004
    Disabled on: virt-004 (score:-10) (id:location-fence-virt-004-virt-004--10)
  Resource: fence-virt-005
    Disabled on: virt-005 (score:-10) (id:location-fence-virt-005-virt-005--10)
Ordering Constraints:
  start dlm-clone then start lvmlockd-clone (kind:Mandatory) (id:order-dlm-clone-lvmlockd-clone-Mandatory)
Colocation Constraints:
  lvmlockd-clone with dlm-clone (score:INFINITY) (id:colocation-lvmlockd-clone-dlm-clone-INFINITY)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: el-cluster
 dc-version: 2.0.0-11.el8-efbf81b659
 have-watchdog: false

Quorum:
  Options:
    wait_for_all: 1

Comment 1 Ken Gaillot 2018-11-29 17:56:57 UTC

Small correction to description: packages are corosync 2.99.4-1.el8 and pacemaker 2.0.0-11.el8.

Looking at incident of interest:

Nov 29 10:14:51 virt-004.cluster-qe.lab.eng.brq.redhat.com pacemakerd          [18629] (main)   notice: Starting Pacemaker 2.0.0-11.el8 | build=efbf81b659 features: generated-manpages agent-manpages ncurses libqb-logging libqb-ipc systemd nagios  corosync-native atomic-attrd acls

Nov 29 10:22:22 virt-004.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [18639] (tengine_stonith_notify)         notice: Peer virt-005 was terminated (reboot) by virt-004 on behalf of stonith-api.25789: OK | initiator=virt-004 ref=2708d2b0-15c8-48bd-9317-2484014ea877

Nov 29 10:26:05 virt-005.cluster-qe.lab.eng.brq.redhat.com pacemakerd          [995] (main)     notice: Starting Pacemaker 2.0.0-11.el8 | build=efbf81b659 features: generated-manpages agent-manpages ncurses libqb-logging libqb-ipc systemd nagios  corosync-native atomic-attrd acls

Nov 29 10:26:13 virt-005.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [1187] (crm_update_peer_proc)    info: cluster_connect_cpg: Node virt-005[2] - corosync-cpg is now online
Nov 29 10:26:13 virt-005.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [1187] (peer_update_callback)    info: Node virt-005 is now a peer | DC=<none> old=0x0000000 new=0x4000000
Nov 29 10:26:13 virt-005.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [1187] (init_cs_connection_once)         info: Connection to 'corosync': established
Nov 29 10:26:13 virt-005.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [1187] (cluster_connect_quorum)  warning: Quorum lost

Nov 29 10:26:15 virt-005.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [1187] (pcmk_quorum_notification)        info: Quorum still lost | membership=196 members=1

Nov 29 10:26:16 virt-005.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [1187] (pcmk_cpg_membership)     info: Group event crmd.0: node 2 joined
Nov 29 10:26:16 virt-005.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [1187] (pcmk_cpg_membership)     info: Group event crmd.0: node 2 (virt-005) is member


A cluster is not being formed at the corosync level. Oddly, corosync doesn't log anything during this time:

Nov 29 09:25:59 [778] virt-005.cluster-qe.lab.eng.brq.redhat.com corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
Nov 29 10:28:46 [778] virt-005.cluster-qe.lab.eng.brq.redhat.com corosync notice  [MAIN  ] Node was shut down by a signal

Reassigning to corosync for further investigation

Comment 2 Jan Friesse 2018-11-30 07:58:05 UTC

@Mnovacek:

1. It's very hard to read logs because after reboot time returns back by one hour:

Nov 29 10:29:06 [1673] virt-005.cluster-qe.lab.eng.brq.redhat.com corosync notice  [MAIN  ] Corosync Cluster Engine exiting nor
mally
Nov 29 09:32:55 [768] virt-005.cluster-qe.lab.eng.brq.redhat.com corosync notice  [MAIN  ] Corosync Cluster Engine ('2.99.4'):
started and ready to provide service.

could you please sent report without this "feature" enabled?

2. Would you mind to test using udpu transport?

3. Could you please try to turn debug in corosync?

Also time of the incident (= when node was rebooted) may be helpful.

Comment 3 Jan Friesse 2018-11-30 10:11:29 UTC

After trying really hard, I'm really unable to reproduce the problem. It however may be because of DNS resolution and/or network not activated.

So in addition to previous comment please check  if your hosts are in the /etc/hosts? If not, does adding them to /etc/hosts helps?

Comment 4 michal novacek 2018-12-03 13:31:01 UTC

Adding fqdn to /etc/hosts did not changed the behaviour:

> $ cat /etc/hosts
> 10.37.166.131 virt-004.ipv4.cluster-qe.lab.eng.brq.redhat.com
> 2620:52:0:25a4:1800:ff:fe00:4 virt-004.ipv6.cluster-qe.lab.eng.brq.redhat.com virt-004.cluster-qe.lab.eng.brq.redhat.com
>
> 10.37.166.132 virt-005.ipv4.cluster-qe.lab.eng.brq.redhat.com
> 2620:52:0:25a4:1800:ff:fe00:5 virt-005.ipv6.cluster-qe.lab.eng.brq.redhat.com virt-005.cluster-qe.lab.eng.brq.redhat.com

---

Adding short form virt-00{4,5} to ipv6 entries solves the problem:

> $ cat /etc/hosts
> 10.37.166.131 virt-004.ipv4.cluster-qe.lab.eng.brq.redhat.com
> 2620:52:0:25a4:1800:ff:fe00:4 virt-004.ipv6.cluster-qe.lab.eng.brq.redhat.com virt-004.cluster-qe.lab.eng.brq.redhat.com virt-004
>
> 10.37.166.132 virt-005.ipv4.cluster-qe.lab.eng.brq.redhat.com virt-005
> 2620:52:0:25a4:1800:ff:fe00:5 virt-005.ipv6.cluster-qe.lab.eng.brq.redhat.com virt-005.cluster-qe.lab.eng.brq.redhat.com virt-005

---

We might have a little strange dns setup in our lab where virt-004 will by
default resolve to ipv6 address:


[root@virt-004 ~]# host virt-004
virt-004.cluster-qe.lab.eng.brq.redhat.com has address 10.37.166.131
virt-004.cluster-qe.lab.eng.brq.redhat.com has IPv6 address 2620:52:0:25a4:1800:ff:fe00:4

[root@virt-004 ~]# ping virt-004
PING virt-004(virt-004.cluster-qe.lab.eng.brq.redhat.com (2620:52:0:25a4:1800:ff:fe00:4)) 56 data bytes
...

[root@virt-004 ~]# ping virt-004.ipv4
PING virt-004.ipv4.cluster-qe.lab.eng.brq.redhat.com (10.37.166.131) 56(84) bytes of data.
...

In the same environment with no added /etc/hosts lines rhel7.6 pacemaker clusters worked well.


What should we do differently to not need to add entries to /etc/hosts? (corosync.conf ring0_addr?)

Comment 5 Jan Friesse 2018-12-03 14:03:53 UTC

@mnovacek:
Chrissie is now working on some DNS issues which I've found (dns resolution errors are not checked) so maybe it will fix also the problem you've reported.

It's not necessarily to do anything differently than before. It should work (or fail with error or at least show warning) and if it doesn't then it's bug and must be fixed (customers setups may be even weirder and current behavior when no error is reported and corosync doesn't work is simply wrong). All the questions which I've asked were just to allow us to reproduce the issue.

It's now evident that problem is ipv4 vs ipv6. Needle (2.x, RHEL7) default was ipv4, camelback (3.x, RHEL8) has no strict default so it looks like in lab setup it resolves to ipv6.

Comment 6 Jan Friesse 2018-12-03 16:32:37 UTC

@mnovacek:
Would you mind to test https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=19353248 without /etc/hosts modified (same configuration as you've had in time of reporting the bug)? It will probably not work, but logs should contain at least some clues now.

Comment 7 Jan Friesse 2018-12-03 16:44:41 UTC

Hopefully helpful repolink:
http://brew-task-repos.usersys.redhat.com/repos/scratch/jfriesse/corosync/2.99.4/1.el8.jf1/corosync-2.99.4-1.el8.jf1-scratch.repo

Comment 8 michal novacek 2018-12-07 09:30:12 UTC

Created attachment 1512426 [details]
pcs cluster report with test version of corosync-2.99.4

Comment 10 Jan Friesse 2018-12-07 09:58:25 UTC

After brief look to logs, I've found at least following problems:

- Time is still moving one hour back and forward. This is NOT a problem for corosync, but it's very hard to investigate such logs. Please try to fix your VM (https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/virtualization_administration_guide/sect-virtualization-tips_and_tricks-libvirt_managed_timers)

- looks like virt15 was never running corosync 2.99.4 (cluster-log.txt, corosync-blackbox-live.txt and journal.log never logs 2.99.4) so it was not using patched version

- log collector probably doesn't work as expected, because virt14 cluster-log.txt contains version 2.99.3, but corosync-blackbox-live.txt (correctly) contains 2.99.4

Comment 11 michal novacek 2018-12-07 15:17:53 UTC

(In reply to Jan Friesse from comment #10)
> After brief look to logs, I've found at least following problems:
> 
> - Time is still moving one hour back and forward. This is NOT a problem for
> corosync, but it's very hard to investigate such logs. Please try to fix
> your VM
> (https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/
> html/virtualization_administration_guide/sect-virtualization-tips_and_tricks-
> libvirt_managed_timers)
> 

I put <clock offset='localtime' /> to virts definition. Hope it helps

> - looks like virt15 was never running corosync 2.99.4 (cluster-log.txt,
> corosync-blackbox-live.txt and journal.log never logs 2.99.4) so it was not
> using patched version

I retried with reboot and cluster re-creation where only installed version on both nodes is 2.99.4

> 
> - log collector probably doesn't work as expected, because virt14
> cluster-log.txt contains version 2.99.3, but corosync-blackbox-live.txt
> (correctly) contains 2.99.4

I can give you access to these machines so you can check whatever you need in case this is really the case.

Comment 12 michal novacek 2018-12-07 15:18:39 UTC

Created attachment 1512598 [details]
pcs cluster report, #2

Comment 13 Jan Friesse 2018-12-10 07:12:42 UTC

@mnovacek:

Thank you for the update. Time seems to still be moving back/forwards, but at least corosync.quorum is complete and problem is evident. For some reason, node with id 2 is bound to ipv4 and node with id 1 is bound to ipv6.

Chrissie sent a patch which I've NACKed because I believe it's not needed, but it's at least worth a try, so let me build new test package.

Comment 14 Jan Friesse 2018-12-10 08:35:19 UTC

Another test build with https://patch-diff.githubusercontent.com/raw/corosync/corosync/pull/409.patch included - http://brew-task-repos.usersys.redhat.com/repos/scratch/jfriesse/corosync/2.99.5/1.el8.jf1/

Comment 15 michal novacek 2018-12-11 09:05:17 UTC

The patch from comment #14 seems to correct the issue.

Comment 16 Jan Friesse 2018-12-11 10:28:16 UTC

Perfect. I've merged patch as a https://github.com/corosync/corosync/commit/3d7f136f86a56dd9d9caa9060f7a01e8b681eb7f

Comment 18 Jan Friesse 2018-12-11 12:09:53 UTC

Created attachment 1513348 [details]
man: Add some information about address resolution

man: Add some information about address resolution

to corosync.conf(5)

Signed-off-by: Christine Caulfield <ccaulfie>
Reviewed-by: Jan Friesse <jfriesse>

Comment 19 Jan Friesse 2018-12-11 12:10:02 UTC

Created attachment 1513349 [details]
=?UTF-8?q?config:=C2=A0Look=20up=20hostnames=20in=20a=20defined=20order?=

=?UTF-8?q?config:=C2=A0Look=20up=20hostnames=20in=20a=20defined=20order?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Current practice is to let getaddrinfo() decide which address we get
but this is not necessarily deterministic as DNS servers won't
always return addresses in the same order if a node has
several. While this doesn't deal with node names that have
multiple IP addresses of the same family (that's an installation issue
IMHO) we can, at least, force a definite order for IPv6/IPv4 name
resolution.

I've chosen IPv6 then IPv4 as that's what happens on my test system (
using /etc/hosts) and it also seems more 'future proof'.

Signed-off-by: Christine Caulfield <ccaulfie>
Reviewed-by: Jan Friesse <jfriesse>