Bug 1614166

Summary:

Always close corosync IPC when dispatch function gets error

Product:

Red Hat Enterprise Linux 8

Reporter:

haidong li <haili>

Component:

pacemaker

Assignee:

Ken Gaillot <kgaillot>

Status:

CLOSED ERRATA

QA Contact:

cluster-qe <cluster-qe>

Severity:

low

Docs Contact:

Priority:

medium

Version:

8.3

CC:

ccaulfie, cluster-maint, ctrautma, fleitner, haili, jfriesse, jishi, kgaillot, mmichels, msmazova, qding

Target Milestone:

Target Release:

8.4

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

pacemaker-2.0.5-1.el8

Doc Type:

No Doc Update

Doc Text:

This is low-level enough to be below most users' visibility

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-05-18 15:26:41 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1885645

Bug Blocks:

Attachments:

Description	Flags
corosync.conf	none
pacemaker.log	none

Description haidong li 2018-08-09 06:29:29 UTC

Description of problem:
node can't be up after frequently up and down on interface

Version-Release number of selected component (if applicable):
[root@dell-per730-49 cluster]# rpm -qa | grep openvswitch
openvswitch2.10-ovn-common-2.10.0-0.20180724git1ac6908.el7fdp.x86_64
openvswitch-selinux-extra-policy-1.0-3.el7fdp.noarch
openvswitch2.10-ovn-host-2.10.0-0.20180724git1ac6908.el7fdp.x86_64
openvswitch2.10-ovn-central-2.10.0-0.20180724git1ac6908.el7fdp.x86_64
openvswitch2.10-2.10.0-0.20180724git1ac6908.el7fdp.x86_64
[root@dell-per730-49 cluster]# uname -a
Linux dell-per730-49.rhts.eng.pek2.redhat.com 3.10.0-924.el7.x86_64 #1 SMP Mon Jul 16 13:57:23 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
[root@dell-per730-49 cluster]#

How reproducible:
frequently up and down for almost 8 hours

Steps to Reproduce:
1.setup cluster on 3 nodes as ovn db server,dode1 is the master node.start ovn-controller on one of them
2.start ovn-controller on another machine,ping from local to remote chassis
3.frequently set up and down on interface of node1,let the master change everytime.
used the script below:
for ((i=0;i<10000;i++));do ip link set p4p1 down;sleep 15;ip link set p4p1 up;sleep 15;done

Actual results:
cluster of node1 goes down and can't be up,log shows No space left on device.

log in the /var/log/cluster/corosync.log:
[71680] dell-per730-49.rhts.eng.pek2.redhat.com corosyncerror   [QB    ] couldn't create file for mmap
[71680] dell-per730-49.rhts.eng.pek2.redhat.com corosyncerror   [QB    ] qb_rb_open:cmap-event-71681-78764-22: No space left on device (28)
[71680] dell-per730-49.rhts.eng.pek2.redhat.com corosyncerror   [QB    ] shm connection FAILED: No space left on device (28)
[71680] dell-per730-49.rhts.eng.pek2.redhat.com corosyncerror   [QB    ] Error in connection setup (71681-78764-22): No space left on device (28)
[71680] dell-per730-49.rhts.eng.pek2.redhat.com corosyncerror   [QB    ] couldn't create file for mmap
[71680] dell-per730-49.rhts.eng.pek2.redhat.com corosyncerror   [QB    ] qb_rb_open:cmap-event-71681-78765-22: No space left on device (28)
[71680] dell-per730-49.rhts.eng.pek2.redhat.com corosyncerror   [QB    ] shm connection FAILED: No space left on device (28)
[71680] dell-per730-49.rhts.eng.pek2.redhat.com corosyncerror   [QB    ] Error in connection setup (71681-78765-22): No space left on device (28)
[71680] dell-per730-49.rhts.eng.pek2.redhat.com corosyncerror   [QB    ] couldn't create file for mmap
[71680] dell-per730-49.rhts.eng.pek2.redhat.com corosyncerror   [QB    ] qb_rb_open:cmap-event-71681-78773-22: No space left on device (28)
[71680] dell-per730-49.rhts.eng.pek2.redhat.com corosyncerror   [QB    ] shm connection FAILED: No space left on device (28)
[71680] dell-per730-49.rhts.eng.pek2.redhat.com corosyncerror   [QB    ] Error in connection setup (71681-78773-22): No space left on device (28)
[71680] dell-per730-49.rhts.eng.pek2.redhat.com corosyncerror   [QB    ] couldn't create file for mmap
[71680] dell-per730-49.rhts.eng.pek2.redhat.com corosyncerror   [QB    ] qb_rb_open:cmap-event-71681-78774-22: No space left on device (28)

[root@dell-per730-49 cluster]# free -g
              total        used        free      shared  buff/cache   available
Mem:             62           1          17          31          43          28
Swap:            27           0          27
[root@dell-per730-49 cluster]# 
[root@dell-per730-49 cluster]# pcs status
Error: cluster is not currently running on this node
[root@dell-per730-49 cluster]# 

show the configuration on other node:
[root@dell-per730-19 ~]# pcs status
Cluster name: my_cluster

WARNINGS:
Corosync and pacemaker node names do not match (IPs used in setup?)

Stack: corosync
Current DC: dell-per730-19.rhts.eng.pek2.redhat.com (version 1.1.19-3.el7-c3c624ea3d) - partition with quorum
Last updated: Thu Aug  9 02:17:15 2018
Last change: Wed Aug  8 22:40:45 2018 by root via crm_attribute on dell-per730-19.rhts.eng.pek2.redhat.com

3 nodes configured
4 resources configured

Node dell-per730-49.rhts.eng.pek2.redhat.com: UNCLEAN (offline)
Online: [ dell-per730-19.rhts.eng.pek2.redhat.com dell-per730-20.rhts.eng.pek2.redhat.com ]

Full list of resources:

 ip-30.0.0.20	(ocf::heartbeat:IPaddr2):	Started dell-per730-19.rhts.eng.pek2.redhat.com
 Master/Slave Set: ovndb_servers-master [ovndb_servers]
     Masters: [ dell-per730-19.rhts.eng.pek2.redhat.com ]
     Slaves: [ dell-per730-20.rhts.eng.pek2.redhat.com ]
     Stopped: [ dell-per730-49.rhts.eng.pek2.redhat.com ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@dell-per730-19 ~]# 
[root@dell-per730-19 ~]# ovn-nbctl show
switch 28178887-5ae5-47ff-a62a-6ca9afed800f (s9)
    port lsp1
        addresses: ["16:45:35:fe:96:2a"]
    port lsp2
        addresses: ["2a:7e:3e:6c:76:2d"]
[root@dell-per730-19 ~]# ovn-sbctl show
Chassis "hv0"
    hostname: "hp-dl385pg8-06.rhts.eng.pek2.redhat.com"
    Encap geneve
        ip: "30.0.0.11"
        options: {csum="true"}
    Port_Binding "lsp1"
Chassis "hv1"
    hostname: "dell-per730-20.rhts.eng.pek2.redhat.com"
    Encap geneve
        ip: "30.0.0.2"
        options: {csum="true"}
Chassis "hv2"
    hostname: "dell-per730-19.rhts.eng.pek2.redhat.com"
    Encap geneve
        ip: "30.0.0.10"
        options: {csum="true"}
    Port_Binding "lsp2"
[root@dell-per730-19 ~]# 


Expected results:


Additional info:

Comment 2 Mark Michelson 2018-08-09 12:53:35 UTC

"No space left on device" implies the disk is full. Are there any directories with more entries than expected? Are there any unusually large files (perhaps logs)? I want to try to determine if the issue is in OVN, OVS, corosync, pacemaker, or something else on the system.

Comment 4 haidong li 2018-08-10 02:09:22 UTC

(In reply to Mark Michelson from comment #2)
> "No space left on device" implies the disk is full. Are there any
> directories with more entries than expected? Are there any unusually large
> files (perhaps logs)? I want to try to determine if the issue is in OVN,
> OVS, corosync, pacemaker, or something else on the system.

The log file isn't large:
[root@dell-per730-49 log]# pwd
/var/log
[root@dell-per730-49 log]# ll -h
total 78M
...

Is it caused by the cache? I found the cache can't be released:
[root@dell-per730-49 log]# free -g
              total        used        free      shared  buff/cache   available
Mem:             62           1          18          31          43          28
Swap:            27           0          27
[root@dell-per730-49 log]# cat /proc/sys/vm/drop_caches 
0
[root@dell-per730-49 log]# echo 1 > /proc/sys/vm/drop_caches 
[root@dell-per730-49 log]# free -g
              total        used        free      shared  buff/cache   available
Mem:             62           1          29          31          32          29
Swap:            27           0          27


You can login the machine dell-per730-49.rhts.eng.pek2.redhat.com with root/redhat if you want to check.

Comment 5 Mark Michelson 2018-08-10 12:26:30 UTC

Thanks, I logged in and looked around a bit:

[root@dell-per730-49 ~]# df -h
Filesystem                              Size  Used Avail Use% Mounted on
/dev/mapper/rhel_dell--per730--49-root   50G   15G   36G  29% /
devtmpfs                                 32G     0   32G   0% /dev
tmpfs                                    32G   32G   15M 100% /dev/shm
tmpfs                                    32G  595M   31G   2% /run
tmpfs                                    32G     0   32G   0% /sys/fs/cgroup
/dev/sda1                              1014M  146M  869M  15% /boot
/dev/mapper/rhel_dell--per730--49-home  201G   34M  201G   1% /home
tmpfs                                   6.3G     0  6.3G   0% /run/user/0


/dev/shm is 100% used.

[root@dell-per730-49 ~]# ls -l /dev/shm | wc -l
63117

It looks like either libqb or corosync is not cleaning up temporary files.

Comment 6 Mark Michelson 2018-09-05 21:14:39 UTC

Hi it appears that the system is no longer up. Is it possible for you to reproduce this and to leave the test system up for someone to look into? I found some people that are interested in having a look when the system gets into its broken state.

Thanks!

Comment 7 haidong li 2018-09-07 02:13:42 UTC

(In reply to Mark Michelson from comment #6)
> Hi it appears that the system is no longer up. Is it possible for you to
> reproduce this and to leave the test system up for someone to look into? I
> found some people that are interested in having a look when the system gets
> into its broken state.
> 
> Thanks!

Hi Mark,
   I have reproduced it on another machine,you can login and check it with root/redhat:
   dell-per730-20.rhts.eng.pek2.redhat.com

[root@dell-per730-20 ~]# df -h
Filesystem                                            Size  Used Avail Use% Mounted on
/dev/mapper/rhel_dell--per730--20-root                 50G   13G   38G  26% /
devtmpfs                                               32G     0   32G   0% /dev
tmpfs                                                  32G   32G  9.6M 100% /dev/shm
tmpfs                                                  32G  211M   32G   1% /run
tmpfs                                                  32G     0   32G   0% /sys/fs/cgroup
/dev/sda1                                            1014M  236M  779M  24% /boot
/dev/mapper/rhel_dell--per730--20-home                201G   33M  201G   1% /home
ibm-x3250m4-03.rhts.eng.pek2.redhat.com:/data/vmcore  1.4T  1.1T  326G  77% /var/crash
tmpfs                                                 6.3G     0  6.3G   0% /run/user/0
[root@dell-per730-20 ~]#

Comment 8 Mark Michelson 2018-09-19 21:36:09 UTC

Looking into this more, the files that are on the filesystem seem to be the result of internal corosync operations. Even if there are errors in our scripts, they shouldn't result in leaks of these types of files on the file system. I'm changing the component of this issue to try to bring in someone who can provide input on this issue.

Comment 9 Jan Friesse 2018-09-20 08:11:21 UTC

The problem you are seeing is really because of full /dev/shm. The question is, why it becomes full.

Another problem is ifdown. Corosync doesn't handle ifdown well. It's long time known problem, which is fixed in soon to be released corosync 3.x and we don't have any plan to fix it properly in corosync 2.x. So question is, are you able to reproduce the problem without using the ifdown (you can use iptables to block udp traffic)?

Could you share your corosync.conf?

Are you able to reproduce the problem without pacemaker (so just using the corosync)?

Comment 10 Jan Friesse 2018-09-20 08:12:26 UTC

Also could you please share corosync, pacemaker and libqb versions?

Comment 11 haidong li 2018-09-21 07:46:58 UTC

Created attachment 1485405 [details]
corosync.conf

I have attached the corosync.conf and listed the version of  corosync, pacemaker and libqb.I will try to reproduce the issue using iptables to block udp traffic.If it's necessary to reproduce it with corosync only and without pacemaker,can you give some explain about how to use it,or some commands I should use?I have never used it before,thanks!

[root@dell-per730-57 ~]# rpm -qa | grep pacemaker
pacemaker-cluster-libs-1.1.19-7.el7.x86_64
pacemaker-cli-1.1.19-7.el7.x86_64
pacemaker-libs-1.1.19-7.el7.x86_64
pacemaker-1.1.19-7.el7.x86_64
[root@dell-per730-57 ~]# rpm -qa | grep corosync
corosync-2.4.3-4.el7.x86_64
corosynclib-2.4.3-4.el7.x86_64
[root@dell-per730-57 ~]# rpm -qa | grep libqb
libqb-1.0.1-7.el7.x86_64

Comment 12 Jan Friesse 2018-09-25 13:38:29 UTC

Ok, so corosync.conf is pretty standard and libs are newest possible versions, so there should be no problem.

It's not entirely necessary to reproduce the issue without pacemaker, but it could reduce the problem space.

By corosync only I mean to just start corosync - so create cluster as usually and then stop pacemaker service on all nodes and then try your test. Bug can be probably reproduced on one node only, so you can try just one node.

Another possibility is to take a look to /dev/shm. First, please try to find out what files are using most space. Also there should be files like qb-*-[event|request|response]-PID1-PID2-random-[data|header]. PID1 is the server pid and it should be corosync one. Please try to find out if it really is. PID2 should be client pid, so please take a look which process is associated with that pid.

Comment 14 Jan Friesse 2018-09-26 14:08:05 UTC

@haidong li: Thank you for the login credentials. I can now fully understand what is happening:
1. Ifdown (as I wrote, ifdown is unsupported) + ifup result to corosync crash
2. Corosync crash is detected by pacamaker, which also terminates with error code, but it looks like it doesn't close all IPCs (call *_finalize function) so /dev/shm files are not properly deleted.
3. Systemd restarts pacemaker because pacemaker.unit is configure to restart on on-failure. Pacemaker unit file depends on corosync so corosync is started first
4. goto 1

To conclude:
- This bug is not reproducible using iptables method of failure.
- Virtually all clusters are using power fencing (I believe this is only approved configuration) so node would be fenced (because corosync was killed) and problem would never appear - that's the reason I changed priority and severity of the bug to low.
- We have partial fix for the ifdown problem for corosync 2.x in upstream (https://github.com/corosync/corosync/commit/96354fba72b7e7065610f37df0c0547b1e93ad51) so it will land with next rebase (if allowed).
-> There is really nothing more we can fix in the corosync itself

That said, I'm reassigning to pacemaker to consider enhance pacemaker so it closes IPC (= call *_finalize functions) when corosync crashes (= *_dispatch function returns error other than cs_err_try_again).

Comment 15 Ken Gaillot 2020-05-06 00:52:09 UTC

This will be considered for RHEL 8 only

Comment 16 Ken Gaillot 2020-08-05 20:04:28 UTC

We had trouble reproducing the issue, but we believe it would be fixed by commit dc341923 which is now merged upstream.

Comment 18 Markéta Smazová 2020-10-20 15:26:34 UTC

Can you please provide a reproducer for QE testing?

Comment 19 Jianlin Shi 2020-10-22 01:53:24 UTC

I succeed to reproduce the issue on openvswitch2.10-ovn-common-2.10.0-0.20180724git1ac6908.el7fdp.x86_64 with following steps:

1. install pacemaker on 3 nodes and start pcsd
2. install ovs and ovn on 3 nodes, and start openvswitch, disable selinux
3. start pcs
setenforce 0
systemctl start openvswitch
ip_s=1.1.1.16
ip_c1=1.1.1.17
ip_c2=1.1.1.18
ip_v=1.1.1.100
(sleep 2;echo "hacluster"; sleep 2; echo "redhat" ) |pcs cluster auth  $ip_c1 $ip_c2 $ip_s
sleep 5
pcs cluster setup --force --start --name my_cluster $ip_c1 $ip_c2 $ip_s
pcs cluster enable --all

pcs property set stonith-enabled=false
pcs property set no-quorum-policy=ignore
pcs cluster cib tmp-cib.xml
sleep 10
cp tmp-cib.xml tmp-cib.deltasrc
pcs resource delete ip-$ip_v
pcs resource delete ovndb_servers-master
sleep 5
pcs status

pcs -f tmp-cib.xml resource create ip-$ip_v ocf:heartbeat:IPaddr2 ip=$ip_v op monitor interval=30s
sleep 5
pcs -f tmp-cib.xml resource create ovndb_servers  ocf:ovn:ovndb-servers manage_northd=yes master_ip=$ip_v nb_master_port=6641 sb_master_port=6642 master
sleep 5
pcs -f tmp-cib.xml resource meta ovndb_servers-master notify=true
pcs -f tmp-cib.xml constraint order start ip-$ip_v then promote ovndb_servers-master
pcs -f tmp-cib.xml constraint colocation add ip-$ip_v with master ovndb_servers-master

pcs -f tmp-cib.xml constraint location ip-$ip_v prefers $ip_s=1500
pcs -f tmp-cib.xml constraint location ovndb_servers-master prefers $ip_s=1500
pcs -f tmp-cib.xml constraint location ip-$ip_v prefers $ip_c2=1000
pcs -f tmp-cib.xml constraint location ovndb_servers-master prefers $ip_c2=1000
pcs -f tmp-cib.xml constraint location ip-$ip_v prefers $ip_c1=500
pcs -f tmp-cib.xml constraint location ovndb_servers-master prefers $ip_c1=500

pcs cluster cib-push tmp-cib.xml diff-against=tmp-cib.deltasrc

4. after setup, master is 1.1.1.17
5. set interface down and up on 1.1.1.17 with following script

for ((i=0;i<10000;i++));do ip link set p1p1 down;sleep 15;ip link set p1p1 up;sleep 15;done

6. after hours, the issue happens

[root@wsfd-advnetlab18 ovn2.10.0]# pcs status
Cluster name: my_cluster

WARNINGS:
Corosync and pacemaker node names do not match (IPs used in setup?)                                   

Stack: corosync
Current DC: wsfd-advnetlab16.anl.lab.eng.bos.redhat.com (version 1.1.19-8.el7-c3c624ea3d) - partition with quorum
Last updated: Wed Oct 21 21:52:29 2020
Last change: Wed Oct 21 16:14:36 2020 by root via crm_attribute on wsfd-advnetlab16.anl.lab.eng.bos.redhat.com

3 nodes configured
4 resources configured

Online: [ wsfd-advnetlab16.anl.lab.eng.bos.redhat.com wsfd-advnetlab18.anl.lab.eng.bos.redhat.com ]   
OFFLINE: [ wsfd-advnetlab17.anl.lab.eng.bos.redhat.com ]

Full list of resources:

 ip-1.1.1.100   (ocf::heartbeat:IPaddr2):       Started wsfd-advnetlab16.anl.lab.eng.bos.redhat.com
 Master/Slave Set: ovndb_servers-master [ovndb_servers]
     Masters: [ wsfd-advnetlab16.anl.lab.eng.bos.redhat.com ]
     Slaves: [ wsfd-advnetlab18.anl.lab.eng.bos.redhat.com ]
     Stopped: [ wsfd-advnetlab17.anl.lab.eng.bos.redhat.com ]                                         

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/disabled

[root@wsfd-advnetlab17 ~]# rpm -qa | grep -E "openvswitch|ovn"
openvswitch-selinux-extra-policy-1.0-3.el7fdp.noarch
openvswitch2.10-ovn-central-2.10.0-0.20180724git1ac6908.el7fdp.x86_64
openvswitch2.10-2.10.0-0.20180724git1ac6908.el7fdp.x86_64
openvswitch2.10-ovn-common-2.10.0-0.20180724git1ac6908.el7fdp.x86_64                                  
openvswitch2.10-ovn-host-2.10.0-0.20180724git1ac6908.el7fdp.x86_64
[root@wsfd-advnetlab17 ~]# tail /var/log/cluster/corosync.log
[133905] wsfd-advnetlab17.anl.lab.eng.bos.redhat.com corosyncnotice  [TOTEM ] adding new UDPU member {1.1.1.18}
[133905] wsfd-advnetlab17.anl.lab.eng.bos.redhat.com corosyncnotice  [TOTEM ] adding new UDPU member {1.1.1.16}
[133905] wsfd-advnetlab17.anl.lab.eng.bos.redhat.com corosyncnotice  [TOTEM ] The network interface [1.1.1.17] is now up.
[133905] wsfd-advnetlab17.anl.lab.eng.bos.redhat.com corosyncnotice  [TOTEM ] adding new UDPU member {1.1.1.17}
[133905] wsfd-advnetlab17.anl.lab.eng.bos.redhat.com corosyncnotice  [TOTEM ] adding new UDPU member {1.1.1.18}
[133905] wsfd-advnetlab17.anl.lab.eng.bos.redhat.com corosyncnotice  [TOTEM ] adding new UDPU member {1.1.1.16}
[133905] wsfd-advnetlab17.anl.lab.eng.bos.redhat.com corosyncerror   [QB    ] couldn't create file for mmap
[133905] wsfd-advnetlab17.anl.lab.eng.bos.redhat.com corosyncerror   [QB    ] qb_rb_open:cmap-event-133906-133941-25: No space left on device (28)
[133905] wsfd-advnetlab17.anl.lab.eng.bos.redhat.com corosyncerror   [QB    ] shm connection FAILED: No space left on device (28)
[133905] wsfd-advnetlab17.anl.lab.eng.bos.redhat.com corosyncerror   [QB    ] Error in connection setup (133906-133941-25): No space left on device (28)

Comment 23 Markéta Smazová 2021-01-11 19:48:01 UTC

@Jianlin Shi: Thank you for the reproducer (comment 19). I tried to use it to test the issue on 3-node pacemaker 
cluster with pacemaker-2.0.5-4.el8, on rhosp-openvswitch-2.13-8.el8ost.noarch and rhosp-ovn-2.13-8.el8ost.noarch, 
but I was not successful - the ovndb-servers resource failed to start. I tried several times to set up the databases 
and find out why the ovsdb-server failed to start, but since I haven't worked with either ovs or ovn before I did not succeed.

Can you please help and verify if the issue is fixed in pacemaker-2.0.5-4.el8 ?



[root@virt-138 ~]# rpm -q pacemaker
pacemaker-2.0.5-4.el8.x86_64

[root@virt-138 ~]# rpm -qa | grep -E "openvswitch|ovn"
openvswitch2.13-2.13.0-71.el8fdp.x86_64
openvswitch-selinux-extra-policy-1.0-22.el8fdp.noarch
network-scripts-openvswitch2.13-2.13.0-71.el8fdp.x86_64
network-scripts-openvswitch2.11-2.11.3-74.el8fdp.x86_64
rhosp-ovn-2.13-8.el8ost.noarch
rhosp-openvswitch-2.13-8.el8ost.noarch
ovn2.13-20.09.0-17.el8fdp.x86_64

[root@virt-138 ~]# systemctl is-active pcsd
active
[root@virt-138 ~]# getenforce
Permissive
[root@virt-138 ~]# systemctl is-active openvswitch
active

> Cluster set-up as described in comment 19
[ ... ]

> After cluster setup, the ovndb-servers failed to start.

[root@virt-138 ~]# pcs status
Cluster name: my_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: virt-138 (version 2.0.5-4.el8-ba59be7122) - partition with quorum
  * Last updated: Fri Jan  8 19:00:15 2021
  * Last change:  Fri Jan  8 18:59:35 2021 by root via cibadmin on virt-138
  * 3 nodes configured
  * 4 resource instances configured

Node List:
  * Online: [ virt-138 virt-140 virt-150 ]

Full List of Resources:
  * VirtualIP	(ocf::heartbeat:IPaddr2):	 Stopped
  * Clone Set: ovndb_servers-clone [ovndb_servers] (promotable):
    * Stopped: [ virt-138 virt-140 virt-150 ]

Failed Resource Actions:
  * ovndb_servers_start_0 on virt-150 'error' (1): call=11, status='Timed Out', exitreason='', last-rc-change='2021-01-08 18:59:36 +01:00', queued=0ms, exec=30010ms
  * ovndb_servers_start_0 on virt-138 'error' (1): call=11, status='Timed Out', exitreason='', last-rc-change='2021-01-08 18:59:36 +01:00', queued=0ms, exec=30005ms
  * ovndb_servers_start_0 on virt-140 'error' (1): call=11, status='Timed Out', exitreason='', last-rc-change='2021-01-08 18:59:36 +01:00', queued=0ms, exec=30007ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

[root@virt-138 ~]# pcs resource debug-start ovndb_servers
Operation start for ovndb_servers (ocf:ovn:ovndb-servers) failed: 'Timed Out' (2)
 >  stdout: Starting ovsdb-nb [FAILED]
 >  stdout: Starting ovsdb-sb [FAILED]
 >  stderr: ovsdb-server: "db:OVN_Northbound,NB_Global,connections": no database named OVN_Northbound
 >  stderr: 
 >  stderr: ovn-nbctl: unix:/var/run/ovn/ovnnb_db.sock: database connection failed (No such file or directory)
 >  stderr: ovsdb-server: "db:OVN_Southbound,SB_Global,connections": no database named OVN_Southbound
 >  stderr:

Comment 24 Jianlin Shi 2021-01-12 06:49:03 UTC

(In reply to Markéta Smazová from comment #23)
> @Jianlin Shi: Thank you for the reproducer (comment 19). I tried to use it
> to test the issue on 3-node pacemaker 
> cluster with pacemaker-2.0.5-4.el8, on
> rhosp-openvswitch-2.13-8.el8ost.noarch and rhosp-ovn-2.13-8.el8ost.noarch, 
> but I was not successful - the ovndb-servers resource failed to start. I
> tried several times to set up the databases 
> and find out why the ovsdb-server failed to start, but since I haven't
> worked with either ovs or ovn before I did not succeed.
> 
> Can you please help and verify if the issue is fixed in
> pacemaker-2.0.5-4.el8 ?

the issue in Description in reported on rhel7.
I tried to reproduce on rhel8, but failed because of bz1915129.
so I can't confirm if pacemaker-2.0.5-4.el8 works before bz1915129 is verified.

Comment 25 Ken Gaillot 2021-01-12 16:17:16 UTC

(In reply to Jianlin Shi from comment #24)
> (In reply to Markéta Smazová from comment #23)
> > @Jianlin Shi: Thank you for the reproducer (comment 19). I tried to use it
> > to test the issue on 3-node pacemaker 
> > cluster with pacemaker-2.0.5-4.el8, on
> > rhosp-openvswitch-2.13-8.el8ost.noarch and rhosp-ovn-2.13-8.el8ost.noarch, 
> > but I was not successful - the ovndb-servers resource failed to start. I
> > tried several times to set up the databases 
> > and find out why the ovsdb-server failed to start, but since I haven't
> > worked with either ovs or ovn before I did not succeed.
> > 
> > Can you please help and verify if the issue is fixed in
> > pacemaker-2.0.5-4.el8 ?
> 
> the issue in Description in reported on rhel7.
> I tried to reproduce on rhel8, but failed because of bz1915129.
> so I can't confirm if pacemaker-2.0.5-4.el8 works before bz1915129 is
> verified.

The issue of corosync crashing after an ifdown was fixed as of RHEL 8.0, so that exact issue can't occur in RHEL 8. However anything causing corosync to crash should trigger it. I haven't verified, but I would expect a reproducer would be:

1. Configure a cluster of two nodes where the nodes are VMs with small amounts of memory, and fencing disabled.

2. On one node, kill -9 corosync. Pacemaker should exit, then systemd should start corosync and pacemaker again. Without the fix, repeating that a large number of times should show /dev/shm getting fuller and eventually full.

As an aside, with power fencing, this couldn't happen because the node would reboot, which would clear /dev/shm.

Comment 26 Jianlin Shi 2021-01-13 01:50:52 UTC

(In reply to Ken Gaillot from comment #25)
> (In reply to Jianlin Shi from comment #24)
> > (In reply to Markéta Smazová from comment #23)
> > > @Jianlin Shi: Thank you for the reproducer (comment 19). I tried to use it
> > > to test the issue on 3-node pacemaker 
> > > cluster with pacemaker-2.0.5-4.el8, on
> > > rhosp-openvswitch-2.13-8.el8ost.noarch and rhosp-ovn-2.13-8.el8ost.noarch, 
> > > but I was not successful - the ovndb-servers resource failed to start. I
> > > tried several times to set up the databases 
> > > and find out why the ovsdb-server failed to start, but since I haven't
> > > worked with either ovs or ovn before I did not succeed.
> > > 
> > > Can you please help and verify if the issue is fixed in
> > > pacemaker-2.0.5-4.el8 ?
> > 
> > the issue in Description in reported on rhel7.
> > I tried to reproduce on rhel8, but failed because of bz1915129.
> > so I can't confirm if pacemaker-2.0.5-4.el8 works before bz1915129 is
> > verified.
> 
> The issue of corosync crashing after an ifdown was fixed as of RHEL 8.0, so
> that exact issue can't occur in RHEL 8. However anything causing corosync to
> crash should trigger it. I haven't verified, but I would expect a reproducer
> would be:
> 
> 1. Configure a cluster of two nodes where the nodes are VMs with small
> amounts of memory, and fencing disabled.
> 
> 2. On one node, kill -9 corosync. Pacemaker should exit, then systemd should
> start corosync and pacemaker again. Without the fix, repeating that a large
> number of times should show /dev/shm getting fuller and eventually full.
> 
> As an aside, with power fencing, this couldn't happen because the node would
> reboot, which would clear /dev/shm.

so with the reproducer, ovn is not mandatory.
Markéta, could you follow the steps to reproduce?

Comment 27 Markéta Smazová 2021-01-18 14:40:37 UTC

I tried the reproducer in comment 25 on RHEL8. With the previous version of pacemaker-2.0.4-6.el8 I was able to reproduce 
the initial issue of /dev/shm getting full after corosync crash.

However while testing it on pacemaker-2.0.5-4.el8 I ran into some issues and could not verify the fix. After crashing 
corosync few times pacemaker stopped and would not restart. Details are below "after fix", I also attached a pacemaker.log.


@Ken: Could you please help with verification of the fix in pacemaker-2.0.5-4.el8 ?


before fix
-----------

>   [root@virt-016 ~]# rpm -q pacemaker
>   pacemaker-2.0.4-6.el8.x86_64

Setup two node cluster, disable fencing:
>   [root@virt-016 ~]# pcs host auth virt-0{16,17} -u hacluster -p password
>   virt-016: Authorized
>   virt-017: Authorized
>   [root@virt-016 ~]# pcs cluster setup test_cluster virt-016 virt-017 --start --wait
>   [...]
>   Cluster has been successfully set up.
>   Starting cluster on hosts: 'virt-016', 'virt-017'...
>   Waiting for node(s) to start: 'virt-016', 'virt-017'...
>   virt-016: Cluster started
>   virt-017: Cluster started

>   [root@virt-016 ~]# pcs cluster enable --all; pcs property set stonith-enabled=false
>   virt-016: Cluster Enabled
>   virt-017: Cluster Enabled
>   [root@virt-016 ~]# pcs status
>   Cluster name: test_cluster
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-017 (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum
>     * Last updated: Thu Jan 14 15:28:21 2021
>     * Last change:  Thu Jan 14 15:28:14 2021 by root via cibadmin on virt-016
>     * 2 nodes configured
>     * 0 resource instances configured

>   Node List:
>     * Online: [ virt-016 virt-017 ]

>   Full List of Resources:
>     * No resources

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Decrease size of `/dev/shm` :
>   [root@virt-016 15:28:21 ~]# mount -o remount,size=250M /dev/shm; df -h
>   Filesystem                       Size  Used Avail Use% Mounted on
>   devtmpfs                         989M     0  989M   0% /dev
>   tmpfs                            250M   32M  219M  13% /dev/shm
>   tmpfs                           1009M  8.5M 1000M   1% /run
>   tmpfs                           1009M     0 1009M   0% /sys/fs/cgroup
>   /dev/mapper/rhel_virt--016-root  6.2G  3.3G  3.0G  52% /
>   /dev/vda1                       1014M  197M  818M  20% /boot
>   tmpfs                            202M     0  202M   0% /run/user/0

Disable "SuccessExitStatus" variable in /usr/lib/systemd/system/pacemaker.service and reload pacemaker. This step is done 
just for the purpose of this reproducer (the corosync crash results in pacemaker crash and exit with status 100 and the
exit status is preventing systemd from restarting pacemaker).
>   [root@virt-016 ~]# sed -i 's/SuccessExitStatus=100/#SuccessExitStatus=100/' /usr/lib/systemd/system/pacemaker.service
>   [root@virt-016 ~]# grep  SuccessExitStatus=100 /usr/lib/systemd/system/pacemaker.service
>   #SuccessExitStatus=100
>   [root@virt-016 ~]# systemctl daemon-reload
>   [root@virt-016 ~]# systemctl is-active pacemaker
>   active

On one of the cluster nodes kill corosync, wait until it is restarted and repeat that 10 times:
>   [root@virt-016 15:30:17 ~]# for ((i=0;i<10;i++)); do while ! systemctl is-active corosync; do sleep 20; done; sleep 15; killall -9 corosync; done
>   active
>   failed
>   [...]

After that /dev/shm gets full:
>   [root@virt-016 15:49:55 ~]# df -h
>   Filesystem                       Size  Used Avail Use% Mounted on
>   devtmpfs                         989M     0  989M   0% /dev
>   tmpfs                            250M  245M  5.4M  98% /dev/shm
>   tmpfs                           1009M  8.5M 1000M   1% /run
>   tmpfs                           1009M     0 1009M   0% /sys/fs/cgroup
>   /dev/mapper/rhel_virt--016-root  6.2G  3.3G  3.0G  53% /
>   /dev/vda1                       1014M  197M  818M  20% /boot
>   tmpfs                            202M     0  202M   0% /run/user/0

The cluster node stays offline:
>   [root@virt-016 15:49:09 ~]# pcs status
>   Error: error running crm_mon, is pacemaker running?
>     Could not connect to the CIB: Transport endpoint is not connected
>     crm_mon: Error: cluster is not available on this node
>   [root@virt-017 15:49:09 ~]# pcs status
>   Cluster name: test_cluster
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-017 (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum
>     * Last updated: Thu Jan 14 16:33:28 2021
>     * Last change:  Thu Jan 14 15:28:14 2021 by root via cibadmin on virt-016
>     * 2 nodes configured
>     * 0 resource instances configured

>   Node List:
>     * Online: [ virt-017 ]
>     * OFFLINE: [ virt-016 ]

>   Full List of Resources:
>     * No resources

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled


Corosync log excerpts:
>   [root@virt-016 ~]# tail -f /var/log/cluster/corosync.log | grep -i QB
>   [...]
>   Jan 14 15:48:43 [8320] virt-016.cluster-qe.lab.eng.brq.redhat.com corosync error   [QB    ] couldn't create file for mmap
>   Jan 14 15:48:43 [8320] virt-016.cluster-qe.lab.eng.brq.redhat.com corosync error   [QB    ] qb_rb_open:/dev/shm/qb-8320-8331-35-qXmZoc/qb-event-cmap: No space left on device (28)
>   Jan 14 15:48:43 [8320] virt-016.cluster-qe.lab.eng.brq.redhat.com corosync error   [QB    ] shm connection FAILED: No space left on device (28)
>   Jan 14 15:48:43 [8320] virt-016.cluster-qe.lab.eng.brq.redhat.com corosync error   [QB    ] Error in connection setup (/dev/shm/qb-8320-8331-35-qXmZoc/qb): No space left on device (28)
>   Jan 14 15:48:44 [8320] virt-016.cluster-qe.lab.eng.brq.redhat.com corosync error   [QB    ] couldn't create file for mmap
>   Jan 14 15:48:44 [8320] virt-016.cluster-qe.lab.eng.brq.redhat.com corosync error   [QB    ] qb_rb_open:/dev/shm/qb-8320-8333-35-4yvZ8r/qb-event-cmap: No space left on device (28)
>   Jan 14 15:48:44 [8320] virt-016.cluster-qe.lab.eng.brq.redhat.com corosync error   [QB    ] shm connection FAILED: No space left on device (28)
>   Jan 14 15:48:44 [8320] virt-016.cluster-qe.lab.eng.brq.redhat.com corosync error   [QB    ] Error in connection setup (/dev/shm/qb-8320-8333-35-4yvZ8r/qb): No space left on device (28)


after fix
----------
>   [root@virt-242 ~]# rpm -q pacemaker
>   pacemaker-2.0.5-4.el8.x86_64


>   [root@virt-242 ~]# pcs host auth virt-2{42,43} -u hacluster -p password
>   virt-243: Authorized
>   virt-242: Authorized
>   [root@virt-242 ~]# pcs cluster setup test_cluster virt-2{42,43} --start --wait
>   [...]
>   Cluster has been successfully set up.
>   Starting cluster on hosts: 'virt-242', 'virt-243'...
>   Waiting for node(s) to start: 'virt-242', 'virt-243'...
>   virt-242: Cluster started
>   virt-243: Cluster started
>   [root@virt-242 ~]# pcs cluster enable --all; pcs property set stonith-enabled=false
>   virt-242: Cluster Enabled
>   virt-243: Cluster Enabled

>   [root@virt-242 ~]# pcs status
>   Cluster name: test_cluster
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-243 (version 2.0.5-4.el8-ba59be7122) - partition with quorum
>     * Last updated: Fri Jan 15 13:40:36 2021
>     * Last change:  Fri Jan 15 13:40:25 2021 by root via cibadmin on virt-242
>     * 2 nodes configured
>     * 0 resource instances configured

>   Node List:
>     * Online: [ virt-242 virt-243 ]

>   Full List of Resources:
>     * No resources

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Decrease size of `/dev/shm` :
>   [root@virt-242 13:40:36 ~]# mount -o remount,size=250M /dev/shm; df -h
>   Filesystem                       Size  Used Avail Use% Mounted on
>   devtmpfs                         2.0G     0  2.0G   0% /dev
>   tmpfs                            250M   32M  219M  13% /dev/shm
>   tmpfs                            2.0G   65M  2.0G   4% /run
>   tmpfs                            2.0G     0  2.0G   0% /sys/fs/cgroup
>   /dev/mapper/rhel_virt--242-root  6.2G  3.3G  3.0G  53% /
>   /dev/vda1                       1014M  196M  819M  20% /boot
>   tmpfs                            404M     0  404M   0% /run/user/0

>   [root@virt-242 ~]# sed -i 's/SuccessExitStatus=100/#SuccessExitStatus=100/' /usr/lib/systemd/system/pacemaker.service
>   [root@virt-242 ~]# grep  SuccessExitStatus=100 /usr/lib/systemd/system/pacemaker.service
>   #SuccessExitStatus=100
>   [root@virt-242 ~]# systemctl daemon-reload


I tried the same script as previously, but after few corosync crashes pacemaker didn't restart, so corosync remained stopped.
>   [root@virt-242 13:41:12 ~]# for ((i=0;i<10;i++)); do while ! systemctl is-active corosync; do sleep 20; done; sleep 15; killall -9 corosync; done
>   active
>   failed
>   failed
>   failed
>   failed
>   failed
>   active
>   failed
>   failed
>   failed
>   active
>   active
>   corosync: no process found
>   failed
>   failed
>   [...]
>   ^C


Also /dev/shm got fuller.
>   [root@virt-242 13:50:16 ~]# df -h
>   Filesystem                       Size  Used Avail Use% Mounted on
>   devtmpfs                         2.0G     0  2.0G   0% /dev
>   tmpfs                            250M   49M  202M  20% /dev/shm
>   tmpfs                            2.0G   65M  2.0G   4% /run
>   tmpfs                            2.0G     0  2.0G   0% /sys/fs/cgroup
>   /dev/mapper/rhel_virt--242-root  6.2G  3.3G  3.0G  53% /
>   /dev/vda1                       1014M  196M  819M  20% /boot
>   tmpfs                            404M     0  404M   0% /run/user/0


Pacemaker log excerpt (full pacemaker.log is in attachment):
[...]

>   Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [781558] (crmd_fast_exit) 	error: Could not recover from internal error
>   Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [781558] (crm_xml_cleanup) 	info: Cleaning up memory from libxml2
>   Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld  [781558] (crm_exit) 	info: Exiting pacemaker-controld | with status 1
>   Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemakerd          [781531] (pcmk_child_exit) 	error: pacemaker-controld[781558] exited with status 1 (Error occurred)
>   Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemakerd          [781531] (stop_child) 	notice: Stopping pacemaker-schedulerd | sent signal 15 to process 781536
>   Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-schedulerd[781536] (crm_signal_dispatch) 	notice: Caught 'Terminated' signal | 15 (invoking handler)
>   Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-schedulerd[781536] (qb_ipcs_us_withdraw) 	info: withdrawing server sockets
>   Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-schedulerd[781536] (crm_xml_cleanup) 	info: Cleaning up memory from libxml2
>   Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-schedulerd[781536] (crm_exit) 	info: Exiting pacemaker-schedulerd | with status 0
>   Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemakerd          [781531] (pcmk_child_exit) 	info: pacemaker-schedulerd[781536] exited with status 0 (OK)
>   Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemakerd          [781531] (stop_child) 	notice: Stopping pacemaker-execd | sent signal 15 to process 781534
>   Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-execd     [781534] (crm_signal_dispatch) 	notice: Caught 'Terminated' signal | 15 (invoking handler)
>   Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-execd     [781534] (lrmd_exit) 	info: Terminating with 0 clients
>   Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-execd     [781534] (qb_ipcs_us_withdraw) 	info: withdrawing server sockets
>   Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-execd     [781534] (crm_xml_cleanup) 	info: Cleaning up memory from libxml2
>   Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-execd     [781534] (crm_exit) 	info: Exiting pacemaker-execd | with status 0
>   Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemakerd          [781531] (pcmk_child_exit) 	info: pacemaker-execd[781534] exited with status 0 (OK)
>   Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemakerd          [781531] (pcmk_shutdown_worker) 	notice: Shutdown complete
>   Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemakerd          [781531] (qb_ipcs_us_withdraw) 	info: withdrawing server sockets
>   Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemakerd          [781531] (crm_xml_cleanup) 	info: Cleaning up memory from libxml2
>   Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemakerd          [781531] (crm_exit) 	info: Exiting pacemakerd | with status 0

Comment 28 Markéta Smazová 2021-01-18 14:45:08 UTC

Created attachment 1748466 [details]
pacemaker.log

I am attaching pacemaker.log from after fix reproducer (comment 27).

Comment 29 Ken Gaillot 2021-01-19 01:04:29 UTC

I'm not sure what's going wrong in the "after" case. We might have missed some places where the corosync shutdown functions should be called, or something might be exiting abnormally. Now that we have a reliable reproducer, maybe we can find it easier.

Chrissie, would you have time to look at this?

Comment 30 Christine Caulfield 2021-01-20 09:55:41 UTC

If you kill -9 corosync then it's guaranteed that files will be left in /dev/shm as there is nothing left to tidy up after it. If it's a bug anywhere for this then it's in libqb. There is already a lot of code to try and tidy up after failed processes but ultimately it's almost impossible to fix it completely for the kill -9 case.

Comment 31 Ken Gaillot 2021-01-20 14:55:46 UTC

(In reply to Christine Caulfield from comment #30)
> If you kill -9 corosync then it's guaranteed that files will be left in
> /dev/shm as there is nothing left to tidy up after it. If it's a bug
> anywhere for this then it's in libqb. There is already a lot of code to try
> and tidy up after failed processes but ultimately it's almost impossible to
> fix it completely for the kill -9 case.

Is there some way QA can confirm this fix is improving the situation? Maybe no /dev/shm files owned by hacluster?

Do you have any thoughts why Pacemaker didn't restart in some cases? I'm thinking maybe systemd is throttling respawns at that point.

Comment 32 Jan Friesse 2021-01-21 07:40:27 UTC

@ccaulfie I was trying to reproduce this issue and I found quite interesting results (not saying where is problem, maybe it is in testcpg) which you may look to:

Starting with empty /dev/shm and running testcpg on one terminal and ` for i in `seq 1 10`;do corosync; sleep 10; killall -9 corosync;done` on second terminal. Results are quite interesting:

 ls -la
total 90288
drwxrwxrwt 13 root root     700 Jan 21 08:28 .
drwxr-xr-x 20 root root    3160 Nov 20 12:25 ..
drwxrwx---  2 root root     160 Jan 21 08:27 qb-362779-362777-31-0z4a8o
drwxrwx---  2 root root     160 Jan 21 08:27 qb-362796-362777-31-hr9UUw
drwxrwx---  2 root root     160 Jan 21 08:27 qb-362811-362777-31-hjt1mD
drwxrwx---  2 root root     160 Jan 21 08:27 qb-362825-362777-31-jbbhD1
drwxrwx---  2 root root     160 Jan 21 08:27 qb-362839-362777-31-aljMzi
drwxrwx---  2 root root     160 Jan 21 08:28 qb-362853-362777-31-QPTgmI
drwxrwx---  2 root root     160 Jan 21 08:28 qb-362867-362777-31-8VMKQ6
drwxrwx---  2 root root     160 Jan 21 08:28 qb-362881-362777-31-Adjmvv
drwxrwx---  2 root root      40 Jan 21 08:28 qb-362895-362777-31-cmN52U
drwxrwx---  2 root root     160 Jan 21 08:28 qb-362909-362777-31-cSy1yo
drwxrwx---  2 root root     160 Jan 21 08:28 qb-362923-362777-31-yDw5fN
-rw-------  1 root root 8392704 Jan 21 08:27 qb-corosync-362779-blackbox-data
-rw-------  1 root root    8248 Jan 21 08:27 qb-corosync-362779-blackbox-header
-rw-------  1 root root 8392704 Jan 21 08:27 qb-corosync-362796-blackbox-data
-rw-------  1 root root    8248 Jan 21 08:27 qb-corosync-362796-blackbox-header
-rw-------  1 root root 8392704 Jan 21 08:27 qb-corosync-362811-blackbox-data
-rw-------  1 root root    8248 Jan 21 08:27 qb-corosync-362811-blackbox-header
-rw-------  1 root root 8392704 Jan 21 08:27 qb-corosync-362825-blackbox-data
-rw-------  1 root root    8248 Jan 21 08:27 qb-corosync-362825-blackbox-header
-rw-------  1 root root 8392704 Jan 21 08:27 qb-corosync-362839-blackbox-data
-rw-------  1 root root    8248 Jan 21 08:27 qb-corosync-362839-blackbox-header
-rw-------  1 root root 8392704 Jan 21 08:28 qb-corosync-362853-blackbox-data
-rw-------  1 root root    8248 Jan 21 08:28 qb-corosync-362853-blackbox-header
-rw-------  1 root root 8392704 Jan 21 08:28 qb-corosync-362867-blackbox-data
-rw-------  1 root root    8248 Jan 21 08:28 qb-corosync-362867-blackbox-header
-rw-------  1 root root 8392704 Jan 21 08:28 qb-corosync-362881-blackbox-data
-rw-------  1 root root    8248 Jan 21 08:28 qb-corosync-362881-blackbox-header
-rw-------  1 root root 8392704 Jan 21 08:28 qb-corosync-362895-blackbox-data
-rw-------  1 root root    8248 Jan 21 08:28 qb-corosync-362895-blackbox-header
-rw-------  1 root root 8392704 Jan 21 08:28 qb-corosync-362909-blackbox-data
-rw-------  1 root root    8248 Jan 21 08:28 qb-corosync-362909-blackbox-header
-rw-------  1 root root 8392704 Jan 21 08:28 qb-corosync-362923-blackbox-data
-rw-------  1 root root    8248 Jan 21 08:28 qb-corosync-362923-blackbox-header

The blackbox data are expected (there is really nobody who can delete them) but what is not expected is difference between content of qb-362895-362777-31-cmN52U and qb-362909-362777-31-cSy1yo.

One is empty:
# ls -la /dev/shm/qb-362895-362777-31-cmN52U 
total 0
drwxrwx---  2 root root  40 Jan 21 08:28 .
drwxrwxrwt 13 root root 700 Jan 21 08:28 ..

And second one is full of files:
ls -la /dev/shm/qb-362909-362777-31-cSy1yo
total 3120
drwxrwx---  2 root root     160 Jan 21 08:28 .
drwxrwxrwt 13 root root     700 Jan 21 08:28 ..
-rw-------  1 root root 1052672 Jan 21 08:28 qb-event-cpg-data
-rw-------  1 root root    8248 Jan 21 08:28 qb-event-cpg-header
-rw-------  1 root root 1052672 Jan 21 08:28 qb-request-cpg-data
-rw-------  1 root root    8252 Jan 21 08:28 qb-request-cpg-header
-rw-------  1 root root 1052672 Jan 21 08:28 qb-response-cpg-data
-rw-------  1 root root    8248 Jan 21 08:28 qb-response-cpg-header

So I was thinking who is actually responsible for deleting client files in /dev/shm/ and if my memory doesn't fool me it is client. So I've tried running testcpg in strace:
...
munmap(0x7f66f3384000, 2105344)         = 0
munmap(0x7f66f6651000, 8248)            = 0
openat(AT_FDCWD, "/dev/shm/qb-362962-362959-31-Q1aMpj", O_RDONLY|O_PATH|O_DIRECTORY) = 3
unlinkat(3, "qb-event-cpg-data", 0)     = 0
unlinkat(3, "qb-event-cpg-header", 0)   = 0
close(3)                                = 0
munmap(0x7f66f3182000, 2105344)         = 0
munmap(0x7f66f664e000, 8248)            = 0
munmap(0x7f66f653e000, 1052672)         = 0
write(1, "Finalize+restart result is 2 (sh"..., 43Finalize+restart result is 2 (should be 1)
...

so it seems to really be client. Running same cycle (but now testcpg running under strace) and final result is:
drwxrwx---  2 root root      40 Jan 21 08:33 qb-362962-362959-31-Q1aMpj
drwxrwx---  2 root root      40 Jan 21 08:34 qb-362976-362959-31-mg0dOq
drwxrwx---  2 root root     160 Jan 21 08:34 qb-362990-362959-31-UabRiQ
drwxrwx---  2 root root      40 Jan 21 08:34 qb-363004-362959-31-6BT7Te
drwxrwx---  2 root root      40 Jan 21 08:34 qb-363018-362959-31-WozfnE
drwxrwx---  2 root root      40 Jan 21 08:34 qb-363032-362959-31-KuUA52
drwxrwx---  2 root root      40 Jan 21 08:34 qb-363046-362959-31-uqgbow
drwxrwx---  2 root root     160 Jan 21 08:34 qb-363060-362959-31-WS13kU
drwxrwx---  2 root root     160 Jan 21 08:35 qb-363074-362959-31-4zCUBj
drwxrwx---  2 root root      40 Jan 21 08:35 qb-363088-362959-31-OwIjJI
...

So you can see most of the directories were emptied but not all of them so maybe there is some race?

Comment 33 Christine Caulfield 2021-01-21 15:12:19 UTC

Honza is right, it's a libqb race.

What happens is that libqb client will clean up after a server that unexpected exits IFF it is sure that the server is definitely no longer running. (normally it's the server's responsibility to tidy up after clients).

To do this it tries a kill(pid, 0) and only cleans up the server parts if it returns ESRCH. If the server takes slightly longer to exit than libqb does to get to that bit of code then the files will be left lying around because libqb thinks the server was still running and doesn't clean up - because it expects that the server will do it's own tidying. This will always be a race.

I'm going to look into giving libqb a few tries at this (it knows if the server has closed the connection on the client, so it won't delay normal shutdown - I hope) and see if that's a reasonable solution. It won't close the race fully (that's not really possible, server shutdowns might take ages under some circumstances),, but it might mitigate them a bit more.

The patch to pacemaker (the actual subject of this BZ) will help mitigate the problem, but doesn't fully solve it.

Comment 34 Christine Caulfield 2021-01-22 10:41:09 UTC

libqb patch here: https://github.com/ClusterLabs/libqb/pull/434

Comment 39 Markéta Smazová 2021-02-12 15:53:13 UTC

before fix
-----------

>   [root@virt-153 ~]# rpm -q pacemaker
>   pacemaker-2.0.4-6.el8.x86_64

Setup two node cluster, disable fencing:
>   [root@virt-153 ~]# pcs host auth virt-1{53,54} -u hacluster -p password
>   virt-154: Authorized
>   virt-153: Authorized

>   [root@virt-153 ~]# pcs cluster setup test_cluster virt-153 virt-154 --start --wait
>   [...]
>   Cluster has been successfully set up.
>   Starting cluster on hosts: 'virt-153', 'virt-154'...
>   Waiting for node(s) to start: 'virt-153', 'virt-154'...
>   virt-154: Cluster started
>   virt-153: Cluster started

>   [root@virt-153 ~]# pcs cluster enable --all; pcs property set stonith-enabled=false
>   virt-153: Cluster Enabled
>   virt-154: Cluster Enabled

>   [root@virt-153 ~]# pcs status
>   Cluster name: test_cluster
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-153 (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum
>     * Last updated: Thu Feb 11 13:51:50 2021
>     * Last change:  Thu Feb 11 13:51:43 2021 by root via cibadmin on virt-153
>     * 2 nodes configured
>     * 0 resource instances configured

>   Node List:
>     * Online: [ virt-153 virt-154 ]

>   Full List of Resources:
>     * No resources

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Decrease size of `/dev/shm` :
>   [root@virt-153 ~]# mount -o remount,size=250M /dev/shm; df -h
>   Filesystem                       Size  Used Avail Use% Mounted on
>   devtmpfs                         990M     0  990M   0% /dev
>   tmpfs                            250M   47M  204M  19% /dev/shm
>   tmpfs                           1009M   97M  913M  10% /run
>   tmpfs                           1009M     0 1009M   0% /sys/fs/cgroup
>   /dev/mapper/rhel_virt--153-root  6.2G  3.5G  2.8G  56% /
>   /dev/vda1                       1014M  197M  818M  20% /boot
>   tmpfs                            202M     0  202M   0% /run/user/0


On one of the cluster nodes kill corosync, wait and then restart pacemaker:
>   [root@virt-153 ~]# for ((i=0;i<15;i++)); do while ! systemctl is-active corosync; do sleep 30; systemctl start pacemaker; done; sleep 15; killall -9 corosync; done
>   active
>   failed
>   active
>   [...]

After that /dev/shm gets full:
>   [root@virt-153 ~]# df -h
>   Filesystem                       Size  Used Avail Use% Mounted on
>   devtmpfs                         990M     0  990M   0% /dev
>   tmpfs                            250M  250M  688K 100% /dev/shm
>   tmpfs                           1009M   97M  913M  10% /run
>   tmpfs                           1009M     0 1009M   0% /sys/fs/cgroup
>   /dev/mapper/rhel_virt--153-root  6.2G  3.5G  2.8G  56% /
>   /dev/vda1                       1014M  197M  818M  20% /boot
>   tmpfs                            202M     0  202M   0% /run/user/0

>   [root@virt-153 ~]# ls -l /dev/shm | wc -l
>   76

The cluster node stays offline:
>   [root@virt-153 ~]# pcs status
>   Error: error running crm_mon, is pacemaker running?
>     Could not connect to the CIB: Transport endpoint is not connected
>     crm_mon: Error: cluster is not available on this node

>   [root@virt-154 ~]# pcs status
>   Cluster name: test_cluster
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-154 (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum
>     * Last updated: Thu Feb 11 14:02:59 2021
>     * Last change:  Thu Feb 11 13:51:43 2021 by root via cibadmin on virt-153
>     * 2 nodes configured
>     * 0 resource instances configured

>   Node List:
>     * Node virt-153: UNCLEAN (offline)
>     * Online: [ virt-154 ]

>   Full List of Resources:
>     * No resources

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Corosync log excerpts:
>   [root@virt-153 ~]# tail -f /var/log/cluster/corosync.log | grep -i QB
    [...]
>   Feb 11 14:05:01 [762564] virt-153.cluster-qe.lab.eng.brq.redhat.com corosync error   [QB    ] couldn't create file for mmap
>   Feb 11 14:05:01 [762564] virt-153.cluster-qe.lab.eng.brq.redhat.com corosync error   [QB    ] qb_rb_open:/dev/shm/qb-762564-762644-34-EXcDw3/qb-request-cmap: No space left on device (28)
>   Feb 11 14:05:01 [762564] virt-153.cluster-qe.lab.eng.brq.redhat.com corosync error   [QB    ] shm connection FAILED: No space left on device (28)
>   Feb 11 14:05:01 [762564] virt-153.cluster-qe.lab.eng.brq.redhat.com corosync error   [QB    ] Error in connection setup (/dev/shm/qb-762564-762644-34-EXcDw3/qb): No space left on device (28)
>   Feb 11 14:05:03 [762564] virt-153.cluster-qe.lab.eng.brq.redhat.com corosync error   [QB    ] couldn't create file for mmap
>   Feb 11 14:05:03 [762564] virt-153.cluster-qe.lab.eng.brq.redhat.com corosync error   [QB    ] qb_rb_open:/dev/shm/qb-762564-762644-34-WOF4QM/qb-request-cmap: No space left on device (28)
>   Feb 11 14:05:03 [762564] virt-153.cluster-qe.lab.eng.brq.redhat.com corosync error   [QB    ] shm connection FAILED: No space left on device (28)
>   Feb 11 14:05:03 [762564] virt-153.cluster-qe.lab.eng.brq.redhat.com corosync error   [QB    ] Error in connection setup (/dev/shm/qb-762564-762644-34-WOF4QM/qb): No space left on device (28)



after fix
----------

>   [root@virt-175 ~]# rpm -q pacemaker
>   pacemaker-2.0.5-6.el8.x86_64

Setup two node cluster, disable fencing:
>   [root@virt-175 ~]# pcs host auth virt-1{75,76} -u hacluster -p password
>   virt-175: Authorized
>   virt-176: Authorized

>   [root@virt-175 ~]# pcs cluster setup test_cluster virt-175 virt-176 --start --wait
>   [...]
>   Cluster has been successfully set up.
>   Starting cluster on hosts: 'virt-175', 'virt-176'...
>   Waiting for node(s) to start: 'virt-175', 'virt-176'...
>   virt-175: Cluster started
>   virt-176: Cluster started

>   [root@virt-175 ~]# pcs cluster enable --all; pcs property set stonith-enabled=false
>   virt-175: Cluster Enabled
>   virt-176: Cluster Enabled

>   [root@virt-175 ~]# pcs status
>   Cluster name: test_cluster
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-176 (version 2.0.5-6.el8-ba59be7122) - partition with quorum
>     * Last updated: Thu Feb 11 14:20:05 2021
>     * Last change:  Thu Feb 11 14:19:59 2021 by root via cibadmin on virt-175
>     * 2 nodes configured
>     * 0 resource instances configured

>   Node List:
>     * Online: [ virt-175 virt-176 ]

>   Full List of Resources:
>     * No resources

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Decrease size of `/dev/shm` :
>   [root@virt-175 ~]# mount -o remount,size=250M /dev/shm; df -h
>   Filesystem                       Size  Used Avail Use% Mounted on
>   devtmpfs                         990M     0  990M   0% /dev
>   tmpfs                            250M   32M  219M  13% /dev/shm
>   tmpfs                           1010M   22M  989M   3% /run
>   tmpfs                           1010M     0 1010M   0% /sys/fs/cgroup
>   /dev/mapper/rhel_virt--175-root  6.2G  3.6G  2.7G  58% /
>   /dev/vda1                       1014M  202M  813M  20% /boot
>   tmpfs                            202M     0  202M   0% /run/user/0


On one of the cluster nodes kill corosync, wait and then restart pacemaker:
>   [root@virt-175 ~]# for ((i=0;i<15;i++)); do while ! systemctl is-active corosync; do sleep 30; systemctl start pacemaker; done; sleep 15; killall -9 corosync; done
>   active
>   failed
>   [...]

Now after the fix, /dev/shm fills more slowly:
>   [root@virt-175 ~]# df -h
>   Filesystem                       Size  Used Avail Use% Mounted on
>   devtmpfs                         990M     0  990M   0% /dev
>   tmpfs                            250M  182M   69M  73% /dev/shm
>   tmpfs                           1010M   22M  989M   3% /run
>   tmpfs                           1010M     0 1010M   0% /sys/fs/cgroup
>   /dev/mapper/rhel_virt--175-root  6.2G  3.5G  2.8G  56% /
>   /dev/vda1                       1014M  202M  813M  20% /boot
>   tmpfs                            202M     0  202M   0% /run/user/0

>   [root@virt-175 ~]# ls -l /dev/shm | wc -l
>   67

Verified as SanityOnly in pacemaker-2.0.5-6.el8.

The fix helps with the issue, but doesn't fully solve it. The upcoming libqb patch (Comment 34) is expected to have a much bigger impact on fixing the issue.

Comment 41 errata-xmlrpc 2021-05-18 15:26:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:1782