Bug 1614166
Summary: | Always close corosync IPC when dispatch function gets error | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | haidong li <haili> | ||||||
Component: | pacemaker | Assignee: | Ken Gaillot <kgaillot> | ||||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||||
Severity: | low | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 8.3 | CC: | ccaulfie, cluster-maint, ctrautma, fleitner, haili, jfriesse, jishi, kgaillot, mmichels, msmazova, qding | ||||||
Target Milestone: | rc | ||||||||
Target Release: | 8.4 | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | pacemaker-2.0.5-1.el8 | Doc Type: | No Doc Update | ||||||
Doc Text: |
This is low-level enough to be below most users' visibility
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2021-05-18 15:26:41 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 1885645 | ||||||||
Bug Blocks: | |||||||||
Attachments: |
|
Description
haidong li
2018-08-09 06:29:29 UTC
"No space left on device" implies the disk is full. Are there any directories with more entries than expected? Are there any unusually large files (perhaps logs)? I want to try to determine if the issue is in OVN, OVS, corosync, pacemaker, or something else on the system. (In reply to Mark Michelson from comment #2) > "No space left on device" implies the disk is full. Are there any > directories with more entries than expected? Are there any unusually large > files (perhaps logs)? I want to try to determine if the issue is in OVN, > OVS, corosync, pacemaker, or something else on the system. The log file isn't large: [root@dell-per730-49 log]# pwd /var/log [root@dell-per730-49 log]# ll -h total 78M ... Is it caused by the cache? I found the cache can't be released: [root@dell-per730-49 log]# free -g total used free shared buff/cache available Mem: 62 1 18 31 43 28 Swap: 27 0 27 [root@dell-per730-49 log]# cat /proc/sys/vm/drop_caches 0 [root@dell-per730-49 log]# echo 1 > /proc/sys/vm/drop_caches [root@dell-per730-49 log]# free -g total used free shared buff/cache available Mem: 62 1 29 31 32 29 Swap: 27 0 27 You can login the machine dell-per730-49.rhts.eng.pek2.redhat.com with root/redhat if you want to check. Thanks, I logged in and looked around a bit: [root@dell-per730-49 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/rhel_dell--per730--49-root 50G 15G 36G 29% / devtmpfs 32G 0 32G 0% /dev tmpfs 32G 32G 15M 100% /dev/shm tmpfs 32G 595M 31G 2% /run tmpfs 32G 0 32G 0% /sys/fs/cgroup /dev/sda1 1014M 146M 869M 15% /boot /dev/mapper/rhel_dell--per730--49-home 201G 34M 201G 1% /home tmpfs 6.3G 0 6.3G 0% /run/user/0 /dev/shm is 100% used. [root@dell-per730-49 ~]# ls -l /dev/shm | wc -l 63117 It looks like either libqb or corosync is not cleaning up temporary files. Hi it appears that the system is no longer up. Is it possible for you to reproduce this and to leave the test system up for someone to look into? I found some people that are interested in having a look when the system gets into its broken state. Thanks! (In reply to Mark Michelson from comment #6) > Hi it appears that the system is no longer up. Is it possible for you to > reproduce this and to leave the test system up for someone to look into? I > found some people that are interested in having a look when the system gets > into its broken state. > > Thanks! Hi Mark, I have reproduced it on another machine,you can login and check it with root/redhat: dell-per730-20.rhts.eng.pek2.redhat.com [root@dell-per730-20 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/rhel_dell--per730--20-root 50G 13G 38G 26% / devtmpfs 32G 0 32G 0% /dev tmpfs 32G 32G 9.6M 100% /dev/shm tmpfs 32G 211M 32G 1% /run tmpfs 32G 0 32G 0% /sys/fs/cgroup /dev/sda1 1014M 236M 779M 24% /boot /dev/mapper/rhel_dell--per730--20-home 201G 33M 201G 1% /home ibm-x3250m4-03.rhts.eng.pek2.redhat.com:/data/vmcore 1.4T 1.1T 326G 77% /var/crash tmpfs 6.3G 0 6.3G 0% /run/user/0 [root@dell-per730-20 ~]# Looking into this more, the files that are on the filesystem seem to be the result of internal corosync operations. Even if there are errors in our scripts, they shouldn't result in leaks of these types of files on the file system. I'm changing the component of this issue to try to bring in someone who can provide input on this issue. The problem you are seeing is really because of full /dev/shm. The question is, why it becomes full. Another problem is ifdown. Corosync doesn't handle ifdown well. It's long time known problem, which is fixed in soon to be released corosync 3.x and we don't have any plan to fix it properly in corosync 2.x. So question is, are you able to reproduce the problem without using the ifdown (you can use iptables to block udp traffic)? Could you share your corosync.conf? Are you able to reproduce the problem without pacemaker (so just using the corosync)? Also could you please share corosync, pacemaker and libqb versions? Created attachment 1485405 [details]
corosync.conf
I have attached the corosync.conf and listed the version of corosync, pacemaker and libqb.I will try to reproduce the issue using iptables to block udp traffic.If it's necessary to reproduce it with corosync only and without pacemaker,can you give some explain about how to use it,or some commands I should use?I have never used it before,thanks!
[root@dell-per730-57 ~]# rpm -qa | grep pacemaker
pacemaker-cluster-libs-1.1.19-7.el7.x86_64
pacemaker-cli-1.1.19-7.el7.x86_64
pacemaker-libs-1.1.19-7.el7.x86_64
pacemaker-1.1.19-7.el7.x86_64
[root@dell-per730-57 ~]# rpm -qa | grep corosync
corosync-2.4.3-4.el7.x86_64
corosynclib-2.4.3-4.el7.x86_64
[root@dell-per730-57 ~]# rpm -qa | grep libqb
libqb-1.0.1-7.el7.x86_64
Ok, so corosync.conf is pretty standard and libs are newest possible versions, so there should be no problem. It's not entirely necessary to reproduce the issue without pacemaker, but it could reduce the problem space. By corosync only I mean to just start corosync - so create cluster as usually and then stop pacemaker service on all nodes and then try your test. Bug can be probably reproduced on one node only, so you can try just one node. Another possibility is to take a look to /dev/shm. First, please try to find out what files are using most space. Also there should be files like qb-*-[event|request|response]-PID1-PID2-random-[data|header]. PID1 is the server pid and it should be corosync one. Please try to find out if it really is. PID2 should be client pid, so please take a look which process is associated with that pid. @haidong li: Thank you for the login credentials. I can now fully understand what is happening: 1. Ifdown (as I wrote, ifdown is unsupported) + ifup result to corosync crash 2. Corosync crash is detected by pacamaker, which also terminates with error code, but it looks like it doesn't close all IPCs (call *_finalize function) so /dev/shm files are not properly deleted. 3. Systemd restarts pacemaker because pacemaker.unit is configure to restart on on-failure. Pacemaker unit file depends on corosync so corosync is started first 4. goto 1 To conclude: - This bug is not reproducible using iptables method of failure. - Virtually all clusters are using power fencing (I believe this is only approved configuration) so node would be fenced (because corosync was killed) and problem would never appear - that's the reason I changed priority and severity of the bug to low. - We have partial fix for the ifdown problem for corosync 2.x in upstream (https://github.com/corosync/corosync/commit/96354fba72b7e7065610f37df0c0547b1e93ad51) so it will land with next rebase (if allowed). -> There is really nothing more we can fix in the corosync itself That said, I'm reassigning to pacemaker to consider enhance pacemaker so it closes IPC (= call *_finalize functions) when corosync crashes (= *_dispatch function returns error other than cs_err_try_again). This will be considered for RHEL 8 only We had trouble reproducing the issue, but we believe it would be fixed by commit dc341923 which is now merged upstream. Can you please provide a reproducer for QE testing? I succeed to reproduce the issue on openvswitch2.10-ovn-common-2.10.0-0.20180724git1ac6908.el7fdp.x86_64 with following steps: 1. install pacemaker on 3 nodes and start pcsd 2. install ovs and ovn on 3 nodes, and start openvswitch, disable selinux 3. start pcs setenforce 0 systemctl start openvswitch ip_s=1.1.1.16 ip_c1=1.1.1.17 ip_c2=1.1.1.18 ip_v=1.1.1.100 (sleep 2;echo "hacluster"; sleep 2; echo "redhat" ) |pcs cluster auth $ip_c1 $ip_c2 $ip_s sleep 5 pcs cluster setup --force --start --name my_cluster $ip_c1 $ip_c2 $ip_s pcs cluster enable --all pcs property set stonith-enabled=false pcs property set no-quorum-policy=ignore pcs cluster cib tmp-cib.xml sleep 10 cp tmp-cib.xml tmp-cib.deltasrc pcs resource delete ip-$ip_v pcs resource delete ovndb_servers-master sleep 5 pcs status pcs -f tmp-cib.xml resource create ip-$ip_v ocf:heartbeat:IPaddr2 ip=$ip_v op monitor interval=30s sleep 5 pcs -f tmp-cib.xml resource create ovndb_servers ocf:ovn:ovndb-servers manage_northd=yes master_ip=$ip_v nb_master_port=6641 sb_master_port=6642 master sleep 5 pcs -f tmp-cib.xml resource meta ovndb_servers-master notify=true pcs -f tmp-cib.xml constraint order start ip-$ip_v then promote ovndb_servers-master pcs -f tmp-cib.xml constraint colocation add ip-$ip_v with master ovndb_servers-master pcs -f tmp-cib.xml constraint location ip-$ip_v prefers $ip_s=1500 pcs -f tmp-cib.xml constraint location ovndb_servers-master prefers $ip_s=1500 pcs -f tmp-cib.xml constraint location ip-$ip_v prefers $ip_c2=1000 pcs -f tmp-cib.xml constraint location ovndb_servers-master prefers $ip_c2=1000 pcs -f tmp-cib.xml constraint location ip-$ip_v prefers $ip_c1=500 pcs -f tmp-cib.xml constraint location ovndb_servers-master prefers $ip_c1=500 pcs cluster cib-push tmp-cib.xml diff-against=tmp-cib.deltasrc 4. after setup, master is 1.1.1.17 5. set interface down and up on 1.1.1.17 with following script for ((i=0;i<10000;i++));do ip link set p1p1 down;sleep 15;ip link set p1p1 up;sleep 15;done 6. after hours, the issue happens [root@wsfd-advnetlab18 ovn2.10.0]# pcs status Cluster name: my_cluster WARNINGS: Corosync and pacemaker node names do not match (IPs used in setup?) Stack: corosync Current DC: wsfd-advnetlab16.anl.lab.eng.bos.redhat.com (version 1.1.19-8.el7-c3c624ea3d) - partition with quorum Last updated: Wed Oct 21 21:52:29 2020 Last change: Wed Oct 21 16:14:36 2020 by root via crm_attribute on wsfd-advnetlab16.anl.lab.eng.bos.redhat.com 3 nodes configured 4 resources configured Online: [ wsfd-advnetlab16.anl.lab.eng.bos.redhat.com wsfd-advnetlab18.anl.lab.eng.bos.redhat.com ] OFFLINE: [ wsfd-advnetlab17.anl.lab.eng.bos.redhat.com ] Full list of resources: ip-1.1.1.100 (ocf::heartbeat:IPaddr2): Started wsfd-advnetlab16.anl.lab.eng.bos.redhat.com Master/Slave Set: ovndb_servers-master [ovndb_servers] Masters: [ wsfd-advnetlab16.anl.lab.eng.bos.redhat.com ] Slaves: [ wsfd-advnetlab18.anl.lab.eng.bos.redhat.com ] Stopped: [ wsfd-advnetlab17.anl.lab.eng.bos.redhat.com ] Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/disabled [root@wsfd-advnetlab17 ~]# rpm -qa | grep -E "openvswitch|ovn" openvswitch-selinux-extra-policy-1.0-3.el7fdp.noarch openvswitch2.10-ovn-central-2.10.0-0.20180724git1ac6908.el7fdp.x86_64 openvswitch2.10-2.10.0-0.20180724git1ac6908.el7fdp.x86_64 openvswitch2.10-ovn-common-2.10.0-0.20180724git1ac6908.el7fdp.x86_64 openvswitch2.10-ovn-host-2.10.0-0.20180724git1ac6908.el7fdp.x86_64 [root@wsfd-advnetlab17 ~]# tail /var/log/cluster/corosync.log [133905] wsfd-advnetlab17.anl.lab.eng.bos.redhat.com corosyncnotice [TOTEM ] adding new UDPU member {1.1.1.18} [133905] wsfd-advnetlab17.anl.lab.eng.bos.redhat.com corosyncnotice [TOTEM ] adding new UDPU member {1.1.1.16} [133905] wsfd-advnetlab17.anl.lab.eng.bos.redhat.com corosyncnotice [TOTEM ] The network interface [1.1.1.17] is now up. [133905] wsfd-advnetlab17.anl.lab.eng.bos.redhat.com corosyncnotice [TOTEM ] adding new UDPU member {1.1.1.17} [133905] wsfd-advnetlab17.anl.lab.eng.bos.redhat.com corosyncnotice [TOTEM ] adding new UDPU member {1.1.1.18} [133905] wsfd-advnetlab17.anl.lab.eng.bos.redhat.com corosyncnotice [TOTEM ] adding new UDPU member {1.1.1.16} [133905] wsfd-advnetlab17.anl.lab.eng.bos.redhat.com corosyncerror [QB ] couldn't create file for mmap [133905] wsfd-advnetlab17.anl.lab.eng.bos.redhat.com corosyncerror [QB ] qb_rb_open:cmap-event-133906-133941-25: No space left on device (28) [133905] wsfd-advnetlab17.anl.lab.eng.bos.redhat.com corosyncerror [QB ] shm connection FAILED: No space left on device (28) [133905] wsfd-advnetlab17.anl.lab.eng.bos.redhat.com corosyncerror [QB ] Error in connection setup (133906-133941-25): No space left on device (28) @Jianlin Shi: Thank you for the reproducer (comment 19). I tried to use it to test the issue on 3-node pacemaker cluster with pacemaker-2.0.5-4.el8, on rhosp-openvswitch-2.13-8.el8ost.noarch and rhosp-ovn-2.13-8.el8ost.noarch, but I was not successful - the ovndb-servers resource failed to start. I tried several times to set up the databases and find out why the ovsdb-server failed to start, but since I haven't worked with either ovs or ovn before I did not succeed. Can you please help and verify if the issue is fixed in pacemaker-2.0.5-4.el8 ? [root@virt-138 ~]# rpm -q pacemaker pacemaker-2.0.5-4.el8.x86_64 [root@virt-138 ~]# rpm -qa | grep -E "openvswitch|ovn" openvswitch2.13-2.13.0-71.el8fdp.x86_64 openvswitch-selinux-extra-policy-1.0-22.el8fdp.noarch network-scripts-openvswitch2.13-2.13.0-71.el8fdp.x86_64 network-scripts-openvswitch2.11-2.11.3-74.el8fdp.x86_64 rhosp-ovn-2.13-8.el8ost.noarch rhosp-openvswitch-2.13-8.el8ost.noarch ovn2.13-20.09.0-17.el8fdp.x86_64 [root@virt-138 ~]# systemctl is-active pcsd active [root@virt-138 ~]# getenforce Permissive [root@virt-138 ~]# systemctl is-active openvswitch active > Cluster set-up as described in comment 19 [ ... ] > After cluster setup, the ovndb-servers failed to start. [root@virt-138 ~]# pcs status Cluster name: my_cluster Cluster Summary: * Stack: corosync * Current DC: virt-138 (version 2.0.5-4.el8-ba59be7122) - partition with quorum * Last updated: Fri Jan 8 19:00:15 2021 * Last change: Fri Jan 8 18:59:35 2021 by root via cibadmin on virt-138 * 3 nodes configured * 4 resource instances configured Node List: * Online: [ virt-138 virt-140 virt-150 ] Full List of Resources: * VirtualIP (ocf::heartbeat:IPaddr2): Stopped * Clone Set: ovndb_servers-clone [ovndb_servers] (promotable): * Stopped: [ virt-138 virt-140 virt-150 ] Failed Resource Actions: * ovndb_servers_start_0 on virt-150 'error' (1): call=11, status='Timed Out', exitreason='', last-rc-change='2021-01-08 18:59:36 +01:00', queued=0ms, exec=30010ms * ovndb_servers_start_0 on virt-138 'error' (1): call=11, status='Timed Out', exitreason='', last-rc-change='2021-01-08 18:59:36 +01:00', queued=0ms, exec=30005ms * ovndb_servers_start_0 on virt-140 'error' (1): call=11, status='Timed Out', exitreason='', last-rc-change='2021-01-08 18:59:36 +01:00', queued=0ms, exec=30007ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@virt-138 ~]# pcs resource debug-start ovndb_servers Operation start for ovndb_servers (ocf:ovn:ovndb-servers) failed: 'Timed Out' (2) > stdout: Starting ovsdb-nb [FAILED] > stdout: Starting ovsdb-sb [FAILED] > stderr: ovsdb-server: "db:OVN_Northbound,NB_Global,connections": no database named OVN_Northbound > stderr: > stderr: ovn-nbctl: unix:/var/run/ovn/ovnnb_db.sock: database connection failed (No such file or directory) > stderr: ovsdb-server: "db:OVN_Southbound,SB_Global,connections": no database named OVN_Southbound > stderr: (In reply to Markéta Smazová from comment #23) > @Jianlin Shi: Thank you for the reproducer (comment 19). I tried to use it > to test the issue on 3-node pacemaker > cluster with pacemaker-2.0.5-4.el8, on > rhosp-openvswitch-2.13-8.el8ost.noarch and rhosp-ovn-2.13-8.el8ost.noarch, > but I was not successful - the ovndb-servers resource failed to start. I > tried several times to set up the databases > and find out why the ovsdb-server failed to start, but since I haven't > worked with either ovs or ovn before I did not succeed. > > Can you please help and verify if the issue is fixed in > pacemaker-2.0.5-4.el8 ? the issue in Description in reported on rhel7. I tried to reproduce on rhel8, but failed because of bz1915129. so I can't confirm if pacemaker-2.0.5-4.el8 works before bz1915129 is verified. (In reply to Jianlin Shi from comment #24) > (In reply to Markéta Smazová from comment #23) > > @Jianlin Shi: Thank you for the reproducer (comment 19). I tried to use it > > to test the issue on 3-node pacemaker > > cluster with pacemaker-2.0.5-4.el8, on > > rhosp-openvswitch-2.13-8.el8ost.noarch and rhosp-ovn-2.13-8.el8ost.noarch, > > but I was not successful - the ovndb-servers resource failed to start. I > > tried several times to set up the databases > > and find out why the ovsdb-server failed to start, but since I haven't > > worked with either ovs or ovn before I did not succeed. > > > > Can you please help and verify if the issue is fixed in > > pacemaker-2.0.5-4.el8 ? > > the issue in Description in reported on rhel7. > I tried to reproduce on rhel8, but failed because of bz1915129. > so I can't confirm if pacemaker-2.0.5-4.el8 works before bz1915129 is > verified. The issue of corosync crashing after an ifdown was fixed as of RHEL 8.0, so that exact issue can't occur in RHEL 8. However anything causing corosync to crash should trigger it. I haven't verified, but I would expect a reproducer would be: 1. Configure a cluster of two nodes where the nodes are VMs with small amounts of memory, and fencing disabled. 2. On one node, kill -9 corosync. Pacemaker should exit, then systemd should start corosync and pacemaker again. Without the fix, repeating that a large number of times should show /dev/shm getting fuller and eventually full. As an aside, with power fencing, this couldn't happen because the node would reboot, which would clear /dev/shm. (In reply to Ken Gaillot from comment #25) > (In reply to Jianlin Shi from comment #24) > > (In reply to Markéta Smazová from comment #23) > > > @Jianlin Shi: Thank you for the reproducer (comment 19). I tried to use it > > > to test the issue on 3-node pacemaker > > > cluster with pacemaker-2.0.5-4.el8, on > > > rhosp-openvswitch-2.13-8.el8ost.noarch and rhosp-ovn-2.13-8.el8ost.noarch, > > > but I was not successful - the ovndb-servers resource failed to start. I > > > tried several times to set up the databases > > > and find out why the ovsdb-server failed to start, but since I haven't > > > worked with either ovs or ovn before I did not succeed. > > > > > > Can you please help and verify if the issue is fixed in > > > pacemaker-2.0.5-4.el8 ? > > > > the issue in Description in reported on rhel7. > > I tried to reproduce on rhel8, but failed because of bz1915129. > > so I can't confirm if pacemaker-2.0.5-4.el8 works before bz1915129 is > > verified. > > The issue of corosync crashing after an ifdown was fixed as of RHEL 8.0, so > that exact issue can't occur in RHEL 8. However anything causing corosync to > crash should trigger it. I haven't verified, but I would expect a reproducer > would be: > > 1. Configure a cluster of two nodes where the nodes are VMs with small > amounts of memory, and fencing disabled. > > 2. On one node, kill -9 corosync. Pacemaker should exit, then systemd should > start corosync and pacemaker again. Without the fix, repeating that a large > number of times should show /dev/shm getting fuller and eventually full. > > As an aside, with power fencing, this couldn't happen because the node would > reboot, which would clear /dev/shm. so with the reproducer, ovn is not mandatory. Markéta, could you follow the steps to reproduce? I tried the reproducer in comment 25 on RHEL8. With the previous version of pacemaker-2.0.4-6.el8 I was able to reproduce the initial issue of /dev/shm getting full after corosync crash. However while testing it on pacemaker-2.0.5-4.el8 I ran into some issues and could not verify the fix. After crashing corosync few times pacemaker stopped and would not restart. Details are below "after fix", I also attached a pacemaker.log. @Ken: Could you please help with verification of the fix in pacemaker-2.0.5-4.el8 ? before fix ----------- > [root@virt-016 ~]# rpm -q pacemaker > pacemaker-2.0.4-6.el8.x86_64 Setup two node cluster, disable fencing: > [root@virt-016 ~]# pcs host auth virt-0{16,17} -u hacluster -p password > virt-016: Authorized > virt-017: Authorized > [root@virt-016 ~]# pcs cluster setup test_cluster virt-016 virt-017 --start --wait > [...] > Cluster has been successfully set up. > Starting cluster on hosts: 'virt-016', 'virt-017'... > Waiting for node(s) to start: 'virt-016', 'virt-017'... > virt-016: Cluster started > virt-017: Cluster started > [root@virt-016 ~]# pcs cluster enable --all; pcs property set stonith-enabled=false > virt-016: Cluster Enabled > virt-017: Cluster Enabled > [root@virt-016 ~]# pcs status > Cluster name: test_cluster > Cluster Summary: > * Stack: corosync > * Current DC: virt-017 (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum > * Last updated: Thu Jan 14 15:28:21 2021 > * Last change: Thu Jan 14 15:28:14 2021 by root via cibadmin on virt-016 > * 2 nodes configured > * 0 resource instances configured > Node List: > * Online: [ virt-016 virt-017 ] > Full List of Resources: > * No resources > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled Decrease size of `/dev/shm` : > [root@virt-016 15:28:21 ~]# mount -o remount,size=250M /dev/shm; df -h > Filesystem Size Used Avail Use% Mounted on > devtmpfs 989M 0 989M 0% /dev > tmpfs 250M 32M 219M 13% /dev/shm > tmpfs 1009M 8.5M 1000M 1% /run > tmpfs 1009M 0 1009M 0% /sys/fs/cgroup > /dev/mapper/rhel_virt--016-root 6.2G 3.3G 3.0G 52% / > /dev/vda1 1014M 197M 818M 20% /boot > tmpfs 202M 0 202M 0% /run/user/0 Disable "SuccessExitStatus" variable in /usr/lib/systemd/system/pacemaker.service and reload pacemaker. This step is done just for the purpose of this reproducer (the corosync crash results in pacemaker crash and exit with status 100 and the exit status is preventing systemd from restarting pacemaker). > [root@virt-016 ~]# sed -i 's/SuccessExitStatus=100/#SuccessExitStatus=100/' /usr/lib/systemd/system/pacemaker.service > [root@virt-016 ~]# grep SuccessExitStatus=100 /usr/lib/systemd/system/pacemaker.service > #SuccessExitStatus=100 > [root@virt-016 ~]# systemctl daemon-reload > [root@virt-016 ~]# systemctl is-active pacemaker > active On one of the cluster nodes kill corosync, wait until it is restarted and repeat that 10 times: > [root@virt-016 15:30:17 ~]# for ((i=0;i<10;i++)); do while ! systemctl is-active corosync; do sleep 20; done; sleep 15; killall -9 corosync; done > active > failed > [...] After that /dev/shm gets full: > [root@virt-016 15:49:55 ~]# df -h > Filesystem Size Used Avail Use% Mounted on > devtmpfs 989M 0 989M 0% /dev > tmpfs 250M 245M 5.4M 98% /dev/shm > tmpfs 1009M 8.5M 1000M 1% /run > tmpfs 1009M 0 1009M 0% /sys/fs/cgroup > /dev/mapper/rhel_virt--016-root 6.2G 3.3G 3.0G 53% / > /dev/vda1 1014M 197M 818M 20% /boot > tmpfs 202M 0 202M 0% /run/user/0 The cluster node stays offline: > [root@virt-016 15:49:09 ~]# pcs status > Error: error running crm_mon, is pacemaker running? > Could not connect to the CIB: Transport endpoint is not connected > crm_mon: Error: cluster is not available on this node > [root@virt-017 15:49:09 ~]# pcs status > Cluster name: test_cluster > Cluster Summary: > * Stack: corosync > * Current DC: virt-017 (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum > * Last updated: Thu Jan 14 16:33:28 2021 > * Last change: Thu Jan 14 15:28:14 2021 by root via cibadmin on virt-016 > * 2 nodes configured > * 0 resource instances configured > Node List: > * Online: [ virt-017 ] > * OFFLINE: [ virt-016 ] > Full List of Resources: > * No resources > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled Corosync log excerpts: > [root@virt-016 ~]# tail -f /var/log/cluster/corosync.log | grep -i QB > [...] > Jan 14 15:48:43 [8320] virt-016.cluster-qe.lab.eng.brq.redhat.com corosync error [QB ] couldn't create file for mmap > Jan 14 15:48:43 [8320] virt-016.cluster-qe.lab.eng.brq.redhat.com corosync error [QB ] qb_rb_open:/dev/shm/qb-8320-8331-35-qXmZoc/qb-event-cmap: No space left on device (28) > Jan 14 15:48:43 [8320] virt-016.cluster-qe.lab.eng.brq.redhat.com corosync error [QB ] shm connection FAILED: No space left on device (28) > Jan 14 15:48:43 [8320] virt-016.cluster-qe.lab.eng.brq.redhat.com corosync error [QB ] Error in connection setup (/dev/shm/qb-8320-8331-35-qXmZoc/qb): No space left on device (28) > Jan 14 15:48:44 [8320] virt-016.cluster-qe.lab.eng.brq.redhat.com corosync error [QB ] couldn't create file for mmap > Jan 14 15:48:44 [8320] virt-016.cluster-qe.lab.eng.brq.redhat.com corosync error [QB ] qb_rb_open:/dev/shm/qb-8320-8333-35-4yvZ8r/qb-event-cmap: No space left on device (28) > Jan 14 15:48:44 [8320] virt-016.cluster-qe.lab.eng.brq.redhat.com corosync error [QB ] shm connection FAILED: No space left on device (28) > Jan 14 15:48:44 [8320] virt-016.cluster-qe.lab.eng.brq.redhat.com corosync error [QB ] Error in connection setup (/dev/shm/qb-8320-8333-35-4yvZ8r/qb): No space left on device (28) after fix ---------- > [root@virt-242 ~]# rpm -q pacemaker > pacemaker-2.0.5-4.el8.x86_64 > [root@virt-242 ~]# pcs host auth virt-2{42,43} -u hacluster -p password > virt-243: Authorized > virt-242: Authorized > [root@virt-242 ~]# pcs cluster setup test_cluster virt-2{42,43} --start --wait > [...] > Cluster has been successfully set up. > Starting cluster on hosts: 'virt-242', 'virt-243'... > Waiting for node(s) to start: 'virt-242', 'virt-243'... > virt-242: Cluster started > virt-243: Cluster started > [root@virt-242 ~]# pcs cluster enable --all; pcs property set stonith-enabled=false > virt-242: Cluster Enabled > virt-243: Cluster Enabled > [root@virt-242 ~]# pcs status > Cluster name: test_cluster > Cluster Summary: > * Stack: corosync > * Current DC: virt-243 (version 2.0.5-4.el8-ba59be7122) - partition with quorum > * Last updated: Fri Jan 15 13:40:36 2021 > * Last change: Fri Jan 15 13:40:25 2021 by root via cibadmin on virt-242 > * 2 nodes configured > * 0 resource instances configured > Node List: > * Online: [ virt-242 virt-243 ] > Full List of Resources: > * No resources > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled Decrease size of `/dev/shm` : > [root@virt-242 13:40:36 ~]# mount -o remount,size=250M /dev/shm; df -h > Filesystem Size Used Avail Use% Mounted on > devtmpfs 2.0G 0 2.0G 0% /dev > tmpfs 250M 32M 219M 13% /dev/shm > tmpfs 2.0G 65M 2.0G 4% /run > tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup > /dev/mapper/rhel_virt--242-root 6.2G 3.3G 3.0G 53% / > /dev/vda1 1014M 196M 819M 20% /boot > tmpfs 404M 0 404M 0% /run/user/0 > [root@virt-242 ~]# sed -i 's/SuccessExitStatus=100/#SuccessExitStatus=100/' /usr/lib/systemd/system/pacemaker.service > [root@virt-242 ~]# grep SuccessExitStatus=100 /usr/lib/systemd/system/pacemaker.service > #SuccessExitStatus=100 > [root@virt-242 ~]# systemctl daemon-reload I tried the same script as previously, but after few corosync crashes pacemaker didn't restart, so corosync remained stopped. > [root@virt-242 13:41:12 ~]# for ((i=0;i<10;i++)); do while ! systemctl is-active corosync; do sleep 20; done; sleep 15; killall -9 corosync; done > active > failed > failed > failed > failed > failed > active > failed > failed > failed > active > active > corosync: no process found > failed > failed > [...] > ^C Also /dev/shm got fuller. > [root@virt-242 13:50:16 ~]# df -h > Filesystem Size Used Avail Use% Mounted on > devtmpfs 2.0G 0 2.0G 0% /dev > tmpfs 250M 49M 202M 20% /dev/shm > tmpfs 2.0G 65M 2.0G 4% /run > tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup > /dev/mapper/rhel_virt--242-root 6.2G 3.3G 3.0G 53% / > /dev/vda1 1014M 196M 819M 20% /boot > tmpfs 404M 0 404M 0% /run/user/0 Pacemaker log excerpt (full pacemaker.log is in attachment): [...] > Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [781558] (crmd_fast_exit) error: Could not recover from internal error > Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [781558] (crm_xml_cleanup) info: Cleaning up memory from libxml2 > Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-controld [781558] (crm_exit) info: Exiting pacemaker-controld | with status 1 > Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemakerd [781531] (pcmk_child_exit) error: pacemaker-controld[781558] exited with status 1 (Error occurred) > Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemakerd [781531] (stop_child) notice: Stopping pacemaker-schedulerd | sent signal 15 to process 781536 > Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-schedulerd[781536] (crm_signal_dispatch) notice: Caught 'Terminated' signal | 15 (invoking handler) > Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-schedulerd[781536] (qb_ipcs_us_withdraw) info: withdrawing server sockets > Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-schedulerd[781536] (crm_xml_cleanup) info: Cleaning up memory from libxml2 > Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-schedulerd[781536] (crm_exit) info: Exiting pacemaker-schedulerd | with status 0 > Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemakerd [781531] (pcmk_child_exit) info: pacemaker-schedulerd[781536] exited with status 0 (OK) > Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemakerd [781531] (stop_child) notice: Stopping pacemaker-execd | sent signal 15 to process 781534 > Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-execd [781534] (crm_signal_dispatch) notice: Caught 'Terminated' signal | 15 (invoking handler) > Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-execd [781534] (lrmd_exit) info: Terminating with 0 clients > Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-execd [781534] (qb_ipcs_us_withdraw) info: withdrawing server sockets > Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-execd [781534] (crm_xml_cleanup) info: Cleaning up memory from libxml2 > Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemaker-execd [781534] (crm_exit) info: Exiting pacemaker-execd | with status 0 > Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemakerd [781531] (pcmk_child_exit) info: pacemaker-execd[781534] exited with status 0 (OK) > Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemakerd [781531] (pcmk_shutdown_worker) notice: Shutdown complete > Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemakerd [781531] (qb_ipcs_us_withdraw) info: withdrawing server sockets > Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemakerd [781531] (crm_xml_cleanup) info: Cleaning up memory from libxml2 > Jan 15 13:47:00 virt-242.cluster-qe.lab.eng.brq.redhat.com pacemakerd [781531] (crm_exit) info: Exiting pacemakerd | with status 0 Created attachment 1748466 [details] pacemaker.log I am attaching pacemaker.log from after fix reproducer (comment 27). I'm not sure what's going wrong in the "after" case. We might have missed some places where the corosync shutdown functions should be called, or something might be exiting abnormally. Now that we have a reliable reproducer, maybe we can find it easier. Chrissie, would you have time to look at this? If you kill -9 corosync then it's guaranteed that files will be left in /dev/shm as there is nothing left to tidy up after it. If it's a bug anywhere for this then it's in libqb. There is already a lot of code to try and tidy up after failed processes but ultimately it's almost impossible to fix it completely for the kill -9 case. (In reply to Christine Caulfield from comment #30) > If you kill -9 corosync then it's guaranteed that files will be left in > /dev/shm as there is nothing left to tidy up after it. If it's a bug > anywhere for this then it's in libqb. There is already a lot of code to try > and tidy up after failed processes but ultimately it's almost impossible to > fix it completely for the kill -9 case. Is there some way QA can confirm this fix is improving the situation? Maybe no /dev/shm files owned by hacluster? Do you have any thoughts why Pacemaker didn't restart in some cases? I'm thinking maybe systemd is throttling respawns at that point. @ccaulfie I was trying to reproduce this issue and I found quite interesting results (not saying where is problem, maybe it is in testcpg) which you may look to: Starting with empty /dev/shm and running testcpg on one terminal and ` for i in `seq 1 10`;do corosync; sleep 10; killall -9 corosync;done` on second terminal. Results are quite interesting: ls -la total 90288 drwxrwxrwt 13 root root 700 Jan 21 08:28 . drwxr-xr-x 20 root root 3160 Nov 20 12:25 .. drwxrwx--- 2 root root 160 Jan 21 08:27 qb-362779-362777-31-0z4a8o drwxrwx--- 2 root root 160 Jan 21 08:27 qb-362796-362777-31-hr9UUw drwxrwx--- 2 root root 160 Jan 21 08:27 qb-362811-362777-31-hjt1mD drwxrwx--- 2 root root 160 Jan 21 08:27 qb-362825-362777-31-jbbhD1 drwxrwx--- 2 root root 160 Jan 21 08:27 qb-362839-362777-31-aljMzi drwxrwx--- 2 root root 160 Jan 21 08:28 qb-362853-362777-31-QPTgmI drwxrwx--- 2 root root 160 Jan 21 08:28 qb-362867-362777-31-8VMKQ6 drwxrwx--- 2 root root 160 Jan 21 08:28 qb-362881-362777-31-Adjmvv drwxrwx--- 2 root root 40 Jan 21 08:28 qb-362895-362777-31-cmN52U drwxrwx--- 2 root root 160 Jan 21 08:28 qb-362909-362777-31-cSy1yo drwxrwx--- 2 root root 160 Jan 21 08:28 qb-362923-362777-31-yDw5fN -rw------- 1 root root 8392704 Jan 21 08:27 qb-corosync-362779-blackbox-data -rw------- 1 root root 8248 Jan 21 08:27 qb-corosync-362779-blackbox-header -rw------- 1 root root 8392704 Jan 21 08:27 qb-corosync-362796-blackbox-data -rw------- 1 root root 8248 Jan 21 08:27 qb-corosync-362796-blackbox-header -rw------- 1 root root 8392704 Jan 21 08:27 qb-corosync-362811-blackbox-data -rw------- 1 root root 8248 Jan 21 08:27 qb-corosync-362811-blackbox-header -rw------- 1 root root 8392704 Jan 21 08:27 qb-corosync-362825-blackbox-data -rw------- 1 root root 8248 Jan 21 08:27 qb-corosync-362825-blackbox-header -rw------- 1 root root 8392704 Jan 21 08:27 qb-corosync-362839-blackbox-data -rw------- 1 root root 8248 Jan 21 08:27 qb-corosync-362839-blackbox-header -rw------- 1 root root 8392704 Jan 21 08:28 qb-corosync-362853-blackbox-data -rw------- 1 root root 8248 Jan 21 08:28 qb-corosync-362853-blackbox-header -rw------- 1 root root 8392704 Jan 21 08:28 qb-corosync-362867-blackbox-data -rw------- 1 root root 8248 Jan 21 08:28 qb-corosync-362867-blackbox-header -rw------- 1 root root 8392704 Jan 21 08:28 qb-corosync-362881-blackbox-data -rw------- 1 root root 8248 Jan 21 08:28 qb-corosync-362881-blackbox-header -rw------- 1 root root 8392704 Jan 21 08:28 qb-corosync-362895-blackbox-data -rw------- 1 root root 8248 Jan 21 08:28 qb-corosync-362895-blackbox-header -rw------- 1 root root 8392704 Jan 21 08:28 qb-corosync-362909-blackbox-data -rw------- 1 root root 8248 Jan 21 08:28 qb-corosync-362909-blackbox-header -rw------- 1 root root 8392704 Jan 21 08:28 qb-corosync-362923-blackbox-data -rw------- 1 root root 8248 Jan 21 08:28 qb-corosync-362923-blackbox-header The blackbox data are expected (there is really nobody who can delete them) but what is not expected is difference between content of qb-362895-362777-31-cmN52U and qb-362909-362777-31-cSy1yo. One is empty: # ls -la /dev/shm/qb-362895-362777-31-cmN52U total 0 drwxrwx--- 2 root root 40 Jan 21 08:28 . drwxrwxrwt 13 root root 700 Jan 21 08:28 .. And second one is full of files: ls -la /dev/shm/qb-362909-362777-31-cSy1yo total 3120 drwxrwx--- 2 root root 160 Jan 21 08:28 . drwxrwxrwt 13 root root 700 Jan 21 08:28 .. -rw------- 1 root root 1052672 Jan 21 08:28 qb-event-cpg-data -rw------- 1 root root 8248 Jan 21 08:28 qb-event-cpg-header -rw------- 1 root root 1052672 Jan 21 08:28 qb-request-cpg-data -rw------- 1 root root 8252 Jan 21 08:28 qb-request-cpg-header -rw------- 1 root root 1052672 Jan 21 08:28 qb-response-cpg-data -rw------- 1 root root 8248 Jan 21 08:28 qb-response-cpg-header So I was thinking who is actually responsible for deleting client files in /dev/shm/ and if my memory doesn't fool me it is client. So I've tried running testcpg in strace: ... munmap(0x7f66f3384000, 2105344) = 0 munmap(0x7f66f6651000, 8248) = 0 openat(AT_FDCWD, "/dev/shm/qb-362962-362959-31-Q1aMpj", O_RDONLY|O_PATH|O_DIRECTORY) = 3 unlinkat(3, "qb-event-cpg-data", 0) = 0 unlinkat(3, "qb-event-cpg-header", 0) = 0 close(3) = 0 munmap(0x7f66f3182000, 2105344) = 0 munmap(0x7f66f664e000, 8248) = 0 munmap(0x7f66f653e000, 1052672) = 0 write(1, "Finalize+restart result is 2 (sh"..., 43Finalize+restart result is 2 (should be 1) ... so it seems to really be client. Running same cycle (but now testcpg running under strace) and final result is: drwxrwx--- 2 root root 40 Jan 21 08:33 qb-362962-362959-31-Q1aMpj drwxrwx--- 2 root root 40 Jan 21 08:34 qb-362976-362959-31-mg0dOq drwxrwx--- 2 root root 160 Jan 21 08:34 qb-362990-362959-31-UabRiQ drwxrwx--- 2 root root 40 Jan 21 08:34 qb-363004-362959-31-6BT7Te drwxrwx--- 2 root root 40 Jan 21 08:34 qb-363018-362959-31-WozfnE drwxrwx--- 2 root root 40 Jan 21 08:34 qb-363032-362959-31-KuUA52 drwxrwx--- 2 root root 40 Jan 21 08:34 qb-363046-362959-31-uqgbow drwxrwx--- 2 root root 160 Jan 21 08:34 qb-363060-362959-31-WS13kU drwxrwx--- 2 root root 160 Jan 21 08:35 qb-363074-362959-31-4zCUBj drwxrwx--- 2 root root 40 Jan 21 08:35 qb-363088-362959-31-OwIjJI ... So you can see most of the directories were emptied but not all of them so maybe there is some race? Honza is right, it's a libqb race. What happens is that libqb client will clean up after a server that unexpected exits IFF it is sure that the server is definitely no longer running. (normally it's the server's responsibility to tidy up after clients). To do this it tries a kill(pid, 0) and only cleans up the server parts if it returns ESRCH. If the server takes slightly longer to exit than libqb does to get to that bit of code then the files will be left lying around because libqb thinks the server was still running and doesn't clean up - because it expects that the server will do it's own tidying. This will always be a race. I'm going to look into giving libqb a few tries at this (it knows if the server has closed the connection on the client, so it won't delay normal shutdown - I hope) and see if that's a reasonable solution. It won't close the race fully (that's not really possible, server shutdowns might take ages under some circumstances),, but it might mitigate them a bit more. The patch to pacemaker (the actual subject of this BZ) will help mitigate the problem, but doesn't fully solve it. libqb patch here: https://github.com/ClusterLabs/libqb/pull/434 before fix ----------- > [root@virt-153 ~]# rpm -q pacemaker > pacemaker-2.0.4-6.el8.x86_64 Setup two node cluster, disable fencing: > [root@virt-153 ~]# pcs host auth virt-1{53,54} -u hacluster -p password > virt-154: Authorized > virt-153: Authorized > [root@virt-153 ~]# pcs cluster setup test_cluster virt-153 virt-154 --start --wait > [...] > Cluster has been successfully set up. > Starting cluster on hosts: 'virt-153', 'virt-154'... > Waiting for node(s) to start: 'virt-153', 'virt-154'... > virt-154: Cluster started > virt-153: Cluster started > [root@virt-153 ~]# pcs cluster enable --all; pcs property set stonith-enabled=false > virt-153: Cluster Enabled > virt-154: Cluster Enabled > [root@virt-153 ~]# pcs status > Cluster name: test_cluster > Cluster Summary: > * Stack: corosync > * Current DC: virt-153 (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum > * Last updated: Thu Feb 11 13:51:50 2021 > * Last change: Thu Feb 11 13:51:43 2021 by root via cibadmin on virt-153 > * 2 nodes configured > * 0 resource instances configured > Node List: > * Online: [ virt-153 virt-154 ] > Full List of Resources: > * No resources > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled Decrease size of `/dev/shm` : > [root@virt-153 ~]# mount -o remount,size=250M /dev/shm; df -h > Filesystem Size Used Avail Use% Mounted on > devtmpfs 990M 0 990M 0% /dev > tmpfs 250M 47M 204M 19% /dev/shm > tmpfs 1009M 97M 913M 10% /run > tmpfs 1009M 0 1009M 0% /sys/fs/cgroup > /dev/mapper/rhel_virt--153-root 6.2G 3.5G 2.8G 56% / > /dev/vda1 1014M 197M 818M 20% /boot > tmpfs 202M 0 202M 0% /run/user/0 On one of the cluster nodes kill corosync, wait and then restart pacemaker: > [root@virt-153 ~]# for ((i=0;i<15;i++)); do while ! systemctl is-active corosync; do sleep 30; systemctl start pacemaker; done; sleep 15; killall -9 corosync; done > active > failed > active > [...] After that /dev/shm gets full: > [root@virt-153 ~]# df -h > Filesystem Size Used Avail Use% Mounted on > devtmpfs 990M 0 990M 0% /dev > tmpfs 250M 250M 688K 100% /dev/shm > tmpfs 1009M 97M 913M 10% /run > tmpfs 1009M 0 1009M 0% /sys/fs/cgroup > /dev/mapper/rhel_virt--153-root 6.2G 3.5G 2.8G 56% / > /dev/vda1 1014M 197M 818M 20% /boot > tmpfs 202M 0 202M 0% /run/user/0 > [root@virt-153 ~]# ls -l /dev/shm | wc -l > 76 The cluster node stays offline: > [root@virt-153 ~]# pcs status > Error: error running crm_mon, is pacemaker running? > Could not connect to the CIB: Transport endpoint is not connected > crm_mon: Error: cluster is not available on this node > [root@virt-154 ~]# pcs status > Cluster name: test_cluster > Cluster Summary: > * Stack: corosync > * Current DC: virt-154 (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum > * Last updated: Thu Feb 11 14:02:59 2021 > * Last change: Thu Feb 11 13:51:43 2021 by root via cibadmin on virt-153 > * 2 nodes configured > * 0 resource instances configured > Node List: > * Node virt-153: UNCLEAN (offline) > * Online: [ virt-154 ] > Full List of Resources: > * No resources > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled Corosync log excerpts: > [root@virt-153 ~]# tail -f /var/log/cluster/corosync.log | grep -i QB [...] > Feb 11 14:05:01 [762564] virt-153.cluster-qe.lab.eng.brq.redhat.com corosync error [QB ] couldn't create file for mmap > Feb 11 14:05:01 [762564] virt-153.cluster-qe.lab.eng.brq.redhat.com corosync error [QB ] qb_rb_open:/dev/shm/qb-762564-762644-34-EXcDw3/qb-request-cmap: No space left on device (28) > Feb 11 14:05:01 [762564] virt-153.cluster-qe.lab.eng.brq.redhat.com corosync error [QB ] shm connection FAILED: No space left on device (28) > Feb 11 14:05:01 [762564] virt-153.cluster-qe.lab.eng.brq.redhat.com corosync error [QB ] Error in connection setup (/dev/shm/qb-762564-762644-34-EXcDw3/qb): No space left on device (28) > Feb 11 14:05:03 [762564] virt-153.cluster-qe.lab.eng.brq.redhat.com corosync error [QB ] couldn't create file for mmap > Feb 11 14:05:03 [762564] virt-153.cluster-qe.lab.eng.brq.redhat.com corosync error [QB ] qb_rb_open:/dev/shm/qb-762564-762644-34-WOF4QM/qb-request-cmap: No space left on device (28) > Feb 11 14:05:03 [762564] virt-153.cluster-qe.lab.eng.brq.redhat.com corosync error [QB ] shm connection FAILED: No space left on device (28) > Feb 11 14:05:03 [762564] virt-153.cluster-qe.lab.eng.brq.redhat.com corosync error [QB ] Error in connection setup (/dev/shm/qb-762564-762644-34-WOF4QM/qb): No space left on device (28) after fix ---------- > [root@virt-175 ~]# rpm -q pacemaker > pacemaker-2.0.5-6.el8.x86_64 Setup two node cluster, disable fencing: > [root@virt-175 ~]# pcs host auth virt-1{75,76} -u hacluster -p password > virt-175: Authorized > virt-176: Authorized > [root@virt-175 ~]# pcs cluster setup test_cluster virt-175 virt-176 --start --wait > [...] > Cluster has been successfully set up. > Starting cluster on hosts: 'virt-175', 'virt-176'... > Waiting for node(s) to start: 'virt-175', 'virt-176'... > virt-175: Cluster started > virt-176: Cluster started > [root@virt-175 ~]# pcs cluster enable --all; pcs property set stonith-enabled=false > virt-175: Cluster Enabled > virt-176: Cluster Enabled > [root@virt-175 ~]# pcs status > Cluster name: test_cluster > Cluster Summary: > * Stack: corosync > * Current DC: virt-176 (version 2.0.5-6.el8-ba59be7122) - partition with quorum > * Last updated: Thu Feb 11 14:20:05 2021 > * Last change: Thu Feb 11 14:19:59 2021 by root via cibadmin on virt-175 > * 2 nodes configured > * 0 resource instances configured > Node List: > * Online: [ virt-175 virt-176 ] > Full List of Resources: > * No resources > Daemon Status: > corosync: active/enabled > pacemaker: active/enabled > pcsd: active/enabled Decrease size of `/dev/shm` : > [root@virt-175 ~]# mount -o remount,size=250M /dev/shm; df -h > Filesystem Size Used Avail Use% Mounted on > devtmpfs 990M 0 990M 0% /dev > tmpfs 250M 32M 219M 13% /dev/shm > tmpfs 1010M 22M 989M 3% /run > tmpfs 1010M 0 1010M 0% /sys/fs/cgroup > /dev/mapper/rhel_virt--175-root 6.2G 3.6G 2.7G 58% / > /dev/vda1 1014M 202M 813M 20% /boot > tmpfs 202M 0 202M 0% /run/user/0 On one of the cluster nodes kill corosync, wait and then restart pacemaker: > [root@virt-175 ~]# for ((i=0;i<15;i++)); do while ! systemctl is-active corosync; do sleep 30; systemctl start pacemaker; done; sleep 15; killall -9 corosync; done > active > failed > [...] Now after the fix, /dev/shm fills more slowly: > [root@virt-175 ~]# df -h > Filesystem Size Used Avail Use% Mounted on > devtmpfs 990M 0 990M 0% /dev > tmpfs 250M 182M 69M 73% /dev/shm > tmpfs 1010M 22M 989M 3% /run > tmpfs 1010M 0 1010M 0% /sys/fs/cgroup > /dev/mapper/rhel_virt--175-root 6.2G 3.5G 2.8G 56% / > /dev/vda1 1014M 202M 813M 20% /boot > tmpfs 202M 0 202M 0% /run/user/0 > [root@virt-175 ~]# ls -l /dev/shm | wc -l > 67 Verified as SanityOnly in pacemaker-2.0.5-6.el8. The fix helps with the issue, but doesn't fully solve it. The upcoming libqb patch (Comment 34) is expected to have a much bigger impact on fixing the issue. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:1782 |