1466875 – [pacemaker/libqb integration] "crm_mon -d" cannot outlive particular pacemaker session, reconnecting to new one

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1466875 - [pacemaker/libqb integration] "crm_mon -d" cannot outlive particular pacemaker session, reconnecting to new one

Summary: [pacemaker/libqb integration] "crm_mon -d" cannot outlive particular pacemake...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	pre-dev-freeze
Target Release:	8.4
Assignee:	Chris Lumens
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-06-30 15:39 UTC by Jan Pokorný [poki]
Modified:	2021-05-18 15:28 UTC (History)
CC List:	4 users (show)
Fixed In Version:	pacemaker-2.0.5-6.el8
Doc Type:	No Doc Update
Doc Text:	This affects few enough users that documentation is not needed, especially since there is no pcs interface.
Clone Of:
Environment:
Last Closed:	2021-05-18 15:26:41 UTC
Type:	Enhancement
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Cluster Labs	5461	0	None	None	None	2020-12-01 17:11:43 UTC

Description Jan Pokorný [poki] 2017-06-30 15:39:11 UTC

# pcs cluster start --wait
# crm_mon -d --as-html /tmp/mon.html
# pcs cluster stop --wait

# pcs cluster start --wait

# strace -p $(pidof crm_mon)
> [...]
> poll([{fd=6, events=POLLIN}, {fd=7, events=POLLIN}], 2, 500) = 0 (Timeout)
> poll([{fd=6, events=POLLIN}, {fd=7, events=POLLIN}], 2, 500) = 0 (Timeout)
> poll([{fd=6, events=POLLIN}, {fd=7, events=POLLIN}], 2, 500) = 0 (Timeout)
> poll([{fd=6, events=POLLIN}, {fd=7, events=POLLIN}], 2, 500) = 0 (Timeout)
> [...]

# gcore $(pidof crm_mon)
# gdb $(which crm_mon) core.$(pidof crm_mon)

(gdb) bt
> #0  0x00007fa5c5c1fa20 in __poll_nocancel () at ../sysdeps/unix/syscall-template.S:81
> #1  0x00007fa5c61497ac in g_main_context_poll (priority=2147483647, n_fds=2, fds=0x1a9f5a0, timeout=500, context=0x1ab3080) at gmain.c:4226
> #2  g_main_context_iterate (context=0x1ab3080, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3922
> #3  0x00007fa5c6149aea in g_main_loop_run (loop=0x1a9f750) at gmain.c:4123
> #4  0x000000000040381f in main (argc=<optimized out>, argv=<optimized out>) at crm_mon.c:812

(gdb) info proc mappings
>          Start Addr           End Addr       Size     Offset objfile
[...]
>      0x7fa5c9c57000     0x7fa5c9c58000     0x1000        0x0 /dev/shm/qb-cib_ro-control-30389-30758-20 (deleted)
[...]

# ls -l /dev/shm/qb-cib_ro-control-30389-30758-20
> ls: cannot access /dev/shm/qb-cib_ro-control-30389-30758-20: No such file or directory

The crm_mon seems to suggest the connection is expected to be
re-established just fine in such a scenario.

Comment 1 Jan Pokorný [poki] 2017-06-30 15:44:15 UTC

Furthermore:

# lsof -p $(pidof crm_mon)
> COMMAND   PID USER   FD      TYPE             DEVICE SIZE/OFF      NODE NAME
> [...]
> crm_mon 30758 root  DEL       REG               0,17             599553 /dev/shm/qb-cib_ro-control-30389-30758-20
> [...]
> crm_mon 30758 root    5u     unix 0xffff88008ae0bc00      0t0    600577 @qb-cib_ro-30389-30758-20-response
> crm_mon 30758 root    6u     unix 0xffff88008ae0e000      0t0    600578 @qb-cib_ro-30389-30758-20-event

# kill -0 30389
> -bash: kill: (30389) - No such process

Looks like the UNIX socket should be long gone at this point, and it is not.

Comment 2 Jan Pokorný [poki] 2017-07-04 17:31:17 UTC

Observation:

- it actually worked for me correctly several times, but most of the
  time, it won't

- in failing cases, mon_cib_connection_destroy never gets called

- the previous means that client->destroy_fn from the mainloop's
  perspective won't get called

- the previous means that mainloop_gio_destroy won't get called


Need to find a link between a gracefull cib daemon shutdown and an
attempt to run all the connection at least partially through the
above mentioned call stack.  Hopefully there will be some.

Comment 4 Ken Gaillot 2017-10-09 17:32:24 UTC

Due to time constraints, this will not make 7.5

Comment 5 Ken Gaillot 2019-03-27 19:12:08 UTC

Reproducer will be added once solution is settled

Comment 8 RHEL Program Management 2020-12-01 07:29:12 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 9 Ken Gaillot 2020-12-01 17:12:10 UTC

This is still a priority and is being tracked by an upstream bz. This bz will be reopened once developer time becomes available to address it.

Comment 10 Ken Gaillot 2021-01-19 23:02:59 UTC

Fix has been merged upstream as of commit 8c51b49

Comment 11 Ken Gaillot 2021-01-21 22:20:16 UTC

QA: Reproducer:

1. Start a cluster.
2. Run on any node:
   crm_mon --output-to=$ANY_FILE --daemonize
3. That should create $ANY_FILE and update it with the cluster status every 5 seconds. (You can cause various changes to see a different status.)
4. Restart pacemaker on the same node. Before the fix, it would no longer update the file with new events; after the fix, it will.

Comment 16 Markéta Smazová 2021-02-15 21:50:27 UTC

before fix
-----------

>   [root@virt-153 ~]# rpm -q pacemaker
>   pacemaker-2.0.4-6.el8.x86_64

>   [root@virt-153 ~]# pcs status
>   Cluster name: STSRHTS19672
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-154 (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum
>     * Last updated: Mon Feb 15 21:32:00 2021
>     * Last change:  Mon Feb 15 18:42:42 2021 by root via cibadmin on virt-153
>     * 2 nodes configured
>     * 2 resource instances configured

>   Node List:
>     * Online: [ virt-153 virt-154 ]

>   Full List of Resources:
>     * fence-virt-153      (stonith:fence_xvm):     Started virt-153
>     * fence-virt-154      (stonith:fence_xvm):     Started virt-154

>   Daemon Status:
>     corosync: active/disabled
>     pacemaker: active/disabled
>     pcsd: active/enabled

run `crm_mon --daemonize`:
>   [root@virt-153 ~]# crm_mon --output-to=out_before.html --daemonize --output-as=html

create resource:
>   [root@virt-153 ~]# pcs resource create dummy ocf:pacemaker:Dummy

backup output file:
>   [root@virt-153 ~]# cp out_before.html out_before-old.html

stop cluster:
>   [root@virt-153 ~]# pcs cluster stop --all
>   virt-153: Stopping Cluster (pacemaker)...
>   virt-154: Stopping Cluster (pacemaker)...
>   virt-154: Stopping Cluster (corosync)...
>   virt-153: Stopping Cluster (corosync)...

start cluster again:
>   [root@virt-153 ~]# pcs cluster start --all --wait
>   virt-153: Starting Cluster...
>   virt-154: Starting Cluster...
>   Waiting for node(s) to start...
>   virt-154: Started
>   virt-153: Started

create new resource:
>   [root@virt-153 ~]# pcs resource create dummy1 ocf:pacemaker:Dummy

backup output file again:
>   [root@virt-153 ~]# cp out_before.html out_before-new.html

see diff:
>   [root@virt-153 ~]# diff out_before.html out_before-new.html
>   26c26
>   < <span class="bold">Current DC: </span><span>virt-154 (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum</span>
>   ---
>   > <span class="bold">Current DC: </span><span>virt-153 (version 2.0.4-6.el8-2deceaa3ae) - partition with quorum</span>
>   29c29
>   < <span class="bold">Last updated: </span><span>Mon Feb 15 21:32:01 2021</span>
>   ---
>   > <span class="bold">Last updated: </span><span>Mon Feb 15 21:32:41 2021</span>
>   32c32
>   < <span class="bold">Last change: </span><span>Mon Feb 15 18:42:42 2021 by root via cibadmin on virt-153</span>
>   ---
>   > <span class="bold">Last change: </span><span>Mon Feb 15 21:32:41 2021 by root via cibadmin on virt-153</span>
>   35c35
>   < <li><span>2 resource instances configured</span></li>
>   ---
>   > <li><span>4 resource instances configured</span></li>
>   49a50
>   > <li><span class="rsc-ok">dummy	(ocf::pacemaker:Dummy):	 Started virt-153</span></li>



after fix
----------

>   [root@virt-175 ~]# rpm -q pacemaker
>   pacemaker-2.0.5-6.el8.x86_64

>   [root@virt-175 ~]# pcs status
>   Cluster name: STSRHTS6310
>   Cluster Summary:
>     * Stack: corosync
>     * Current DC: virt-175 (version 2.0.5-6.el8-ba59be7122) - partition with quorum
>     * Last updated: Mon Feb 15 17:42:19 2021
>     * Last change:  Mon Feb 15 17:38:19 2021 by root via cibadmin on virt-175
>     * 2 nodes configured
>     * 2 resource instances configured

>   Node List:
>     * Online: [ virt-175 virt-176 ]

>   Full List of Resources:
>     * fence-virt-175	(stonith:fence_xvm):	 Started virt-175
>     * fence-virt-176	(stonith:fence_xvm):	 Started virt-176

>   Daemon Status:
>     corosync: active/disabled
>     pacemaker: active/disabled
>     pcsd: active/enabled

run `crm_mon --daemonize`:
>   [root@virt-175 ~]# crm_mon --output-to=out_after.html --daemonize --output-as=html

create resource:
>   [root@virt-175 ~]# pcs resource create dummy ocf:pacemaker:Dummy

backup output file:
>   [root@virt-175 ~]# cp out_after.html out_after-old.html

stop cluster:
>   [root@virt-175 ~]# pcs cluster stop --all
>   virt-176: Stopping Cluster (pacemaker)...
>   virt-175: Stopping Cluster (pacemaker)...
>   virt-176: Stopping Cluster (corosync)...
>   virt-175: Stopping Cluster (corosync)...

start cluster again:
>   [root@virt-175 ~]# pcs cluster start --all --wait
>   virt-176: Starting Cluster...
>   virt-175: Starting Cluster...
>   Waiting for node(s) to start...
>   virt-176: Started
>   virt-175: Started

create new resource:
>   [root@virt-175 ~]# pcs resource create dummy1 ocf:pacemaker:Dummy

backup output file:
>   [root@virt-175 ~]# cp out_after.html out_after-new.html

see diff:
>   [root@virt-175 ~]# diff out_after-old.html out_after-new.html
>   29c29
>   < <span class="bold">Last updated: </span><span>Mon Feb 15 17:42:23 2021</span>
>   ---
>   > <span class="bold">Last updated: </span><span>Mon Feb 15 17:43:44 2021</span>
>   32c32
>   < <span class="bold">Last change: </span><span>Mon Feb 15 17:42:20 2021 by root via cibadmin on virt-175</span>
>   ---
>   > <span class="bold">Last change: </span><span>Mon Feb 15 17:43:41 2021 by root via cibadmin on virt-175</span>
>   35c35
>   < <li><span>3 resource instances configured</span></li>
>   ---
>   > <li><span>4 resource instances configured</span></li>
>   50a51
>   > <li><span class="rsc-ok">dummy1	(ocf::pacemaker:Dummy):	 Started virt-176</span></li>


I tried several times to reproduce the original issue, but was not successful.

Verified as SanityOnly in pacemaker-2.0.5-6.el8

Comment 18 errata-xmlrpc 2021-05-18 15:26:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (pacemaker bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:1782

Note You need to log in before you can comment on or make changes to this bug.