Bug 1660592 - Regression: scheduler crash during bundle connection recovery
Summary: Regression: scheduler crash during bundle connection recovery
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: pacemaker
Version: 8.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 8.0
Assignee: Ken Gaillot
QA Contact: pkomarov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-12-18 18:39 UTC by Ken Gaillot
Modified: 2019-06-14 01:45 UTC (History)
8 users (show)

Fixed In Version: pacemaker-2.0.1-3.el8
Doc Type: No Doc Update
Doc Text:
Issue was not in a released version
Clone Of:
Environment:
Last Closed: 2019-06-14 01:45:55 UTC
Type: Bug
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Cluster Labs 5374 None None None 2018-12-18 18:39:34 UTC

Description Ken Gaillot 2018-12-18 18:39:34 UTC
Description of problem: Pacemaker's pacemaker-schedulerd daemon segfaults when pacemaker-remoted is killed inside a bundle.


Version-Release number of selected component (if applicable): 2.0.1-2


How reproducible: Reliably


Steps to Reproduce:
1. Configure a cluster with at least one node and a bundle resource.
2. "killall -9 pacemaker-remoted" on any node hosting a bundle replica

Actual results: pacemaker-schedulerd segfaults on the DC, showing a log message like:

error: Managed process 4936 (pacemaker-schedulerd) dumped core


Expected results: Recovery proceeds without any daemon crash

Comment 1 Ken Gaillot 2018-12-18 22:20:19 UTC
Fixed in upstream 2.0 branch by commit 5f75a663

Comment 7 pkomarov 2019-04-27 21:58:23 UTC
Verified, 

[root@controller-2 cluster]# pcs status|grep dc
 stonith-fence_ipmilan-525400fdc32c	(stonith:fence_ipmilan):	Started controller-0
[root@controller-2 cluster]# pcs status|grep DC
Current DC: controller-2 (version 2.0.1-4.el8-0eb7991564) - partition with quorum

[root@controller-2 cluster]# rpm -qa|grep pacemaker
pacemaker-cli-2.0.1-4.el8.x86_64
ansible-pacemaker-1.0.4-0.20190418190349.0e4d7c0.el8ost.noarch
pacemaker-libs-2.0.1-4.el8.x86_64
pacemaker-remote-2.0.1-4.el8.x86_64
pacemaker-schemas-2.0.1-4.el8.noarch
pacemaker-cluster-libs-2.0.1-4.el8.x86_64
puppet-pacemaker-0.7.3-0.20190420103227.0bfb86a.el8ost.noarch
pacemaker-2.0.1-4.el8.x86_64


[root@controller-2 ~]# podman exec -it galera-bundle-podman-2 bash
()[root@controller-2 /]# ps -ef      
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 20:47 ?        00:00:00 dumb-init -- /bin/bash /usr/local/bin/kolla
root           8       1  0 20:47 ?        00:00:01 /usr/sbin/pacemaker_remoted
root         649       1  0 20:49 ?        00:00:00 /bin/sh /usr/bin/mysqld_safe --defaults-fil
mysql        957     649  0 20:49 ?        00:00:22 /usr/libexec/mysqld --defaults-file=/etc/my
root       21503       0  0 21:44 pts/0    00:00:00 bash
root       21703   21503  0 21:45 pts/0    00:00:00 ps -ef
()[root@controller-2 /]# killall -9 pacemaker_remoted
()[root@controller-2 /]# exit status 137
[root@controller-2 ~]# pcs status |grep -i galera
GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 ovn-dbs-bundle-0@controller-0 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ]
 podman container set: galera-bundle [192.168.24.1:8787/rhosp15/openstack-mariadb:pcmklatest]
   galera-bundle-0	(ocf::heartbeat:galera):	Master controller-0
   galera-bundle-1	(ocf::heartbeat:galera):	Master controller-1
   galera-bundle-2	(ocf::heartbeat:galera):	Promoting controller-2
* galera-bundle-2_monitor_30000 on controller-2 'unknown error' (1): call=17, status=Error, exitreason='',
[root@controller-2 ~]# pcs status |grep -i galera
GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 ovn-dbs-bundle-0@controller-0 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ]
 podman container set: galera-bundle [192.168.24.1:8787/rhosp15/openstack-mariadb:pcmklatest]
   galera-bundle-0	(ocf::heartbeat:galera):	Master controller-0
   galera-bundle-1	(ocf::heartbeat:galera):	Master controller-1
   galera-bundle-2	(ocf::heartbeat:galera):	Master controller-2
* galera-bundle-2_monitor_30000 on controller-2 'unknown error' (1): call=17, status=Error, exitreason='',
[root@controller-2 ~]# pcs status|grep dc
 stonith-fence_ipmilan-525400fdc32c	(stonith:fence_ipmilan):	Started controller-2
[root@controller-2 ~]# pcs status|grep DC
Current DC: controller-1 (version 2.0.1-4.el8-0eb7991564) - partition with quorum


status on DC cluster logs: clean : 
oot@controller-1 cluster]# tail -F corosync.log
Apr 27 20:47:14 [92700] controller-1 corosync info    [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 470 to 1366
Apr 27 20:47:14 [92700] controller-1 corosync info    [KNET  ] pmtud: Global data MTU changed to: 1366
Apr 27 20:47:14 [92700] controller-1 corosync info    [KNET  ] rx: host: 1 link: 0 is up
Apr 27 20:47:14 [92700] controller-1 corosync notice  [TOTEM ] A new membership (1:24) was formed. Members joined: 1
Apr 27 20:47:14 [92700] controller-1 corosync warning [CPG   ] downlist left_list: 0 received
Apr 27 20:47:14 [92700] controller-1 corosync warning [CPG   ] downlist left_list: 0 received
Apr 27 20:47:14 [92700] controller-1 corosync warning [CPG   ] downlist left_list: 0 received
Apr 27 20:47:14 [92700] controller-1 corosync notice  [QUORUM] Members[3]: 1 2 3
Apr 27 20:47:14 [92700] controller-1 corosync notice  [MAIN  ] Completed service synchronization, ready to provide service.
Apr 27 20:47:14 [92700] controller-1 corosync info    [KNET  ] pmtud: PMTUD link change for host: 1 link: 0 from 470 to 1366



# check that since the galera fail and recovery no other errors were logged : 
[root@controller-1 cluster]# grep "galera-bundle-2_monitor_.*unknown error: failed" /var/log/pacemaker/pacemaker.log
Apr 27 21:45:28 controller-1 pacemaker-controld  [93015] (process_graph_event) 	info: Detected action (2.50) galera-bundle-2_monitor_30000.17=unknown error: failed
[root@controller-1 cluster]# 
[root@controller-1 cluster]# 
[root@controller-1 cluster]# grep -A 99999 "galera-bundle-2_monitor_.*unknown error: failed" /var/log/pacemaker/pacemaker.log|grep error
Apr 27 21:45:28 controller-1 pacemaker-controld  [93015] (process_graph_event) 	info: Detected action (2.50) galera-bundle-2_monitor_30000.17=unknown error: failed
Apr 27 21:45:28 controller-1 pacemaker-schedulerd[93014] (unpack_rsc_op_failure) 	warning: Processing failed monitor of galera-bundle-2 on controller-2: unknown error | rc=1
Apr 27 21:45:28 controller-1 pacemaker-schedulerd[93014] (unpack_rsc_op_failure) 	warning: Processing failed monitor of galera-bundle-2 on controller-2: unknown error | rc=1
Apr 27 21:45:34 controller-1 pacemaker-schedulerd[93014] (unpack_rsc_op_failure) 	warning: Processing failed monitor of galera-bundle-2 on controller-2: unknown error | rc=1
Apr 27 21:45:42 controller-1 pacemaker-schedulerd[93014] (unpack_rsc_op_failure) 	warning: Processing failed monitor of galera-bundle-2 on controller-2: unknown error | rc=1
Apr 27 21:45:54 controller-1 pacemaker-schedulerd[93014] (unpack_rsc_op_failure) 	warning: Processing failed monitor of galera-bundle-2 on controller-2: unknown error | rc=1

#pacemaker-schedulerd is still alive : 
[root@controller-1 cluster]# ps -ef |grep pacemaker-schedulerd
haclust+   93014   93005  0 20:47 ?        00:00:01 /usr/libexec/pacemaker/pacemaker-schedulerd


Note You need to log in before you can comment on or make changes to this bug.