Hide Forgot
Description of problem: Pacemaker's pacemaker-schedulerd daemon segfaults when pacemaker-remoted is killed inside a bundle. Version-Release number of selected component (if applicable): 2.0.1-2 How reproducible: Reliably Steps to Reproduce: 1. Configure a cluster with at least one node and a bundle resource. 2. "killall -9 pacemaker-remoted" on any node hosting a bundle replica Actual results: pacemaker-schedulerd segfaults on the DC, showing a log message like: error: Managed process 4936 (pacemaker-schedulerd) dumped core Expected results: Recovery proceeds without any daemon crash
Fixed in upstream 2.0 branch by commit 5f75a663
Verified, [root@controller-2 cluster]# pcs status|grep dc stonith-fence_ipmilan-525400fdc32c (stonith:fence_ipmilan): Started controller-0 [root@controller-2 cluster]# pcs status|grep DC Current DC: controller-2 (version 2.0.1-4.el8-0eb7991564) - partition with quorum [root@controller-2 cluster]# rpm -qa|grep pacemaker pacemaker-cli-2.0.1-4.el8.x86_64 ansible-pacemaker-1.0.4-0.20190418190349.0e4d7c0.el8ost.noarch pacemaker-libs-2.0.1-4.el8.x86_64 pacemaker-remote-2.0.1-4.el8.x86_64 pacemaker-schemas-2.0.1-4.el8.noarch pacemaker-cluster-libs-2.0.1-4.el8.x86_64 puppet-pacemaker-0.7.3-0.20190420103227.0bfb86a.el8ost.noarch pacemaker-2.0.1-4.el8.x86_64 [root@controller-2 ~]# podman exec -it galera-bundle-podman-2 bash ()[root@controller-2 /]# ps -ef UID PID PPID C STIME TTY TIME CMD root 1 0 0 20:47 ? 00:00:00 dumb-init -- /bin/bash /usr/local/bin/kolla root 8 1 0 20:47 ? 00:00:01 /usr/sbin/pacemaker_remoted root 649 1 0 20:49 ? 00:00:00 /bin/sh /usr/bin/mysqld_safe --defaults-fil mysql 957 649 0 20:49 ? 00:00:22 /usr/libexec/mysqld --defaults-file=/etc/my root 21503 0 0 21:44 pts/0 00:00:00 bash root 21703 21503 0 21:45 pts/0 00:00:00 ps -ef ()[root@controller-2 /]# killall -9 pacemaker_remoted ()[root@controller-2 /]# exit status 137 [root@controller-2 ~]# pcs status |grep -i galera GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 ovn-dbs-bundle-0@controller-0 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ] podman container set: galera-bundle [192.168.24.1:8787/rhosp15/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master controller-0 galera-bundle-1 (ocf::heartbeat:galera): Master controller-1 galera-bundle-2 (ocf::heartbeat:galera): Promoting controller-2 * galera-bundle-2_monitor_30000 on controller-2 'unknown error' (1): call=17, status=Error, exitreason='', [root@controller-2 ~]# pcs status |grep -i galera GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 ovn-dbs-bundle-0@controller-0 ovn-dbs-bundle-1@controller-1 ovn-dbs-bundle-2@controller-2 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-1 rabbitmq-bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ] podman container set: galera-bundle [192.168.24.1:8787/rhosp15/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master controller-0 galera-bundle-1 (ocf::heartbeat:galera): Master controller-1 galera-bundle-2 (ocf::heartbeat:galera): Master controller-2 * galera-bundle-2_monitor_30000 on controller-2 'unknown error' (1): call=17, status=Error, exitreason='', [root@controller-2 ~]# pcs status|grep dc stonith-fence_ipmilan-525400fdc32c (stonith:fence_ipmilan): Started controller-2 [root@controller-2 ~]# pcs status|grep DC Current DC: controller-1 (version 2.0.1-4.el8-0eb7991564) - partition with quorum status on DC cluster logs: clean : oot@controller-1 cluster]# tail -F corosync.log Apr 27 20:47:14 [92700] controller-1 corosync info [KNET ] pmtud: PMTUD link change for host: 3 link: 0 from 470 to 1366 Apr 27 20:47:14 [92700] controller-1 corosync info [KNET ] pmtud: Global data MTU changed to: 1366 Apr 27 20:47:14 [92700] controller-1 corosync info [KNET ] rx: host: 1 link: 0 is up Apr 27 20:47:14 [92700] controller-1 corosync notice [TOTEM ] A new membership (1:24) was formed. Members joined: 1 Apr 27 20:47:14 [92700] controller-1 corosync warning [CPG ] downlist left_list: 0 received Apr 27 20:47:14 [92700] controller-1 corosync warning [CPG ] downlist left_list: 0 received Apr 27 20:47:14 [92700] controller-1 corosync warning [CPG ] downlist left_list: 0 received Apr 27 20:47:14 [92700] controller-1 corosync notice [QUORUM] Members[3]: 1 2 3 Apr 27 20:47:14 [92700] controller-1 corosync notice [MAIN ] Completed service synchronization, ready to provide service. Apr 27 20:47:14 [92700] controller-1 corosync info [KNET ] pmtud: PMTUD link change for host: 1 link: 0 from 470 to 1366 # check that since the galera fail and recovery no other errors were logged : [root@controller-1 cluster]# grep "galera-bundle-2_monitor_.*unknown error: failed" /var/log/pacemaker/pacemaker.log Apr 27 21:45:28 controller-1 pacemaker-controld [93015] (process_graph_event) info: Detected action (2.50) galera-bundle-2_monitor_30000.17=unknown error: failed [root@controller-1 cluster]# [root@controller-1 cluster]# [root@controller-1 cluster]# grep -A 99999 "galera-bundle-2_monitor_.*unknown error: failed" /var/log/pacemaker/pacemaker.log|grep error Apr 27 21:45:28 controller-1 pacemaker-controld [93015] (process_graph_event) info: Detected action (2.50) galera-bundle-2_monitor_30000.17=unknown error: failed Apr 27 21:45:28 controller-1 pacemaker-schedulerd[93014] (unpack_rsc_op_failure) warning: Processing failed monitor of galera-bundle-2 on controller-2: unknown error | rc=1 Apr 27 21:45:28 controller-1 pacemaker-schedulerd[93014] (unpack_rsc_op_failure) warning: Processing failed monitor of galera-bundle-2 on controller-2: unknown error | rc=1 Apr 27 21:45:34 controller-1 pacemaker-schedulerd[93014] (unpack_rsc_op_failure) warning: Processing failed monitor of galera-bundle-2 on controller-2: unknown error | rc=1 Apr 27 21:45:42 controller-1 pacemaker-schedulerd[93014] (unpack_rsc_op_failure) warning: Processing failed monitor of galera-bundle-2 on controller-2: unknown error | rc=1 Apr 27 21:45:54 controller-1 pacemaker-schedulerd[93014] (unpack_rsc_op_failure) warning: Processing failed monitor of galera-bundle-2 on controller-2: unknown error | rc=1 #pacemaker-schedulerd is still alive : [root@controller-1 cluster]# ps -ef |grep pacemaker-schedulerd haclust+ 93014 93005 0 20:47 ? 00:00:01 /usr/libexec/pacemaker/pacemaker-schedulerd