Description of problem: While testing HA resource behavior inside Newton we do this sequence of operations: 1 - Stop Galera; 2 - Stop openstack-cinder-volume resource; 3 - Start openstack-cinder-volume resource; Problem is that while doing step 2 the cluster is not able to stop the resource due to a timeout: ip-172.17.0.14 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 ip-172.20.0.13 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1 ip-172.18.0.11 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 ip-172.19.0.18 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 Clone Set: haproxy-clone [haproxy] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] ip-192.0.2.8 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1 Master/Slave Set: galera-master [galera] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: rabbitmq-clone [rabbitmq] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Master/Slave Set: redis-master [redis] Masters: [ overcloud-controller-0 ] Slaves: [ overcloud-controller-1 overcloud-controller-2 ] ip-172.17.0.15 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 openstack-cinder-volume (systemd:openstack-cinder-volume): FAILED overcloud-controller-0 (unmanaged) Failed Actions: * openstack-cinder-volume_stop_0 on overcloud-controller-0 'OCF_TIMEOUT' (198): call=175, status=Timed Out, exitreason='none', last-rc-change='Wed Sep 28 13:46:11 2016', queued=0ms, exec=19988ms Now, this can be a race condition, since it does not happen every time, but looking inside the logs could be useful to understand what happened this time.
sosreports, logs and stauts here: http://file.rdu.redhat.com/~rscarazz/BZ1380445/
So on controller-0 we have the following timeline: 1) cinder-volume is up and running and so is galera: Sep 28 13:27:14 localhost galera(galera)[54212]: INFO: Galera started Sep 28 13:27:14 localhost crmd[42964]: notice: Operation galera_promote_0: ok (node=overcloud-controller-0, call=43, rc=0, cib-update=78, confirmed=true) ... Sep 28 13:38:19 localhost systemd: Started Cluster Controlled openstack-cinder-volume. Sep 28 13:38:21 localhost crmd[42964]: notice: Operation openstack-cinder-volume_start_0: ok (node=overcloud-controller-0, call=72, rc=0, cib-update=101, confirmed=true) Sep 28 13:38:21 localhost cinder-volume: 2016-09-28 13:38:21.376 134083 INFO cinder.rpc [req-28d86c59-3c70-40cf-980f-9c2479667293 - - - - -] Automatically selected cinder-volume object... 2) galera is stopped: Sep 28 13:43:58 localhost galera(galera)[14132]: INFO: MySQL stopped Sep 28 13:43:58 localhost galera(galera)[14132]: INFO: attempting to detect last commit version by reading /var/lib/mysql/grastate.dat Sep 28 13:43:58 localhost galera(galera)[14132]: INFO: Last commit version found: 4671 Sep 28 13:43:58 localhost crmd[42964]: notice: Operation galera_demote_0: ok (node=overcloud-controller-0, call=148, rc=0, cib-update=190, confirmed=true) Sep 28 13:43:58 localhost crmd[42964]: notice: Transition aborted by status-1-galera-last-committed, galera-last-committed=4671: Transient attribute change (create cib=0.35.5, source=abort_unless_down:319, path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1'], 0) Sep 28 13:43:58 localhost pengine[42963]: notice: Stop galera:0#011(overcloud-controller-0) Sep 28 13:43:58 localhost pengine[42963]: notice: Stop galera:1#011(overcloud-controller-1) Sep 28 13:43:58 localhost pengine[42963]: notice: Stop galera:2#011(overcloud-controller-2) 3) cinder-volume starts complaining: Sep 28 13:43:32 localhost crmd[42964]: notice: Operation openstack-cinder-volume_monitor_0: ok (node=overcloud-controller-0, call=139, rc=0, cib-update=168, confirmed=true) Sep 28 13:43:32 localhost crmd[42964]: warning: Action 14 (openstack-cinder-volume_monitor_0) on overcloud-controller-0 failed (target: 7 vs. rc: 0): Error Sep 28 13:43:32 localhost crmd[42964]: warning: Action 14 (openstack-cinder-volume_monitor_0) on overcloud-controller-0 failed (target: 7 vs. rc: 0): Error Sep 28 13:43:34 localhost crmd[42964]: notice: Initiating action 112: monitor openstack-cinder-volume_monitor_60000 on overcloud-controller-0 (local) Sep 28 13:44:02 localhost cinder-volume: 2016-09-28 13:44:02.905 134306 ERROR oslo_db.sqlalchemy.exc_filters [req-6b13fe49-3958-4a9a-8c87-b238e57117aa - - - - -] DBAPIError exception wrapped from (pymysql.err.InternalError) (1927, u'Connection was killed') [SQL: u'SELECT 1'] 4) We stop cinder-volume successfully: Sep 28 13:44:45 localhost pengine[42963]: notice: Stop openstack-cinder-volume#011(overcloud-controller-0) Sep 28 13:44:45 localhost crmd[42964]: notice: Initiating action 111: stop openstack-cinder-volume_stop_0 on overcloud-controller-0 (local) Sep 28 13:44:45 localhost cinder-volume: 2016-09-28 13:44:45.910 134083 INFO oslo_service.service [req-28d86c59-3c70-40cf-980f-9c2479667293 - - - - -] Caught SIGTERM, stopping children Sep 28 13:44:45 localhost cinder-volume: 2016-09-28 13:44:45.913 134083 INFO oslo_service.service [req-28d86c59-3c70-40cf-980f-9c2479667293 - - - - -] Waiting on 1 children to exit Sep 28 13:44:45 localhost cinder-volume: 2016-09-28 13:44:45.922 134083 INFO oslo_service.service [req-28d86c59-3c70-40cf-980f-9c2479667293 - - - - -] Child 134306 killed by signal 15 Sep 28 13:44:47 localhost crmd[42964]: notice: Operation openstack-cinder-volume_stop_0: ok (node=overcloud-controller-0, call=158, rc=0, cib-update=211, confirmed=true) 5) We start cinder-volume: Sep 28 13:45:05 localhost pengine[42963]: notice: Start openstack-cinder-volume#011(overcloud-controller-0) Sep 28 13:45:05 localhost crmd[42964]: notice: Initiating action 95: start openstack-cinder-volume_start_0 on overcloud-controller-0 (local) Sep 28 13:45:05 localhost systemd: Configuration file /run/systemd/system/openstack-cinder-volume.service.d/50-pacemaker.conf is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway. Sep 28 13:45:05 localhost systemd: Started Cluster Controlled openstack-cinder-volume. Sep 28 13:45:05 localhost systemd: Starting Cluster Controlled openstack-cinder-volume... Sep 28 13:45:07 localhost cinder-volume: Option "logdir" from group "DEFAULT" is deprecated. Use option "log-dir" from group "DEFAULT". Sep 28 13:45:07 localhost crmd[42964]: notice: Operation openstack-cinder-volume_start_0: ok (node=overcloud-controller-0, call=162, rc=0, cib-update=222, confirmed=true) Sep 28 13:45:07 localhost crmd[42964]: notice: Initiating action 96: monitor openstack-cinder-volume_monitor_60000 on overcloud-controller-0 (local) 6) Then we stop it again and here we fail: Sep 28 13:46:11 localhost pengine[42963]: notice: Stop openstack-cinder-volume#011(overcloud-controller-0) Sep 28 13:46:11 localhost crmd[42964]: notice: Initiating action 96: stop openstack-cinder-volume_stop_0 on overcloud-controller-0 (local) Sep 28 13:46:12 localhost cinder-volume: 2016-09-28 13:46:12.010 23277 ERROR oslo_db.sqlalchemy.exc_filters [req-8f4a09ff-bed5-4bc4-8947-85e6b6b36fb0 - - - - -] DBAPIError exception wrapped from (pymysql.err.InternalError) (1927, u'Connection was killed') [SQL: u'SELECT 1'] ... Sep 28 13:46:12 localhost cinder-volume: 2016-09-28 13:46:12.015 23277 ERROR cinder.service Sep 28 13:46:31 localhost lrmd[42961]: notice: Giving up on openstack-cinder-volume stop (rc=196): timeout (elapsed=19987ms, remaining=13ms) Sep 28 13:46:31 localhost crmd[42964]: error: Operation openstack-cinder-volume_stop_0: Timed Out (node=overcloud-controller-0, call=175, timeout=20000ms) Sep 28 13:46:31 localhost crmd[42964]: warning: Action 96 (openstack-cinder-volume_stop_0) on overcloud-controller-0 failed (target: 0 vs. rc: 198): Error Sep 28 13:46:31 localhost crmd[42964]: notice: Transition aborted by openstack-cinder-volume_stop_0 'modify' on overcloud-controller-0: Event failed (magic=2:198;96:67:0:c63a51fa-bb3a-43fe-b017-3d281390b096, cib=0.42.1, source=match_graph_event:381, 0) Sep 28 13:46:31 localhost crmd[42964]: warning: Action 96 (openstack-cinder-volume_stop_0) on overcloud-controller-0 failed (target: 0 vs. rc: 198): Error Sep 28 13:46:31 localhost pengine[42963]: warning: Processing failed op stop for openstack-cinder-volume on overcloud-controller-0: OCF_TIMEOUT (198) Sep 28 13:46:31 localhost pengine[42963]: warning: Processing failed op stop for openstack-cinder-volume on overcloud-controller-0: OCF_TIMEOUT (198) So it seems to me the first stop is successful. It's the second one? that seems to have trouble. What puzzles me is that we did not wait enough for systemd to sigkill/sigterm the service. I am pretty sure we had bumped the systemd timeouts in the past
Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html