1380445 – cinder-volume stop timeout when galera is not running

Bug 1380445 - cinder-volume stop timeout when galera is not running

Summary: cinder-volume stop timeout when galera is not running

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-tripleo
Sub Component:
Version:	10.0 (Newton)
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	10.0 (Newton)
Assignee:	Michele Baldessari
QA Contact:	nlevinki
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-09-29 15:40 UTC by Raoul Scarazzini
Modified:	2016-12-14 16:06 UTC (History)
CC List:	12 users (show)
Fixed In Version:	puppet-tripleo-5.2.0-2.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-12-14 16:06:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1629366	None	None	None	2016-09-30 15:28:20 UTC
OpenStack gerrit	380665	None	MERGED	Fix the timeout for pacemaker systemd resources	2020-10-05 19:40:53 UTC
Red Hat Product Errata	RHEA-2016:2948	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 10 enhancement update	2016-12-14 19:55:27 UTC

Description Raoul Scarazzini 2016-09-29 15:40:54 UTC

Description of problem:

While testing HA resource behavior inside Newton we do this sequence of operations:

1 - Stop Galera;
2 - Stop openstack-cinder-volume resource;
3 - Start openstack-cinder-volume resource;

Problem is that while doing step 2 the cluster is not able to stop the resource due to a timeout:

 ip-172.17.0.14	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-0
 ip-172.20.0.13	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-1
 ip-172.18.0.11	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-2
 ip-172.19.0.18	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-0
 Clone Set: haproxy-clone [haproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 ip-192.0.2.8	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-1
 Master/Slave Set: galera-master [galera]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Master/Slave Set: redis-master [redis]
     Masters: [ overcloud-controller-0 ]
     Slaves: [ overcloud-controller-1 overcloud-controller-2 ]
 ip-172.17.0.15	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-2
 openstack-cinder-volume	(systemd:openstack-cinder-volume):	FAILED overcloud-controller-0 (unmanaged)

Failed Actions:
* openstack-cinder-volume_stop_0 on overcloud-controller-0 'OCF_TIMEOUT' (198): call=175, status=Timed Out, exitreason='none',
    last-rc-change='Wed Sep 28 13:46:11 2016', queued=0ms, exec=19988ms

Now, this can be a race condition, since it does not happen every time, but looking inside the logs could be useful to understand what happened this time.

Comment 1 Raoul Scarazzini 2016-09-29 15:46:48 UTC

sosreports, logs and stauts here: http://file.rdu.redhat.com/~rscarazz/BZ1380445/

Comment 2 Michele Baldessari 2016-09-30 10:44:28 UTC

So on controller-0 we have the following timeline:
1) cinder-volume is up and running and so is galera:
Sep 28 13:27:14 localhost galera(galera)[54212]: INFO: Galera started
Sep 28 13:27:14 localhost crmd[42964]:  notice: Operation galera_promote_0: ok (node=overcloud-controller-0, call=43, rc=0, cib-update=78, confirmed=true)

...
Sep 28 13:38:19 localhost systemd: Started Cluster Controlled openstack-cinder-volume.
Sep 28 13:38:21 localhost crmd[42964]:  notice: Operation openstack-cinder-volume_start_0: ok (node=overcloud-controller-0, call=72, rc=0, cib-update=101, confirmed=true)
Sep 28 13:38:21 localhost cinder-volume: 2016-09-28 13:38:21.376 134083 INFO cinder.rpc [req-28d86c59-3c70-40cf-980f-9c2479667293 - - - - -] Automatically selected cinder-volume object...

2) galera is stopped:
Sep 28 13:43:58 localhost galera(galera)[14132]: INFO: MySQL stopped
Sep 28 13:43:58 localhost galera(galera)[14132]: INFO: attempting to detect last commit version by reading /var/lib/mysql/grastate.dat
Sep 28 13:43:58 localhost galera(galera)[14132]: INFO: Last commit version found:  4671
Sep 28 13:43:58 localhost crmd[42964]:  notice: Operation galera_demote_0: ok (node=overcloud-controller-0, call=148, rc=0, cib-update=190, confirmed=true)
Sep 28 13:43:58 localhost crmd[42964]:  notice: Transition aborted by status-1-galera-last-committed, galera-last-committed=4671: Transient attribute change (create cib=0.35.5, source=abort_unless_down:319, path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1'], 0)
Sep 28 13:43:58 localhost pengine[42963]:  notice: Stop    galera:0#011(overcloud-controller-0)
Sep 28 13:43:58 localhost pengine[42963]:  notice: Stop    galera:1#011(overcloud-controller-1)
Sep 28 13:43:58 localhost pengine[42963]:  notice: Stop    galera:2#011(overcloud-controller-2)

3) cinder-volume starts complaining:
Sep 28 13:43:32 localhost crmd[42964]:  notice: Operation openstack-cinder-volume_monitor_0: ok (node=overcloud-controller-0, call=139, rc=0, cib-update=168, confirmed=true)
Sep 28 13:43:32 localhost crmd[42964]: warning: Action 14 (openstack-cinder-volume_monitor_0) on overcloud-controller-0 failed (target: 7 vs. rc: 0): Error
Sep 28 13:43:32 localhost crmd[42964]: warning: Action 14 (openstack-cinder-volume_monitor_0) on overcloud-controller-0 failed (target: 7 vs. rc: 0): Error
Sep 28 13:43:34 localhost crmd[42964]:  notice: Initiating action 112: monitor openstack-cinder-volume_monitor_60000 on overcloud-controller-0 (local)
Sep 28 13:44:02 localhost cinder-volume: 2016-09-28 13:44:02.905 134306 ERROR oslo_db.sqlalchemy.exc_filters [req-6b13fe49-3958-4a9a-8c87-b238e57117aa - - - - -] DBAPIError exception wrapped from (pymysql.err.InternalError) (1927, u'Connection was killed') [SQL: u'SELECT 1']

4) We stop cinder-volume successfully:
Sep 28 13:44:45 localhost pengine[42963]:  notice: Stop    openstack-cinder-volume#011(overcloud-controller-0)
Sep 28 13:44:45 localhost crmd[42964]:  notice: Initiating action 111: stop openstack-cinder-volume_stop_0 on overcloud-controller-0 (local)
Sep 28 13:44:45 localhost cinder-volume: 2016-09-28 13:44:45.910 134083 INFO oslo_service.service [req-28d86c59-3c70-40cf-980f-9c2479667293 - - - - -] Caught SIGTERM, stopping children
Sep 28 13:44:45 localhost cinder-volume: 2016-09-28 13:44:45.913 134083 INFO oslo_service.service [req-28d86c59-3c70-40cf-980f-9c2479667293 - - - - -] Waiting on 1 children to exit
Sep 28 13:44:45 localhost cinder-volume: 2016-09-28 13:44:45.922 134083 INFO oslo_service.service [req-28d86c59-3c70-40cf-980f-9c2479667293 - - - - -] Child 134306 killed by signal 15
Sep 28 13:44:47 localhost crmd[42964]:  notice: Operation openstack-cinder-volume_stop_0: ok (node=overcloud-controller-0, call=158, rc=0, cib-update=211, confirmed=true)

5) We start cinder-volume:
Sep 28 13:45:05 localhost pengine[42963]:  notice: Start   openstack-cinder-volume#011(overcloud-controller-0)
Sep 28 13:45:05 localhost crmd[42964]:  notice: Initiating action 95: start openstack-cinder-volume_start_0 on overcloud-controller-0 (local)
Sep 28 13:45:05 localhost systemd: Configuration file /run/systemd/system/openstack-cinder-volume.service.d/50-pacemaker.conf is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
Sep 28 13:45:05 localhost systemd: Started Cluster Controlled openstack-cinder-volume.
Sep 28 13:45:05 localhost systemd: Starting Cluster Controlled openstack-cinder-volume...
Sep 28 13:45:07 localhost cinder-volume: Option "logdir" from group "DEFAULT" is deprecated. Use option "log-dir" from group "DEFAULT".
Sep 28 13:45:07 localhost crmd[42964]:  notice: Operation openstack-cinder-volume_start_0: ok (node=overcloud-controller-0, call=162, rc=0, cib-update=222, confirmed=true)
Sep 28 13:45:07 localhost crmd[42964]:  notice: Initiating action 96: monitor openstack-cinder-volume_monitor_60000 on overcloud-controller-0 (local)

6) Then we stop it again and here we fail:
Sep 28 13:46:11 localhost pengine[42963]:  notice: Stop    openstack-cinder-volume#011(overcloud-controller-0)
Sep 28 13:46:11 localhost crmd[42964]:  notice: Initiating action 96: stop openstack-cinder-volume_stop_0 on overcloud-controller-0 (local)
Sep 28 13:46:12 localhost cinder-volume: 2016-09-28 13:46:12.010 23277 ERROR oslo_db.sqlalchemy.exc_filters [req-8f4a09ff-bed5-4bc4-8947-85e6b6b36fb0 - - - - -] DBAPIError exception wrapped from (pymysql.err.InternalError) (1927, u'Connection was killed') [SQL: u'SELECT 1']
...
Sep 28 13:46:12 localhost cinder-volume: 2016-09-28 13:46:12.015 23277 ERROR cinder.service
Sep 28 13:46:31 localhost lrmd[42961]:  notice: Giving up on openstack-cinder-volume stop (rc=196): timeout (elapsed=19987ms, remaining=13ms)
Sep 28 13:46:31 localhost crmd[42964]:   error: Operation openstack-cinder-volume_stop_0: Timed Out (node=overcloud-controller-0, call=175, timeout=20000ms)
Sep 28 13:46:31 localhost crmd[42964]: warning: Action 96 (openstack-cinder-volume_stop_0) on overcloud-controller-0 failed (target: 0 vs. rc: 198): Error
Sep 28 13:46:31 localhost crmd[42964]:  notice: Transition aborted by openstack-cinder-volume_stop_0 'modify' on overcloud-controller-0: Event failed (magic=2:198;96:67:0:c63a51fa-bb3a-43fe-b017-3d281390b096, cib=0.42.1, source=match_graph_event:381, 0)
Sep 28 13:46:31 localhost crmd[42964]: warning: Action 96 (openstack-cinder-volume_stop_0) on overcloud-controller-0 failed (target: 0 vs. rc: 198): Error
Sep 28 13:46:31 localhost pengine[42963]: warning: Processing failed op stop for openstack-cinder-volume on overcloud-controller-0: OCF_TIMEOUT (198)
Sep 28 13:46:31 localhost pengine[42963]: warning: Processing failed op stop for openstack-cinder-volume on overcloud-controller-0: OCF_TIMEOUT (198)

So it seems to me the first stop is successful. It's the second one? that seems to have trouble. What puzzles me is that we did not wait enough for systemd to sigkill/sigterm the service. I am pretty sure we had bumped the systemd timeouts in the past

Comment 10 Marian Krcmarik 2016-10-20 09:13:45 UTC

Verified.

Comment 12 errata-xmlrpc 2016-12-14 16:06:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html

Note You need to log in before you can comment on or make changes to this bug.