Bug 1380445 - cinder-volume stop timeout when galera is not running
Summary: cinder-volume stop timeout when galera is not running
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-tripleo
Version: 10.0 (Newton)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: 10.0 (Newton)
Assignee: Michele Baldessari
QA Contact: nlevinki
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-29 15:40 UTC by Raoul Scarazzini
Modified: 2016-12-14 16:06 UTC (History)
12 users (show)

Fixed In Version: puppet-tripleo-5.2.0-2.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-12-14 16:06:06 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2016:2948 normal SHIPPED_LIVE Red Hat OpenStack Platform 10 enhancement update 2016-12-14 19:55:27 UTC
OpenStack gerrit 380665 None None None 2016-09-30 22:20:06 UTC
Launchpad 1629366 None None None 2016-09-30 15:28:20 UTC

Description Raoul Scarazzini 2016-09-29 15:40:54 UTC
Description of problem:

While testing HA resource behavior inside Newton we do this sequence of operations:

1 - Stop Galera;
2 - Stop openstack-cinder-volume resource;
3 - Start openstack-cinder-volume resource;

Problem is that while doing step 2 the cluster is not able to stop the resource due to a timeout:

 ip-172.17.0.14	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-0
 ip-172.20.0.13	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-1
 ip-172.18.0.11	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-2
 ip-172.19.0.18	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-0
 Clone Set: haproxy-clone [haproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 ip-192.0.2.8	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-1
 Master/Slave Set: galera-master [galera]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Master/Slave Set: redis-master [redis]
     Masters: [ overcloud-controller-0 ]
     Slaves: [ overcloud-controller-1 overcloud-controller-2 ]
 ip-172.17.0.15	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-2
 openstack-cinder-volume	(systemd:openstack-cinder-volume):	FAILED overcloud-controller-0 (unmanaged)

Failed Actions:
* openstack-cinder-volume_stop_0 on overcloud-controller-0 'OCF_TIMEOUT' (198): call=175, status=Timed Out, exitreason='none',
    last-rc-change='Wed Sep 28 13:46:11 2016', queued=0ms, exec=19988ms

Now, this can be a race condition, since it does not happen every time, but looking inside the logs could be useful to understand what happened this time.

Comment 1 Raoul Scarazzini 2016-09-29 15:46:48 UTC
sosreports, logs and stauts here: http://file.rdu.redhat.com/~rscarazz/BZ1380445/

Comment 2 Michele Baldessari 2016-09-30 10:44:28 UTC
So on controller-0 we have the following timeline:
1) cinder-volume is up and running and so is galera:
Sep 28 13:27:14 localhost galera(galera)[54212]: INFO: Galera started
Sep 28 13:27:14 localhost crmd[42964]:  notice: Operation galera_promote_0: ok (node=overcloud-controller-0, call=43, rc=0, cib-update=78, confirmed=true)

...
Sep 28 13:38:19 localhost systemd: Started Cluster Controlled openstack-cinder-volume.
Sep 28 13:38:21 localhost crmd[42964]:  notice: Operation openstack-cinder-volume_start_0: ok (node=overcloud-controller-0, call=72, rc=0, cib-update=101, confirmed=true)
Sep 28 13:38:21 localhost cinder-volume: 2016-09-28 13:38:21.376 134083 INFO cinder.rpc [req-28d86c59-3c70-40cf-980f-9c2479667293 - - - - -] Automatically selected cinder-volume object...

2) galera is stopped:
Sep 28 13:43:58 localhost galera(galera)[14132]: INFO: MySQL stopped
Sep 28 13:43:58 localhost galera(galera)[14132]: INFO: attempting to detect last commit version by reading /var/lib/mysql/grastate.dat
Sep 28 13:43:58 localhost galera(galera)[14132]: INFO: Last commit version found:  4671
Sep 28 13:43:58 localhost crmd[42964]:  notice: Operation galera_demote_0: ok (node=overcloud-controller-0, call=148, rc=0, cib-update=190, confirmed=true)
Sep 28 13:43:58 localhost crmd[42964]:  notice: Transition aborted by status-1-galera-last-committed, galera-last-committed=4671: Transient attribute change (create cib=0.35.5, source=abort_unless_down:319, path=/cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attributes[@id='status-1'], 0)
Sep 28 13:43:58 localhost pengine[42963]:  notice: Stop    galera:0#011(overcloud-controller-0)
Sep 28 13:43:58 localhost pengine[42963]:  notice: Stop    galera:1#011(overcloud-controller-1)
Sep 28 13:43:58 localhost pengine[42963]:  notice: Stop    galera:2#011(overcloud-controller-2)

3) cinder-volume starts complaining:
Sep 28 13:43:32 localhost crmd[42964]:  notice: Operation openstack-cinder-volume_monitor_0: ok (node=overcloud-controller-0, call=139, rc=0, cib-update=168, confirmed=true)
Sep 28 13:43:32 localhost crmd[42964]: warning: Action 14 (openstack-cinder-volume_monitor_0) on overcloud-controller-0 failed (target: 7 vs. rc: 0): Error
Sep 28 13:43:32 localhost crmd[42964]: warning: Action 14 (openstack-cinder-volume_monitor_0) on overcloud-controller-0 failed (target: 7 vs. rc: 0): Error
Sep 28 13:43:34 localhost crmd[42964]:  notice: Initiating action 112: monitor openstack-cinder-volume_monitor_60000 on overcloud-controller-0 (local)
Sep 28 13:44:02 localhost cinder-volume: 2016-09-28 13:44:02.905 134306 ERROR oslo_db.sqlalchemy.exc_filters [req-6b13fe49-3958-4a9a-8c87-b238e57117aa - - - - -] DBAPIError exception wrapped from (pymysql.err.InternalError) (1927, u'Connection was killed') [SQL: u'SELECT 1']

4) We stop cinder-volume successfully:
Sep 28 13:44:45 localhost pengine[42963]:  notice: Stop    openstack-cinder-volume#011(overcloud-controller-0)
Sep 28 13:44:45 localhost crmd[42964]:  notice: Initiating action 111: stop openstack-cinder-volume_stop_0 on overcloud-controller-0 (local)
Sep 28 13:44:45 localhost cinder-volume: 2016-09-28 13:44:45.910 134083 INFO oslo_service.service [req-28d86c59-3c70-40cf-980f-9c2479667293 - - - - -] Caught SIGTERM, stopping children
Sep 28 13:44:45 localhost cinder-volume: 2016-09-28 13:44:45.913 134083 INFO oslo_service.service [req-28d86c59-3c70-40cf-980f-9c2479667293 - - - - -] Waiting on 1 children to exit
Sep 28 13:44:45 localhost cinder-volume: 2016-09-28 13:44:45.922 134083 INFO oslo_service.service [req-28d86c59-3c70-40cf-980f-9c2479667293 - - - - -] Child 134306 killed by signal 15
Sep 28 13:44:47 localhost crmd[42964]:  notice: Operation openstack-cinder-volume_stop_0: ok (node=overcloud-controller-0, call=158, rc=0, cib-update=211, confirmed=true)

5) We start cinder-volume:
Sep 28 13:45:05 localhost pengine[42963]:  notice: Start   openstack-cinder-volume#011(overcloud-controller-0)
Sep 28 13:45:05 localhost crmd[42964]:  notice: Initiating action 95: start openstack-cinder-volume_start_0 on overcloud-controller-0 (local)
Sep 28 13:45:05 localhost systemd: Configuration file /run/systemd/system/openstack-cinder-volume.service.d/50-pacemaker.conf is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
Sep 28 13:45:05 localhost systemd: Started Cluster Controlled openstack-cinder-volume.
Sep 28 13:45:05 localhost systemd: Starting Cluster Controlled openstack-cinder-volume...
Sep 28 13:45:07 localhost cinder-volume: Option "logdir" from group "DEFAULT" is deprecated. Use option "log-dir" from group "DEFAULT".
Sep 28 13:45:07 localhost crmd[42964]:  notice: Operation openstack-cinder-volume_start_0: ok (node=overcloud-controller-0, call=162, rc=0, cib-update=222, confirmed=true)
Sep 28 13:45:07 localhost crmd[42964]:  notice: Initiating action 96: monitor openstack-cinder-volume_monitor_60000 on overcloud-controller-0 (local)

6) Then we stop it again and here we fail:
Sep 28 13:46:11 localhost pengine[42963]:  notice: Stop    openstack-cinder-volume#011(overcloud-controller-0)
Sep 28 13:46:11 localhost crmd[42964]:  notice: Initiating action 96: stop openstack-cinder-volume_stop_0 on overcloud-controller-0 (local)
Sep 28 13:46:12 localhost cinder-volume: 2016-09-28 13:46:12.010 23277 ERROR oslo_db.sqlalchemy.exc_filters [req-8f4a09ff-bed5-4bc4-8947-85e6b6b36fb0 - - - - -] DBAPIError exception wrapped from (pymysql.err.InternalError) (1927, u'Connection was killed') [SQL: u'SELECT 1']
...
Sep 28 13:46:12 localhost cinder-volume: 2016-09-28 13:46:12.015 23277 ERROR cinder.service
Sep 28 13:46:31 localhost lrmd[42961]:  notice: Giving up on openstack-cinder-volume stop (rc=196): timeout (elapsed=19987ms, remaining=13ms)
Sep 28 13:46:31 localhost crmd[42964]:   error: Operation openstack-cinder-volume_stop_0: Timed Out (node=overcloud-controller-0, call=175, timeout=20000ms)
Sep 28 13:46:31 localhost crmd[42964]: warning: Action 96 (openstack-cinder-volume_stop_0) on overcloud-controller-0 failed (target: 0 vs. rc: 198): Error
Sep 28 13:46:31 localhost crmd[42964]:  notice: Transition aborted by openstack-cinder-volume_stop_0 'modify' on overcloud-controller-0: Event failed (magic=2:198;96:67:0:c63a51fa-bb3a-43fe-b017-3d281390b096, cib=0.42.1, source=match_graph_event:381, 0)
Sep 28 13:46:31 localhost crmd[42964]: warning: Action 96 (openstack-cinder-volume_stop_0) on overcloud-controller-0 failed (target: 0 vs. rc: 198): Error
Sep 28 13:46:31 localhost pengine[42963]: warning: Processing failed op stop for openstack-cinder-volume on overcloud-controller-0: OCF_TIMEOUT (198)
Sep 28 13:46:31 localhost pengine[42963]: warning: Processing failed op stop for openstack-cinder-volume on overcloud-controller-0: OCF_TIMEOUT (198)

So it seems to me the first stop is successful. It's the second one? that seems to have trouble. What puzzles me is that we did not wait enough for systemd to sigkill/sigterm the service. I am pretty sure we had bumped the systemd timeouts in the past

Comment 10 Marian Krcmarik 2016-10-20 09:13:45 UTC
Verified.

Comment 12 errata-xmlrpc 2016-12-14 16:06:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html


Note You need to log in before you can comment on or make changes to this bug.