Bug 1247302

Summary:	cinder-volume is not getting requests submitted to the API
Product:	Red Hat OpenStack	Reporter:	Jack Waterworth <jwaterwo>
Component:	openstack-cinder	Assignee:	Jon Bernard <jobernar>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	nlevinki <nlevinki>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	5.0 (RHEL 7)	CC:	chhudson, dshetty, eharney, fpercoco, jbiao, jdurgin, jeckersb, jwaterwo, lyarwood, scohen, sgotliv, yeylon
Target Milestone:	---	Keywords:	Unconfirmed, ZStream
Target Release:	5.0 (RHEL 7)
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-08-25 09:05:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jack Waterworth 2015-07-27 18:08:52 UTC

Description of problem:
customer has a script that creates a 1gb volume periodically to verify cinder is working as expected. They had an issue where, between 2pm and 11pm, the volumes were not created.

How reproducible:
periodically at random


Steps to Reproduce:
1. create cinder volumes periodically

Actual results:
cinder-volume never gets the request to create a volume

Expected results:
cinder-volume should create the device

Additional info:
during the time of the issue, we can see that no requests are being processed in the cinder volume.log

------------------
2015-07-14 13:52:24.870 4659 INFO cinder.volume.flows.manager.create_volume [req-80db52ee-b0d3-4057-ab0e-27d82005b913 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Volume adcbfcc3-99f1-496a-9688-7671cdd9d3d3: being created using CreateVolumeFromSpecTask._create_raw_volume with specification: {'status': u'creating', 'volume_size': 1, 'volume_name': u'volume-adcbfcc3-99f1-496a-9688-7671cdd9d3d3'}
2015-07-14 14:05:11.005 4659 INFO cinder.volume.flows.manager.create_volume [req-ed5bfe7d-ca77-42d6-a71d-8f01416f31c8 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Volume 2bc31323-51c9-4848-9bc3-523d57d81fce: being created using CreateVolumeFromSpecTask._create_from_snapshot with specification: {'status': u'creating', 'volume_size': 1, 'snapshot_id': u'fe418d91-88d7-4611-93b3-59b91448262b', 'volume_name': u'volume-2bc31323-51c9-4848-9bc3-523d57d81fce'}
2015-07-14 22:44:14.282 19866 INFO cinder.volume.flows.manager.create_volume [req-7094f2e5-3dcd-4b90-9532-19bb3f48c51f 13e80e03f324451d81c87a74c15fb264 3cfc4a2c3483486a9ae24bf877eaf2a8 - - -] Volume 43d9cf5b-b87c-4659-939e-b9139706f780: being created using CreateVolumeFromSpecTask._create_raw_volume with specification: {'status': u'creating', 'volume_size': 1, 'volume_name': u'volume-43d9cf5b-b87c-4659-939e-b9139706f780'}
------------------

however, during the same time period, we have plenty of activity from api.log

------------------
2015-07-14 15:17:25.041 4672 AUDIT cinder.api.v1.volumes [req-b787b954-403a-4898-98cd-200febc35ddb 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Create volume of 1 GB
2015-07-14 15:25:15.188 4672 AUDIT cinder.api.v1.volumes [req-b2824525-f5f5-4b63-b720-90c0f217d181 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Create volume of 1 GB
2015-07-14 15:33:24.870 4672 AUDIT cinder.api.v1.volumes [req-f7e814f9-d2c8-4ff7-aa3a-8139e4cffb19 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Create volume of 1 GB
2015-07-14 15:41:25.490 4672 AUDIT cinder.api.v1.volumes [req-24ab4eb0-5424-416f-bc60-c42e31f3ee57 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Create volume of 1 GB
2015-07-14 15:49:24.220 4672 AUDIT cinder.api.v1.volumes [req-af02b00f-bfef-4ebc-93b6-7691dcb1ef3d 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Create volume of 1 GB
2015-07-14 15:57:24.586 4672 AUDIT cinder.api.v1.volumes [req-1827efb8-7d22-4184-a24f-835dd289de5a 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Create volume of 1 GB
2015-07-14 16:05:25.778 4672 AUDIT cinder.api.v1.volumes [req-2ec60af8-88b6-4a3e-ad71-029e38f98013 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Create volume of 1 GB
2015-07-14 16:13:34.403 4672 AUDIT cinder.api.v1.volumes [req-4f49393e-2890-4323-926c-adfafc09c2bf 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Create volume of 1 GB
2015-07-14 16:21:30.293 4672 AUDIT cinder.api.v1.volumes [req-f76226f6-0616-43b9-b3a0-bde650065bae 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Create volume of 1 GB
2015-07-14 16:29:30.707 4672 AUDIT cinder.api.v1.volumes [req-00ca8da2-d2d4-4f97-b1dd-1d5ce65268ab 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Create volume of 1 GB
2015-07-14 16:37:30.158 4672 AUDIT cinder.api.v1.volumes [req-56619a55-410e-42a6-8809-b8fab6e8bc2e 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Create volume of 1 GB
2015-07-14 16:45:30.820 4672 AUDIT cinder.api.v1.volumes [req-cf6c1088-4740-41d9-8c87-2be5cb8ea2cb 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Create volume of 1 GB
2015-07-14 16:53:25.829 4672 AUDIT cinder.api.v1.volumes [req-7bc3ffaa-081b-442e-acbf-d7da172ca54d 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Create volume of 1 GB
2015-07-14 17:01:40.562 4672 AUDIT cinder.api.v1.volumes [req-46ff1ce4-99d6-4d89-8777-1ea29ae5b80d 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Create volume of 1 GB
------------------

these requests appear to be getting lost between the api and the volume services

additionally, the customer noticed that rabbitmq has queues with "stuck" requests.

------------------
new finding -

reoccurred just now -

rabbit-queues for cinder-service were not being consumed.

[root@csx-a-rabbitmq-001 ~]# rabbitmqctl list_queues | sort -n -k2,2 |grep cinder
cinder-backup   0
cinder-backup_fanout_2709be9d09c84f968535eb57d5d3d855    0
cinder-backup_fanout_5d18501eefee4a588e42579cb027f9f2      0
cinder-backup_fanout_bcf68078844042d88f99c3b93ba4a448      0
cinder-backup.us-texas-3.cloud.cisco.com                   0
cinder-scheduler             0
cinder-scheduler_fanout_3aa8cdbbcd06488b8adfcd7d639e331b              0
cinder-scheduler_fanout_9770bab0700343958a2e2854a03a1d58               0
cinder-scheduler_fanout_f08a56f40643494bb323ac87d1be1c18                0
cinder-scheduler.us-texas-3.cloud.cisco.com                0
cinder-volume  0
cinder-volume_fanout_685543690f2c4559b5676393a2aced38     0
cinder-volume_fanout_8d6e1370802e45dfb278906b54bc1f6d    0
cinder-volume_fanout_92076bda2d03483ea86ee96acd4cf901    0
cinder-volume_fanout_c42167c07c5d43c2b36226c1406db26d     0
cinder-volume_fanout_c5218c51a0164a98b48838c7bd07a7a1     0
cinder-volume_fanout_c56606dda54c4233a6790b3e4277d0da   0
cinder-volume.us-texas-3.cloud.cisco.com@solidfire   0
cinder-volume_fanout_09bd4b1ecc7d4b748f29e5688b78b7df   6
cinder-volume_fanout_8ea9e5ed87c54fc4b3eabd3fed74e5df   6
cinder-volume_fanout_8fac0842bb6b49f89a0c29c500a42e5e     6
cinder-volume_fanout_938ec5849ec8441d9ed104497cbce638    6
cinder-volume_fanout_a225f33652f441c9b1fa037665353803       6
cinder-volume_fanout_de79aabd88f045559846b84fc7916ba5    6
cc_cinder.error 196
cinder-volume.us-texas-3.cloud.cisco.com@generic     196

cinder-volume.us-texas-3.cloud.cisco.com@generic    would increase for each cinder create/delete command I performed



After restarting rabbit services on each – the queues for rabbit return to normal.
cinder-volume.us-texas-3.cloud.cisco.com@generic     0
------------------

restarting cinder on all nodes cleans up this issue.

additionally, i noticed that ALL requests appear to be sent to one node (node3)

Comment 7 Flavio Percoco 2015-07-30 12:24:34 UTC

One thing that I'm currently missing is how this issue was debugged in-house. That is to say, how was it determined that cinder-volume was not getting the requests and the volumes were not created?

Taking a random requests from the ones listed in the description, I can see it going from cinder-api to cinder-scheduler and then cinder-volume. This process ended in the volume being created successfully.

Request: req-00ca8da2-d2d4-4f97-b1dd-1d5ce65268ab

./cinderlogs.zip-0d5f7655-c275-455a-9847-4858a225d09c/cinderlogs-001/volume.log:2015-07-14 22:29:07.982 1970 INFO cinder.volume.flows.manager.create_volume [req-00ca8da2-d2d4-4f97-b1dd-1d5ce65268ab 00922c14ea054b
28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Volume volume-6576abea-d575-48b0-ac64-f5b06776fc23 (6576abea-d575-48b0-ac64-f5b06776fc23): created successfully

It'd be useful to know the list of 'stuck' volumes in order to debug the problematic cases rather than the whole system. If we don't have that list, it'd be useful to know how it was determined that some volumes were not created.

Comment 8 Jack Waterworth 2015-07-30 14:49:47 UTC

according to the API, that request you referenced came in at 16:29:30, 6 hours before it was actually created.

2015-07-14 16:29:30.707 4672 AUDIT cinder.api.v1.volumes [req-00ca8da2-d2d4-4f97-b1dd-1d5ce65268ab 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Create volume of 1 GB

The customer complains that cinder did not create volumes between 16 - 22.  I suspect the request was stuck in the volume queue until the cinder-volume service was restarted. At that time, everything was flushed and created as expected.

Could you show me where you see this in the scheduler? I checked the logs but did not see any information.

Comment 10 Flavio Percoco 2015-07-31 02:04:25 UTC

Jack,

Good observation about the 6h difference. However, I'm still quite confused. The bug description says that no volumes were created between 2pm and 10pm. However, I can see in the cinder-volume.log - on the same volume where the test volumes weren't created - that volumes were created during that time.

My guess right now is that there was indeed some communication issues between the rabbit nodes and the cinder-volume. I'm going to ask eck to take another look at that while I keep debugging this issue on the cinder side. I don't think the volume node ran out of capacity, otherwise the scheduler wouldn't have picked it.

./cinderctl-002-soscleaner-2012737258202097.tar.gz-e5968a23-4a9c-48f2-89c8-a4c7fcc03940/soscleaner-2012737258202097/var/log/cinder/volume.log-20150720:2015-07-14 15:49:24.600 6124 INFO cinder.volume.flows.manager
.create_volume [req-af02b00f-bfef-4ebc-93b6-7691dcb1ef3d 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Volume volume-b5029977-9347-456d-8403-154a51f359b3 (b5029977-9347-456d-8403-154a51
f359b3): created successfully
./cinderctl-002-soscleaner-2012737258202097.tar.gz-e5968a23-4a9c-48f2-89c8-a4c7fcc03940/soscleaner-2012737258202097/var/log/cinder/volume.log-20150720:2015-07-14 16:05:26.158 6124 INFO cinder.volume.flows.manager
.create_volume [req-2ec60af8-88b6-4a3e-ad71-029e38f98013 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Volume volume-221d5572-e984-4e47-bce8-ae895b7ddf6e (221d5572-e984-4e47-bce8-ae895b
7ddf6e): created successfully
./cinderctl-002-soscleaner-2012737258202097.tar.gz-e5968a23-4a9c-48f2-89c8-a4c7fcc03940/soscleaner-2012737258202097/var/log/cinder/volume.log-20150720:2015-07-14 16:21:30.640 6124 INFO cinder.volume.flows.manager
.create_volume [req-f76226f6-0616-43b9-b3a0-bde650065bae 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Volume volume-18632c7a-5cb5-42cc-b915-2e4b3359739e (18632c7a-5cb5-42cc-b915-2e4b33
59739e): created successfully
./cinderctl-002-soscleaner-2012737258202097.tar.gz-e5968a23-4a9c-48f2-89c8-a4c7fcc03940/soscleaner-2012737258202097/var/log/cinder/volume.log-20150720:2015-07-14 16:37:30.503 6124 INFO cinder.volume.flows.manager
.create_volume [req-56619a55-410e-42a6-8809-b8fab6e8bc2e 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Volume volume-94ff7284-a2cb-444c-9525-e406b96e49c5 (94ff7284-a2cb-444c-9525-e406b9
6e49c5): created successfully
./cinderctl-002-soscleaner-2012737258202097.tar.gz-e5968a23-4a9c-48f2-89c8-a4c7fcc03940/soscleaner-2012737258202097/var/log/cinder/volume.log-20150720:2015-07-14 16:53:26.178 6124 INFO cinder.volume.flows.manager
.create_volume [req-7bc3ffaa-081b-442e-acbf-d7da172ca54d 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Volume volume-14bb394a-0a33-4ffc-b464-1f740daf6a19 (14bb394a-0a33-4ffc-b464-1f740d
af6a19): created successfully
./cinderctl-002-soscleaner-2012737258202097.tar.gz-e5968a23-4a9c-48f2-89c8-a4c7fcc03940/soscleaner-2012737258202097/var/log/cinder/volume.log-20150720:2015-07-14 17:09:30.238 6124 INFO cinder.volume.flows.manager
.create_volume [req-5b8d6ac3-7f67-433a-a315-831904f520ba 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Volume volume-1cd90403-bed1-416e-96f3-1bf6bcc5cadf (1cd90403-bed1-416e-96f3-1bf6bc
c5cadf): created successfully
./cinderctl-002-soscleaner-2012737258202097.tar.gz-e5968a23-4a9c-48f2-89c8-a4c7fcc03940/soscleaner-2012737258202097/var/log/cinder/volume.log-20150720:2015-07-14 17:42:31.769 6124 INFO cinder.volume.flows.manager
.create_volume [req-cc35af77-fc65-48d6-a2bf-da5c72521956 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Volume volume-281bdc92-ffd5-47c4-93e6-422bee549a5e (281bdc92-ffd5-47c4-93e6-422bee
549a5e): created successfully
./cinderctl-002-soscleaner-2012737258202097.tar.gz-e5968a23-4a9c-48f2-89c8-a4c7fcc03940/soscleaner-2012737258202097/var/log/cinder/volume.log-20150720:2015-07-14 17:48:16.124 6124 INFO cinder.volume.manager [req-
d27335f6-360d-4fbc-bdb2-7f4a6eb14497 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] snapshot 7b61b1b6-0744-4d83-aafe-5df222d5f592: created successfully
./cinderctl-002-soscleaner-2012737258202097.tar.gz-e5968a23-4a9c-48f2-89c8-a4c7fcc03940/soscleaner-2012737258202097/var/log/cinder/volume.log-20150720:2015-07-14 17:58:39.199 6124 INFO cinder.volume.flows.manager
.create_volume [req-34025c04-536e-4152-82b2-e658e3a8d00b 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Volume volume-f8d75c93-f7fa-4627-917b-d35c21752dab (f8d75c93-f7fa-4627-917b-d35c21
752dab): created successfully

Taking a random request from the above log snippet, we can see it hit the scheduler in node 002 a couple of ms before it hit cinder-volume:

./cinderctl-002-soscleaner-2012737258202097.tar.gz-e5968a23-4a9c-48f2-89c8-a4c7fcc03940/soscleaner-2012737258202097/var/log/cinder/scheduler.log-20150720:2015-07-14 17:09:30.021 5983 DEBUG stevedore.extension [req-5b8d6ac3-7f67-433a-a315-831904f520ba 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] found extension EntryPoint.parse('default = taskflow.engines.action_engine.engine:SingleThreadedActionEngine') _load_plugins /usr/lib/python2.7/site-packages/stevedore/extension.py:156

The same request hit the cinder-api service at 17:09 as well:

./cinderctl-003-soscleaner-2990161794430296.tar.gz-68dfb6ab-5e85-4b1f-92b7-f092580cb02f/soscleaner-2990161794430296/var/log/cinder/api.log-20150719:2015-07-14 17:09:29.898 4672 DEBUG routes.middleware [req-5b8d6ac3-7f67-433a-a315-831904f520ba 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Matched POST /cad4174a384740318ecf5c456047a035/volumes __call__ /usr/lib/python2.7/site-packages/routes/middleware.py:100


I'll keep debugging

Comment 13 Sergey Gotliv 2015-08-02 04:17:04 UTC

Jon,

You reproduced that issue a few month ago:

https://bugs.launchpad.net/cinder/+bug/1446682 - cinder-volume keeps opening Ceph clients until the maximum number of opened files reached

which is the upstream version of the BZ#1219559 from the previous comments. Did Cinder stop consuming messages from the rabbit queue when happened? Can LP#1446682 explain the problem described in this case?

Comment 14 Sergey Gotliv 2015-08-02 06:20:19 UTC

I see that Node3 is trying to delete a few rbd volumes around 14:15, I don't see that any of these delete operations have been completed, which makes me think that maybe RBD driver is deleting volumes synchronously (Jon do you remember that bug?), it takes time or it's stuck because RBD is slow or non-responsive, so meantime it doesn't consume other messages from the queue.

volume 7be4579b-a976-4c7f-8f81-678208b1b272: removing export delete_volume /usr/lib/python2.7/site-packages/cinder/volume/manager.py:399

volume 707b243f-9bc4-4597-968d-82a4a9fbb581: removing export delete_volume /usr/lib/python2.7/site-packages/cinder/volume/manager.py:399

volume 45694fbc-61a3-42a4-a6df-611d179ffc51: removing export delete_volume /usr/lib/python2.7/site-packages/cinder/volume/manager.py:399

volume cd9dd7ee-5ac0-4ddc-92a3-6aca92f2a869: removing export delete_volume /usr/lib/python2.7/site-packages/cinder/volume/manager.py:399

Comment 15 Sergey Gotliv 2015-08-02 09:30:30 UTC

I am wondering what is the status and size of these volumes, is it possible that Cinder is stuck trying to delete them?

7be4579b-a976-4c7f-8f81-678208b1b272
707b243f-9bc4-4597-968d-82a4a9fbb581
45694fbc-61a3-42a4-a6df-611d179ffc51
cd9dd7ee-5ac0-4ddc-92a3-6aca92f2a869

Comment 16 Sergey Gotliv 2015-08-02 13:29:04 UTC

Remind me that bug a little bit https://launchpad.net/bugs/1401335

Comment 17 Jon Bernard 2015-08-03 17:06:14 UTC

I don't *think* the leaking of file handles is the issue here due to the description of what's happening, but you could always verify using 'lsof -np <pid>' on the child process of cinder-volume.

It is possible the deletion of large volumes could cause a cinder-volume process to be unresponsive.  Large volume deletion can take a significant amount of time, and until recently, this work was initiated from the main cinder-volume context, which would cause it to block until completed.  This sounds consistent with the behaviour being described.

Comment 18 Flavio Percoco 2015-08-04 08:43:38 UTC

I took one of the cases mentioned by Sergey in one of his comments ( 7be4579b-a976-4c7f-8f81-678208b1b272 ) and looked at the volume logs to see whether the delete had completed or not. Surprisingly, I found that there are several delete requests in the logs for this volume and they are in different nodes.

The node 003 got a delete request at 14:15 and despite it saying the deletion process started, it never reported the completion. However, when cinderctl-002 got the request at 22:26, it did process it and reported the success of the operation at 22:29:02.437. The things that seem weird here are:

1. 2 different nodes got a delete request for the same volume which, in theory, shouldn't happen.
2. Node 002 got a second delete operation and it reported that the volume didn't exist.
3. Node 003 never reported the end of the operation.

Logs from cinderctl-003

./cinderctl-003-soscleaner-1150731481727834.tar.gz-69336f0b-e6c6-4198-97b1-960b7a19da7b/soscleaner-1150731481727834/var/log/cinder/volume.log-20150719:2015-07-14 14:15:41.081 4659 DEBUG cinder.openstack.common.lo
ckutils [req-94f052ab-1344-4ab6-9b24-601bab772e2a 34f31fc86bed406797298307fb8463d2 68e7a813ea7e456b8f94f3f2cd4d4ba3 - - -] Got semaphore "7be4579b-a976-4c7f-8f81-678208b1b272-delete_volume" for method "lvo_inner2
"... inner /usr/lib/python2.7/site-packages/cinder/openstack/common/lockutils.py:191
./cinderctl-003-soscleaner-1150731481727834.tar.gz-69336f0b-e6c6-4198-97b1-960b7a19da7b/soscleaner-1150731481727834/var/log/cinder/volume.log-20150719:2015-07-14 14:15:41.081 4659 DEBUG cinder.openstack.common.lo
ckutils [req-94f052ab-1344-4ab6-9b24-601bab772e2a 34f31fc86bed406797298307fb8463d2 68e7a813ea7e456b8f94f3f2cd4d4ba3 - - -] Attempting to grab file lock "7be4579b-a976-4c7f-8f81-678208b1b272-delete_volume" for met
hod "lvo_inner2"... inner /usr/lib/python2.7/site-packages/cinder/openstack/common/lockutils.py:202
./cinderctl-003-soscleaner-1150731481727834.tar.gz-69336f0b-e6c6-4198-97b1-960b7a19da7b/soscleaner-1150731481727834/var/log/cinder/volume.log-20150719:2015-07-14 14:15:41.081 4659 DEBUG cinder.openstack.common.lo
ckutils [req-94f052ab-1344-4ab6-9b24-601bab772e2a 34f31fc86bed406797298307fb8463d2 68e7a813ea7e456b8f94f3f2cd4d4ba3 - - -] Got file lock "7be4579b-a976-4c7f-8f81-678208b1b272-delete_volume" at /var/lib/cinder/tmp
/cinder-7be4579b-a976-4c7f-8f81-678208b1b272-delete_volume for method "lvo_inner2"... inner /usr/lib/python2.7/site-packages/cinder/openstack/common/lockutils.py:232
./cinderctl-003-soscleaner-1150731481727834.tar.gz-69336f0b-e6c6-4198-97b1-960b7a19da7b/soscleaner-1150731481727834/var/log/cinder/volume.log-20150719:2015-07-14 14:15:41.149 4659 INFO cinder.volume.manager [req-
94f052ab-1344-4ab6-9b24-601bab772e2a 34f31fc86bed406797298307fb8463d2 68e7a813ea7e456b8f94f3f2cd4d4ba3 - - -] volume 7be4579b-a976-4c7f-8f81-678208b1b272: deleting
./cinderctl-003-soscleaner-1150731481727834.tar.gz-69336f0b-e6c6-4198-97b1-960b7a19da7b/soscleaner-1150731481727834/var/log/cinder/volume.log-20150719:2015-07-14 14:15:41.245 4659 DEBUG cinder.volume.manager [req
-94f052ab-1344-4ab6-9b24-601bab772e2a 34f31fc86bed406797298307fb8463d2 68e7a813ea7e456b8f94f3f2cd4d4ba3 - - -] volume 7be4579b-a976-4c7f-8f81-678208b1b272: removing export delete_volume /usr/lib/python2.7/site-pa
ckages/cinder/volume/manager.py:399
./cinderctl-003-soscleaner-1150731481727834.tar.gz-69336f0b-e6c6-4198-97b1-960b7a19da7b/soscleaner-1150731481727834/var/log/cinder/volume.log-20150719:2015-07-14 14:15:41.245 4659 DEBUG cinder.volume.manager [req
-94f052ab-1344-4ab6-9b24-601bab772e2a 34f31fc86bed406797298307fb8463d2 68e7a813ea7e456b8f94f3f2cd4d4ba3 - - -] volume 7be4579b-a976-4c7f-8f81-678208b1b272: deleting delete_volume /usr/lib/python2.7/site-packages/
cinder/volume/manager.py:401
./cinderctl-003-soscleaner-1150731481727834.tar.gz-69336f0b-e6c6-4198-97b1-960b7a19da7b/soscleaner-1150731481727834/var/log/cinder/volume.log-20150719:2015-07-14 14:15:41.412 4659 DEBUG cinder.volume.drivers.rbd 
[req-94f052ab-1344-4ab6-9b24-601bab772e2a 34f31fc86bed406797298307fb8463d2 68e7a813ea7e456b8f94f3f2cd4d4ba3 - - -] volume volume-7be4579b-a976-4c7f-8f81-678208b1b272 is not a clone _get_clone_info /usr/lib/python
2.7/site-packages/cinder/volume/drivers/rbd.py:556
./cinderctl-003-soscleaner-1150731481727834.tar.gz-69336f0b-e6c6-4198-97b1-960b7a19da7b/soscleaner-1150731481727834/var/log/cinder/volume.log-20150719:2015-07-14 14:15:41.415 4659 DEBUG cinder.volume.drivers.rbd 
[req-94f052ab-1344-4ab6-9b24-601bab772e2a 34f31fc86bed406797298307fb8463d2 68e7a813ea7e456b8f94f3f2cd4d4ba3 - - -] deleting rbd volume volume-7be4579b-a976-4c7f-8f81-678208b1b272 delete_volume /usr/lib/python2.7/
site-packages/cinder/volume/drivers/rbd.py:633
./cinderctl-003-soscleaner-1150731481727834.tar.gz-69336f0b-e6c6-4198-97b1-960b7a19da7b/soscleaner-1150731481727834/var/log/cinder/volume.log-20150719:2015-07-14 22:28:49.735 19300 INFO cinder.volume.manager [req
-2b5f577e-d8b0-4651-903f-acece0d542e0 - - - - -] volume 7be4579b-a976-4c7f-8f81-678208b1b272: skipping export


Logs from cinderctl-002

./cinderctl-002-soscleaner-8542214154554249.tar.gz-629c22b7-ce14-408a-8da9-d6d00fcac003/soscleaner-8542214154554249/var/log/cinder/volume.log-20150720:2015-07-14 22:26:40.733 11716 INFO cinder.volume.manager [req
-b2d1fab1-bead-4dff-9a48-4d81abbc7dfa - - - - -] volume 7be4579b-a976-4c7f-8f81-678208b1b272: skipping export
./cinderctl-002-soscleaner-8542214154554249.tar.gz-629c22b7-ce14-408a-8da9-d6d00fcac003/soscleaner-8542214154554249/var/log/cinder/volume.log-20150720:2015-07-14 22:26:56.193 11870 INFO cinder.volume.manager [req
-e0e73134-2faf-4132-813f-88350d4eb72c - - - - -] volume 7be4579b-a976-4c7f-8f81-678208b1b272: skipping export
./cinderctl-002-soscleaner-8542214154554249.tar.gz-629c22b7-ce14-408a-8da9-d6d00fcac003/soscleaner-8542214154554249/var/log/cinder/volume.log-20150720:2015-07-14 22:28:47.776 11870 DEBUG cinder.openstack.common.l
ockutils [req-f05e1722-ecba-4500-9585-bb2b64810b58 7f84af160bec4db68ae60d702d97989f d9c4536810714f0db89d8998cc257d5a - - -] Got semaphore "7be4579b-a976-4c7f-8f81-678208b1b272-delete_volume" for method "lvo_inner
2"... inner /usr/lib/python2.7/site-packages/cinder/openstack/common/lockutils.py:191
./cinderctl-002-soscleaner-8542214154554249.tar.gz-629c22b7-ce14-408a-8da9-d6d00fcac003/soscleaner-8542214154554249/var/log/cinder/volume.log-20150720:2015-07-14 22:28:47.777 11870 DEBUG cinder.openstack.common.l
ockutils [req-f05e1722-ecba-4500-9585-bb2b64810b58 7f84af160bec4db68ae60d702d97989f d9c4536810714f0db89d8998cc257d5a - - -] Attempting to grab file lock "7be4579b-a976-4c7f-8f81-678208b1b272-delete_volume" for me
thod "lvo_inner2"... inner /usr/lib/python2.7/site-packages/cinder/openstack/common/lockutils.py:202
./cinderctl-002-soscleaner-8542214154554249.tar.gz-629c22b7-ce14-408a-8da9-d6d00fcac003/soscleaner-8542214154554249/var/log/cinder/volume.log-20150720:2015-07-14 22:28:47.777 11870 DEBUG cinder.openstack.common.l
ockutils [req-f05e1722-ecba-4500-9585-bb2b64810b58 7f84af160bec4db68ae60d702d97989f d9c4536810714f0db89d8998cc257d5a - - -] Got file lock "7be4579b-a976-4c7f-8f81-678208b1b272-delete_volume" at /var/lib/cinder/tm
p/cinder-7be4579b-a976-4c7f-8f81-678208b1b272-delete_volume for method "lvo_inner2"... inner /usr/lib/python2.7/site-packages/cinder/openstack/common/lockutils.py:232
./cinderctl-002-soscleaner-8542214154554249.tar.gz-629c22b7-ce14-408a-8da9-d6d00fcac003/soscleaner-8542214154554249/var/log/cinder/volume.log-20150720:2015-07-14 22:28:49.052 11870 INFO cinder.volume.manager [req
-f05e1722-ecba-4500-9585-bb2b64810b58 7f84af160bec4db68ae60d702d97989f d9c4536810714f0db89d8998cc257d5a - - -] volume 7be4579b-a976-4c7f-8f81-678208b1b272: deleting
./cinderctl-002-soscleaner-8542214154554249.tar.gz-629c22b7-ce14-408a-8da9-d6d00fcac003/soscleaner-8542214154554249/var/log/cinder/volume.log-20150720:2015-07-14 22:28:50.189 11870 DEBUG cinder.volume.manager [re
q-f05e1722-ecba-4500-9585-bb2b64810b58 7f84af160bec4db68ae60d702d97989f d9c4536810714f0db89d8998cc257d5a - - -] volume 7be4579b-a976-4c7f-8f81-678208b1b272: removing export delete_volume /usr/lib/python2.7/site-p
ackages/cinder/volume/manager.py:399
./cinderctl-002-soscleaner-8542214154554249.tar.gz-629c22b7-ce14-408a-8da9-d6d00fcac003/soscleaner-8542214154554249/var/log/cinder/volume.log-20150720:2015-07-14 22:28:50.189 11870 DEBUG cinder.volume.manager [re
q-f05e1722-ecba-4500-9585-bb2b64810b58 7f84af160bec4db68ae60d702d97989f d9c4536810714f0db89d8998cc257d5a - - -] volume 7be4579b-a976-4c7f-8f81-678208b1b272: deleting delete_volume /usr/lib/python2.7/site-packages
/cinder/volume/manager.py:401
./cinderctl-002-soscleaner-8542214154554249.tar.gz-629c22b7-ce14-408a-8da9-d6d00fcac003/soscleaner-8542214154554249/var/log/cinder/volume.log-20150720:2015-07-14 22:28:52.894 11870 INFO cinder.volume.drivers.rbd 
[req-f05e1722-ecba-4500-9585-bb2b64810b58 7f84af160bec4db68ae60d702d97989f d9c4536810714f0db89d8998cc257d5a - - -] volume volume-7be4579b-a976-4c7f-8f81-678208b1b272 no longer exists in backend
./cinderctl-002-soscleaner-8542214154554249.tar.gz-629c22b7-ce14-408a-8da9-d6d00fcac003/soscleaner-8542214154554249/var/log/cinder/volume.log-20150720:2015-07-14 22:29:02.437 11870 INFO cinder.volume.manager [req
-f05e1722-ecba-4500-9585-bb2b64810b58 7f84af160bec4db68ae60d702d97989f d9c4536810714f0db89d8998cc257d5a - - -] volume 7be4579b-a976-4c7f-8f81-678208b1b272: deleted successfully
./cinderctl-002-soscleaner-8542214154554249.tar.gz-629c22b7-ce14-408a-8da9-d6d00fcac003/soscleaner-8542214154554249/var/log/cinder/volume.log-20150720:2015-07-14 22:29:03.006 11870 DEBUG cinder.openstack.common.lockutils [req-f05e1722ecba-4500-9585-bb2b64810b58 7f84af160bec4db68ae60d702d97989f d9c4536810714f0db89d8998cc257d5a - - -] Released file lock "7be4579b-a976-4c7f-8f81-678208b1b272-delete_volume" at /var/lib/cind
er/tmp/cinder-7be4579b-a976-4c7f-8f81-678208b1b272-delete_volume for method "lvo_inner2"... inner /usr/lib/python2.7/site-packages/cinder/openstack/common/lockutils.py:239

Comment 19 Flavio Percoco 2015-08-04 09:19:48 UTC

Can we also get more information on what the HA (if any) architecture looks like in this deployment? 

My last comment makes me think that, as John also mentioned in c#4, there could've been a partition in the rabbit cluster that caused this.

Comment 22 Jon Bernard 2015-08-04 21:15:06 UTC

From looking at the cinder configuration files on each of the nodes in the sos report, they're using the same 'host=' value, presumably to allow for a level of cinder-volume redundancy in the event of a failure.  Unfortunatly this configuration opens them to many race conditions, as this setup was not planned for and the code does not protect against such races.

This is something that really should be changed.  I don't yet have a clear understanding of why things are failing in the way described here, but I wanted to mention the 'host=' setting.

Comment 23 Flavio Percoco 2015-08-05 07:33:07 UTC

Josh, could you take a look at the Ceph logs? Ceph seems to be stuck - just a guess - but it's not clear why that'd happen.

Comment 24 Josh Durgin 2015-08-06 01:00:14 UTC

There's nothing in the ceph logs to indicate that the cluster was unhealthy at any point, so the problem seems to be client-side.

The sos reports includes some lsof info which shows very few unix sockets open (6 of 301 total by cinder-volume) so the fd limit is likely not being hit at all, and this case is unrelated to BZ#1219559.

From the earlier comments about deletes in the logs, I'd agree that the most
likely explanation is https://bugs.launchpad.net/cinder/+bug/1401335 as
mentioned earlier. Deletes can take a long time, and with enough threads stuck deleting do end up leaving messages in rabbit.

Comment 26 Sergey Gotliv 2015-08-06 18:47:26 UTC

Jack,

Josh and Jon are working to provide a steps to collect an additional information from the customer environment before the restart. Meantime, please, ask customer to provide the sos report. As you see in the previous comments we suspect that Cinder is stuck on deleting, let's see what happened this time.

Comment 27 Jon Bernard 2015-08-06 21:41:13 UTC

Jack,

To enable ceph client logs, add this to /etc/ceph/ceph.conf on the nodes
running cinder-volume. Put this in the [client] section, i.e.:

  [client]
  debug ms = 1
  debug rbd = 20
  log file = /var/log/cinder/rbd-$pid-$cctid.log

There's no need to restart anything. New requests for rbd volumes
should start producing log files under /var/log/cinder.

Josh looked at enabling eventlet logging, but suspects it to be too verbose to make use of unless we can find a more easily reproducible test case.

I looked at the possible use of gdb to print the thread call traces, but that requires the most recent version with python extentions enabled - so we may want to consider that next, but I'm not sure how much time the customer can put into this.

Anything you can do to further narrow the the failure down into a test case that we could attempt to recreate, would be most helpful - easier said than done :)

Hopefully the ceph client logging will help us determine if long-running deletes are involved, or if it's something different entirely.

Comment 28 Sergey Gotliv 2015-08-15 10:05:20 UTC

Jack,

There is a NEEDINFO on you for more than a week? Any updates from the customer?

Comment 30 Jon Bernard 2015-08-17 13:36:31 UTC

Jack, it would be useful to see the logs from my previous comment.  If they are exhausing the file-max limit, we would see that reflected in the logs.  And if something else were occuring, we have a reasonable chance of catching that too.  The behaviour that's being described in the bug doesn't match with what I would expect from file handle exhaustion - we'd see errors immediatly and not 8-hour migration timeouts, so I have some doubts about the current theory; it could be that they're just getting lucky for the moment.

Comment 31 Sergey Gotliv 2015-08-25 09:05:15 UTC

Jack,

Please, reopen this case if you get a new information from the customer or this issue happens again. We have 2 theories right now (see previous comments), but
without ceph client debug logs we can't be really sure.