Bug 1247302
Summary: | cinder-volume is not getting requests submitted to the API | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Jack Waterworth <jwaterwo> |
Component: | openstack-cinder | Assignee: | Jon Bernard <jobernar> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | nlevinki <nlevinki> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 5.0 (RHEL 7) | CC: | chhudson, dshetty, eharney, fpercoco, jbiao, jdurgin, jeckersb, jwaterwo, lyarwood, scohen, sgotliv, yeylon |
Target Milestone: | --- | Keywords: | Unconfirmed, ZStream |
Target Release: | 5.0 (RHEL 7) | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-08-25 09:05:15 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jack Waterworth
2015-07-27 18:08:52 UTC
One thing that I'm currently missing is how this issue was debugged in-house. That is to say, how was it determined that cinder-volume was not getting the requests and the volumes were not created? Taking a random requests from the ones listed in the description, I can see it going from cinder-api to cinder-scheduler and then cinder-volume. This process ended in the volume being created successfully. Request: req-00ca8da2-d2d4-4f97-b1dd-1d5ce65268ab ./cinderlogs.zip-0d5f7655-c275-455a-9847-4858a225d09c/cinderlogs-001/volume.log:2015-07-14 22:29:07.982 1970 INFO cinder.volume.flows.manager.create_volume [req-00ca8da2-d2d4-4f97-b1dd-1d5ce65268ab 00922c14ea054b 28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Volume volume-6576abea-d575-48b0-ac64-f5b06776fc23 (6576abea-d575-48b0-ac64-f5b06776fc23): created successfully It'd be useful to know the list of 'stuck' volumes in order to debug the problematic cases rather than the whole system. If we don't have that list, it'd be useful to know how it was determined that some volumes were not created. according to the API, that request you referenced came in at 16:29:30, 6 hours before it was actually created. 2015-07-14 16:29:30.707 4672 AUDIT cinder.api.v1.volumes [req-00ca8da2-d2d4-4f97-b1dd-1d5ce65268ab 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Create volume of 1 GB The customer complains that cinder did not create volumes between 16 - 22. I suspect the request was stuck in the volume queue until the cinder-volume service was restarted. At that time, everything was flushed and created as expected. Could you show me where you see this in the scheduler? I checked the logs but did not see any information. Jack, Good observation about the 6h difference. However, I'm still quite confused. The bug description says that no volumes were created between 2pm and 10pm. However, I can see in the cinder-volume.log - on the same volume where the test volumes weren't created - that volumes were created during that time. My guess right now is that there was indeed some communication issues between the rabbit nodes and the cinder-volume. I'm going to ask eck to take another look at that while I keep debugging this issue on the cinder side. I don't think the volume node ran out of capacity, otherwise the scheduler wouldn't have picked it. ./cinderctl-002-soscleaner-2012737258202097.tar.gz-e5968a23-4a9c-48f2-89c8-a4c7fcc03940/soscleaner-2012737258202097/var/log/cinder/volume.log-20150720:2015-07-14 15:49:24.600 6124 INFO cinder.volume.flows.manager .create_volume [req-af02b00f-bfef-4ebc-93b6-7691dcb1ef3d 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Volume volume-b5029977-9347-456d-8403-154a51f359b3 (b5029977-9347-456d-8403-154a51 f359b3): created successfully ./cinderctl-002-soscleaner-2012737258202097.tar.gz-e5968a23-4a9c-48f2-89c8-a4c7fcc03940/soscleaner-2012737258202097/var/log/cinder/volume.log-20150720:2015-07-14 16:05:26.158 6124 INFO cinder.volume.flows.manager .create_volume [req-2ec60af8-88b6-4a3e-ad71-029e38f98013 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Volume volume-221d5572-e984-4e47-bce8-ae895b7ddf6e (221d5572-e984-4e47-bce8-ae895b 7ddf6e): created successfully ./cinderctl-002-soscleaner-2012737258202097.tar.gz-e5968a23-4a9c-48f2-89c8-a4c7fcc03940/soscleaner-2012737258202097/var/log/cinder/volume.log-20150720:2015-07-14 16:21:30.640 6124 INFO cinder.volume.flows.manager .create_volume [req-f76226f6-0616-43b9-b3a0-bde650065bae 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Volume volume-18632c7a-5cb5-42cc-b915-2e4b3359739e (18632c7a-5cb5-42cc-b915-2e4b33 59739e): created successfully ./cinderctl-002-soscleaner-2012737258202097.tar.gz-e5968a23-4a9c-48f2-89c8-a4c7fcc03940/soscleaner-2012737258202097/var/log/cinder/volume.log-20150720:2015-07-14 16:37:30.503 6124 INFO cinder.volume.flows.manager .create_volume [req-56619a55-410e-42a6-8809-b8fab6e8bc2e 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Volume volume-94ff7284-a2cb-444c-9525-e406b96e49c5 (94ff7284-a2cb-444c-9525-e406b9 6e49c5): created successfully ./cinderctl-002-soscleaner-2012737258202097.tar.gz-e5968a23-4a9c-48f2-89c8-a4c7fcc03940/soscleaner-2012737258202097/var/log/cinder/volume.log-20150720:2015-07-14 16:53:26.178 6124 INFO cinder.volume.flows.manager .create_volume [req-7bc3ffaa-081b-442e-acbf-d7da172ca54d 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Volume volume-14bb394a-0a33-4ffc-b464-1f740daf6a19 (14bb394a-0a33-4ffc-b464-1f740d af6a19): created successfully ./cinderctl-002-soscleaner-2012737258202097.tar.gz-e5968a23-4a9c-48f2-89c8-a4c7fcc03940/soscleaner-2012737258202097/var/log/cinder/volume.log-20150720:2015-07-14 17:09:30.238 6124 INFO cinder.volume.flows.manager .create_volume [req-5b8d6ac3-7f67-433a-a315-831904f520ba 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Volume volume-1cd90403-bed1-416e-96f3-1bf6bcc5cadf (1cd90403-bed1-416e-96f3-1bf6bc c5cadf): created successfully ./cinderctl-002-soscleaner-2012737258202097.tar.gz-e5968a23-4a9c-48f2-89c8-a4c7fcc03940/soscleaner-2012737258202097/var/log/cinder/volume.log-20150720:2015-07-14 17:42:31.769 6124 INFO cinder.volume.flows.manager .create_volume [req-cc35af77-fc65-48d6-a2bf-da5c72521956 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Volume volume-281bdc92-ffd5-47c4-93e6-422bee549a5e (281bdc92-ffd5-47c4-93e6-422bee 549a5e): created successfully ./cinderctl-002-soscleaner-2012737258202097.tar.gz-e5968a23-4a9c-48f2-89c8-a4c7fcc03940/soscleaner-2012737258202097/var/log/cinder/volume.log-20150720:2015-07-14 17:48:16.124 6124 INFO cinder.volume.manager [req- d27335f6-360d-4fbc-bdb2-7f4a6eb14497 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] snapshot 7b61b1b6-0744-4d83-aafe-5df222d5f592: created successfully ./cinderctl-002-soscleaner-2012737258202097.tar.gz-e5968a23-4a9c-48f2-89c8-a4c7fcc03940/soscleaner-2012737258202097/var/log/cinder/volume.log-20150720:2015-07-14 17:58:39.199 6124 INFO cinder.volume.flows.manager .create_volume [req-34025c04-536e-4152-82b2-e658e3a8d00b 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Volume volume-f8d75c93-f7fa-4627-917b-d35c21752dab (f8d75c93-f7fa-4627-917b-d35c21 752dab): created successfully Taking a random request from the above log snippet, we can see it hit the scheduler in node 002 a couple of ms before it hit cinder-volume: ./cinderctl-002-soscleaner-2012737258202097.tar.gz-e5968a23-4a9c-48f2-89c8-a4c7fcc03940/soscleaner-2012737258202097/var/log/cinder/scheduler.log-20150720:2015-07-14 17:09:30.021 5983 DEBUG stevedore.extension [req-5b8d6ac3-7f67-433a-a315-831904f520ba 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] found extension EntryPoint.parse('default = taskflow.engines.action_engine.engine:SingleThreadedActionEngine') _load_plugins /usr/lib/python2.7/site-packages/stevedore/extension.py:156 The same request hit the cinder-api service at 17:09 as well: ./cinderctl-003-soscleaner-2990161794430296.tar.gz-68dfb6ab-5e85-4b1f-92b7-f092580cb02f/soscleaner-2990161794430296/var/log/cinder/api.log-20150719:2015-07-14 17:09:29.898 4672 DEBUG routes.middleware [req-5b8d6ac3-7f67-433a-a315-831904f520ba 00922c14ea054b28b595255da3ed72cc cad4174a384740318ecf5c456047a035 - - -] Matched POST /cad4174a384740318ecf5c456047a035/volumes __call__ /usr/lib/python2.7/site-packages/routes/middleware.py:100 I'll keep debugging Jon, You reproduced that issue a few month ago: https://bugs.launchpad.net/cinder/+bug/1446682 - cinder-volume keeps opening Ceph clients until the maximum number of opened files reached which is the upstream version of the BZ#1219559 from the previous comments. Did Cinder stop consuming messages from the rabbit queue when happened? Can LP#1446682 explain the problem described in this case? I see that Node3 is trying to delete a few rbd volumes around 14:15, I don't see that any of these delete operations have been completed, which makes me think that maybe RBD driver is deleting volumes synchronously (Jon do you remember that bug?), it takes time or it's stuck because RBD is slow or non-responsive, so meantime it doesn't consume other messages from the queue. volume 7be4579b-a976-4c7f-8f81-678208b1b272: removing export delete_volume /usr/lib/python2.7/site-packages/cinder/volume/manager.py:399 volume 707b243f-9bc4-4597-968d-82a4a9fbb581: removing export delete_volume /usr/lib/python2.7/site-packages/cinder/volume/manager.py:399 volume 45694fbc-61a3-42a4-a6df-611d179ffc51: removing export delete_volume /usr/lib/python2.7/site-packages/cinder/volume/manager.py:399 volume cd9dd7ee-5ac0-4ddc-92a3-6aca92f2a869: removing export delete_volume /usr/lib/python2.7/site-packages/cinder/volume/manager.py:399 I am wondering what is the status and size of these volumes, is it possible that Cinder is stuck trying to delete them? 7be4579b-a976-4c7f-8f81-678208b1b272 707b243f-9bc4-4597-968d-82a4a9fbb581 45694fbc-61a3-42a4-a6df-611d179ffc51 cd9dd7ee-5ac0-4ddc-92a3-6aca92f2a869 Remind me that bug a little bit https://launchpad.net/bugs/1401335 I don't *think* the leaking of file handles is the issue here due to the description of what's happening, but you could always verify using 'lsof -np <pid>' on the child process of cinder-volume. It is possible the deletion of large volumes could cause a cinder-volume process to be unresponsive. Large volume deletion can take a significant amount of time, and until recently, this work was initiated from the main cinder-volume context, which would cause it to block until completed. This sounds consistent with the behaviour being described. I took one of the cases mentioned by Sergey in one of his comments ( 7be4579b-a976-4c7f-8f81-678208b1b272 ) and looked at the volume logs to see whether the delete had completed or not. Surprisingly, I found that there are several delete requests in the logs for this volume and they are in different nodes. The node 003 got a delete request at 14:15 and despite it saying the deletion process started, it never reported the completion. However, when cinderctl-002 got the request at 22:26, it did process it and reported the success of the operation at 22:29:02.437. The things that seem weird here are: 1. 2 different nodes got a delete request for the same volume which, in theory, shouldn't happen. 2. Node 002 got a second delete operation and it reported that the volume didn't exist. 3. Node 003 never reported the end of the operation. Logs from cinderctl-003 ./cinderctl-003-soscleaner-1150731481727834.tar.gz-69336f0b-e6c6-4198-97b1-960b7a19da7b/soscleaner-1150731481727834/var/log/cinder/volume.log-20150719:2015-07-14 14:15:41.081 4659 DEBUG cinder.openstack.common.lo ckutils [req-94f052ab-1344-4ab6-9b24-601bab772e2a 34f31fc86bed406797298307fb8463d2 68e7a813ea7e456b8f94f3f2cd4d4ba3 - - -] Got semaphore "7be4579b-a976-4c7f-8f81-678208b1b272-delete_volume" for method "lvo_inner2 "... inner /usr/lib/python2.7/site-packages/cinder/openstack/common/lockutils.py:191 ./cinderctl-003-soscleaner-1150731481727834.tar.gz-69336f0b-e6c6-4198-97b1-960b7a19da7b/soscleaner-1150731481727834/var/log/cinder/volume.log-20150719:2015-07-14 14:15:41.081 4659 DEBUG cinder.openstack.common.lo ckutils [req-94f052ab-1344-4ab6-9b24-601bab772e2a 34f31fc86bed406797298307fb8463d2 68e7a813ea7e456b8f94f3f2cd4d4ba3 - - -] Attempting to grab file lock "7be4579b-a976-4c7f-8f81-678208b1b272-delete_volume" for met hod "lvo_inner2"... inner /usr/lib/python2.7/site-packages/cinder/openstack/common/lockutils.py:202 ./cinderctl-003-soscleaner-1150731481727834.tar.gz-69336f0b-e6c6-4198-97b1-960b7a19da7b/soscleaner-1150731481727834/var/log/cinder/volume.log-20150719:2015-07-14 14:15:41.081 4659 DEBUG cinder.openstack.common.lo ckutils [req-94f052ab-1344-4ab6-9b24-601bab772e2a 34f31fc86bed406797298307fb8463d2 68e7a813ea7e456b8f94f3f2cd4d4ba3 - - -] Got file lock "7be4579b-a976-4c7f-8f81-678208b1b272-delete_volume" at /var/lib/cinder/tmp /cinder-7be4579b-a976-4c7f-8f81-678208b1b272-delete_volume for method "lvo_inner2"... inner /usr/lib/python2.7/site-packages/cinder/openstack/common/lockutils.py:232 ./cinderctl-003-soscleaner-1150731481727834.tar.gz-69336f0b-e6c6-4198-97b1-960b7a19da7b/soscleaner-1150731481727834/var/log/cinder/volume.log-20150719:2015-07-14 14:15:41.149 4659 INFO cinder.volume.manager [req- 94f052ab-1344-4ab6-9b24-601bab772e2a 34f31fc86bed406797298307fb8463d2 68e7a813ea7e456b8f94f3f2cd4d4ba3 - - -] volume 7be4579b-a976-4c7f-8f81-678208b1b272: deleting ./cinderctl-003-soscleaner-1150731481727834.tar.gz-69336f0b-e6c6-4198-97b1-960b7a19da7b/soscleaner-1150731481727834/var/log/cinder/volume.log-20150719:2015-07-14 14:15:41.245 4659 DEBUG cinder.volume.manager [req -94f052ab-1344-4ab6-9b24-601bab772e2a 34f31fc86bed406797298307fb8463d2 68e7a813ea7e456b8f94f3f2cd4d4ba3 - - -] volume 7be4579b-a976-4c7f-8f81-678208b1b272: removing export delete_volume /usr/lib/python2.7/site-pa ckages/cinder/volume/manager.py:399 ./cinderctl-003-soscleaner-1150731481727834.tar.gz-69336f0b-e6c6-4198-97b1-960b7a19da7b/soscleaner-1150731481727834/var/log/cinder/volume.log-20150719:2015-07-14 14:15:41.245 4659 DEBUG cinder.volume.manager [req -94f052ab-1344-4ab6-9b24-601bab772e2a 34f31fc86bed406797298307fb8463d2 68e7a813ea7e456b8f94f3f2cd4d4ba3 - - -] volume 7be4579b-a976-4c7f-8f81-678208b1b272: deleting delete_volume /usr/lib/python2.7/site-packages/ cinder/volume/manager.py:401 ./cinderctl-003-soscleaner-1150731481727834.tar.gz-69336f0b-e6c6-4198-97b1-960b7a19da7b/soscleaner-1150731481727834/var/log/cinder/volume.log-20150719:2015-07-14 14:15:41.412 4659 DEBUG cinder.volume.drivers.rbd [req-94f052ab-1344-4ab6-9b24-601bab772e2a 34f31fc86bed406797298307fb8463d2 68e7a813ea7e456b8f94f3f2cd4d4ba3 - - -] volume volume-7be4579b-a976-4c7f-8f81-678208b1b272 is not a clone _get_clone_info /usr/lib/python 2.7/site-packages/cinder/volume/drivers/rbd.py:556 ./cinderctl-003-soscleaner-1150731481727834.tar.gz-69336f0b-e6c6-4198-97b1-960b7a19da7b/soscleaner-1150731481727834/var/log/cinder/volume.log-20150719:2015-07-14 14:15:41.415 4659 DEBUG cinder.volume.drivers.rbd [req-94f052ab-1344-4ab6-9b24-601bab772e2a 34f31fc86bed406797298307fb8463d2 68e7a813ea7e456b8f94f3f2cd4d4ba3 - - -] deleting rbd volume volume-7be4579b-a976-4c7f-8f81-678208b1b272 delete_volume /usr/lib/python2.7/ site-packages/cinder/volume/drivers/rbd.py:633 ./cinderctl-003-soscleaner-1150731481727834.tar.gz-69336f0b-e6c6-4198-97b1-960b7a19da7b/soscleaner-1150731481727834/var/log/cinder/volume.log-20150719:2015-07-14 22:28:49.735 19300 INFO cinder.volume.manager [req -2b5f577e-d8b0-4651-903f-acece0d542e0 - - - - -] volume 7be4579b-a976-4c7f-8f81-678208b1b272: skipping export Logs from cinderctl-002 ./cinderctl-002-soscleaner-8542214154554249.tar.gz-629c22b7-ce14-408a-8da9-d6d00fcac003/soscleaner-8542214154554249/var/log/cinder/volume.log-20150720:2015-07-14 22:26:40.733 11716 INFO cinder.volume.manager [req -b2d1fab1-bead-4dff-9a48-4d81abbc7dfa - - - - -] volume 7be4579b-a976-4c7f-8f81-678208b1b272: skipping export ./cinderctl-002-soscleaner-8542214154554249.tar.gz-629c22b7-ce14-408a-8da9-d6d00fcac003/soscleaner-8542214154554249/var/log/cinder/volume.log-20150720:2015-07-14 22:26:56.193 11870 INFO cinder.volume.manager [req -e0e73134-2faf-4132-813f-88350d4eb72c - - - - -] volume 7be4579b-a976-4c7f-8f81-678208b1b272: skipping export ./cinderctl-002-soscleaner-8542214154554249.tar.gz-629c22b7-ce14-408a-8da9-d6d00fcac003/soscleaner-8542214154554249/var/log/cinder/volume.log-20150720:2015-07-14 22:28:47.776 11870 DEBUG cinder.openstack.common.l ockutils [req-f05e1722-ecba-4500-9585-bb2b64810b58 7f84af160bec4db68ae60d702d97989f d9c4536810714f0db89d8998cc257d5a - - -] Got semaphore "7be4579b-a976-4c7f-8f81-678208b1b272-delete_volume" for method "lvo_inner 2"... inner /usr/lib/python2.7/site-packages/cinder/openstack/common/lockutils.py:191 ./cinderctl-002-soscleaner-8542214154554249.tar.gz-629c22b7-ce14-408a-8da9-d6d00fcac003/soscleaner-8542214154554249/var/log/cinder/volume.log-20150720:2015-07-14 22:28:47.777 11870 DEBUG cinder.openstack.common.l ockutils [req-f05e1722-ecba-4500-9585-bb2b64810b58 7f84af160bec4db68ae60d702d97989f d9c4536810714f0db89d8998cc257d5a - - -] Attempting to grab file lock "7be4579b-a976-4c7f-8f81-678208b1b272-delete_volume" for me thod "lvo_inner2"... inner /usr/lib/python2.7/site-packages/cinder/openstack/common/lockutils.py:202 ./cinderctl-002-soscleaner-8542214154554249.tar.gz-629c22b7-ce14-408a-8da9-d6d00fcac003/soscleaner-8542214154554249/var/log/cinder/volume.log-20150720:2015-07-14 22:28:47.777 11870 DEBUG cinder.openstack.common.l ockutils [req-f05e1722-ecba-4500-9585-bb2b64810b58 7f84af160bec4db68ae60d702d97989f d9c4536810714f0db89d8998cc257d5a - - -] Got file lock "7be4579b-a976-4c7f-8f81-678208b1b272-delete_volume" at /var/lib/cinder/tm p/cinder-7be4579b-a976-4c7f-8f81-678208b1b272-delete_volume for method "lvo_inner2"... inner /usr/lib/python2.7/site-packages/cinder/openstack/common/lockutils.py:232 ./cinderctl-002-soscleaner-8542214154554249.tar.gz-629c22b7-ce14-408a-8da9-d6d00fcac003/soscleaner-8542214154554249/var/log/cinder/volume.log-20150720:2015-07-14 22:28:49.052 11870 INFO cinder.volume.manager [req -f05e1722-ecba-4500-9585-bb2b64810b58 7f84af160bec4db68ae60d702d97989f d9c4536810714f0db89d8998cc257d5a - - -] volume 7be4579b-a976-4c7f-8f81-678208b1b272: deleting ./cinderctl-002-soscleaner-8542214154554249.tar.gz-629c22b7-ce14-408a-8da9-d6d00fcac003/soscleaner-8542214154554249/var/log/cinder/volume.log-20150720:2015-07-14 22:28:50.189 11870 DEBUG cinder.volume.manager [re q-f05e1722-ecba-4500-9585-bb2b64810b58 7f84af160bec4db68ae60d702d97989f d9c4536810714f0db89d8998cc257d5a - - -] volume 7be4579b-a976-4c7f-8f81-678208b1b272: removing export delete_volume /usr/lib/python2.7/site-p ackages/cinder/volume/manager.py:399 ./cinderctl-002-soscleaner-8542214154554249.tar.gz-629c22b7-ce14-408a-8da9-d6d00fcac003/soscleaner-8542214154554249/var/log/cinder/volume.log-20150720:2015-07-14 22:28:50.189 11870 DEBUG cinder.volume.manager [re q-f05e1722-ecba-4500-9585-bb2b64810b58 7f84af160bec4db68ae60d702d97989f d9c4536810714f0db89d8998cc257d5a - - -] volume 7be4579b-a976-4c7f-8f81-678208b1b272: deleting delete_volume /usr/lib/python2.7/site-packages /cinder/volume/manager.py:401 ./cinderctl-002-soscleaner-8542214154554249.tar.gz-629c22b7-ce14-408a-8da9-d6d00fcac003/soscleaner-8542214154554249/var/log/cinder/volume.log-20150720:2015-07-14 22:28:52.894 11870 INFO cinder.volume.drivers.rbd [req-f05e1722-ecba-4500-9585-bb2b64810b58 7f84af160bec4db68ae60d702d97989f d9c4536810714f0db89d8998cc257d5a - - -] volume volume-7be4579b-a976-4c7f-8f81-678208b1b272 no longer exists in backend ./cinderctl-002-soscleaner-8542214154554249.tar.gz-629c22b7-ce14-408a-8da9-d6d00fcac003/soscleaner-8542214154554249/var/log/cinder/volume.log-20150720:2015-07-14 22:29:02.437 11870 INFO cinder.volume.manager [req -f05e1722-ecba-4500-9585-bb2b64810b58 7f84af160bec4db68ae60d702d97989f d9c4536810714f0db89d8998cc257d5a - - -] volume 7be4579b-a976-4c7f-8f81-678208b1b272: deleted successfully ./cinderctl-002-soscleaner-8542214154554249.tar.gz-629c22b7-ce14-408a-8da9-d6d00fcac003/soscleaner-8542214154554249/var/log/cinder/volume.log-20150720:2015-07-14 22:29:03.006 11870 DEBUG cinder.openstack.common.lockutils [req-f05e1722ecba-4500-9585-bb2b64810b58 7f84af160bec4db68ae60d702d97989f d9c4536810714f0db89d8998cc257d5a - - -] Released file lock "7be4579b-a976-4c7f-8f81-678208b1b272-delete_volume" at /var/lib/cind er/tmp/cinder-7be4579b-a976-4c7f-8f81-678208b1b272-delete_volume for method "lvo_inner2"... inner /usr/lib/python2.7/site-packages/cinder/openstack/common/lockutils.py:239 Can we also get more information on what the HA (if any) architecture looks like in this deployment? My last comment makes me think that, as John also mentioned in c#4, there could've been a partition in the rabbit cluster that caused this. From looking at the cinder configuration files on each of the nodes in the sos report, they're using the same 'host=' value, presumably to allow for a level of cinder-volume redundancy in the event of a failure. Unfortunatly this configuration opens them to many race conditions, as this setup was not planned for and the code does not protect against such races. This is something that really should be changed. I don't yet have a clear understanding of why things are failing in the way described here, but I wanted to mention the 'host=' setting. Josh, could you take a look at the Ceph logs? Ceph seems to be stuck - just a guess - but it's not clear why that'd happen. There's nothing in the ceph logs to indicate that the cluster was unhealthy at any point, so the problem seems to be client-side. The sos reports includes some lsof info which shows very few unix sockets open (6 of 301 total by cinder-volume) so the fd limit is likely not being hit at all, and this case is unrelated to BZ#1219559. From the earlier comments about deletes in the logs, I'd agree that the most likely explanation is https://bugs.launchpad.net/cinder/+bug/1401335 as mentioned earlier. Deletes can take a long time, and with enough threads stuck deleting do end up leaving messages in rabbit. Jack, Josh and Jon are working to provide a steps to collect an additional information from the customer environment before the restart. Meantime, please, ask customer to provide the sos report. As you see in the previous comments we suspect that Cinder is stuck on deleting, let's see what happened this time. Jack, To enable ceph client logs, add this to /etc/ceph/ceph.conf on the nodes running cinder-volume. Put this in the [client] section, i.e.: [client] debug ms = 1 debug rbd = 20 log file = /var/log/cinder/rbd-$pid-$cctid.log There's no need to restart anything. New requests for rbd volumes should start producing log files under /var/log/cinder. Josh looked at enabling eventlet logging, but suspects it to be too verbose to make use of unless we can find a more easily reproducible test case. I looked at the possible use of gdb to print the thread call traces, but that requires the most recent version with python extentions enabled - so we may want to consider that next, but I'm not sure how much time the customer can put into this. Anything you can do to further narrow the the failure down into a test case that we could attempt to recreate, would be most helpful - easier said than done :) Hopefully the ceph client logging will help us determine if long-running deletes are involved, or if it's something different entirely. Jack, There is a NEEDINFO on you for more than a week? Any updates from the customer? Jack, it would be useful to see the logs from my previous comment. If they are exhausing the file-max limit, we would see that reflected in the logs. And if something else were occuring, we have a reasonable chance of catching that too. The behaviour that's being described in the bug doesn't match with what I would expect from file handle exhaustion - we'd see errors immediatly and not 8-hour migration timeouts, so I have some doubts about the current theory; it could be that they're just getting lucky for the moment. Jack, Please, reopen this case if you get a new information from the customer or this issue happens again. We have 2 theories right now (see previous comments), but without ceph client debug logs we can't be really sure. |