Description of problem: Glance api can hang in registry notification. Version-Release number of selected component (if applicable): RDO havana How reproducible: Usually Steps to Reproduce: 1. openstack-config --set /etc/glance/glance-api.conf DEFAULT workers 1 2. service openstack-glance-api restart 3. Try several times: for ((a=0; a<16; a++)) do glance image-create --name image-$a --container-format bare --disk-format raw --file /etc/passwd & done Actual results: glance image-list # Times out API service hangs for forever. glance-api service restart is required to continue the normal operation. Expected results: Service does noh hangs. Additional info: - On F19-devstack havana, RHEL-RDO-havana with pyhon-qpid 0.14 .. 0.24 - Tested on a 2 Core VM - With rabbitMq the response is 500, so the bug can be related to the glance-registry. - No issue seen with workers=2 yet. - glnace image-create command without the --file argument also can cause hanging issues.
Created attachment 798638 [details] strace of one glance-registry process on F19
Created attachment 798642 [details] strace gdb wireshark lsof pstree strace and wireshark/tshark started before the 16 glance image-create command. The ssh traffic is removed from the network capture. pstree is created before the action. lsof was used listed before and after the the issue. The gdb used just after the issue.
According to the qpipd manual the connection backlog is 10 by default. The AMQP server initiates a connection close (FIN) just after the connection created, before the AMQP handshaking or any data transferred. See 'tcp.stream eq 134' expression with wireshark in the attached pcap file (tar.xz). There is multiple sockets with similar story in the glance-api strace. glance poll(2)-ing on the half closed sockets them. (Glance open and closes AMQP tcp connection frequently, it does not tries to reuse an existing AMQP connection.) https://github.com/openstack/glance/blob/master/glance/notifier/notify_qpid.py#L141
qpid.16983:16995 1379324256.307588 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46895 No protocol received closing\n", 111) = 111 qpid.16983:16995 1379324256.317537 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46896 No protocol received closing\n", 111) = 111 qpid.16983:16995 1379324256.323928 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46897 No protocol received closing\n", 111) = 111 qpid.16983:16995 1379324256.331552 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46898 No protocol received closing\n", 111) = 111 qpid.16983:16995 1379324256.336658 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46899 No protocol received closing\n", 111) = 111 qpid.16983:16995 1379324256.345344 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46900 No protocol received closing\n", 111) = 111 qpid.16983:16995 1379324256.352342 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46901 No protocol received closing\n", 111) = 111 qpid.16983:16995 1379324256.359793 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46902 No protocol received closing\n", 111) = 111 qpid.16983:16995 1379324256.370614 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46903 No protocol received closing\n", 111) = 111 qpid.16983:16995 1379324256.374449 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46904 No protocol received closing\n", 111) = 111 qpid.16983:16995 1379324256.381316 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46905 No protocol received closing\n", 111) = 111 qpid.16983:16995 1379324256.389423 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46906 No protocol received closing\n", 111) = 111 Client needs to initiate the communication, but qpid closes the connection very early.
Several changes happened with time out management since qpid 0.22 https://issues.apache.org/jira/browse/QPID-4854. Probably worth to try it with qpid-cpp 0.24.
upgrading to qpid-0.24 did not solved the issue (both client and server). (All configuration is the default)
The qpidd closed the connection after 10 sec which was expected. The client did not sent the AMQP header.
glance uses the eventlet and monkey patches several methods, but does not patches for instance the select(). After extending the monkey patched methods the service does not dies. I see similar results as with rabbitMQ, some cases I got 500 responses, but the service does not hangs permanently. 2013-09-18 11:49:29.212 12234 ERROR glance.notifier.notify_qpid [5c709887-d5a6-4b3a-842e-467ce5f7867c 681888a5061442198bdd2f8501d68d45 b0b2342ce3ca4e939192c1256dda643c] Notification error. Priority: info Message: {'event_type': 'image.create', 'timestamp': '2013-09-18 11:49:29.166362', 'message_id': '75db5170-c1d0-494f-a248-8c30ecf7e12d', 'priority': 'INFO', 'publisher_id': 'devstack-1379181899.novalocal', 'payload': {u'status': u'queued', u'deleted': False, u'container_format': u'bare', u'min_ram': 0, u'updated_at': u'2013-09-18T11:49:29', u'owner': u'b0b2342ce3ca4e939192c1256dda643c', u'min_disk': 0, u'is_public': False, u'deleted_at': None, u'id': u'bda44585-539d-47f2-b162-de4f2940b1f0', u'size': 2030, u'name': u'image-12', u'checksum': None, u'created_at': u'2013-09-18T11:49:29', u'disk_format': u'raw', u'properties': {}, u'protected': False}} 2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid Traceback (most recent call last): 2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid File "/opt/stack/new/glance/glance/notifier/notify_qpid.py", line 134, in _send 2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid sender.send(qpid_msg) 2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid File "<string>", line 6, in send 2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 865, in send 2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid raise Detached() 2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid Detached: None 2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid 2013-09-18 11:49:29.229 12234 ERROR glance.notifier.notify_qpid [9449db2e-7690-4a86-af12-4f5576bce05e 681888a5061442198bdd2f8501d68d45 b0b2342ce3ca4e939192c1256dda643c] Notification error. Priority: info Message: {'event_type': 'image.create', 'timestamp': '2013-09-18 11:49:29.139054', 'message_id': 'b4eeb060-e70a-494c-8cd6-bb6b1770ba29', 'priority': 'INFO', 'publisher_id': 'devstack-1379181899.novalocal', 'payload': {u'status': u'queued', u'deleted': False, u'container_format': u'bare', u'min_ram': 0, u'updated_at': u'2013-09-18T11:49:29', u'owner': u'b0b2342ce3ca4e939192c1256dda643c', u'min_disk': 0, u'is_public': False, u'deleted_at': None, u'id': u'36bbd1e2-b2c7-4a0e-b7cb-7f2cbcc5f6dc', u'size': 2030, u'name': u'image-8', u'checksum': None, u'created_at': u'2013-09-18T11:49:29', u'disk_format': u'raw', u'properties': {}, u'protected': False}} 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid Traceback (most recent call last): 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid File "/opt/stack/new/glance/glance/notifier/notify_qpid.py", line 134, in _send 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid sender.send(qpid_msg) 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid File "<string>", line 6, in send 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 867, in send 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid self._ecwait(lambda: self.linked, timeout=timeout) 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 51, in _ecwait 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid self.check_closed() 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 824, in check_closed 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid raise LinkClosed() 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid LinkClosed: None 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid
The full monkey patch leads to pipe(2) and connection leakage. The AMQP closing add by: https://bugs.launchpad.net/glance/+bug/1100317 A mail thread related to python-qpid eventlet monkey patching. http://qpid.2158936.n2.nabble.com/QPID-and-eventlet-monkey-patch-td7584370.html
Created attachment 800014 [details] single_auth_multi_image.py Simple reproducer script. keystone tenant-create --name demo keystone user-create --name demo --tenant demo --pass secret ./single_auth_multi_image.py --thread 256 glance image-list ^C... terminating glance client On 2 core vm above 256 'thread' it has good chances for reproduce the issue with multiple workers.
Created attachment 800019 [details] python stacks of one glance api worker When the file is a <string> it may be a method created by the qpid.concurrency.synchronized. (thread locking)
python-qpid without this patch http://svn.apache.org/viewvc/qpid/trunk/qpid/python/qpid/util.py?r1=1406584&r2=1493873 could cause socket leakage. (Looks like this is not the root cause of the hanging)
The same glance.notifier.notify_qpid.QpidStrategy instance is in the _open_connection() method multiple times in parallel. The method/class does not seems to be re-entrant or thread-safe. The usual socket is monkey patched by the eventlet, so the the socket is NON_BLOCKING. On connection attempts the first response is EINPROGRESS to the systemcall. This allows other co-rutins ('threads') to be 'scheduled', The connection which really established is not used, several additional AMQP connection exists, without activity. If the code is modified with more eventlet/monkey patching (more scheduling point) the leaked connections does the heart-beating in the 'background'. Probably more python methods (like select) needs to be monkey patched for normal operation, if the PipeWiet does not have interruptible oration it can cause co-rutin starvation. glance.notifier.notify_qpid.QpidStrategy must be used in a safer way.
Created attachment 800568 [details] glance-real-new-connection.patch This haves the qpid notifier to really reconnect on every notification attempt, as it advertised in one previous commit. The previous reproducer was able to hang the glance-api service with 3 threads, when it was configured to start up with worker. After the patch it is able to survive 128 concurrent connection/request. It causes to use many pipes and co-routine, so with higher level concurrency you might need to increase the maximum file descriptor per process. At 128 concurrency level, with the default qpid settings some cases one thread might not get scheduled for long time and it causes connection timeout. In this case the reconnection is working. No connection leakage experimented. The test system was a 2 core VM. The eventlet now is allowed to patch all method. More method needs to patched in order to avoid a hang. So the attached patch makes glance more stable based on the first tests, however proper AMQP connection reuse could boost the performance and lower the resource usage. May be in the future the oslo 'rpc' code will be used in glance as well.
Attila, is this right upstream bug in Launchpad I've linked to?