Bug 1008818
Summary: | glance api hangs with low (1) workers on multiple parallel image creation requests | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Community] RDO | Reporter: | Attila Fazekas <afazekas> | ||||||||||||
Component: | openstack-glance | Assignee: | Attila Fazekas <afazekas> | ||||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Shai Revivo <srevivo> | ||||||||||||
Severity: | unspecified | Docs Contact: | |||||||||||||
Priority: | unspecified | ||||||||||||||
Version: | unspecified | CC: | afazekas, apevec, eglynn, fpercoco | ||||||||||||
Target Milestone: | --- | ||||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | Unspecified | ||||||||||||||
OS: | Unspecified | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | openstack-glance-2013.1.4-2.fc19 | Doc Type: | Bug Fix | ||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | |||||||||||||||
: | 1014016 (view as bug list) | Environment: | |||||||||||||
Last Closed: | 2016-03-30 23:07:08 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Attachments: |
|
Description
Attila Fazekas
2013-09-17 06:51:54 UTC
Created attachment 798638 [details]
strace of one glance-registry process on F19
Created attachment 798642 [details]
strace gdb wireshark lsof pstree
strace and wireshark/tshark started before the 16 glance image-create command.
The ssh traffic is removed from the network capture.
pstree is created before the action.
lsof was used listed before and after the the issue.
The gdb used just after the issue.
According to the qpipd manual the connection backlog is 10 by default. The AMQP server initiates a connection close (FIN) just after the connection created, before the AMQP handshaking or any data transferred. See 'tcp.stream eq 134' expression with wireshark in the attached pcap file (tar.xz). There is multiple sockets with similar story in the glance-api strace. glance poll(2)-ing on the half closed sockets them. (Glance open and closes AMQP tcp connection frequently, it does not tries to reuse an existing AMQP connection.) https://github.com/openstack/glance/blob/master/glance/notifier/notify_qpid.py#L141 qpid.16983:16995 1379324256.307588 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46895 No protocol received closing\n", 111) = 111 qpid.16983:16995 1379324256.317537 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46896 No protocol received closing\n", 111) = 111 qpid.16983:16995 1379324256.323928 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46897 No protocol received closing\n", 111) = 111 qpid.16983:16995 1379324256.331552 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46898 No protocol received closing\n", 111) = 111 qpid.16983:16995 1379324256.336658 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46899 No protocol received closing\n", 111) = 111 qpid.16983:16995 1379324256.345344 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46900 No protocol received closing\n", 111) = 111 qpid.16983:16995 1379324256.352342 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46901 No protocol received closing\n", 111) = 111 qpid.16983:16995 1379324256.359793 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46902 No protocol received closing\n", 111) = 111 qpid.16983:16995 1379324256.370614 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46903 No protocol received closing\n", 111) = 111 qpid.16983:16995 1379324256.374449 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46904 No protocol received closing\n", 111) = 111 qpid.16983:16995 1379324256.381316 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46905 No protocol received closing\n", 111) = 111 qpid.16983:16995 1379324256.389423 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46906 No protocol received closing\n", 111) = 111 Client needs to initiate the communication, but qpid closes the connection very early. Several changes happened with time out management since qpid 0.22 https://issues.apache.org/jira/browse/QPID-4854. Probably worth to try it with qpid-cpp 0.24. upgrading to qpid-0.24 did not solved the issue (both client and server). (All configuration is the default) The qpidd closed the connection after 10 sec which was expected. The client did not sent the AMQP header. glance uses the eventlet and monkey patches several methods, but does not patches for instance the select(). After extending the monkey patched methods the service does not dies. I see similar results as with rabbitMQ, some cases I got 500 responses, but the service does not hangs permanently. 2013-09-18 11:49:29.212 12234 ERROR glance.notifier.notify_qpid [5c709887-d5a6-4b3a-842e-467ce5f7867c 681888a5061442198bdd2f8501d68d45 b0b2342ce3ca4e939192c1256dda643c] Notification error. Priority: info Message: {'event_type': 'image.create', 'timestamp': '2013-09-18 11:49:29.166362', 'message_id': '75db5170-c1d0-494f-a248-8c30ecf7e12d', 'priority': 'INFO', 'publisher_id': 'devstack-1379181899.novalocal', 'payload': {u'status': u'queued', u'deleted': False, u'container_format': u'bare', u'min_ram': 0, u'updated_at': u'2013-09-18T11:49:29', u'owner': u'b0b2342ce3ca4e939192c1256dda643c', u'min_disk': 0, u'is_public': False, u'deleted_at': None, u'id': u'bda44585-539d-47f2-b162-de4f2940b1f0', u'size': 2030, u'name': u'image-12', u'checksum': None, u'created_at': u'2013-09-18T11:49:29', u'disk_format': u'raw', u'properties': {}, u'protected': False}} 2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid Traceback (most recent call last): 2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid File "/opt/stack/new/glance/glance/notifier/notify_qpid.py", line 134, in _send 2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid sender.send(qpid_msg) 2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid File "<string>", line 6, in send 2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 865, in send 2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid raise Detached() 2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid Detached: None 2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid 2013-09-18 11:49:29.229 12234 ERROR glance.notifier.notify_qpid [9449db2e-7690-4a86-af12-4f5576bce05e 681888a5061442198bdd2f8501d68d45 b0b2342ce3ca4e939192c1256dda643c] Notification error. Priority: info Message: {'event_type': 'image.create', 'timestamp': '2013-09-18 11:49:29.139054', 'message_id': 'b4eeb060-e70a-494c-8cd6-bb6b1770ba29', 'priority': 'INFO', 'publisher_id': 'devstack-1379181899.novalocal', 'payload': {u'status': u'queued', u'deleted': False, u'container_format': u'bare', u'min_ram': 0, u'updated_at': u'2013-09-18T11:49:29', u'owner': u'b0b2342ce3ca4e939192c1256dda643c', u'min_disk': 0, u'is_public': False, u'deleted_at': None, u'id': u'36bbd1e2-b2c7-4a0e-b7cb-7f2cbcc5f6dc', u'size': 2030, u'name': u'image-8', u'checksum': None, u'created_at': u'2013-09-18T11:49:29', u'disk_format': u'raw', u'properties': {}, u'protected': False}} 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid Traceback (most recent call last): 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid File "/opt/stack/new/glance/glance/notifier/notify_qpid.py", line 134, in _send 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid sender.send(qpid_msg) 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid File "<string>", line 6, in send 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 867, in send 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid self._ecwait(lambda: self.linked, timeout=timeout) 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 51, in _ecwait 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid self.check_closed() 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 824, in check_closed 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid raise LinkClosed() 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid LinkClosed: None 2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid The full monkey patch leads to pipe(2) and connection leakage. The AMQP closing add by: https://bugs.launchpad.net/glance/+bug/1100317 A mail thread related to python-qpid eventlet monkey patching. http://qpid.2158936.n2.nabble.com/QPID-and-eventlet-monkey-patch-td7584370.html Created attachment 800014 [details]
single_auth_multi_image.py
Simple reproducer script.
keystone tenant-create --name demo
keystone user-create --name demo --tenant demo --pass secret
./single_auth_multi_image.py --thread 256
glance image-list
^C... terminating glance client
On 2 core vm above 256 'thread' it has good chances for reproduce the issue with multiple workers.
Created attachment 800019 [details]
python stacks of one glance api worker
When the file is a <string> it may be a method created by the qpid.concurrency.synchronized. (thread locking)
python-qpid without this patch http://svn.apache.org/viewvc/qpid/trunk/qpid/python/qpid/util.py?r1=1406584&r2=1493873 could cause socket leakage. (Looks like this is not the root cause of the hanging) The same glance.notifier.notify_qpid.QpidStrategy instance is in the _open_connection() method multiple times in parallel. The method/class does not seems to be re-entrant or thread-safe. The usual socket is monkey patched by the eventlet, so the the socket is NON_BLOCKING. On connection attempts the first response is EINPROGRESS to the systemcall. This allows other co-rutins ('threads') to be 'scheduled', The connection which really established is not used, several additional AMQP connection exists, without activity. If the code is modified with more eventlet/monkey patching (more scheduling point) the leaked connections does the heart-beating in the 'background'. Probably more python methods (like select) needs to be monkey patched for normal operation, if the PipeWiet does not have interruptible oration it can cause co-rutin starvation. glance.notifier.notify_qpid.QpidStrategy must be used in a safer way. Created attachment 800568 [details]
glance-real-new-connection.patch
This haves the qpid notifier to really reconnect on every notification attempt, as it advertised in one previous commit.
The previous reproducer was able to hang the glance-api service with 3 threads, when it was configured to start up with worker.
After the patch it is able to survive 128 concurrent connection/request.
It causes to use many pipes and co-routine, so with higher level concurrency you might need to increase the maximum file descriptor per process.
At 128 concurrency level, with the default qpid settings some cases one thread might not get scheduled for long time and it causes connection timeout. In this case the reconnection is working. No connection leakage experimented.
The test system was a 2 core VM.
The eventlet now is allowed to patch all method. More method needs to patched in order to avoid a hang.
So the attached patch makes glance more stable based on the first tests, however proper AMQP connection reuse could boost the performance and lower the resource usage.
May be in the future the oslo 'rpc' code will be used in glance as well.
Attila, is this right upstream bug in Launchpad I've linked to? |