RDO tickets are now tracked in Jira https://issues.redhat.com/projects/RDO/issues/
Bug 1008818 - glance api hangs with low (1) workers on multiple parallel image creation requests
Summary: glance api hangs with low (1) workers on multiple parallel image creation req...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: RDO
Classification: Community
Component: openstack-glance
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Attila Fazekas
QA Contact: Shai Revivo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-09-17 06:51 UTC by Attila Fazekas
Modified: 2016-04-26 16:47 UTC (History)
4 users (show)

Fixed In Version: openstack-glance-2013.1.4-2.fc19
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1014016 (view as bug list)
Environment:
Last Closed: 2016-03-30 23:07:08 UTC
Embargoed:


Attachments (Terms of Use)
strace of one glance-registry process on F19 (4.90 MB, text/plain)
2013-09-17 06:53 UTC, Attila Fazekas
no flags Details
strace gdb wireshark lsof pstree (806.56 KB, application/x-xz)
2013-09-17 06:59 UTC, Attila Fazekas
no flags Details
single_auth_multi_image.py (2.79 KB, text/x-python)
2013-09-19 15:44 UTC, Attila Fazekas
no flags Details
python stacks of one glance api worker (4.06 KB, text/plain)
2013-09-19 15:51 UTC, Attila Fazekas
no flags Details
glance-real-new-connection.patch (4.04 KB, patch)
2013-09-20 16:31 UTC, Attila Fazekas
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1229042 0 None None None Never

Description Attila Fazekas 2013-09-17 06:51:54 UTC
Description of problem:
Glance api can hang in registry notification.

Version-Release number of selected component (if applicable):
RDO havana

How reproducible:
Usually

Steps to Reproduce:
1. openstack-config --set /etc/glance/glance-api.conf DEFAULT workers 1 
2. service openstack-glance-api restart
3. Try several times:
for ((a=0; a<16; a++)) do
        glance  image-create --name image-$a --container-format bare --disk-format raw --file /etc/passwd &
done

Actual results:
glance image-list # Times out
API service hangs for forever.

glance-api service restart is required to continue the normal operation.

Expected results:
Service does noh hangs.


Additional info:
- On F19-devstack havana, RHEL-RDO-havana with pyhon-qpid 0.14 .. 0.24
- Tested on a 2 Core VM
- With rabbitMq the response is 500, so the bug can be related to the glance-registry. 

- No issue seen with workers=2 yet.
- glnace image-create command without the --file argument also can cause hanging issues.

Comment 1 Attila Fazekas 2013-09-17 06:53:28 UTC
Created attachment 798638 [details]
strace of one glance-registry process on F19

Comment 2 Attila Fazekas 2013-09-17 06:59:38 UTC
Created attachment 798642 [details]
strace gdb wireshark lsof pstree

strace and wireshark/tshark started before the 16 glance image-create command.
The ssh traffic is removed from the network capture.
pstree is created before the action.
lsof was used listed before and after the the issue.

The gdb used just after the issue.

Comment 3 Attila Fazekas 2013-09-17 10:56:11 UTC
According to the qpipd manual the connection backlog is 10 by default.

The AMQP server initiates a connection close (FIN) just after the connection created, before the AMQP handshaking or any data transferred. 

See 'tcp.stream eq 134' expression with wireshark in the attached pcap file (tar.xz).

There is multiple sockets with similar story in the glance-api strace.
glance poll(2)-ing on the half closed sockets them.

(Glance open and closes AMQP tcp connection frequently, it does not tries to reuse an existing AMQP connection.)
https://github.com/openstack/glance/blob/master/glance/notifier/notify_qpid.py#L141

Comment 4 Attila Fazekas 2013-09-17 12:05:46 UTC
qpid.16983:16995 1379324256.307588 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46895 No protocol received closing\n", 111) = 111
qpid.16983:16995 1379324256.317537 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46896 No protocol received closing\n", 111) = 111
qpid.16983:16995 1379324256.323928 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46897 No protocol received closing\n", 111) = 111
qpid.16983:16995 1379324256.331552 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46898 No protocol received closing\n", 111) = 111
qpid.16983:16995 1379324256.336658 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46899 No protocol received closing\n", 111) = 111
qpid.16983:16995 1379324256.345344 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46900 No protocol received closing\n", 111) = 111
qpid.16983:16995 1379324256.352342 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46901 No protocol received closing\n", 111) = 111
qpid.16983:16995 1379324256.359793 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46902 No protocol received closing\n", 111) = 111
qpid.16983:16995 1379324256.370614 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46903 No protocol received closing\n", 111) = 111
qpid.16983:16995 1379324256.374449 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46904 No protocol received closing\n", 111) = 111
qpid.16983:16995 1379324256.381316 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46905 No protocol received closing\n", 111) = 111
qpid.16983:16995 1379324256.389423 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46906 No protocol received closing\n", 111) = 111

Client needs to initiate the communication, but qpid closes the connection very early.

Comment 5 Attila Fazekas 2013-09-17 13:42:35 UTC
Several changes happened with time out management since qpid 0.22
https://issues.apache.org/jira/browse/QPID-4854.
Probably worth to try it with qpid-cpp 0.24.

Comment 6 Attila Fazekas 2013-09-17 17:39:56 UTC
upgrading to qpid-0.24 did not solved the issue (both client and server). (All configuration is the default)

Comment 7 Attila Fazekas 2013-09-18 10:52:21 UTC
The qpidd closed the connection after 10 sec which was expected.
The client did not sent the AMQP header.

Comment 8 Attila Fazekas 2013-09-18 11:56:06 UTC
glance uses the eventlet and monkey patches several methods, but does not patches for instance the select().

After extending the monkey patched methods the service does not dies.
I see similar results as with rabbitMQ, some cases I got 500 responses, but the service does not hangs permanently.

2013-09-18 11:49:29.212 12234 ERROR glance.notifier.notify_qpid [5c709887-d5a6-4b3a-842e-467ce5f7867c 681888a5061442198bdd2f8501d68d45 b0b2342ce3ca4e939192c1256dda643c] Notification error.  Priority: info Message: {'event_type': 'image.create', 'timestamp': '2013-09-18 11:49:29.166362', 'message_id': '75db5170-c1d0-494f-a248-8c30ecf7e12d', 'priority': 'INFO', 'publisher_id': 'devstack-1379181899.novalocal', 'payload': {u'status': u'queued', u'deleted': False, u'container_format': u'bare', u'min_ram': 0, u'updated_at': u'2013-09-18T11:49:29', u'owner': u'b0b2342ce3ca4e939192c1256dda643c', u'min_disk': 0, u'is_public': False, u'deleted_at': None, u'id': u'bda44585-539d-47f2-b162-de4f2940b1f0', u'size': 2030, u'name': u'image-12', u'checksum': None, u'created_at': u'2013-09-18T11:49:29', u'disk_format': u'raw', u'properties': {}, u'protected': False}}
2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid Traceback (most recent call last):
2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid   File "/opt/stack/new/glance/glance/notifier/notify_qpid.py", line 134, in _send
2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid     sender.send(qpid_msg)
2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid   File "<string>", line 6, in send
2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 865, in send
2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid     raise Detached()
2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid Detached: None
2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid

2013-09-18 11:49:29.229 12234 ERROR glance.notifier.notify_qpid [9449db2e-7690-4a86-af12-4f5576bce05e 681888a5061442198bdd2f8501d68d45 b0b2342ce3ca4e939192c1256dda643c] Notification error.  Priority: info Message: {'event_type': 'image.create', 'timestamp': '2013-09-18 11:49:29.139054', 'message_id': 'b4eeb060-e70a-494c-8cd6-bb6b1770ba29', 'priority': 'INFO', 'publisher_id': 'devstack-1379181899.novalocal', 'payload': {u'status': u'queued', u'deleted': False, u'container_format': u'bare', u'min_ram': 0, u'updated_at': u'2013-09-18T11:49:29', u'owner': u'b0b2342ce3ca4e939192c1256dda643c', u'min_disk': 0, u'is_public': False, u'deleted_at': None, u'id': u'36bbd1e2-b2c7-4a0e-b7cb-7f2cbcc5f6dc', u'size': 2030, u'name': u'image-8', u'checksum': None, u'created_at': u'2013-09-18T11:49:29', u'disk_format': u'raw', u'properties': {}, u'protected': False}}
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid Traceback (most recent call last):
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid   File "/opt/stack/new/glance/glance/notifier/notify_qpid.py", line 134, in _send
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid     sender.send(qpid_msg)
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid   File "<string>", line 6, in send
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 867, in send
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid     self._ecwait(lambda: self.linked, timeout=timeout)
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 51, in _ecwait
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid     self.check_closed()
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 824, in check_closed
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid     raise LinkClosed()
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid LinkClosed: None
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid

Comment 9 Attila Fazekas 2013-09-19 05:19:49 UTC
The full monkey patch leads to pipe(2) and connection leakage.

The AMQP closing add by:  https://bugs.launchpad.net/glance/+bug/1100317

A mail thread related to python-qpid eventlet monkey patching.
http://qpid.2158936.n2.nabble.com/QPID-and-eventlet-monkey-patch-td7584370.html

Comment 10 Attila Fazekas 2013-09-19 15:44:54 UTC
Created attachment 800014 [details]
single_auth_multi_image.py

Simple reproducer script.

keystone tenant-create --name demo
keystone user-create --name demo --tenant demo --pass secret
./single_auth_multi_image.py --thread 256
glance image-list
^C... terminating glance client

On 2 core vm above 256 'thread' it has good chances for reproduce the issue with multiple workers.

Comment 11 Attila Fazekas 2013-09-19 15:51:26 UTC
Created attachment 800019 [details]
python stacks of one glance api worker

When the file is a <string> it may be a method created by the qpid.concurrency.synchronized. (thread locking)

Comment 12 Attila Fazekas 2013-09-20 08:57:58 UTC
python-qpid without this patch
http://svn.apache.org/viewvc/qpid/trunk/qpid/python/qpid/util.py?r1=1406584&r2=1493873 could cause socket leakage. (Looks like this is not the root cause of the hanging)

Comment 13 Attila Fazekas 2013-09-20 14:21:43 UTC
The same glance.notifier.notify_qpid.QpidStrategy instance is in the _open_connection() method multiple times in parallel. 

The method/class does not seems to be re-entrant or thread-safe.

The usual socket is monkey patched by the eventlet, so the the socket is NON_BLOCKING. On connection attempts the first response is EINPROGRESS to the systemcall.
This allows other co-rutins ('threads') to be 'scheduled',

The connection which really established is not used, several additional AMQP connection exists, without activity.

If the code is modified with more eventlet/monkey patching (more scheduling point) the leaked connections does the heart-beating in the 'background'.

Probably more python methods (like select) needs to be monkey patched for normal operation, if the PipeWiet does not have interruptible oration it can cause co-rutin starvation.

glance.notifier.notify_qpid.QpidStrategy must be used in a safer way.

Comment 14 Attila Fazekas 2013-09-20 16:31:53 UTC
Created attachment 800568 [details]
glance-real-new-connection.patch

This haves the qpid notifier to really reconnect on every notification attempt, as it advertised in one previous commit.

The previous reproducer was able to hang the glance-api service with 3 threads, when it was configured to start up with worker.

After the patch it is able to survive 128 concurrent connection/request.
It causes to use many pipes and co-routine, so with higher level concurrency you might need to increase the maximum file descriptor per process.

At 128 concurrency level, with the default qpid settings some cases one thread might not get scheduled for long time and it causes connection timeout. In this case the reconnection is working. No connection leakage experimented.

The test system was a 2 core VM.

The eventlet now is allowed to patch all method. More method needs to patched in order to avoid a hang.

So the attached patch makes glance more stable based on the first tests, however proper AMQP connection reuse  could boost the performance and lower the resource usage.

May be in the future the oslo 'rpc' code will be used in glance as well.

Comment 15 Alan Pevec 2013-09-30 23:36:28 UTC
Attila, is this right upstream bug in Launchpad I've linked to?


Note You need to log in before you can comment on or make changes to this bug.