Bug 1008818

Summary: glance api hangs with low (1) workers on multiple parallel image creation requests
Product: [Community] RDO Reporter: Attila Fazekas <afazekas>
Component: openstack-glanceAssignee: Attila Fazekas <afazekas>
Status: CLOSED CURRENTRELEASE QA Contact: Shai Revivo <srevivo>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: afazekas, apevec, eglynn, fpercoco
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-glance-2013.1.4-2.fc19 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1014016 (view as bug list) Environment:
Last Closed: 2016-03-30 23:07:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
strace of one glance-registry process on F19
none
strace gdb wireshark lsof pstree
none
single_auth_multi_image.py
none
python stacks of one glance api worker
none
glance-real-new-connection.patch none

Description Attila Fazekas 2013-09-17 06:51:54 UTC
Description of problem:
Glance api can hang in registry notification.

Version-Release number of selected component (if applicable):
RDO havana

How reproducible:
Usually

Steps to Reproduce:
1. openstack-config --set /etc/glance/glance-api.conf DEFAULT workers 1 
2. service openstack-glance-api restart
3. Try several times:
for ((a=0; a<16; a++)) do
        glance  image-create --name image-$a --container-format bare --disk-format raw --file /etc/passwd &
done

Actual results:
glance image-list # Times out
API service hangs for forever.

glance-api service restart is required to continue the normal operation.

Expected results:
Service does noh hangs.


Additional info:
- On F19-devstack havana, RHEL-RDO-havana with pyhon-qpid 0.14 .. 0.24
- Tested on a 2 Core VM
- With rabbitMq the response is 500, so the bug can be related to the glance-registry. 

- No issue seen with workers=2 yet.
- glnace image-create command without the --file argument also can cause hanging issues.

Comment 1 Attila Fazekas 2013-09-17 06:53:28 UTC
Created attachment 798638 [details]
strace of one glance-registry process on F19

Comment 2 Attila Fazekas 2013-09-17 06:59:38 UTC
Created attachment 798642 [details]
strace gdb wireshark lsof pstree

strace and wireshark/tshark started before the 16 glance image-create command.
The ssh traffic is removed from the network capture.
pstree is created before the action.
lsof was used listed before and after the the issue.

The gdb used just after the issue.

Comment 3 Attila Fazekas 2013-09-17 10:56:11 UTC
According to the qpipd manual the connection backlog is 10 by default.

The AMQP server initiates a connection close (FIN) just after the connection created, before the AMQP handshaking or any data transferred. 

See 'tcp.stream eq 134' expression with wireshark in the attached pcap file (tar.xz).

There is multiple sockets with similar story in the glance-api strace.
glance poll(2)-ing on the half closed sockets them.

(Glance open and closes AMQP tcp connection frequently, it does not tries to reuse an existing AMQP connection.)
https://github.com/openstack/glance/blob/master/glance/notifier/notify_qpid.py#L141

Comment 4 Attila Fazekas 2013-09-17 12:05:46 UTC
qpid.16983:16995 1379324256.307588 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46895 No protocol received closing\n", 111) = 111
qpid.16983:16995 1379324256.317537 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46896 No protocol received closing\n", 111) = 111
qpid.16983:16995 1379324256.323928 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46897 No protocol received closing\n", 111) = 111
qpid.16983:16995 1379324256.331552 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46898 No protocol received closing\n", 111) = 111
qpid.16983:16995 1379324256.336658 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46899 No protocol received closing\n", 111) = 111
qpid.16983:16995 1379324256.345344 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46900 No protocol received closing\n", 111) = 111
qpid.16983:16995 1379324256.352342 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46901 No protocol received closing\n", 111) = 111
qpid.16983:16995 1379324256.359793 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46902 No protocol received closing\n", 111) = 111
qpid.16983:16995 1379324256.370614 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46903 No protocol received closing\n", 111) = 111
qpid.16983:16995 1379324256.374449 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46904 No protocol received closing\n", 111) = 111
qpid.16983:16995 1379324256.381316 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46905 No protocol received closing\n", 111) = 111
qpid.16983:16995 1379324256.389423 write(2, "2013-09-16 09:37:36 [System] error Connection qpid.127.0.0.1:5672-127.0.0.1:46906 No protocol received closing\n", 111) = 111

Client needs to initiate the communication, but qpid closes the connection very early.

Comment 5 Attila Fazekas 2013-09-17 13:42:35 UTC
Several changes happened with time out management since qpid 0.22
https://issues.apache.org/jira/browse/QPID-4854.
Probably worth to try it with qpid-cpp 0.24.

Comment 6 Attila Fazekas 2013-09-17 17:39:56 UTC
upgrading to qpid-0.24 did not solved the issue (both client and server). (All configuration is the default)

Comment 7 Attila Fazekas 2013-09-18 10:52:21 UTC
The qpidd closed the connection after 10 sec which was expected.
The client did not sent the AMQP header.

Comment 8 Attila Fazekas 2013-09-18 11:56:06 UTC
glance uses the eventlet and monkey patches several methods, but does not patches for instance the select().

After extending the monkey patched methods the service does not dies.
I see similar results as with rabbitMQ, some cases I got 500 responses, but the service does not hangs permanently.

2013-09-18 11:49:29.212 12234 ERROR glance.notifier.notify_qpid [5c709887-d5a6-4b3a-842e-467ce5f7867c 681888a5061442198bdd2f8501d68d45 b0b2342ce3ca4e939192c1256dda643c] Notification error.  Priority: info Message: {'event_type': 'image.create', 'timestamp': '2013-09-18 11:49:29.166362', 'message_id': '75db5170-c1d0-494f-a248-8c30ecf7e12d', 'priority': 'INFO', 'publisher_id': 'devstack-1379181899.novalocal', 'payload': {u'status': u'queued', u'deleted': False, u'container_format': u'bare', u'min_ram': 0, u'updated_at': u'2013-09-18T11:49:29', u'owner': u'b0b2342ce3ca4e939192c1256dda643c', u'min_disk': 0, u'is_public': False, u'deleted_at': None, u'id': u'bda44585-539d-47f2-b162-de4f2940b1f0', u'size': 2030, u'name': u'image-12', u'checksum': None, u'created_at': u'2013-09-18T11:49:29', u'disk_format': u'raw', u'properties': {}, u'protected': False}}
2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid Traceback (most recent call last):
2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid   File "/opt/stack/new/glance/glance/notifier/notify_qpid.py", line 134, in _send
2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid     sender.send(qpid_msg)
2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid   File "<string>", line 6, in send
2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 865, in send
2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid     raise Detached()
2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid Detached: None
2013-09-18 11:49:29.212 12234 TRACE glance.notifier.notify_qpid

2013-09-18 11:49:29.229 12234 ERROR glance.notifier.notify_qpid [9449db2e-7690-4a86-af12-4f5576bce05e 681888a5061442198bdd2f8501d68d45 b0b2342ce3ca4e939192c1256dda643c] Notification error.  Priority: info Message: {'event_type': 'image.create', 'timestamp': '2013-09-18 11:49:29.139054', 'message_id': 'b4eeb060-e70a-494c-8cd6-bb6b1770ba29', 'priority': 'INFO', 'publisher_id': 'devstack-1379181899.novalocal', 'payload': {u'status': u'queued', u'deleted': False, u'container_format': u'bare', u'min_ram': 0, u'updated_at': u'2013-09-18T11:49:29', u'owner': u'b0b2342ce3ca4e939192c1256dda643c', u'min_disk': 0, u'is_public': False, u'deleted_at': None, u'id': u'36bbd1e2-b2c7-4a0e-b7cb-7f2cbcc5f6dc', u'size': 2030, u'name': u'image-8', u'checksum': None, u'created_at': u'2013-09-18T11:49:29', u'disk_format': u'raw', u'properties': {}, u'protected': False}}
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid Traceback (most recent call last):
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid   File "/opt/stack/new/glance/glance/notifier/notify_qpid.py", line 134, in _send
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid     sender.send(qpid_msg)
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid   File "<string>", line 6, in send
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 867, in send
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid     self._ecwait(lambda: self.linked, timeout=timeout)
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 51, in _ecwait
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid     self.check_closed()
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 824, in check_closed
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid     raise LinkClosed()
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid LinkClosed: None
2013-09-18 11:49:29.229 12234 TRACE glance.notifier.notify_qpid

Comment 9 Attila Fazekas 2013-09-19 05:19:49 UTC
The full monkey patch leads to pipe(2) and connection leakage.

The AMQP closing add by:  https://bugs.launchpad.net/glance/+bug/1100317

A mail thread related to python-qpid eventlet monkey patching.
http://qpid.2158936.n2.nabble.com/QPID-and-eventlet-monkey-patch-td7584370.html

Comment 10 Attila Fazekas 2013-09-19 15:44:54 UTC
Created attachment 800014 [details]
single_auth_multi_image.py

Simple reproducer script.

keystone tenant-create --name demo
keystone user-create --name demo --tenant demo --pass secret
./single_auth_multi_image.py --thread 256
glance image-list
^C... terminating glance client

On 2 core vm above 256 'thread' it has good chances for reproduce the issue with multiple workers.

Comment 11 Attila Fazekas 2013-09-19 15:51:26 UTC
Created attachment 800019 [details]
python stacks of one glance api worker

When the file is a <string> it may be a method created by the qpid.concurrency.synchronized. (thread locking)

Comment 12 Attila Fazekas 2013-09-20 08:57:58 UTC
python-qpid without this patch
http://svn.apache.org/viewvc/qpid/trunk/qpid/python/qpid/util.py?r1=1406584&r2=1493873 could cause socket leakage. (Looks like this is not the root cause of the hanging)

Comment 13 Attila Fazekas 2013-09-20 14:21:43 UTC
The same glance.notifier.notify_qpid.QpidStrategy instance is in the _open_connection() method multiple times in parallel. 

The method/class does not seems to be re-entrant or thread-safe.

The usual socket is monkey patched by the eventlet, so the the socket is NON_BLOCKING. On connection attempts the first response is EINPROGRESS to the systemcall.
This allows other co-rutins ('threads') to be 'scheduled',

The connection which really established is not used, several additional AMQP connection exists, without activity.

If the code is modified with more eventlet/monkey patching (more scheduling point) the leaked connections does the heart-beating in the 'background'.

Probably more python methods (like select) needs to be monkey patched for normal operation, if the PipeWiet does not have interruptible oration it can cause co-rutin starvation.

glance.notifier.notify_qpid.QpidStrategy must be used in a safer way.

Comment 14 Attila Fazekas 2013-09-20 16:31:53 UTC
Created attachment 800568 [details]
glance-real-new-connection.patch

This haves the qpid notifier to really reconnect on every notification attempt, as it advertised in one previous commit.

The previous reproducer was able to hang the glance-api service with 3 threads, when it was configured to start up with worker.

After the patch it is able to survive 128 concurrent connection/request.
It causes to use many pipes and co-routine, so with higher level concurrency you might need to increase the maximum file descriptor per process.

At 128 concurrency level, with the default qpid settings some cases one thread might not get scheduled for long time and it causes connection timeout. In this case the reconnection is working. No connection leakage experimented.

The test system was a 2 core VM.

The eventlet now is allowed to patch all method. More method needs to patched in order to avoid a hang.

So the attached patch makes glance more stable based on the first tests, however proper AMQP connection reuse  could boost the performance and lower the resource usage.

May be in the future the oslo 'rpc' code will be used in glance as well.

Comment 15 Alan Pevec 2013-09-30 23:36:28 UTC
Attila, is this right upstream bug in Launchpad I've linked to?