Bug 1740681

Summary: heartbeats missed and connection timeout
Product: Red Hat OpenStack Reporter: Hervé Beraud <hberaud>
Component: python-amqpAssignee: RHOS Maint <rhos-maint>
Status: CLOSED ERRATA QA Contact: pkomarov
Severity: high Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: amcleod, apevec, bshephar, jmelvin, lhh, pkomarov
Target Milestone: z9Keywords: Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-amqp-2.3.2-4.el7ost Doc Type: Bug Fix
Doc Text:
Previously, an SSLError timeout that was not managed correctly caused a connection issue that impacted oslo.messaging rabbitmq driver connections and oslo.messaging. With this update, SSLError timeouts are treated as socket timeouts, which mean that oslo.messaging and services stop logging errors to rabbitmq heartbeat and the connection between services and the rabbitmq server remains stable.
Story Points: ---
Clone Of:
: 1741267 (view as bug list) Environment:
Last Closed: 2019-11-07 14:04:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1741267    

Description Hervé Beraud 2019-08-13 13:22:55 UTC
Description of problem:

py-amqp lack on SSLError timeout that are not properly managed.

If I'm right possibly without specifically handling this case (SSLError timeout), the socket.timeout() was not raised sometimes causing the connection to lock up.

This is related to a python bug https://bugs.python.org/issue10272

The current version of py-amqp on my freshly deployed version of OSP13 is:

```
$ rpm -qa | grep amqp
python2-amqp-2.3.2-3.el7ost.noarch
```

py-amqp was patched for the SSL issue in version 2.4.1

```
$ git log v2.4.0..v2.4.1 --no-merges --oneline
ba132f4 Bump version: 2.4.0 → 2.4.1
e669e83 Updated changelog.
2356f42 Treat EWOULDBLOCK as timeout (#253)
bf122a0 Always treat SSLError timeouts as socket timeouts (#247)
457b3ba Support float read_timeout/write_timeout (#246)
40e0ef5 Add unit test for SSLTransport _write function (#251)
e45ea3e read_frame python3 compatible for large payloads (#248)
734305d Add unit test for test_wrap_socket_sni (#250)
60acabc Fix crash in basic_publish when broker does not support connection.blocked capability (#244)
f507172 basic_consume() should return consumer tag instead of tuple (#240)
d09a0b0 Parametrize product_version in integration tests (#236)
0f7ffd2 Bump PyPy to 6.0. Add PyPy3 to the build process. (#238)
```

So we don't have the fix (bf122a0 Always treat SSLError timeouts as socket timeouts (#247)) embdded in our version (2.3.2).

The fix was released through this patch https://github.com/celery/py-amqp/pull/247

Possibly this BZ is also related to:
- https://bugzilla.redhat.com/show_bug.cgi?id=1725917
- https://bugzilla.redhat.com/show_bug.cgi?id=1734203
- https://bugzilla.redhat.com/show_bug.cgi?id=1733930
- https://bugs.launchpad.net/ubuntu/+source/oslo.messaging/+bug/1800957

Can you release a new version of py-amqp related to 2.4.1 or higher with the patch embbded for OSP13/14...

Version-Release number of selected component (if applicable):

2.3.2-2


How reproducible:

Unknown


Steps to Reproduce:
1. 
2.
3.

Actual results:

In few circumstances some amqp heartbeat from oslo.messaging can fail and the driver can be disconnected and we can observe timeout and error logs related in services logs (nova-api by example)(cf. https://bugzilla.redhat.com/show_bug.cgi?id=1725917 for more details)

Expected results:

No error logs related to missed heartbeat and connection timeout

Comment 1 Hervé Beraud 2019-08-14 15:35:57 UTC
We can't bump the package version due to the openstack requirements constraints so I'll only backport the needed fix there:

bf122a0 Always treat SSLError timeouts as socket timeouts (#247)

Comment 2 Hervé Beraud 2019-08-14 15:58:14 UTC
Fixed in version python-amqp-2.1.4-3.el7ost

Comment 3 Hervé Beraud 2019-08-14 18:01:02 UTC
Cross tagged from OSP14 with python-amqp-2.3.2-4.el7ost

```
$ brew list-tag-history --build=python-amqp-2.3.2-4.el7ost
Wed Aug 14 19:39:34 2019: python-amqp-2.3.2-4.el7ost tagged into rhos-14.0-rhel-7-candidate by hberaud [still active]
Wed Aug 14 19:56:26 2019: python-amqp-2.3.2-4.el7ost tagged into rhos-13.0-rhel-7-candidate by hberaud [still active]
```

Comment 4 Hervé Beraud 2019-10-07 09:26:08 UTC
*** Bug 1725917 has been marked as a duplicate of this bug. ***

Comment 13 pkomarov 2019-10-23 23:38:19 UTC
Verified , 

[stack@undercloud-0 ~]$ rhos-release -L
Installed repositories (rhel-7.7):
  13
  ceph-3
  ceph-osd-3
  rhel-7.7
[stack@undercloud-0 ~]$ cat core_puddle_version 
2019-10-23.1[stack@undercloud-0 ~]$ 
[stack@undercloud-0 ~]$ 
[stack@undercloud-0 ~]$  rpm -qa | grep amqp
python2-amqp-2.3.2-5.el7ost.noarch
[stack@undercloud-0 ~]$ 
[stack@undercloud-0 ~]$ rpm -q --changelog python2-amqp-2.3.2-5.el7ost.noarch|grep SSL
- Always treat SSLError timeouts as socket timeouts (#247) (rhbz#1741267)

Comment 15 errata-xmlrpc 2019-11-07 14:04:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3791