Bug 1741267

Summary: heartbeats missed and connection timeout
Product: Red Hat OpenStack Reporter: Hervé Beraud <hberaud>
Component: python-amqpAssignee: RHOS Maint <rhos-maint>
Status: CLOSED ERRATA QA Contact: pkomarov
Severity: high Docs Contact:
Priority: high    
Version: 14.0 (Rocky)CC: apevec, lhh, nlevinki, pkomarov, rheslop, rhos-maint
Target Milestone: z8Keywords: Triaged, ZStream
Target Release: 14.0 (Rocky)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-amqp-2.3.2-4.el7ost Doc Type: Bug Fix
Doc Text:
Previously, SSLError timeoutes were not handled properly; socket.timeout() was not raised. This could cause rabbitmq driver connections to lockup. This patch ensures SSLError timeouts are treated as socket timeouts so that oslo.messaging and services log errors related to rabbitmq heartbeat and that the connection between service and the rabbitmq server remains stable.
Story Points: ---
Clone Of: 1740681 Environment:
Last Closed: 2019-11-06 16:53:57 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1740681    
Bug Blocks:    

Description Hervé Beraud 2019-08-14 15:59:55 UTC
+++ This bug was initially created as a clone of Bug #1740681 +++

Description of problem:

py-amqp lack on SSLError timeout that are not properly managed.

If I'm right possibly without specifically handling this case (SSLError timeout), the socket.timeout() was not raised sometimes causing the connection to lock up.

This is related to a python bug https://bugs.python.org/issue10272

The current version of py-amqp on my freshly deployed version of OSP13 is:

```
$ rpm -qa | grep amqp
python2-amqp-2.3.2-3.el7ost.noarch
```

py-amqp was patched for the SSL issue in version 2.4.1

```
$ git log v2.4.0..v2.4.1 --no-merges --oneline
ba132f4 Bump version: 2.4.0 → 2.4.1
e669e83 Updated changelog.
2356f42 Treat EWOULDBLOCK as timeout (#253)
bf122a0 Always treat SSLError timeouts as socket timeouts (#247)
457b3ba Support float read_timeout/write_timeout (#246)
40e0ef5 Add unit test for SSLTransport _write function (#251)
e45ea3e read_frame python3 compatible for large payloads (#248)
734305d Add unit test for test_wrap_socket_sni (#250)
60acabc Fix crash in basic_publish when broker does not support connection.blocked capability (#244)
f507172 basic_consume() should return consumer tag instead of tuple (#240)
d09a0b0 Parametrize product_version in integration tests (#236)
0f7ffd2 Bump PyPy to 6.0. Add PyPy3 to the build process. (#238)
```

So we don't have the fix (bf122a0 Always treat SSLError timeouts as socket timeouts (#247)) embdded in our version (2.3.2).

The fix was released through this patch https://github.com/celery/py-amqp/pull/247

Possibly this BZ is also related to:
- https://bugzilla.redhat.com/show_bug.cgi?id=1725917
- https://bugzilla.redhat.com/show_bug.cgi?id=1734203
- https://bugzilla.redhat.com/show_bug.cgi?id=1733930
- https://bugs.launchpad.net/ubuntu/+source/oslo.messaging/+bug/1800957

Can you release a new version of py-amqp related to 2.4.1 or higher with the patch embbded for OSP13/14...

Version-Release number of selected component (if applicable):

2.3.2-2


How reproducible:

Unknown


Steps to Reproduce:
1. 
2.
3.

Actual results:

In few circumstances some amqp heartbeat from oslo.messaging can fail and the driver can be disconnected and we can observe timeout and error logs related in services logs (nova-api by example)(cf. https://bugzilla.redhat.com/show_bug.cgi?id=1725917 for more details)

Expected results:

No error logs related to missed heartbeat and connection timeout

--- Additional comment from Hervé Beraud on 2019-08-14 15:35:57 UTC ---

We can't bump the package version due to the openstack requirements constraints so I'll only backport the needed fix there:

bf122a0 Always treat SSLError timeouts as socket timeouts (#247)

--- Additional comment from Hervé Beraud on 2019-08-14 15:58:14 UTC ---

Fixed in version python-amqp-2.1.4-3.el7ost

Comment 1 Hervé Beraud 2019-08-14 17:41:34 UTC
python-amqp-2.3.2-4.el7ost

Comment 3 pkomarov 2019-10-23 22:02:25 UTC
Verified , 

(undercloud) [stack@undercloud-0 ~]$ rhos-release -L
Installed repositories (rhel-7.7):
  14
  ceph-3
  ceph-osd-3
  rhel-7.7
(undercloud) [stack@undercloud-0 ~]$ cat core_puddle_version 
2019-10-21.1(undercloud) [stack@undercloud-0 ~]$ rpm -qa | grep amqp
python2-amqp-2.3.2-5.el7ost.noarch

(undercloud) [stack@undercloud-0 ~]$ rpm -q --changelog python2-amqp-2.3.2-5.el7ost.noarch|grep SSL
- Always treat SSLError timeouts as socket timeouts (#247) (rhbz#1741267)

Comment 5 errata-xmlrpc 2019-11-06 16:53:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3747