Bug 2187966

Summary: handshake_timeout,frame_header errors in RabbitMQ logs in RHOSP 16.1.8 deployment with internal TLS
Product: Red Hat OpenStack Reporter: Alex Stupnikov <astupnik>
Component: python-amqpAssignee: OSP Team <rhos-maint>
Status: NEW --- QA Contact: Nobody <nobody>
Severity: medium Docs Contact:
Priority: medium    
Version: 16.1 (Train)CC: apevec, lhh, lmiccini
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alex Stupnikov 2023-04-19 09:30:26 UTC
Description of problem:
When investigating RabbitMQ crash in customer's deployment I have found numerous errors like

    2023-04-17 15:07:50.755 [info] <0.22890.66> accepting AMQP connection <0.22890.66> (192.168.1.19:39862 -> 192.168.1.19:5672)
    2023-04-17 15:07:50.755 [error] <0.22890.66> closing AMQP connection <0.22890.66> (192.168.1.19:39862 -> 192.168.1.19:5672):
    {handshake_timeout,handshake}

OR

    2023-04-17 15:07:40.754 [info] <0.22495.66> accepting AMQP connection <0.22495.66> (192.168.1.19:39868 -> 192.168.1.19:5672)
    2023-04-17 15:07:50.753 [error] <0.22495.66> closing AMQP connection <0.22495.66> (192.168.1.19:39868 -> 192.168.1.19:5672):
    {handshake_timeout,frame_header}

I have collected tcpdump to understand this problem better and from tcpdump it looks like client stops participating in connection establishment after initial exchange (when compared with "good connections"). Some time ago there was a known issue in python-amqp affecting environments where TLS was used to establish communications:
https://bugs.launchpad.net/oslo.messaging/+bug/1800957
https://review.opendev.org/c/openstack/oslo.messaging/+/638735/1/releasenotes/notes/amqp-tls-issue-57c7f6ea894e03d7.yaml

But in RHOSP 16.1 we use newer version of python-amqp. Reporting this as a bug to request a second look from engineering. I will provide information about collected data privately.

Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform release 16.1.8 (Train)

How reproducible:
Errors are generated sporadically in /var/log/containers/rabbitmq/rabbit

Actual results:
Occasional handshake_timeout errors in /var/log/containers/rabbitmq/rabbit

Expected results:
No handshake_timeout errors in RabbitMQ logs