Bug 1574535 - Many rabbitmq inet_error,etimedout eventually resulting in rabbitmq failed actions in pacemaker [NEEDINFO]
Summary: Many rabbitmq inet_error,etimedout eventually resulting in rabbitmq failed ac...
Keywords:
Status: CLOSED EOL
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 11.0 (Ocata)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Peter Lemenkov
QA Contact: Udi Shkalim
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-03 13:56 UTC by Alex Krzos
Modified: 2018-06-08 12:39 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-08 12:39:51 UTC
Target Upstream Version:
abeekhof: needinfo? (plemenko)


Attachments (Terms of Use)
rabbitmq crash dump (2.25 MB, text/plain)
2018-05-03 15:10 UTC, Alex Krzos
no flags Details

Description Alex Krzos 2018-05-03 13:56:05 UTC
Description of problem:
We are scaling OpenShift on OpenStack and Rabbitmq has "bounced" between the three controllers and seems to eventually "fail".

Version-Release number of selected component (if applicable):

OpenStack Ocata (OSP11) Z4
RHEL 7.5 3.10.0-862.el7.x86_64

puppet-rabbitmq-5.6.0-3.03b8592git.el7ost.noarch
rabbitmq-server-3.6.5-5.el7ost.noarch
python-heat-agent-docker-cmd-1.0.1-1.el7ost.noarch
puppet-heat-10.3.1-3.el7ost.noarch
openstack-heat-api-cloudwatch-8.0.6-1.el7ost.noarch
python-heat-agent-ansible-1.0.1-1.el7ost.noarch
python-heatclient-1.8.2-1.el7ost.noarch
python-heat-agent-apply-config-1.0.1-1.el7ost.noarch
openstack-heat-api-8.0.6-1.el7ost.noarch
heat-cfntools-1.3.0-2.el7ost.noarch
python-heat-agent-1.0.1-1.el7ost.noarch
python-heat-agent-json-file-1.0.1-1.el7ost.noarch
openstack-heat-common-8.0.6-1.el7ost.noarch
openstack-heat-engine-8.0.6-1.el7ost.noarch
python-heat-agent-hiera-1.0.1-1.el7ost.noarch
openstack-heat-api-cfn-8.0.6-1.el7ost.noarch
python-heat-agent-puppet-1.0.1-1.el7ost.noarch


How reproducible:
At this scale ~800 app-nodes (~820vms) using heat to expand the openshift cluster

Steps to Reproduce:
1.
2.
3.

Actual results:
Can not seem to scale the infrastructure beyond ~820ish vms (800 app-nodes) through OpenStack heat

Expected results:
Rabbitmq to not quit or fail and for the cluster to scale much beyond (>2000 app-nodes) the numbers we currently have.

Additional info:
OpenStack telemetry was disabled on this cluster to reduce rabbitmq load, ceph load, and telemetry processes on the controllers.  In addition notifications was disabled on all but heat services via tripleo templates.

Snippets from Rabbitmq logs noticed at times of rabbitmq "swapping" to another controller
=ERROR REPORT==== 2-May-2018::15:56:35 ===
closing AMQP connection <0.7234.1> (172.16.0.8:40026 -> 172.16.0.8:5672 - heat-engine:14404:16fc267e-bc70-48bf-be40-34efa7f4dece):
{inet_error,etimedout}

=ERROR REPORT==== 2-May-2018::15:56:35 ===
closing AMQP connection <0.9576.2> (172.16.0.8:46976 -> 172.16.0.8:5672 - nova-conductor:5493:4a16f392-5f7b-48c3-aba6-fb8fa1a544e3):
{inet_error,etimedout}


Snippet form heat logs:
2018-05-02 15:56:26.904 14423 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Socket closed
2018-05-02 15:56:29.227 14408 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Socket closed
2018-05-02 15:56:32.760 14382 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Socket closed
2018-05-02 15:56:36.299 14403 ERROR oslo.messaging._drivers.impl_rabbit [-] [80646e74-26ab-4957-a7bc-942bdd52d3ea] AMQP server overcloud-controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed
2018-05-02 15:56:36.299 14390 ERROR oslo.messaging._drivers.impl_rabbit [-] [9472aaaa-b19e-4ffc-9d2f-44450ea23b3e] AMQP server overcloud-controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed
2018-05-02 15:56:36.427 14357 ERROR oslo.messaging._drivers.impl_rabbit [-] [300a97ed-fe71-491d-b936-fc99975745b7] AMQP server overcloud-controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed
2018-05-02 15:56:36.427 14414 ERROR oslo.messaging._drivers.impl_rabbit [-] [73a7cbc7-40ae-4b07-ac6d-edbd869eaf95] AMQP server overcloud-controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed
2018-05-02 15:56:36.453 14425 ERROR oslo.messaging._drivers.impl_rabbit [-] [795a3595-00cc-480f-9f86-f2e38df0cd81] AMQP server overcloud-controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed
2018-05-02 15:56:36.485 14387 ERROR oslo.messaging._drivers.impl_rabbit [-] [c1d73b2c-9687-4b1c-bcc9-89860fa6e95d] AMQP server overcloud-controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed
2018-05-02 15:56:36.748 14403 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
2018-05-02 15:56:36.859 14323 ERROR oslo.messaging._drivers.impl_rabbit [-] [c453f7c1-ee61-4f59-ad31-8ac22ca60f3f] AMQP server overcloud-controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed

Eventually rabbitmq status would look like:
[root@overcloud-controller-0 rabbitmq]# rabbitmqctl status                                                                                                                                           
Status of node 'rabbit@overcloud-controller-0' ...
[{pid,181040},
 {running_applications,
     [{rabbitmq_management,"RabbitMQ Management Console","3.6.5"},
      {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.6.5"},
      {webmachine,"webmachine","1.10.3"},
      {mochiweb,"MochiMedia Web Server","2.13.1"},
      {rabbitmq_management_agent,"RabbitMQ Management Agent","3.6.5"},
      {rabbit,"RabbitMQ","3.6.5"},
      {os_mon,"CPO  CXC 138 46","2.4"},
      {amqp_client,"RabbitMQ AMQP Client","3.6.5"},
      {rabbit_common,[],"3.6.5"},
      {mnesia,"MNESIA  CXC 138 12","4.13.4"},
      {ssl,"Erlang/OTP SSL application","7.3.3.2"},
      {ranch,"Socket acceptor pool for TCP protocols.","1.2.1"},
      {xmerl,"XML parser","1.3.10"},
      {inets,"INETS  CXC 138 49","6.2.4.1"},
      {public_key,"Public key infrastructure","1.1.1"},
      {asn1,"The Erlang ASN1 compiler version 4.0.2","4.0.2"},
      {syntax_tools,"Syntax tools","1.7"},
      {compiler,"ERTS  CXC 138 10","6.0.3.1"},
      {crypto,"CRYPTO","3.6.3.1"},
      {sasl,"SASL  CXC 138 11","2.7"},
      {stdlib,"ERTS  CXC 138 10","2.8"},
      {kernel,"ERTS  CXC 138 10","4.2"}]},
 {os,{unix,linux}},
 {erlang_version,
     "Erlang/OTP 18 [erts-7.3.1.4] [source] [64-bit] [smp:64:64] [async-threads:1024] [hipe] [kernel-poll:true]\n"},
 {memory,
     [{total,210917848},
      {connection_readers,0},
      {connection_writers,0},
      {connection_channels,0},
      {connection_other,2808},
      {queue_procs,34957016},
      {queue_slave_procs,34580952},
      {plugins,7294808},
      {other_proc,22999640},
      {mnesia,14196216},
      {mgmt_db,7092616},
      {msg_index,1356472},
      {other_ets,4359320},
      {binary,2034424},
      {code,27641442},
      {atom,992409},
      {other_system,53409725}]},
 {alarms,[]},
 {listeners,[{clustering,25672,"::"},{amqp,5672,"172.16.0.13"}]},
 {vm_memory_high_watermark,0.4},
 {vm_memory_limit,107959038771},
 {disk_free_limit,50000000},
 {disk_free,446380048384},
 {file_descriptors,
     [{total_limit,65436},
      {total_used,2},
      {sockets_limit,58890},
      {sockets_used,0}]},
 {processes,[{limit,1048576},{used,10948}]},
 {run_queue,0},
 {uptime,30},
 {kernel,{net_ticktime,60}}]

^ Note sockets used down to 0

pcs resource restart rabbitmq-clone would seem to keep the service up for a short period of time before it would eventually "fail" again

restarting via pcs results in an in sequence restart across each controller and all connections to rabbitmq seem to land on controller-0, when it fails it doesn't seem to "rebalance" those connections to the other controllers.

Logs coming when I can pull all of them down.

Comment 1 Alex Krzos 2018-05-03 15:10:32 UTC
Created attachment 1430747 [details]
rabbitmq crash dump

Comment 3 John Eckersberg 2018-05-04 14:24:17 UTC
Can you paste the contents of /etc/rabbitmq/rabbitmq-env.conf?

My first thought is that the default configured TCP timeout is too short at this scale, we probably need to increase it or just remove it completely from the config.

For Queens and newer, we don't even set this anymore, see https://review.openstack.org/#/c/503788/

Comment 4 Alex Krzos 2018-05-04 16:27:21 UTC
(In reply to John Eckersberg from comment #3)
> Can you paste the contents of /etc/rabbitmq/rabbitmq-env.conf?
> 
> My first thought is that the default configured TCP timeout is too short at
> this scale, we probably need to increase it or just remove it completely
> from the config.
> 
> For Queens and newer, we don't even set this anymore, see
> https://review.openstack.org/#/c/503788/

The specific environment was rebuilt with the same process/templates so the contents of rabbitmq-env.conf should be *almost* the same.

[root@overcloud-controller-0 ~]# cat /etc/rabbitmq/rabbitmq-env.conf 
NODE_IP_ADDRESS=
NODE_PORT=
RABBITMQ_NODENAME=rabbit@overcloud-controller-0
RABBITMQ_SERVER_ERL_ARGS="+K true +P 1048576 -kernel inet_default_connect_options [{nodelay,true},{raw,6,18,<<5000:64/native>>}] -kernel inet_default_listen_options [{raw,6,18,<<5000:64/native>>}]"
export ERL_EPMD_ADDRESS=172.16.0.16
export ERL_INETRC=/etc/rabbitmq/inetrc

Comment 5 John Eckersberg 2018-05-04 16:43:58 UTC
Try changing the RABBITMQ_SERVER_ERL_ARGS line to:

RABBITMQ_SERVER_ERL_ARGS="+K true +P 1048576 -kernel inet_default_connect_options [{nodelay,true}]"

That will remove the TCP timeout from being set on the rabbitmq sockets

Comment 6 Andrew Beekhof 2018-05-21 09:27:32 UTC
Did John's suggestion have the desired effect?

Comment 7 Alex Krzos 2018-05-21 13:12:41 UTC
(In reply to Andrew Beekhof from comment #6)
> Did John's suggestion have the desired effect?

When we rebuilt the cluster, the same issue did not occur with rabbitmq starting/stopping due to pacemaker even after we scale the OpenShift cluster to >2x the size we witnessed this issue at.  We did not apply John's suggestion since we wanted it to reproduce before applying any changes.  

We will be moving to a newer version of OSP for future testing so at this point, I think the bug should be closed as non-reproduced.  Thanks for the time and efforts.

Comment 8 Andrew Beekhof 2018-05-21 22:13:28 UTC
Peter: Any changes or documentation we need based on this feedback or can we go ahead and close this one?

Comment 9 Chris Jones 2018-06-08 12:39:51 UTC
Thanks Alex. OSP11 is now EOL anyway, so please let us know if you have issues with later versions.


Note You need to log in before you can comment on or make changes to this bug.