Description of problem: We are scaling OpenShift on OpenStack and Rabbitmq has "bounced" between the three controllers and seems to eventually "fail". Version-Release number of selected component (if applicable): OpenStack Ocata (OSP11) Z4 RHEL 7.5 3.10.0-862.el7.x86_64 puppet-rabbitmq-5.6.0-3.03b8592git.el7ost.noarch rabbitmq-server-3.6.5-5.el7ost.noarch python-heat-agent-docker-cmd-1.0.1-1.el7ost.noarch puppet-heat-10.3.1-3.el7ost.noarch openstack-heat-api-cloudwatch-8.0.6-1.el7ost.noarch python-heat-agent-ansible-1.0.1-1.el7ost.noarch python-heatclient-1.8.2-1.el7ost.noarch python-heat-agent-apply-config-1.0.1-1.el7ost.noarch openstack-heat-api-8.0.6-1.el7ost.noarch heat-cfntools-1.3.0-2.el7ost.noarch python-heat-agent-1.0.1-1.el7ost.noarch python-heat-agent-json-file-1.0.1-1.el7ost.noarch openstack-heat-common-8.0.6-1.el7ost.noarch openstack-heat-engine-8.0.6-1.el7ost.noarch python-heat-agent-hiera-1.0.1-1.el7ost.noarch openstack-heat-api-cfn-8.0.6-1.el7ost.noarch python-heat-agent-puppet-1.0.1-1.el7ost.noarch How reproducible: At this scale ~800 app-nodes (~820vms) using heat to expand the openshift cluster Steps to Reproduce: 1. 2. 3. Actual results: Can not seem to scale the infrastructure beyond ~820ish vms (800 app-nodes) through OpenStack heat Expected results: Rabbitmq to not quit or fail and for the cluster to scale much beyond (>2000 app-nodes) the numbers we currently have. Additional info: OpenStack telemetry was disabled on this cluster to reduce rabbitmq load, ceph load, and telemetry processes on the controllers. In addition notifications was disabled on all but heat services via tripleo templates. Snippets from Rabbitmq logs noticed at times of rabbitmq "swapping" to another controller =ERROR REPORT==== 2-May-2018::15:56:35 === closing AMQP connection <0.7234.1> (172.16.0.8:40026 -> 172.16.0.8:5672 - heat-engine:14404:16fc267e-bc70-48bf-be40-34efa7f4dece): {inet_error,etimedout} =ERROR REPORT==== 2-May-2018::15:56:35 === closing AMQP connection <0.9576.2> (172.16.0.8:46976 -> 172.16.0.8:5672 - nova-conductor:5493:4a16f392-5f7b-48c3-aba6-fb8fa1a544e3): {inet_error,etimedout} Snippet form heat logs: 2018-05-02 15:56:26.904 14423 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Socket closed 2018-05-02 15:56:29.227 14408 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Socket closed 2018-05-02 15:56:32.760 14382 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Socket closed 2018-05-02 15:56:36.299 14403 ERROR oslo.messaging._drivers.impl_rabbit [-] [80646e74-26ab-4957-a7bc-942bdd52d3ea] AMQP server overcloud-controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed 2018-05-02 15:56:36.299 14390 ERROR oslo.messaging._drivers.impl_rabbit [-] [9472aaaa-b19e-4ffc-9d2f-44450ea23b3e] AMQP server overcloud-controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed 2018-05-02 15:56:36.427 14357 ERROR oslo.messaging._drivers.impl_rabbit [-] [300a97ed-fe71-491d-b936-fc99975745b7] AMQP server overcloud-controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed 2018-05-02 15:56:36.427 14414 ERROR oslo.messaging._drivers.impl_rabbit [-] [73a7cbc7-40ae-4b07-ac6d-edbd869eaf95] AMQP server overcloud-controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed 2018-05-02 15:56:36.453 14425 ERROR oslo.messaging._drivers.impl_rabbit [-] [795a3595-00cc-480f-9f86-f2e38df0cd81] AMQP server overcloud-controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed 2018-05-02 15:56:36.485 14387 ERROR oslo.messaging._drivers.impl_rabbit [-] [c1d73b2c-9687-4b1c-bcc9-89860fa6e95d] AMQP server overcloud-controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed 2018-05-02 15:56:36.748 14403 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer 2018-05-02 15:56:36.859 14323 ERROR oslo.messaging._drivers.impl_rabbit [-] [c453f7c1-ee61-4f59-ad31-8ac22ca60f3f] AMQP server overcloud-controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed Eventually rabbitmq status would look like: [root@overcloud-controller-0 rabbitmq]# rabbitmqctl status Status of node 'rabbit@overcloud-controller-0' ... [{pid,181040}, {running_applications, [{rabbitmq_management,"RabbitMQ Management Console","3.6.5"}, {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.6.5"}, {webmachine,"webmachine","1.10.3"}, {mochiweb,"MochiMedia Web Server","2.13.1"}, {rabbitmq_management_agent,"RabbitMQ Management Agent","3.6.5"}, {rabbit,"RabbitMQ","3.6.5"}, {os_mon,"CPO CXC 138 46","2.4"}, {amqp_client,"RabbitMQ AMQP Client","3.6.5"}, {rabbit_common,[],"3.6.5"}, {mnesia,"MNESIA CXC 138 12","4.13.4"}, {ssl,"Erlang/OTP SSL application","7.3.3.2"}, {ranch,"Socket acceptor pool for TCP protocols.","1.2.1"}, {xmerl,"XML parser","1.3.10"}, {inets,"INETS CXC 138 49","6.2.4.1"}, {public_key,"Public key infrastructure","1.1.1"}, {asn1,"The Erlang ASN1 compiler version 4.0.2","4.0.2"}, {syntax_tools,"Syntax tools","1.7"}, {compiler,"ERTS CXC 138 10","6.0.3.1"}, {crypto,"CRYPTO","3.6.3.1"}, {sasl,"SASL CXC 138 11","2.7"}, {stdlib,"ERTS CXC 138 10","2.8"}, {kernel,"ERTS CXC 138 10","4.2"}]}, {os,{unix,linux}}, {erlang_version, "Erlang/OTP 18 [erts-7.3.1.4] [source] [64-bit] [smp:64:64] [async-threads:1024] [hipe] [kernel-poll:true]\n"}, {memory, [{total,210917848}, {connection_readers,0}, {connection_writers,0}, {connection_channels,0}, {connection_other,2808}, {queue_procs,34957016}, {queue_slave_procs,34580952}, {plugins,7294808}, {other_proc,22999640}, {mnesia,14196216}, {mgmt_db,7092616}, {msg_index,1356472}, {other_ets,4359320}, {binary,2034424}, {code,27641442}, {atom,992409}, {other_system,53409725}]}, {alarms,[]}, {listeners,[{clustering,25672,"::"},{amqp,5672,"172.16.0.13"}]}, {vm_memory_high_watermark,0.4}, {vm_memory_limit,107959038771}, {disk_free_limit,50000000}, {disk_free,446380048384}, {file_descriptors, [{total_limit,65436}, {total_used,2}, {sockets_limit,58890}, {sockets_used,0}]}, {processes,[{limit,1048576},{used,10948}]}, {run_queue,0}, {uptime,30}, {kernel,{net_ticktime,60}}] ^ Note sockets used down to 0 pcs resource restart rabbitmq-clone would seem to keep the service up for a short period of time before it would eventually "fail" again restarting via pcs results in an in sequence restart across each controller and all connections to rabbitmq seem to land on controller-0, when it fails it doesn't seem to "rebalance" those connections to the other controllers. Logs coming when I can pull all of them down.
Created attachment 1430747 [details] rabbitmq crash dump
Can you paste the contents of /etc/rabbitmq/rabbitmq-env.conf? My first thought is that the default configured TCP timeout is too short at this scale, we probably need to increase it or just remove it completely from the config. For Queens and newer, we don't even set this anymore, see https://review.openstack.org/#/c/503788/
(In reply to John Eckersberg from comment #3) > Can you paste the contents of /etc/rabbitmq/rabbitmq-env.conf? > > My first thought is that the default configured TCP timeout is too short at > this scale, we probably need to increase it or just remove it completely > from the config. > > For Queens and newer, we don't even set this anymore, see > https://review.openstack.org/#/c/503788/ The specific environment was rebuilt with the same process/templates so the contents of rabbitmq-env.conf should be *almost* the same. [root@overcloud-controller-0 ~]# cat /etc/rabbitmq/rabbitmq-env.conf NODE_IP_ADDRESS= NODE_PORT= RABBITMQ_NODENAME=rabbit@overcloud-controller-0 RABBITMQ_SERVER_ERL_ARGS="+K true +P 1048576 -kernel inet_default_connect_options [{nodelay,true},{raw,6,18,<<5000:64/native>>}] -kernel inet_default_listen_options [{raw,6,18,<<5000:64/native>>}]" export ERL_EPMD_ADDRESS=172.16.0.16 export ERL_INETRC=/etc/rabbitmq/inetrc
Try changing the RABBITMQ_SERVER_ERL_ARGS line to: RABBITMQ_SERVER_ERL_ARGS="+K true +P 1048576 -kernel inet_default_connect_options [{nodelay,true}]" That will remove the TCP timeout from being set on the rabbitmq sockets
Did John's suggestion have the desired effect?
(In reply to Andrew Beekhof from comment #6) > Did John's suggestion have the desired effect? When we rebuilt the cluster, the same issue did not occur with rabbitmq starting/stopping due to pacemaker even after we scale the OpenShift cluster to >2x the size we witnessed this issue at. We did not apply John's suggestion since we wanted it to reproduce before applying any changes. We will be moving to a newer version of OSP for future testing so at this point, I think the bug should be closed as non-reproduced. Thanks for the time and efforts.
Peter: Any changes or documentation we need based on this feedback or can we go ahead and close this one?
Thanks Alex. OSP11 is now EOL anyway, so please let us know if you have issues with later versions.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days