Description of problem:
CU experienced several fencing at pacemaker controller nodes due to failure on monitoring rabbit-mq cluster resource. the following get logged during the fault:
~~~
2024-05-16 00:14:37.945764+02:00 [error] <0.9405.0> ** Node 'rabbit.domain.com' not responding **
2024-05-16 00:14:37.945764+02:00 [error] <0.9405.0> ** Removing (timedout) connection **
2024-05-16 00:14:37.945764+02:00 [error] <0.9405.0>
2024-05-16 00:14:37.945976+02:00 [notice] <0.9404.0> TLS server: In state connection at tls_connection_1_3.erl:633 generated SERVER ALERT: Fatal - Internal Error
2024-05-16 00:14:37.945976+02:00 [notice] <0.9404.0> - closed
~~~
it turned out that disabling SSL at rabbitmq by changing rabbitmq-env.conf in the following way solved the issue:
from:
RABBITMQ_CTL_ERL_ARGS="+sbwt none +sbwtdcpu none +sbwtdio none -ssl_dist_optfile /etc/rabbitmq/ssl-dist.conf -crypto fips_mode false -pa /usr/lib64/erlang/lib/ssl-10.7.3.2/ebin -proto_dist inet_tls"
RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+sbwt none +sbwtdcpu none +sbwtdio none -ssl_dist_optfile /etc/rabbitmq/ssl-dist.conf -crypto fips_mode false -pa /usr/lib64/erlang/lib/ssl-10.7.3.2/ebin -proto_dist inet_tls"
to:
RABBITMQ_CTL_ERL_ARGS="+sbwt none +sbwtdcpu none +sbwtdio none"
RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+sbwt none +sbwtdcpu none +sbwtdio none"
Version-Release number of selected component (if applicable):
OSP 17.1.2
Since we haven't been able to reproduce this in our lab (and it is unlikely we will be able to update or rebase rabbitmq in osp17.1 in the future) we decided to take the safest path and enforce tlsv1.2 as the default for rabbitmq.
Value has been set via hieradata 'rabbitmq::ssl_versions' to 'tlsv1.2' and can be customized like:
ExtraConfig:
rabbitmq::ssl_versions: ['XXX', 'YYY']