Bug 1970358

Summary:	rabbit-serever crashes intermittently with "Slogan: no more index entries in atom_tab (max=1048576)"
Product:	Red Hat OpenStack	Reporter:	Takashi Kajinami <tkajinam>
Component:	rabbitmq-server	Assignee:	Peter Lemenkov <plemenko>
Status:	CLOSED ERRATA	QA Contact:	dabarzil
Severity:	high	Docs Contact:
Priority:	high
Version:	16.1 (Train)	CC:	apevec, dabarzil, dciabrin, jeckersb, lhh, lmiccini, vcojot
Target Milestone:	z7	Keywords:	Triaged
Target Release:	16.1 (Train on RHEL 8.2)
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	rabbitmq-server-3.7.23-8.el8ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-12-09 20:19:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Takashi Kajinami 2021-06-10 11:09:59 UTC

Description of problem:

A customer has reported an issue with rabbitmq and it seems rabbitmq crashed unexpectedly
and pacemaker noticed that rabbitmq is no longer running.

Failed Resource Actions:
  rabbitmq_monitor_10000 on rabbitmq-bundle-2 'not running' (7): ...

The erl_crash.dump file left no the node indicates that rabbitmq-server crashed
because atom_tab was exhausted.
~~~
=erl_crash_dump:0.5
Fri Jun  1 01:23:45 2021
Slogan: no more index entries in atom_tab (max=1048576)
System version: Erlang/OTP 21 [erts-10.3.5.12] [source] [64-bit] [smp:48:48] [ds:48:48:10] [async-threads:768] [hipe]
~~~

Most of items (about 1 million items) in atoms matches the following pattern.
 rabbitmqcli-<id>-rabbit@<hostname>
and it is likely that this is generated when rabbitmqctl command is executed
and never purged.


Version-Release number of selected component (if applicable):


In the deployment where the issue is lastly observed, the following version of rabbitmq-server

is installed in the rabbitmq-server container,
 rabbitmq-server-3.7.23-2.el8ost.x86_64
and this seems to be the latest package available in RHOSP16.1.

How reproducible:
The customer have experienced rabbitmq crash several times in multiple deployments.
Because we didn't notice the detail of crash previously but it is likely that the issue
is reproducible after keeping the system running.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 5 Takashi Kajinami 2021-06-10 12:19:47 UTC

Let me put a note for other's reference...

It seems the following commit is the cause.
 https://github.com/rabbitmq/rabbitmq-cli/pull/271/files

And 3.8.z is affected, since this problematic change was included since 3.8.0.
 https://github.com/rabbitmq/rabbitmq-cli/pull/271/commits/3c4facc46e17c4ae8d08618e6a19ab6025440875

The following pull request fixed the issue.
 https://github.com/rabbitmq/rabbitmq-cli/pull/461

Comment 6 Takashi Kajinami 2021-06-10 12:49:27 UTC

If I understand the issue correctly there are two workarounds available until the fix is backported.

1. regularly restart rabbitmq-server

2. Set +t option to increase maximum number of atoms.
   1048576(pid max) x 4 (>3 controller nodes) = 4194304 would be enough.

/var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/rabbitmq-env.conf
~~~
...
RABBITMQ_SERVER_ERL_ARGS="+K true +P 1048576 +t 4194304 -kernel inet_default_connect_options [{nodelay,true}]"
...
~~~

I'd appreciate any feedback or additional suggestion about workaround as well.

Comment 7 Peter Lemenkov 2021-06-10 14:26:39 UTC

(In reply to Takashi Kajinami from comment #6)
> If I understand the issue correctly there are two workarounds available
> until the fix is backported.
> 
> 1. regularly restart rabbitmq-server
> 
> 2. Set +t option to increase maximum number of atoms.
>    1048576(pid max) x 4 (>3 controller nodes) = 4194304 would be enough.
> 
> /var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/rabbitmq-env.conf
> ~~~
> ...
> RABBITMQ_SERVER_ERL_ARGS="+K true +P 1048576 +t 4194304 -kernel
> inet_default_connect_options [{nodelay,true}]"
> ...
> ~~~
> 
> I'd appreciate any feedback or additional suggestion about workaround as
> well.

Upgrade to any version after 3.8.10 will fix that as well.

Comment 8 Takashi Kajinami 2021-06-10 14:41:16 UTC

(In reply to Peter Lemenkov from comment #7)
> (In reply to Takashi Kajinami from comment #6)
> > If I understand the issue correctly there are two workarounds available
> > until the fix is backported.
> > 
> > 1. regularly restart rabbitmq-server
> > 
> > 2. Set +t option to increase maximum number of atoms.
> >    1048576(pid max) x 4 (>3 controller nodes) = 4194304 would be enough.
> > 
> > /var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/rabbitmq-env.conf
> > ~~~
> > ...
> > RABBITMQ_SERVER_ERL_ARGS="+K true +P 1048576 +t 4194304 -kernel
> > inet_default_connect_options [{nodelay,true}]"
> > ...
> > ~~~
> > 
> > I'd appreciate any feedback or additional suggestion about workaround as
> > well.
> 
> Upgrade to any version after 3.8.10 will fix that as well.

Ah yes. I could not find the included version but confirmed that 3.8.10 has the fix.

However currently 3.7.23-2 is the latest package available in RHOSP16.1 repo so we
need a new version released in RHOSP16.1 repo

Comment 9 Peter Lemenkov 2021-06-10 20:12:40 UTC

Should be fixed in rabbitmq-server-3.7.23-8.el8ost

Comment 11 Takashi Kajinami 2021-07-07 00:58:08 UTC

Just as a note...

The following change removed usage of RABBITMQ_SERVER_ERL_ARGS.
 https://review.opendev.org/c/openstack/tripleo-heat-templates/+/739750

And since that removal, maximum limit of atoms is set to 5000000,
which is defined in rabbitmq-env script.
~~~
[heat-admin@controller-0 ~]$ cat /etc/rhosp-release 
Red Hat OpenStack Platform release 16.1.6 GA (Train)
[heat-admin@controller-0 ~]$ sudo cat /var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/rabbitmq-env.conf 
LANG=en_US.UTF-8
LC_ALL=en_US.UTF-8
NODE_IP_ADDRESS=
NODE_PORT=
RABBITMQ_CTL_DIST_PORT_MAX=25683
RABBITMQ_CTL_DIST_PORT_MIN=25673
RABBITMQ_CTL_ERL_ARGS="+sbwt none"
RABBITMQ_NODENAME=rabbit@controller-0
RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+sbwt none"
export ERL_EPMD_ADDRESS=172.17.1.23
export ERL_INETRC=/etc/rabbitmq/inetrc
[heat-admin@controller-0 ~]$ ps aux | grep beam.smp | grep -v grep
42439      15788  0.5  0.3 5431420 104212 ?      Sl   Jul06   6:13 /usr/lib64/erlang/erts-10.3.5.15/bin/beam.smp -W w -A 128 -MBas ageffcbf -MHas ageffcbf -MBlmbcs 512 -MHlmbcs 512 -MMmcs 30 -P 1048576 -t 5000000 -stbt db -zdbbl 128000 -K true -sbwt none -B i -- -root /usr/lib64/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.7.23/ebin  -noshell -noinput -s rabbit boot -sname rabbit@controller-0 -boot start_sasl -config /etc/rabbitmq/rabbitmq -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit lager_log_root "/var/log/rabbitmq" -rabbit lager_default_file "/var/log/rabbitmq/rabbit" -rabbit lager_upgrade_file "/var/log/rabbitmq/rabbit" -rabbit feature_flags_file "/var/lib/rabbitmq/mnesia/rabbit@controller-0-feature_flags" -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/plugins:/usr/lib/rabbitmq/lib/rabbitmq_server-3.7.23/plugins" -rabbit plugins_expand_dir "/var/lib/rabbitmq/mnesia/rabbit@controller-0-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@controller-0"
[heat-admin@controller-0 ~]$  sudo podman exec -it $(sudo podman ps -q -f name=rabbitmq) rabbitmqctl eval -q " erlang:system_info(atom_limit)."
5000000
~~~

So now limit 5000000 is much bigger than pid_max (1048576) * number of controller nodes (3) = 3145728
so it might be unlikely we hit this issue in later z release.

Comment 19 dabarzil 2021-09-09 07:24:05 UTC

New atom's limit is set:

()[root@controller-0 /]# rpm -qa|grep rabbitmq-server
rabbitmq-server-3.7.23-8.el8ost.x86_64
[root@controller-0 ~]# sudo podman exec -it $(sudo podman ps -q -f name=rabbitmq) rabbitmqctl eval -q " erlang:system_info(atom_limit)."
5000000

Comment 30 errata-xmlrpc 2021-12-09 20:19:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.7 (Train) bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3762