Bug 1970358 - rabbit-serever crashes intermittently with "Slogan: no more index entries in atom_tab (max=1048576)"
Summary: rabbit-serever crashes intermittently with "Slogan: no more index entries in ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 16.1 (Train)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: z7
: 16.1 (Train on RHEL 8.2)
Assignee: Peter Lemenkov
QA Contact: dabarzil
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-10 11:09 UTC by Takashi Kajinami
Modified: 2024-10-01 18:34 UTC (History)
7 users (show)

Fixed In Version: rabbitmq-server-3.7.23-8.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-12-09 20:19:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github rabbitmq rabbitmq-cli pull 461 0 None closed Limit the set of CLI node names that can be generated to prevent potential atom table exhaustion 2021-06-10 11:53:49 UTC
Github rabbitmq rabbitmq-server issues 549 0 None closed Atom table exhaustion due to rabbitmqctl node names 2021-06-10 11:53:49 UTC
Red Hat Issue Tracker OSP-4544 0 None None None 2021-11-18 11:33:45 UTC
Red Hat Knowledge Base (Solution) 6112701 0 None None None 2021-06-11 00:27:07 UTC
Red Hat Product Errata RHBA-2021:3762 0 None None None 2021-12-09 20:19:59 UTC

Description Takashi Kajinami 2021-06-10 11:09:59 UTC
Description of problem:

A customer has reported an issue with rabbitmq and it seems rabbitmq crashed unexpectedly
and pacemaker noticed that rabbitmq is no longer running.

Failed Resource Actions:
  rabbitmq_monitor_10000 on rabbitmq-bundle-2 'not running' (7): ...

The erl_crash.dump file left no the node indicates that rabbitmq-server crashed
because atom_tab was exhausted.
~~~
=erl_crash_dump:0.5
Fri Jun  1 01:23:45 2021
Slogan: no more index entries in atom_tab (max=1048576)
System version: Erlang/OTP 21 [erts-10.3.5.12] [source] [64-bit] [smp:48:48] [ds:48:48:10] [async-threads:768] [hipe]
~~~

Most of items (about 1 million items) in atoms matches the following pattern.
 rabbitmqcli-<id>-rabbit@<hostname>
and it is likely that this is generated when rabbitmqctl command is executed
and never purged.


Version-Release number of selected component (if applicable):


In the deployment where the issue is lastly observed, the following version of rabbitmq-server

is installed in the rabbitmq-server container,
 rabbitmq-server-3.7.23-2.el8ost.x86_64
and this seems to be the latest package available in RHOSP16.1.

How reproducible:
The customer have experienced rabbitmq crash several times in multiple deployments.
Because we didn't notice the detail of crash previously but it is likely that the issue
is reproducible after keeping the system running.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 5 Takashi Kajinami 2021-06-10 12:19:47 UTC
Let me put a note for other's reference...

It seems the following commit is the cause.
 https://github.com/rabbitmq/rabbitmq-cli/pull/271/files

And 3.8.z is affected, since this problematic change was included since 3.8.0.
 https://github.com/rabbitmq/rabbitmq-cli/pull/271/commits/3c4facc46e17c4ae8d08618e6a19ab6025440875

The following pull request fixed the issue.
 https://github.com/rabbitmq/rabbitmq-cli/pull/461

Comment 6 Takashi Kajinami 2021-06-10 12:49:27 UTC
If I understand the issue correctly there are two workarounds available until the fix is backported.

1. regularly restart rabbitmq-server

2. Set +t option to increase maximum number of atoms.
   1048576(pid max) x 4 (>3 controller nodes) = 4194304 would be enough.

/var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/rabbitmq-env.conf
~~~
...
RABBITMQ_SERVER_ERL_ARGS="+K true +P 1048576 +t 4194304 -kernel inet_default_connect_options [{nodelay,true}]"
...
~~~

I'd appreciate any feedback or additional suggestion about workaround as well.

Comment 7 Peter Lemenkov 2021-06-10 14:26:39 UTC
(In reply to Takashi Kajinami from comment #6)
> If I understand the issue correctly there are two workarounds available
> until the fix is backported.
> 
> 1. regularly restart rabbitmq-server
> 
> 2. Set +t option to increase maximum number of atoms.
>    1048576(pid max) x 4 (>3 controller nodes) = 4194304 would be enough.
> 
> /var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/rabbitmq-env.conf
> ~~~
> ...
> RABBITMQ_SERVER_ERL_ARGS="+K true +P 1048576 +t 4194304 -kernel
> inet_default_connect_options [{nodelay,true}]"
> ...
> ~~~
> 
> I'd appreciate any feedback or additional suggestion about workaround as
> well.

Upgrade to any version after 3.8.10 will fix that as well.

Comment 8 Takashi Kajinami 2021-06-10 14:41:16 UTC
(In reply to Peter Lemenkov from comment #7)
> (In reply to Takashi Kajinami from comment #6)
> > If I understand the issue correctly there are two workarounds available
> > until the fix is backported.
> > 
> > 1. regularly restart rabbitmq-server
> > 
> > 2. Set +t option to increase maximum number of atoms.
> >    1048576(pid max) x 4 (>3 controller nodes) = 4194304 would be enough.
> > 
> > /var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/rabbitmq-env.conf
> > ~~~
> > ...
> > RABBITMQ_SERVER_ERL_ARGS="+K true +P 1048576 +t 4194304 -kernel
> > inet_default_connect_options [{nodelay,true}]"
> > ...
> > ~~~
> > 
> > I'd appreciate any feedback or additional suggestion about workaround as
> > well.
> 
> Upgrade to any version after 3.8.10 will fix that as well.

Ah yes. I could not find the included version but confirmed that 3.8.10 has the fix.

However currently 3.7.23-2 is the latest package available in RHOSP16.1 repo so we
need a new version released in RHOSP16.1 repo

Comment 9 Peter Lemenkov 2021-06-10 20:12:40 UTC
Should be fixed in rabbitmq-server-3.7.23-8.el8ost

Comment 11 Takashi Kajinami 2021-07-07 00:58:08 UTC
Just as a note...

The following change removed usage of RABBITMQ_SERVER_ERL_ARGS.
 https://review.opendev.org/c/openstack/tripleo-heat-templates/+/739750

And since that removal, maximum limit of atoms is set to 5000000,
which is defined in rabbitmq-env script.
~~~
[heat-admin@controller-0 ~]$ cat /etc/rhosp-release 
Red Hat OpenStack Platform release 16.1.6 GA (Train)
[heat-admin@controller-0 ~]$ sudo cat /var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/rabbitmq-env.conf 
LANG=en_US.UTF-8
LC_ALL=en_US.UTF-8
NODE_IP_ADDRESS=
NODE_PORT=
RABBITMQ_CTL_DIST_PORT_MAX=25683
RABBITMQ_CTL_DIST_PORT_MIN=25673
RABBITMQ_CTL_ERL_ARGS="+sbwt none"
RABBITMQ_NODENAME=rabbit@controller-0
RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+sbwt none"
export ERL_EPMD_ADDRESS=172.17.1.23
export ERL_INETRC=/etc/rabbitmq/inetrc
[heat-admin@controller-0 ~]$ ps aux | grep beam.smp | grep -v grep
42439      15788  0.5  0.3 5431420 104212 ?      Sl   Jul06   6:13 /usr/lib64/erlang/erts-10.3.5.15/bin/beam.smp -W w -A 128 -MBas ageffcbf -MHas ageffcbf -MBlmbcs 512 -MHlmbcs 512 -MMmcs 30 -P 1048576 -t 5000000 -stbt db -zdbbl 128000 -K true -sbwt none -B i -- -root /usr/lib64/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.7.23/ebin  -noshell -noinput -s rabbit boot -sname rabbit@controller-0 -boot start_sasl -config /etc/rabbitmq/rabbitmq -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit lager_log_root "/var/log/rabbitmq" -rabbit lager_default_file "/var/log/rabbitmq/rabbit" -rabbit lager_upgrade_file "/var/log/rabbitmq/rabbit" -rabbit feature_flags_file "/var/lib/rabbitmq/mnesia/rabbit@controller-0-feature_flags" -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/plugins:/usr/lib/rabbitmq/lib/rabbitmq_server-3.7.23/plugins" -rabbit plugins_expand_dir "/var/lib/rabbitmq/mnesia/rabbit@controller-0-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@controller-0"
[heat-admin@controller-0 ~]$  sudo podman exec -it $(sudo podman ps -q -f name=rabbitmq) rabbitmqctl eval -q " erlang:system_info(atom_limit)."
5000000
~~~

So now limit 5000000 is much bigger than pid_max (1048576) * number of controller nodes (3) = 3145728
so it might be unlikely we hit this issue in later z release.

Comment 19 dabarzil 2021-09-09 07:24:05 UTC
New atom's limit is set:

()[root@controller-0 /]# rpm -qa|grep rabbitmq-server
rabbitmq-server-3.7.23-8.el8ost.x86_64
[root@controller-0 ~]# sudo podman exec -it $(sudo podman ps -q -f name=rabbitmq) rabbitmqctl eval -q " erlang:system_info(atom_limit)."
5000000

Comment 30 errata-xmlrpc 2021-12-09 20:19:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.7 (Train) bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3762


Note You need to log in before you can comment on or make changes to this bug.