Bug 1970358
| Summary: | rabbit-serever crashes intermittently with "Slogan: no more index entries in atom_tab (max=1048576)" | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Takashi Kajinami <tkajinam> |
| Component: | rabbitmq-server | Assignee: | Peter Lemenkov <plemenko> |
| Status: | CLOSED ERRATA | QA Contact: | dabarzil |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 16.1 (Train) | CC: | apevec, dabarzil, dciabrin, jeckersb, lhh, lmiccini, vcojot |
| Target Milestone: | z7 | Keywords: | Triaged |
| Target Release: | 16.1 (Train on RHEL 8.2) | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | rabbitmq-server-3.7.23-8.el8ost | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-12-09 20:19:41 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Takashi Kajinami
2021-06-10 11:09:59 UTC
Let me put a note for other's reference... It seems the following commit is the cause. https://github.com/rabbitmq/rabbitmq-cli/pull/271/files And 3.8.z is affected, since this problematic change was included since 3.8.0. https://github.com/rabbitmq/rabbitmq-cli/pull/271/commits/3c4facc46e17c4ae8d08618e6a19ab6025440875 The following pull request fixed the issue. https://github.com/rabbitmq/rabbitmq-cli/pull/461 If I understand the issue correctly there are two workarounds available until the fix is backported.
1. regularly restart rabbitmq-server
2. Set +t option to increase maximum number of atoms.
1048576(pid max) x 4 (>3 controller nodes) = 4194304 would be enough.
/var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/rabbitmq-env.conf
~~~
...
RABBITMQ_SERVER_ERL_ARGS="+K true +P 1048576 +t 4194304 -kernel inet_default_connect_options [{nodelay,true}]"
...
~~~
I'd appreciate any feedback or additional suggestion about workaround as well.
(In reply to Takashi Kajinami from comment #6) > If I understand the issue correctly there are two workarounds available > until the fix is backported. > > 1. regularly restart rabbitmq-server > > 2. Set +t option to increase maximum number of atoms. > 1048576(pid max) x 4 (>3 controller nodes) = 4194304 would be enough. > > /var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/rabbitmq-env.conf > ~~~ > ... > RABBITMQ_SERVER_ERL_ARGS="+K true +P 1048576 +t 4194304 -kernel > inet_default_connect_options [{nodelay,true}]" > ... > ~~~ > > I'd appreciate any feedback or additional suggestion about workaround as > well. Upgrade to any version after 3.8.10 will fix that as well. (In reply to Peter Lemenkov from comment #7) > (In reply to Takashi Kajinami from comment #6) > > If I understand the issue correctly there are two workarounds available > > until the fix is backported. > > > > 1. regularly restart rabbitmq-server > > > > 2. Set +t option to increase maximum number of atoms. > > 1048576(pid max) x 4 (>3 controller nodes) = 4194304 would be enough. > > > > /var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/rabbitmq-env.conf > > ~~~ > > ... > > RABBITMQ_SERVER_ERL_ARGS="+K true +P 1048576 +t 4194304 -kernel > > inet_default_connect_options [{nodelay,true}]" > > ... > > ~~~ > > > > I'd appreciate any feedback or additional suggestion about workaround as > > well. > > Upgrade to any version after 3.8.10 will fix that as well. Ah yes. I could not find the included version but confirmed that 3.8.10 has the fix. However currently 3.7.23-2 is the latest package available in RHOSP16.1 repo so we need a new version released in RHOSP16.1 repo Should be fixed in rabbitmq-server-3.7.23-8.el8ost Just as a note... The following change removed usage of RABBITMQ_SERVER_ERL_ARGS. https://review.opendev.org/c/openstack/tripleo-heat-templates/+/739750 And since that removal, maximum limit of atoms is set to 5000000, which is defined in rabbitmq-env script. ~~~ [heat-admin@controller-0 ~]$ cat /etc/rhosp-release Red Hat OpenStack Platform release 16.1.6 GA (Train) [heat-admin@controller-0 ~]$ sudo cat /var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/rabbitmq-env.conf LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 NODE_IP_ADDRESS= NODE_PORT= RABBITMQ_CTL_DIST_PORT_MAX=25683 RABBITMQ_CTL_DIST_PORT_MIN=25673 RABBITMQ_CTL_ERL_ARGS="+sbwt none" RABBITMQ_NODENAME=rabbit@controller-0 RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+sbwt none" export ERL_EPMD_ADDRESS=172.17.1.23 export ERL_INETRC=/etc/rabbitmq/inetrc [heat-admin@controller-0 ~]$ ps aux | grep beam.smp | grep -v grep 42439 15788 0.5 0.3 5431420 104212 ? Sl Jul06 6:13 /usr/lib64/erlang/erts-10.3.5.15/bin/beam.smp -W w -A 128 -MBas ageffcbf -MHas ageffcbf -MBlmbcs 512 -MHlmbcs 512 -MMmcs 30 -P 1048576 -t 5000000 -stbt db -zdbbl 128000 -K true -sbwt none -B i -- -root /usr/lib64/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.7.23/ebin -noshell -noinput -s rabbit boot -sname rabbit@controller-0 -boot start_sasl -config /etc/rabbitmq/rabbitmq -kernel inet_default_connect_options [{nodelay,true}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit lager_log_root "/var/log/rabbitmq" -rabbit lager_default_file "/var/log/rabbitmq/rabbit" -rabbit lager_upgrade_file "/var/log/rabbitmq/rabbit" -rabbit feature_flags_file "/var/lib/rabbitmq/mnesia/rabbit@controller-0-feature_flags" -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/plugins:/usr/lib/rabbitmq/lib/rabbitmq_server-3.7.23/plugins" -rabbit plugins_expand_dir "/var/lib/rabbitmq/mnesia/rabbit@controller-0-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@controller-0" [heat-admin@controller-0 ~]$ sudo podman exec -it $(sudo podman ps -q -f name=rabbitmq) rabbitmqctl eval -q " erlang:system_info(atom_limit)." 5000000 ~~~ So now limit 5000000 is much bigger than pid_max (1048576) * number of controller nodes (3) = 3145728 so it might be unlikely we hit this issue in later z release. New atom's limit is set: ()[root@controller-0 /]# rpm -qa|grep rabbitmq-server rabbitmq-server-3.7.23-8.el8ost.x86_64 [root@controller-0 ~]# sudo podman exec -it $(sudo podman ps -q -f name=rabbitmq) rabbitmqctl eval -q " erlang:system_info(atom_limit)." 5000000 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.7 (Train) bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3762 |