Important bits from the case notes, Chris and I were looking at this last week before the BZ was opened...
The crash slogan:
Slogan: Kernel pid terminated (application_controller) ({application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}}},{k
Systemtap was used to watch all exec()s and capture the cmdline of the crashing beam.smp. It looks like:
Fri Apr 5 17:41:51 2019 2013 74073 134463 beam.smp /usr/lib64/erlang/erts-7.3.1.6/bin/beam.smp -- -root /usr/lib64/erlang -progname erl -- -home /var/lib/rabbitmq -- -sname epmd-starter-144528111 -proto_dist "inet_tcp" -noshell -eval halt().
This gets called from rabbit_nodes_common:ensure_epmd here:
https://github.com/rabbitmq/rabbitmq-common/blob/v3.6.x/src/rabbit_nodes_common.erl#L37
Which is called from the rabbit_epmd_monitor process here:
https://github.com/rabbitmq/rabbitmq-server/blob/v3.6.x/src/rabbit_epmd_monitor.erl#L108
The epmd monitor fires the check timer every 60 seconds, thus explaining the regular period seen here.
What is not clear is why the epmd-starter exec fails to start distribution. There is some initial debugging in the case files around perhaps issues with ipv6 and/or hostname resolution, but it's not apparent that either are responsible.
Everything seems to be functioning properly. The service is up, registered, epmd is running, everything is clustered. Just the epmd-starter crashes. Since epmd is already running, it doesn't have any practical effect on the running system.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2019:2623
Important bits from the case notes, Chris and I were looking at this last week before the BZ was opened... The crash slogan: Slogan: Kernel pid terminated (application_controller) ({application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}}},{k Systemtap was used to watch all exec()s and capture the cmdline of the crashing beam.smp. It looks like: Fri Apr 5 17:41:51 2019 2013 74073 134463 beam.smp /usr/lib64/erlang/erts-7.3.1.6/bin/beam.smp -- -root /usr/lib64/erlang -progname erl -- -home /var/lib/rabbitmq -- -sname epmd-starter-144528111 -proto_dist "inet_tcp" -noshell -eval halt(). This gets called from rabbit_nodes_common:ensure_epmd here: https://github.com/rabbitmq/rabbitmq-common/blob/v3.6.x/src/rabbit_nodes_common.erl#L37 Which is called from the rabbit_epmd_monitor process here: https://github.com/rabbitmq/rabbitmq-server/blob/v3.6.x/src/rabbit_epmd_monitor.erl#L108 The epmd monitor fires the check timer every 60 seconds, thus explaining the regular period seen here. What is not clear is why the epmd-starter exec fails to start distribution. There is some initial debugging in the case files around perhaps issues with ipv6 and/or hostname resolution, but it's not apparent that either are responsible. Everything seems to be functioning properly. The service is up, registered, epmd is running, everything is clustered. Just the epmd-starter crashes. Since epmd is already running, it doesn't have any practical effect on the running system.