Bug 1324922

Summary: Log handler repeatedly crashes
Product: [Fedora] Fedora EPEL Reporter: John Eckersberg <jeckersb>
Component: erlangAssignee: Peter Lemenkov <lemenkov>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: urgent    
Version: epel7CC: apevec, binarin, draganHR, ealcaniz, erlang, fdinitto, jeckersb, jschluet, lemenkov, lhh, oblaut, rjones, s, steven.dake, ushkalim
Target Milestone: ---Keywords: Regression, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: erlang-R16B-03.17.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1322609 Environment:
Last Closed: 2016-07-29 06:50:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1322609    
Bug Blocks: 1324185    

Description John Eckersberg 2016-04-07 15:12:57 UTC
+++ This bug was initially created as a clone of Bug #1322609 +++

Starting with erlang-erts-R16B-03.10min.6.el7ost.x86_64, the log handler repeatedly crashes and fills up the rabbitmq startup_log with entries like:

Event crashed log handler:
{info_msg,<0.1719.0>,
          {<0.1832.0>,"Mirrored ~s: Adding mirror on node ~p: ~p~n",
           ["queue 'l3_agent_fanout_0f6bc20f4c54484f9de482cd6d83a15a' in vhost '/'",
            'rabbit@overcloud-controller-1',<6192.10668.1>]}}
function_clause

Meanwhile the rabbitmq log is empty.

Looks like a regression introduced in the "Enable error_logger depth fine tuning" patch.

--- Additional comment from Alexey Lebedeff on 2016-04-07 09:17:10 EDT ---

R16B-03.16.el7 is also affected.

Comment 1 Fedora Update System 2016-04-07 17:34:30 UTC
erlang-R16B-03.17.el7 has been submitted as an update to Fedora EPEL 7. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-e1035fad90

Comment 2 Fedora Update System 2016-04-08 21:49:42 UTC
erlang-R16B-03.17.el7 has been pushed to the Fedora EPEL 7 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-e1035fad90

Comment 3 Steven Dake 2016-04-10 18:05:04 UTC
I think your speculation is incorrect that the depth logging change, whatever that was, introduced this regression.  The problem was introduced in -16 (adding IPV6 support).  This fundamentally changes how epmd operates.  epmd either binds to ipv4 or ipv6, depending on config, but not both.

One workaround mentioned here:
https://github.com/openstack/kolla/blob/master/ansible/roles/rabbitmq/templates/rabbitmq-env.conf.j2#L8

in comment #6 works on -16, but triggers epmd to bind to 0.0.0.0 (all interfaces) which could interfere with neutron, then tenant network, etc.

If your going to enable ipv6, might as well fix epmd binding so its handled properly.  btw otp-23 patch is a disaster.

I have yet to try 17 with removal of EPMD binding which would be a good short term workaround but not a good long term workaround.  Long term this will cause heisenbugs in neutron and other parts of the system that you just haven't discovered yet ;)

Comment 4 Steven Dake 2016-04-11 09:06:53 UTC
I have tried -17 and it suffers from this same binding problem consistently.

Comment 5 Steven Dake 2016-04-11 09:09:17 UTC
removal of EPMD binding solves the epmd: could not bind to any interface, followed by a erlang crash.  Unfortunately with this mode of operation, a wildcard bind is done to all interfaces on the control nodes in OpenStack.

Comment 6 Alexey Lebedeff 2016-04-11 10:24:15 UTC
Steven, is there some part of the conversation that is missing or have you posted your comments to a wrong bug? ) 
Because this one is only about broken logging - all other things should function just fine.

Comment 7 Alan Pevec 2016-04-11 10:39:09 UTC
Alexey, this is follow up to Bodhi comment in the linked update https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-e1035fad90

Comment 8 John Eckersberg 2016-04-12 15:40:15 UTC
OK, let's regroup and clarify some things here before we get more confused.  Part of this is my fault because I directed you on IRC to post on the bodhi update about your crash.  I didn't realize at first that you were seeing an IPv6 crash and thought it was just the logging crash.

Anyway...

This particular bug is about broken logging.  The current released version (R16B-03.16.el7) has broken logging.  The only change[1] in the .17 release is to revert the patch that added the broken logging.

So I would ask two things.

(1) Ignore the IPv6 thing for this bug.  It would be a huge help if you could just sanity check that .16 has broken logging and that .17 is correct (and update karma on the update accordingly).  Then we can either ship that update ASAP or bundle it with an IPv6 fix (if we can get it quickly).

(2) We'll file another bug for the IPv6 issue.  We already fixed one crash bug in https://bugzilla.redhat.com/show_bug.cgi?id=1310808 (incidentally this is the update you said introduced your crash).  If you can get it to reproduce and capture a coredump of the crash that would be awesome.  I will try to reproduce as well by toying with ERL_EPMD_ADDRESS on my end.

[1] http://pkgs.fedoraproject.org/cgit/rpms/erlang.git/commit/?h=epel7&id=6515854c294bc6be60987407a54d9680fd8faf65

Comment 9 Alan Pevec 2016-05-06 13:28:06 UTC
Any updates ?

Comment 10 Steven Dake 2016-05-06 14:05:21 UTC
I think what happened here is I confused the .15 and .16 together into one change according to jeckersb's statement.

The issue with (.15 then) is that EPMD wildcard binds which could result in some really weird behavior if anyone in a cloud environment uses that port while neutron is in use on the box.  I'm not sure if this is a legitimate situation, but no service in OpenStack should wildcard bind.

That said, .16 is totally bust with logging - your right on that point.  I don't recall where the erlang repo is to test -17 with, but if you could provide that I'll test Kolla's current master with it.  It takes about 2 hours to test as soon as I have a repo to work with.

Thanks
-steve

Comment 11 Alan Pevec 2016-05-06 18:05:04 UTC
Steve, in RDO Mitaka testing repo
http://buildlogs.centos.org/centos/7/cloud/x86_64/openstack-mitaka/
we have:
erlang-R16B-03.17.el7
rabbitmq-server-3.3.5-17.el7

Comment 12 Fedora Update System 2016-07-29 06:50:07 UTC
erlang-R16B-03.17.el7 has been pushed to the Fedora EPEL 7 stable repository. If problems still persist, please make note of it in this bug report.