1336795 – RHOS/RHOP eventcatcher can cause a "reconnection storm"

Bug 1336795 - RHOS/RHOP eventcatcher can cause a "reconnection storm"

Summary: RHOS/RHOP eventcatcher can cause a "reconnection storm"

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Providers
Sub Component:
Version:	5.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.7.0
Assignee:	Marek Aufart
QA Contact:	Omri Hochman
Docs Contact:
URL:
Whiteboard:	openstack:event
Duplicates (1):	1347296 (view as bug list)
Depends On:
Blocks:	1351333
TreeView+	depends on / blocked

Reported:	2016-05-17 13:09 UTC by Alex Krzos
Modified:	2019-08-06 20:06 UTC (History)
CC List:	9 users (show)
Fixed In Version:	5.7.0.0
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1351333 (view as bug list)
Environment:
Last Closed:	2017-01-11 20:07:21 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Alex Krzos 2016-05-17 13:09:57 UTC

Description of problem:
While setting up a RHOS6 provider on CFME 5.6 I encountered a situation where the event catchers are stuck attempting to reconnect to ceilometer for events.  Although this provider's events are emitted from amqp, once the provider is configured for ceilometer events.  I could not get both workers out of the reconnection storm until I restarted evmserverd.

The result of this is 1-2 workers constantly attempting to reconnect:
MIQ: Openstack::CloudManager::EventCatcher
MIQ: Openstack::NetworkManager::EventCatcher

They will both consume ~60-70% of a core and flood evm.log and fog.log with error messages:
evm.log:
[----] I, [2016-05-17T08:49:46.679169 #29880:1333938]  INFO -- : Reseting Openstack Ceilometer connection after Connection refused - connect(2) for x.x.x.x:8777 (Errno::ECONNREFUSED).
[----] I, [2016-05-17T08:49:46.679346 #29880:1333938]  INFO -- : Querying Openstack Ceilometer for events newer than 2016-05-17 11:35:11 UTC...
[----] E, [2016-05-17T08:49:46.681164 #29880:1333938] ERROR -- : <Fog> excon.error     #<Excon::Errors::SocketError: Connection refused - connect(2) for x.x.x.x:8777 (Errno::ECONNREFUSED)>

fog.log:
[----] E, [2016-05-17T08:57:57.284550 #29880:1333938] ERROR -- : excon.error     #<Excon::Errors::SocketError: Connection refused - connect(2) for 10.16.4.68:8777 (Errno::ECONNREFUSED)>

[----] E, [2016-05-17T08:57:57.286661 #29880:1333938] ERROR -- : excon.error     #<Excon::Errors::SocketError: Connection refused - connect(2) for 10.16.4.68:8777 (Errno::ECONNREFUSED)>

Ultimately filling the log partition in about 7 hours if not addressed.

More concerning is that turning off the event monitor role only stopped the NetworkManager event catcher and the CloudManager continued its reconnection storm.

Even when editing the provider and correctly configuring the events to be amqp, this does not take affect unless evmserver is completely restarted.

Version-Release number of selected component (if applicable):
5.6.0.6-beta2.5

How reproducible:
On 5.6 with RHOS 6 provider

Steps to Reproduce:
1.  Add provider
2.  Set events to ceilometer (when events should be set to amqp)
3.  Observe logs and system performance and log partition size

Actual results:


Expected results:


Additional info:

The biggest concern here is if a provider is actually ceilometer event driven and the service goes down.  Do these two workers continue to dump error messages in a  reconnection storm until the log partition fills which ultimately turns off evmserverd then.

Comment 5 CFME Bot 2016-05-30 11:21:10 UTC

https://github.com/ManageIQ/manageiq/pull/9027

Comment 6 Marek Aufart 2016-06-16 14:30:58 UTC

*** Bug 1347296 has been marked as a duplicate of this bug. ***

Comment 7 CFME Bot 2016-06-21 14:50:52 UTC

New commit detected on ManageIQ/manageiq/master:
https://github.com/ManageIQ/manageiq/commit/231c902b1d76f8b97762165aa2335a1c3467d12d

commit 231c902b1d76f8b97762165aa2335a1c3467d12d
Author:     Marek Aufart <maufart>
AuthorDate: Mon May 30 12:59:36 2016 +0200
Commit:     Marek Aufart <maufart>
CommitDate: Mon May 30 12:59:36 2016 +0200

    Fix openstack ceilometer reconnect-storm in log
    
    Automatic reconnect functionality in Openstack Ceilometer Event monitor
    was reconnecting too often when Ceilometer service went down. The result
    was high resource consumption and big log file.
    
    Initial purpose for immediate reconnection was quickly reconnect after keystone
    token expiration, but it can be done with default monitor restart too. Which is
    a bit slower - 15 seconds by default instead of immediately, but it should not
    be problem in this case.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1336795

 .../events/openstack_ceilometer_event_monitor.rb   | 24 ++++++++--------------
 1 file changed, 9 insertions(+), 15 deletions(-)

Comment 9 Ronnie Rasouli 2016-10-13 10:14:45 UTC

verified on 5.7.0.4, no reconnection storm. memory is stable error does not appear in logs

Note You need to log in before you can comment on or make changes to this bug.