Description of problem: While setting up a RHOS6 provider on CFME 5.6 I encountered a situation where the event catchers are stuck attempting to reconnect to ceilometer for events. Although this provider's events are emitted from amqp, once the provider is configured for ceilometer events. I could not get both workers out of the reconnection storm until I restarted evmserverd. The result of this is 1-2 workers constantly attempting to reconnect: MIQ: Openstack::CloudManager::EventCatcher MIQ: Openstack::NetworkManager::EventCatcher They will both consume ~60-70% of a core and flood evm.log and fog.log with error messages: evm.log: [----] I, [2016-05-17T08:49:46.679169 #29880:1333938] INFO -- : Reseting Openstack Ceilometer connection after Connection refused - connect(2) for x.x.x.x:8777 (Errno::ECONNREFUSED). [----] I, [2016-05-17T08:49:46.679346 #29880:1333938] INFO -- : Querying Openstack Ceilometer for events newer than 2016-05-17 11:35:11 UTC... [----] E, [2016-05-17T08:49:46.681164 #29880:1333938] ERROR -- : <Fog> excon.error #<Excon::Errors::SocketError: Connection refused - connect(2) for x.x.x.x:8777 (Errno::ECONNREFUSED)> fog.log: [----] E, [2016-05-17T08:57:57.284550 #29880:1333938] ERROR -- : excon.error #<Excon::Errors::SocketError: Connection refused - connect(2) for 10.16.4.68:8777 (Errno::ECONNREFUSED)> [----] E, [2016-05-17T08:57:57.286661 #29880:1333938] ERROR -- : excon.error #<Excon::Errors::SocketError: Connection refused - connect(2) for 10.16.4.68:8777 (Errno::ECONNREFUSED)> Ultimately filling the log partition in about 7 hours if not addressed. More concerning is that turning off the event monitor role only stopped the NetworkManager event catcher and the CloudManager continued its reconnection storm. Even when editing the provider and correctly configuring the events to be amqp, this does not take affect unless evmserver is completely restarted. Version-Release number of selected component (if applicable): 5.6.0.6-beta2.5 How reproducible: On 5.6 with RHOS 6 provider Steps to Reproduce: 1. Add provider 2. Set events to ceilometer (when events should be set to amqp) 3. Observe logs and system performance and log partition size Actual results: Expected results: Additional info: The biggest concern here is if a provider is actually ceilometer event driven and the service goes down. Do these two workers continue to dump error messages in a reconnection storm until the log partition fills which ultimately turns off evmserverd then.
https://github.com/ManageIQ/manageiq/pull/9027
*** Bug 1347296 has been marked as a duplicate of this bug. ***
New commit detected on ManageIQ/manageiq/master: https://github.com/ManageIQ/manageiq/commit/231c902b1d76f8b97762165aa2335a1c3467d12d commit 231c902b1d76f8b97762165aa2335a1c3467d12d Author: Marek Aufart <maufart> AuthorDate: Mon May 30 12:59:36 2016 +0200 Commit: Marek Aufart <maufart> CommitDate: Mon May 30 12:59:36 2016 +0200 Fix openstack ceilometer reconnect-storm in log Automatic reconnect functionality in Openstack Ceilometer Event monitor was reconnecting too often when Ceilometer service went down. The result was high resource consumption and big log file. Initial purpose for immediate reconnection was quickly reconnect after keystone token expiration, but it can be done with default monitor restart too. Which is a bit slower - 15 seconds by default instead of immediately, but it should not be problem in this case. https://bugzilla.redhat.com/show_bug.cgi?id=1336795 .../events/openstack_ceilometer_event_monitor.rb | 24 ++++++++-------------- 1 file changed, 9 insertions(+), 15 deletions(-)
verified on 5.7.0.4, no reconnection storm. memory is stable error does not appear in logs