1223976 – Not capturing events properly from RHOS (RabbitMQ)

Bug 1223976 - Not capturing events properly from RHOS (RabbitMQ)

Summary: Not capturing events properly from RHOS (RabbitMQ)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Providers
Sub Component:
Version:	5.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	GA
Target Release:	5.5.0
Assignee:	Greg Blomquist
QA Contact:	Pete Savage
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1225178
TreeView+	depends on / blocked

Reported:	2015-05-21 20:33 UTC by Pete Savage
Modified:	2015-12-08 13:11 UTC (History)
CC List:	4 users (show)
Fixed In Version:	5.5.0.1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1225178 (view as bug list)
Environment:
Last Closed:	2015-12-08 13:11:29 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:2551	0	normal	SHIPPED_LIVE	Moderate: CFME 5.5.0 bug fixes and enhancement update	2015-12-08 17:58:09 UTC

Description Pete Savage 2015-05-21 20:33:47 UTC

Description of problem: Events not properly being picked up from RHOS


Version-Release number of selected component (if applicable): 5.4.0.2


How reproducible: Very


Steps to Reproduce:
1. Add a RHOS provider
2. Start/Restart and instance
3. Try to find events

Actual results: Events are not collected


Expected results: Events should be collected


Additional info:
This has been tested against RHOS4 and RHOS6 GA versions. Both RHOS environments had extra ports opened up to allow connections to AMQP/RabbitMQ. Connections to port 5672 were verified, and even the filters for the scheduler were removed, but events were still not captured.

Comment 2 Ladislav Smola 2015-05-22 09:06:41 UTC

Some additional questions.

Have you tried to subscribe to the notification channel from other place, too see if messages are sent there, but not consumed by CFME? Or are they not sent there at all?

If they are not sent there, we had the same issue with configuration of each service. E.g. /etc/nova/nova.conf must be configured to send messages to notification channel.

Do you see this in your conf?
/etc/nova/nova.conf
notification_driver = messaging
notification_topics = notifications


If it stops receiving events after some time, might be connections issue, we experienced problems with Bunny gem, that lost connection when we restarted server with AMPQ and it wasn't able to get it back. (we have pretty old bunny gem, maybe update could fix this). mcornea could provide more info about this issue.

Comment 3 Ladislav Smola 2015-05-22 09:09:41 UTC

Note: with proper configuration we are collecting events for OpenstackInfra provider (it's the same implementation as for Openstack provider) with RHOS6. We are testing RHOS7 which had some issues.

Comment 4 Pete Savage 2015-05-22 09:16:05 UTC

I see notification_topics = notifications
but not notification_driver = messaging

Also can I confirm are we connecting to the notifications.* queue, or are we creating our own queue and binding to a fanout exchange?

Comment 5 Ladislav Smola 2015-05-22 09:26:24 UTC

From my limited knowledge of AMQP and from observation, I think we are connecting to notifications.*

When I start events worker, number of connections on notifications.* rises by 6, which are these 6 lines here https://github.com/ManageIQ/manageiq/blob/a1ed085996b42ace3e9498bdf2fe001de517b040/vmdb/config/vmdb.tmpl.yml#L301

Comment 6 Pete Savage 2015-05-22 18:30:18 UTC

Discovery work has been carried out and the issue is with multiple appliances connecting to the same RHOS instance. Each appliance connects to the same queues. As such they consume messages in a round robin fashion meaning that messages can be lost. Though it is less likely for a customer to be running multiple appliances against the same RHOS instance, this fix is highly needed for QE to be able to test RHOS functionality in CFME.

We already have a solution to the problem. Each appliance will use a randomized queue name, much like we current do for QPID. Greg B has developed the fix and it has already proved successful in early testing.

Comment 7 Greg Blomquist 2015-05-26 17:54:21 UTC

https://github.com/ManageIQ/manageiq/pull/2995

Comment 8 CFME Bot 2015-05-26 18:06:03 UTC

New commit detected on manageiq/master:
https://github.com/ManageIQ/manageiq/commit/051cac62ebe60f720dd1844ab0d64b6880c42f98

commit 051cac62ebe60f720dd1844ab0d64b6880c42f98
Author:     Greg Blomquist <gblomqui>
AuthorDate: Tue May 26 12:38:30 2015 -0400
Commit:     Greg Blomquist <gblomqui>
CommitDate: Tue May 26 12:45:57 2015 -0400

    Create unique binding queue names for RabbitMQ
    
    When CFME connects to RabbitMQ to collect OpenStack events, it creates queues to
    bind to the OpenStack services' exchanges.  The queues were named after the
    services to which they were bound.  For example, binding to the "nova" service
    would result in a binding queue called "nova".
    
    If more than one appliance attempted to connect to RabbitMQ to collect OpenStack
    events, only the first appliance to create the binding queue would receive any
    events.
    
    Now, the binding queue is named after the appliance connecting to the RabbitMQ
    service.  The new binding queue name will look like "miq-<host|ip>-<exchange>"
    
     * e.g.: "miq-10.10.10.10-nova"
    
    This allows for two things:  individual appliances will get their own binding
    queue per service, and administrators will be able to tell which binding queues
    belong to which appliances.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1223976

 lib/openstack/amqp/openstack_rabbit_event_monitor.rb | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comment 9 CFME Bot 2015-06-04 17:46:15 UTC

New commit detected on manageiq/master:
https://github.com/ManageIQ/manageiq/commit/28b13d6ca787f7379882a39866ae8e4a39356d6a

commit 28b13d6ca787f7379882a39866ae8e4a39356d6a
Author:     Greg Blomquist <gblomqui>
AuthorDate: Wed Jun 3 16:16:05 2015 -0400
Commit:     Greg Blomquist <gblomqui>
CommitDate: Wed Jun 3 16:45:04 2015 -0400

    Fix custom naming for AMQP binding queues
    
    The original fixes for the bugs linked below used the OpenStack server's IP
    address as the hostname information in the binding queue name.  This meant that
    any two appliances that attempted to connect to a single OpenStack env would
    create the same named binding queues.
    
    This fix uses the appliance's ip address in the binding queue name, making the
    name of the binding queue unique per appliance.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1224389
    https://bugzilla.redhat.com/show_bug.cgi?id=1223976

 lib/openstack/amqp/openstack_qpid_event_monitor.rb             | 10 ++++++----
 lib/openstack/amqp/openstack_qpid_receiver.rb                  |  5 +++--
 lib/openstack/amqp/openstack_rabbit_event_monitor.rb           |  6 ++++--
 lib/spec/openstack/amqp/openstack_qpid_event_monitor_spec.rb   |  2 +-
 lib/spec/openstack/amqp/openstack_qpid_receiver_spec.rb        |  2 +-
 lib/spec/openstack/amqp/openstack_rabbit_event_monitor_spec.rb |  2 +-
 vmdb/lib/workers/mixins/event_catcher_openstack_mixin.rb       |  4 +++-
 7 files changed, 19 insertions(+), 12 deletions(-)

Comment 10 CFME Bot 2015-06-24 19:41:37 UTC

New commit detected on manageiq/master:
https://github.com/ManageIQ/manageiq/commit/3649eb07a5e8b1b9a5f56dab11eb205e66758ef5

commit 3649eb07a5e8b1b9a5f56dab11eb205e66758ef5
Author:     Greg Blomquist <gblomqui>
AuthorDate: Tue Jun 23 13:30:07 2015 -0400
Commit:     Greg Blomquist <gblomqui>
CommitDate: Tue Jun 23 16:00:12 2015 -0400

    Include miq_server when retrieving worker
    
    To try to make the way the OpenStack event catcher creates binding queues
    work a little better, the appliance's IP address was looked up and used as part
    of the binding queue's name.
    
    However, there were a couple of things working against this fix.  First, the
    appliance's IP address was not readily available to the worker process.  Second,
    ManageIQ has a DB connection pool with only one connection.  And, threads (i.e.,
    where event catcher workers do all their work) that attempt to run queries are
    opening a new DB connection.
    
    The original fix never actually tried opening the a new connection.  Instead, it
    was perfectly happy to get back a nil value for the appliance and try to lookup
    Nil#ipaddress.
    
    This fix gets around this problem by throwing the appliance record (miq_server,
    actually) into an ivar and making that available to the thread.  This keeps the
    thread from having to query for the miq_server, while still giving it access to
    the MiqServer#ipaddress.
    
    Original PR:
    https://github.com/ManageIQ/manageiq/pull/3050
    
    Fixes:
    https://bugzilla.redhat.com/show_bug.cgi?id=1232484
    
    References:
    https://bugzilla.redhat.com/show_bug.cgi?id=1224389
    https://bugzilla.redhat.com/show_bug.cgi?id=1223976

 vmdb/lib/workers/mixins/event_catcher_openstack_mixin.rb |  2 +-
 vmdb/lib/workers/worker_base.rb                          | 13 +++++++------
 2 files changed, 8 insertions(+), 7 deletions(-)

Comment 11 Pete Savage 2015-10-05 11:58:59 UTC

Verified in 5.5.0.3

Comment 13 errata-xmlrpc 2015-12-08 13:11:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:2551

Note You need to log in before you can comment on or make changes to this bug.