Bug 1290256 - [RFE] We need a check for growing rabbitMQ queues
Summary: [RFE] We need a check for growing rabbitMQ queues
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: Service Telemetry Framework
Version: 8.0 (Liberty)
Hardware: All
OS: Linux
medium
medium
Target Milestone: ga
: ---
Assignee: OSP Team
QA Contact: Leonid Natapov
mgeary
URL:
Whiteboard: docs-accepted
: 1860915 (view as bug list)
Depends On: 1643497 1669093 1673181 1717975 1726191 2057627
Blocks: 1840081
TreeView+ depends on / blocked
 
Reported: 2015-12-09 23:12 UTC by Nick Barcet
Modified: 2023-05-25 17:34 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-05-25 17:34:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 631264 0 'None' MERGED [collectd] add collectd-rabbitmq-monitoring 2021-01-19 16:36:20 UTC
OpenStack gerrit 644841 0 'None' MERGED Refactor collectd/gnocchi integration 2021-01-19 16:36:20 UTC
OpenStack gerrit 644929 0 'None' MERGED Clean metrics related environments 2021-01-19 16:36:19 UTC
OpenStack gerrit 782633 0 None NEW Update collectd-read-rabbitmq 2022-10-19 20:27:26 UTC
RDO 18089 0 None None None 2019-01-07 15:25:02 UTC
Red Hat Issue Tracker OSP-434 0 None None None 2021-11-25 12:50:13 UTC

Description Nick Barcet 2015-12-09 23:12:19 UTC
Description of problem:
We need to provide an alert when RabbitMQ queues are growing in our availability monitoring solution.

Use case:
As an operator, if the number of messages in my rabbitMQ queue keeps on growing, that's a clear indication that something is going wrong in my setup and I need to be notified of it

Satisfaction criteria:
- a check is implemented
- checks are deployed for each Rabbit queues
- alert are sent when queues are growing

Comment 4 Matthias Runge 2017-07-19 08:40:18 UTC
A script to put rabbit queue lengths into collect:

INTERVAL="${COLLECTD_INTERVAL:-10}"
while sleep $INTERVAL;
do
  sudo rabbitmqctl list_queues | awk '{ s+=+$2} END {print "PUTVAL rabbitmq/queues " s}'
done

Comment 9 Mehdi ABAAKOUK 2017-12-21 14:14:59 UTC
Just "rabbitmqctl list_queues" will not show messages taken by an application but not yet ack by the application.

Maybe you should pass what kind of message you want, for example: rabbitmqctl list_queues messages_ready messages_unacknowledged

Comment 19 Leonid Natapov 2018-12-12 10:20:47 UTC
Hey,Matthias. please,provide testing instructions for QA. Thanks !

Comment 20 Matthias Runge 2018-12-12 10:31:57 UTC
Leonid, after this is configured and enabled, you should see the queue length of rabbitmq queues. Usually the length will be (near to) zero. If you'd like to trigger bigger queues, do LOTS of actions on your machine.

Comment 21 Leonid Natapov 2019-01-15 12:40:39 UTC
(In reply to Matthias Runge from comment #20)
> Leonid, after this is configured and enabled, you should see the queue
> length of rabbitmq queues. Usually the length will be (near to) zero. If
> you'd like to trigger bigger queues, do LOTS of actions on your machine.

How to condifure it and enable it ? Can you explain step by step,please.

Comment 22 Matthias Runge 2019-01-16 08:00:54 UTC
Ideally, at the end, the required packages will land in kolla containers, and it *just* needs to get enabled in the python plugin.

Details for .yaml files will follow.

Comment 23 Matthias Runge 2019-02-06 21:02:47 UTC
Moving this to OSP16, see linked bug https://bugzilla.redhat.com/1673181

Comment 26 Martin Magr 2019-05-06 14:52:25 UTC
We need one more patch upstream to have this feature done.

Comment 37 Leonid Natapov 2020-07-16 07:35:10 UTC
Hey Matthias ! Could you please provide the testing instructions for this BZ.

Comment 38 Matthias Runge 2020-07-20 14:23:30 UTC
In theory, you should be able to use this by adding

-e environments/metrics/collectd-read-rabbitmq.yaml

in your overcloud deploy.

Comment 39 Matthias Runge 2020-07-27 13:18:24 UTC
*** Bug 1860915 has been marked as a duplicate of this bug. ***

Comment 40 spower 2020-11-02 12:39:36 UTC
not approved for 16.1.3 as an exception

Comment 42 Leonid Natapov 2021-01-19 18:25:05 UTC
getting the following error in collectd.log after including the template in deploy command. The overcloud deploy itself was successful.

[2021-01-19 17:52:40] Unhandled python exception in loading module: KeyError: 'interval'
[2021-01-19 17:52:40] Traceback (most recent call last):
[2021-01-19 17:52:40]   File "/usr/lib/python3.6/site-packages/collectd_rabbitmq_monitoring/__init__.py", line 32, in configure
    INTERVAL = config['interval'][0]
[2021-01-19 17:52:40] KeyError: 'interval'

Comment 43 Leif Madsen 2021-03-23 02:07:15 UTC
I'm testing with this configuration, which adds the `interval` parameter per the script in `/usr/lib/python3.6/site-packages/collectd_rabbitmq_monitoring`. The configuration is the same as what is in /usr/share/openstack-tripleo-heat-template/environments/metrics/collectd-read-rabbitmq.yaml, but with the added `interval` parameter.

# This environment file serves for enabling python-collect-rabbitmq and configuring
# it to monitor overcloud RabbitMQ instance

parameter_defaults:
  ControllerExtraConfig:
    tripleo::profile::base::metrics::collectd::python_read_plugins:
      - python-collectd-rabbitmq
    collectd::plugin::python::modules:
      collectd_rabbitmq_monitoring:
        config:
          - host: "%{hiera('rabbitmq::interface')}"
            port: "%{hiera('rabbitmq::port')}"
            username: "%{hiera('rabbitmq::default_user')}"
            password: "%{hiera('rabbitmq::default_pass')}"
            interval: 30

Comment 44 Leif Madsen 2021-03-23 03:41:15 UTC
Gets a bit further, but fails on connection.

[2021-03-23 03:09:14] Traceback (most recent call last):
[2021-03-23 03:09:14]   File "/usr/lib/python3.6/site-packages/collectd_rabbitmq_monitoring/__init__.py", line 54, in read
    overview = cl.get_overview()
[2021-03-23 03:09:14]   File "/usr/lib/python3.6/site-packages/pyrabbit2/api.py", line 294, in get_overview
    overview = self._call(Client.urls['overview'], 'GET')
[2021-03-23 03:09:14]   File "/usr/lib/python3.6/site-packages/pyrabbit2/api.py", line 123, in _call
    resp = self.http.do_call(path, method, body, headers)
[2021-03-23 03:09:14]   File "/usr/lib/python3.6/site-packages/pyrabbit2/http.py", line 99, in do_call
    raise NetworkError("Error during request %s %s" % (type(err), err))
[2021-03-23 03:09:14] pyrabbit2.http.NetworkError: Error during request <class 'requests.exceptions.ConnectionError'> ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

~~~

Started reading documentation, saw that the API interface is actually on 15672 (not 5672, which is the AMQP interface).

https://www.rabbitmq.com/management.html#http-api

Verified I see it on netstat -tlnp via the controller-0. Only listening on 127.0.0.1 though.

~~~
Checked I could get data back:

# curl -i --user guest:A9hS0SvMAzsHFnb4O2LwMdvdG http://127.0.0.1:15672/api/vhosts
HTTP/1.1 200 OK
cache-control: no-cache
content-length: 432
content-security-policy: default-src 'self'
content-type: application/json
date: Tue, 23 Mar 2021 03:32:44 GMT
server: Cowboy
vary: accept, accept-encoding, origin

[{"cluster_state":{"rabbit@controller-0":"running","rabbit@controller-1":"running","rabbit@controller-2":"running"},"messages":0,"messages_details":{"rate":0.0},"messages_ready":0,"messages_ready_details":{"rate":0.0},"messages_unacknowledged":0,"messages_unacknowledged_details":{"rate":0.0},"name":"/","recv_oct":375263338,"recv_oct_details":{"rate":1256.4},"send_oct":413665913,"send_oct_details":{"rate":813.8},"tracing":false}]

~~~

Searched for rabbitmq settings in openstack-tht, found this:

deployment/rabbitmq/rabbitmq-container-puppet.yaml:            rabbitmq::management_ip_address: 127.0.0.1


Going to try setting the management IP address to listen on the rabbitmq_interface...

parameter_defaults:
  ControllerExtraConfig:
    rabbitmq::management_ip_address: "%{hiera('rabbitmq::interface')}"
...[rest of config]...

Comment 45 Leif Madsen 2021-03-23 12:08:10 UTC
I was able to get this working:  https://metrics-store-service-telemetry.apps.stf.cloudops.psi.redhat.com/graph?g0.range_input=1h&g0.expr=collectd_rabbitmq_monitoring_gauge&g0.tab=0

Working configuration below. I believe the 15672 is an ssl_management_port configuration, but I'll need to do another deployment test to see if that hiera data is available and the correct parameter.

Need to verify that something on the host itself isn't expecting to have RabbitMQ listening on 127.0.0.1. Not sure if this causes a regression on something.

~~~

$ cat virt/collect-read-rabbitmq.yaml 

# This environment file serves for enabling python-collect-rabbitmq and configuring
# it to monitor overcloud RabbitMQ instance

parameter_defaults:
  ControllerExtraConfig:
    rabbitmq::management_ip_address: "%{hiera('rabbitmq::interface')}"
    tripleo::profile::base::metrics::collectd::python_read_plugins:
      - python-collectd-rabbitmq
    collectd::plugin::python::modules:
      collectd_rabbitmq_monitoring:
        config:
          - host: "%{hiera('rabbitmq::interface')}"
            port: "15672"
            username: "%{hiera('rabbitmq::default_user')}"
            password: "%{hiera('rabbitmq::default_pass')}"
            interval: 30

Comment 46 Matthias Runge 2021-03-23 12:37:52 UTC
rabbitmq should not be listening exclusively on 127.0.0.1.

Not sure what you mean by "regression" here. Usually, rabbitmq is listening/communicating on port 5672.

[root@compute-0 qemu]# cat /etc/services | grep 5672
amqp            5672/tcp                # AMQP
amqp            5672/udp                # AMQP
amqp            5672/sctp               # AMQP

15672 is not reserved (according to  /etc/services)

Comment 47 Leif Madsen 2021-03-23 16:40:23 UTC
(In reply to Matthias Runge from comment #46)
> rabbitmq should not be listening exclusively on 127.0.0.1.
> 
> Not sure what you mean by "regression" here. Usually, rabbitmq is
> listening/communicating on port 5672.

That's the AMQP interface. We're working with the API management interface, which is HTTP and runs on 15672 (not 5672, which is the AMQP interface).

By default the management interface runs on 15672, and is bound to 127.0.0.1, available from the host/node (physical, e.g. controller-0).

I just want to verify that being bound to a different network interface (non-localhost) is ok, and that something isn't expecting the management API to be bound to that.

Comment 48 Leif Madsen 2021-03-23 23:24:09 UTC
Moving this back to ON_DEV since some changes are required to make this work without issue. The current file is missing the `interval` value, and the default management API interface is 127.0.0.1 which isn't exposed to collectd, thereby rendering the template invalid. The port also needs to be changed to another hiera value since the `rabbitmq::port` is invalid, as that contains the AMQP port, not the management API port (15672 vs 5672)

Comment 50 Leif Madsen 2021-03-24 03:28:18 UTC
Added changes upstream based on testing.

Comment 61 Leif Madsen 2022-10-19 20:36:22 UTC
Still tracking this task, however the implementation that was originally targeting/satisfying for this will change. Instead the function here will be to leverage the exporter interface available in newer releases of RabbitMQ (available as of Wallaby/RHOSP17.0).

In RHOSP 18.0 we'll make use of https://bugzilla.redhat.com/show_bug.cgi?id=2057627 to collect telemetry and transport it off-cluster.


Note You need to log in before you can comment on or make changes to this bug.