Description of problem: We need to provide an alert when RabbitMQ queues are growing in our availability monitoring solution. Use case: As an operator, if the number of messages in my rabbitMQ queue keeps on growing, that's a clear indication that something is going wrong in my setup and I need to be notified of it Satisfaction criteria: - a check is implemented - checks are deployed for each Rabbit queues - alert are sent when queues are growing
A script to put rabbit queue lengths into collect: INTERVAL="${COLLECTD_INTERVAL:-10}" while sleep $INTERVAL; do sudo rabbitmqctl list_queues | awk '{ s+=+$2} END {print "PUTVAL rabbitmq/queues " s}' done
Just "rabbitmqctl list_queues" will not show messages taken by an application but not yet ack by the application. Maybe you should pass what kind of message you want, for example: rabbitmqctl list_queues messages_ready messages_unacknowledged
Hey,Matthias. please,provide testing instructions for QA. Thanks !
Leonid, after this is configured and enabled, you should see the queue length of rabbitmq queues. Usually the length will be (near to) zero. If you'd like to trigger bigger queues, do LOTS of actions on your machine.
(In reply to Matthias Runge from comment #20) > Leonid, after this is configured and enabled, you should see the queue > length of rabbitmq queues. Usually the length will be (near to) zero. If > you'd like to trigger bigger queues, do LOTS of actions on your machine. How to condifure it and enable it ? Can you explain step by step,please.
Ideally, at the end, the required packages will land in kolla containers, and it *just* needs to get enabled in the python plugin. Details for .yaml files will follow.
Moving this to OSP16, see linked bug https://bugzilla.redhat.com/1673181
We need one more patch upstream to have this feature done.
Hey Matthias ! Could you please provide the testing instructions for this BZ.
In theory, you should be able to use this by adding -e environments/metrics/collectd-read-rabbitmq.yaml in your overcloud deploy.
*** Bug 1860915 has been marked as a duplicate of this bug. ***
not approved for 16.1.3 as an exception
getting the following error in collectd.log after including the template in deploy command. The overcloud deploy itself was successful. [2021-01-19 17:52:40] Unhandled python exception in loading module: KeyError: 'interval' [2021-01-19 17:52:40] Traceback (most recent call last): [2021-01-19 17:52:40] File "/usr/lib/python3.6/site-packages/collectd_rabbitmq_monitoring/__init__.py", line 32, in configure INTERVAL = config['interval'][0] [2021-01-19 17:52:40] KeyError: 'interval'
I'm testing with this configuration, which adds the `interval` parameter per the script in `/usr/lib/python3.6/site-packages/collectd_rabbitmq_monitoring`. The configuration is the same as what is in /usr/share/openstack-tripleo-heat-template/environments/metrics/collectd-read-rabbitmq.yaml, but with the added `interval` parameter. # This environment file serves for enabling python-collect-rabbitmq and configuring # it to monitor overcloud RabbitMQ instance parameter_defaults: ControllerExtraConfig: tripleo::profile::base::metrics::collectd::python_read_plugins: - python-collectd-rabbitmq collectd::plugin::python::modules: collectd_rabbitmq_monitoring: config: - host: "%{hiera('rabbitmq::interface')}" port: "%{hiera('rabbitmq::port')}" username: "%{hiera('rabbitmq::default_user')}" password: "%{hiera('rabbitmq::default_pass')}" interval: 30
Gets a bit further, but fails on connection. [2021-03-23 03:09:14] Traceback (most recent call last): [2021-03-23 03:09:14] File "/usr/lib/python3.6/site-packages/collectd_rabbitmq_monitoring/__init__.py", line 54, in read overview = cl.get_overview() [2021-03-23 03:09:14] File "/usr/lib/python3.6/site-packages/pyrabbit2/api.py", line 294, in get_overview overview = self._call(Client.urls['overview'], 'GET') [2021-03-23 03:09:14] File "/usr/lib/python3.6/site-packages/pyrabbit2/api.py", line 123, in _call resp = self.http.do_call(path, method, body, headers) [2021-03-23 03:09:14] File "/usr/lib/python3.6/site-packages/pyrabbit2/http.py", line 99, in do_call raise NetworkError("Error during request %s %s" % (type(err), err)) [2021-03-23 03:09:14] pyrabbit2.http.NetworkError: Error during request <class 'requests.exceptions.ConnectionError'> ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) ~~~ Started reading documentation, saw that the API interface is actually on 15672 (not 5672, which is the AMQP interface). https://www.rabbitmq.com/management.html#http-api Verified I see it on netstat -tlnp via the controller-0. Only listening on 127.0.0.1 though. ~~~ Checked I could get data back: # curl -i --user guest:A9hS0SvMAzsHFnb4O2LwMdvdG http://127.0.0.1:15672/api/vhosts HTTP/1.1 200 OK cache-control: no-cache content-length: 432 content-security-policy: default-src 'self' content-type: application/json date: Tue, 23 Mar 2021 03:32:44 GMT server: Cowboy vary: accept, accept-encoding, origin [{"cluster_state":{"rabbit@controller-0":"running","rabbit@controller-1":"running","rabbit@controller-2":"running"},"messages":0,"messages_details":{"rate":0.0},"messages_ready":0,"messages_ready_details":{"rate":0.0},"messages_unacknowledged":0,"messages_unacknowledged_details":{"rate":0.0},"name":"/","recv_oct":375263338,"recv_oct_details":{"rate":1256.4},"send_oct":413665913,"send_oct_details":{"rate":813.8},"tracing":false}] ~~~ Searched for rabbitmq settings in openstack-tht, found this: deployment/rabbitmq/rabbitmq-container-puppet.yaml: rabbitmq::management_ip_address: 127.0.0.1 Going to try setting the management IP address to listen on the rabbitmq_interface... parameter_defaults: ControllerExtraConfig: rabbitmq::management_ip_address: "%{hiera('rabbitmq::interface')}" ...[rest of config]...
I was able to get this working: https://metrics-store-service-telemetry.apps.stf.cloudops.psi.redhat.com/graph?g0.range_input=1h&g0.expr=collectd_rabbitmq_monitoring_gauge&g0.tab=0 Working configuration below. I believe the 15672 is an ssl_management_port configuration, but I'll need to do another deployment test to see if that hiera data is available and the correct parameter. Need to verify that something on the host itself isn't expecting to have RabbitMQ listening on 127.0.0.1. Not sure if this causes a regression on something. ~~~ $ cat virt/collect-read-rabbitmq.yaml # This environment file serves for enabling python-collect-rabbitmq and configuring # it to monitor overcloud RabbitMQ instance parameter_defaults: ControllerExtraConfig: rabbitmq::management_ip_address: "%{hiera('rabbitmq::interface')}" tripleo::profile::base::metrics::collectd::python_read_plugins: - python-collectd-rabbitmq collectd::plugin::python::modules: collectd_rabbitmq_monitoring: config: - host: "%{hiera('rabbitmq::interface')}" port: "15672" username: "%{hiera('rabbitmq::default_user')}" password: "%{hiera('rabbitmq::default_pass')}" interval: 30
rabbitmq should not be listening exclusively on 127.0.0.1. Not sure what you mean by "regression" here. Usually, rabbitmq is listening/communicating on port 5672. [root@compute-0 qemu]# cat /etc/services | grep 5672 amqp 5672/tcp # AMQP amqp 5672/udp # AMQP amqp 5672/sctp # AMQP 15672 is not reserved (according to /etc/services)
(In reply to Matthias Runge from comment #46) > rabbitmq should not be listening exclusively on 127.0.0.1. > > Not sure what you mean by "regression" here. Usually, rabbitmq is > listening/communicating on port 5672. That's the AMQP interface. We're working with the API management interface, which is HTTP and runs on 15672 (not 5672, which is the AMQP interface). By default the management interface runs on 15672, and is bound to 127.0.0.1, available from the host/node (physical, e.g. controller-0). I just want to verify that being bound to a different network interface (non-localhost) is ok, and that something isn't expecting the management API to be bound to that.
Moving this back to ON_DEV since some changes are required to make this work without issue. The current file is missing the `interval` value, and the default management API interface is 127.0.0.1 which isn't exposed to collectd, thereby rendering the template invalid. The port also needs to be changed to another hiera value since the `rabbitmq::port` is invalid, as that contains the AMQP port, not the management API port (15672 vs 5672)
Added changes upstream based on testing.
Still tracking this task, however the implementation that was originally targeting/satisfying for this will change. Instead the function here will be to leverage the exporter interface available in newer releases of RabbitMQ (available as of Wallaby/RHOSP17.0). In RHOSP 18.0 we'll make use of https://bugzilla.redhat.com/show_bug.cgi?id=2057627 to collect telemetry and transport it off-cluster.