Bug 1790928 - OSP16 | collectd blows up when amqp1 write target is not reachable
Summary: OSP16 | collectd blows up when amqp1 write target is not reachable
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: collectd
Version: 15.0 (Stein)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z2
: 16.1 (Train on RHEL 8.2)
Assignee: Matthias Runge
QA Contact: Leonid Natapov
URL:
Whiteboard:
Depends On: 1771994
Blocks: 1797436 1817124 1859630
TreeView+ depends on / blocked
 
Reported: 2020-01-14 14:53 UTC by Matthias Runge
Modified: 2020-10-28 15:37 UTC (History)
11 users (show)

Fixed In Version: collectd-5.11.0-4.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1817124 1859630 (view as bug list)
Environment:
Last Closed: 2020-10-28 15:36:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github collectd collectd pull 3432 0 None closed amqp1: Add options to limit send queue length 2021-01-12 12:43:06 UTC
Red Hat Knowledge Base (Solution) 4855731 0 None None None 2020-03-02 18:43:37 UTC
Red Hat Product Errata RHEA-2020:4284 0 None None None 2020-10-28 15:37:21 UTC

Description Matthias Runge 2020-01-14 14:53:44 UTC
Description of problem:

collectd-5.8.x

Comment 1 Matthias Runge 2020-01-14 14:55:16 UTC
Leonid, could you please provide info on what has happened, which versions, configuration, involved plugins?
Thank you

Comment 2 Chris Sibbitt 2020-01-15 20:10:24 UTC
I haven't done any actual verification of this yet, but when testing HA QDR scenarios I ran across something in the code[1] that I'm thinking could be the source of this leak.

Deliveries are created for each message on L152, and if the message is pre-settled (as it is for metrics, but not for events[2]) then we immediately call pn_delivery_settle() which will free resources for that delivery. For events, the call to pn_delivery_settle() doesn't occur until L203 when (hopefully!) the peer ACKs the message.

My guess is that if the message is dropped (or perhaps "when amqp1 write target is not reachable" as in this BZ) then we will never get an ACK and never free the delivery.

Again, untested, but it's my first guess. I hope this helps!

[1] https://github.com/collectd/collectd/blob/master/src/amqp1.c#L152
[2] https://github.com/openstack/tripleo-heat-templates/blob/master/environments/metrics/collectd-write-qdr.yaml#L13

Comment 3 Aaron Smith 2020-01-16 15:52:16 UTC
In addition to the above, I believe that there is a problem with how messages are queued for delivery by AMQP.  In the write callback, messages are accepted and put into a local queue (implemented by collectd's DEQ_* library).  This queue does not (I think) check for a size and will continuously add messages to the internal queue.  The queue only gets drained when AMQP credit messages are received.  So I believe this queue can be unbounded if there is a problem with the QDR connection.  A possible solution would be to check the size of the queue before adding and make the queue a fixed size.  When an overrun occurs, drop from the head.

Comment 4 Chris Sibbitt 2020-01-21 21:16:23 UTC
Is this a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1771994 ?

Comment 5 Matthias Runge 2020-01-22 06:59:24 UTC
IMHO it is not a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1771994. When you deploy collectd on OpenStack, there should be no possibility to end up with no write plugin installed, as that doesn't make any sense.

This bug here is especially to make sure that collectd won't use tons of memory if it can not write data via amqp1.

Comment 8 Matthias Runge 2020-02-19 16:10:54 UTC
It may be valuable to look at limiting the queue length for collectd write plugins: https://collectd.org/documentation/manpages/collectd.conf.5.shtml#writequeuelimithigh_highnum

Comment 9 Ryan McCabe 2020-03-25 15:43:46 UTC
Upstream patch: https://github.com/collectd/collectd/pull/3432/commits/e7dd149f6f8279d844d172663023a841aa032a93

This will also need reasonable configuration and documentation if/when merged.

Comment 12 Matthias Runge 2020-04-15 14:57:51 UTC
The upstream patch merged, needs a downstream build.

Comment 16 Matthias Runge 2020-06-02 13:33:56 UTC
16.1 had collectd 5.11, not collectd 5.8

Comment 18 Alex McLeod 2020-06-16 12:28:27 UTC
If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to '-'.

Comment 30 Leonid Natapov 2020-07-16 07:36:41 UTC
Hey Matthias ! Could you please provide the testing instructions for this BZ. What should be changed in what files in order to test it.

Comment 31 Matthias Runge 2020-07-20 17:11:29 UTC
Leonid, as stated last week in the team chat: you can not, since it introduces a new parameter, which is not included in puppet-collectd. You could test it manually.

Comment 32 Matthias Runge 2020-09-07 06:56:09 UTC
The missing bits landed in https://bugzilla.redhat.com/show_bug.cgi?id=1861715


How to test: (you need both, collectd-5.11.0-4.el8ost and puppet-collectd-12.0.0-1.20200626073420.4686e16.el8ost


In an environment file add::

ExtraConfig:
    collectd::plugin::amqp1::send_queue_limit: 40

and observe if this parameter SendQueueLimit is added to /etc/collectd.d/10-amqp1.conf

You should also be able to shut off your stf (or stop the route) and you should not see collectd memory usage going up.

Comment 36 Leonid Natapov 2020-09-29 15:54:21 UTC
# Generated by Puppet
<LoadPlugin amqp1>
  Globals false
  Interval 5
</LoadPlugin>

<Plugin amqp1>
  <Transport "metrics">
    Host "172.17.1.134"
    Port "5666"
    User "guest"
    Password "guest"
    Address "collectd"
    RetryDelay 1
    SendQueueLimit 40
    <Instance "notify">
      Format "JSON"
      Notify true
      PreSettle false
    </Instance>
    <Instance "telemetry">
      Format "JSON"
      PreSettle false
    </Instance>
  </Transport>
</Plugin>

collectd memeory useage is stable

Comment 41 errata-xmlrpc 2020-10-28 15:36:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4284


Note You need to log in before you can comment on or make changes to this bug.