Description of problem: collectd-5.8.x
Leonid, could you please provide info on what has happened, which versions, configuration, involved plugins? Thank you
I haven't done any actual verification of this yet, but when testing HA QDR scenarios I ran across something in the code[1] that I'm thinking could be the source of this leak. Deliveries are created for each message on L152, and if the message is pre-settled (as it is for metrics, but not for events[2]) then we immediately call pn_delivery_settle() which will free resources for that delivery. For events, the call to pn_delivery_settle() doesn't occur until L203 when (hopefully!) the peer ACKs the message. My guess is that if the message is dropped (or perhaps "when amqp1 write target is not reachable" as in this BZ) then we will never get an ACK and never free the delivery. Again, untested, but it's my first guess. I hope this helps! [1] https://github.com/collectd/collectd/blob/master/src/amqp1.c#L152 [2] https://github.com/openstack/tripleo-heat-templates/blob/master/environments/metrics/collectd-write-qdr.yaml#L13
In addition to the above, I believe that there is a problem with how messages are queued for delivery by AMQP. In the write callback, messages are accepted and put into a local queue (implemented by collectd's DEQ_* library). This queue does not (I think) check for a size and will continuously add messages to the internal queue. The queue only gets drained when AMQP credit messages are received. So I believe this queue can be unbounded if there is a problem with the QDR connection. A possible solution would be to check the size of the queue before adding and make the queue a fixed size. When an overrun occurs, drop from the head.
Is this a dup of https://bugzilla.redhat.com/show_bug.cgi?id=1771994 ?
IMHO it is not a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1771994. When you deploy collectd on OpenStack, there should be no possibility to end up with no write plugin installed, as that doesn't make any sense. This bug here is especially to make sure that collectd won't use tons of memory if it can not write data via amqp1.
It may be valuable to look at limiting the queue length for collectd write plugins: https://collectd.org/documentation/manpages/collectd.conf.5.shtml#writequeuelimithigh_highnum
Upstream patch: https://github.com/collectd/collectd/pull/3432/commits/e7dd149f6f8279d844d172663023a841aa032a93 This will also need reasonable configuration and documentation if/when merged.
The upstream patch merged, needs a downstream build.
16.1 had collectd 5.11, not collectd 5.8
https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1201154
If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text. If this bug does not require doc text, please set the 'requires_doc_text' flag to '-'.
Hey Matthias ! Could you please provide the testing instructions for this BZ. What should be changed in what files in order to test it.
Leonid, as stated last week in the team chat: you can not, since it introduces a new parameter, which is not included in puppet-collectd. You could test it manually.
The missing bits landed in https://bugzilla.redhat.com/show_bug.cgi?id=1861715 How to test: (you need both, collectd-5.11.0-4.el8ost and puppet-collectd-12.0.0-1.20200626073420.4686e16.el8ost In an environment file add:: ExtraConfig: collectd::plugin::amqp1::send_queue_limit: 40 and observe if this parameter SendQueueLimit is added to /etc/collectd.d/10-amqp1.conf You should also be able to shut off your stf (or stop the route) and you should not see collectd memory usage going up.
# Generated by Puppet <LoadPlugin amqp1> Globals false Interval 5 </LoadPlugin> <Plugin amqp1> <Transport "metrics"> Host "172.17.1.134" Port "5666" User "guest" Password "guest" Address "collectd" RetryDelay 1 SendQueueLimit 40 <Instance "notify"> Format "JSON" Notify true PreSettle false </Instance> <Instance "telemetry"> Format "JSON" PreSettle false </Instance> </Transport> </Plugin> collectd memeory useage is stable
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:4284