Bug 1294911 - oslo_messaging._drivers.amqpdriver has no "backoff" for throttling logging of errors?
Summary: oslo_messaging._drivers.amqpdriver has no "backoff" for throttling logging of...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-oslo-log
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Linux
high
high
Target Milestone: ---
: 7.0 (Kilo)
Assignee: Victor Stinner
QA Contact: Udi Shkalim
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-12-30 22:47 UTC by Robin Cernin
Modified: 2020-04-15 14:27 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1301751 (view as bug list)
Environment:
Last Closed: 2016-12-20 11:39:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 322263 0 None None None 2016-05-27 16:44:23 UTC

Description Robin Cernin 2015-12-30 22:47:31 UTC
Description of problem:

We think that nova-conductor.log, nova-api.log has no "backoff" for throttling logging of errors?

If the OpenStack cluster is stressed there are written errors every 1 ms in the nova-conductor.log, here is example:

2015-12-15 20:32:27.101 22733 ERROR oslo_messaging._drivers.amqpdriver [-] Failed to process incoming message, retrying...
2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver Traceback (most recent call last):
2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 197, in poll
2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver     self.conn.consume(limit=1)
2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_qpid.py", line 769, in consume
2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver     six.next(it)
2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_qpid.py", line 698, in iterconsume
2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver     yield self.ensure(_error_callback, _consume)
2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_qpid.py", line 614, in ensure
2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver     return method()
2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_qpid.py", line 682, in _consume
2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver     timeout=poll_timeout)
2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver   File "<string>", line 6, in next_receiver
2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 685, in next_receiver
2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver     if self._ecwait(lambda: self.incoming, timeout):
2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver     result = self._ewait(lambda: self.closed or predicate(), timeout)
2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 591, in _ewait
2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver     result = self.connection._ewait(lambda: self.error or predicate(), timeout)
2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 230, in _ewait
2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver     self.check_error()
2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver   File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 223, in check_error
2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver     raise e
2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver error: [Errno 32] Broken pipe

This ends up in "No space on disk left." and OpenStack environment is unusable.

Could you please check if there is any implementation for throttling error logs?

Thank you,
Regards,
Robin Černín

Comment 2 Robin Cernin 2015-12-31 10:29:35 UTC
Hello,

It might be good to sync this with others, like another error message that may occur in the nova-conductor.log:

[-] Unexpected exception occurred N time(s)... retrying.

So it would be like,

[-] Failed to process incoming message occurred N time(s), retrying...

Thank you,
Regards,
Robin Černín

Comment 7 Flavio Percoco 2016-01-19 16:25:29 UTC
oslo.messaging is a library consumed by other OpenStack projects and it has no control on the logging configuration. Each project provides a config file for logging (i.e: https://github.com/openstack/glance/blob/master/etc/glance-api.conf#L384 ) that can be tuned for this.

Projects use oslo.log[0] to handle most of these configurations and instantiate the various loggers. Therefore, I'm moving this BZ under oslo.log and check if there's a way to throttle logs using this library. I'm not aware of a way to do that right now. Perhaps it could be done using syslog.

[0] https://github.com/openstack/oslo.log

Comment 11 Lars Kellogg-Stedman 2016-01-19 21:55:54 UTC
What if we were to (a) stop logging to syslog, (b) stop logging to files, and (c) just log to stdout/stderr (so all log messages would be handled by journald, and could eventually end up in syslog anyway).

Then we could take advantage of the rate limiting support in journald:

RateLimitInterval=, RateLimitBurst=

    Configures the rate limiting that is applied to all messages generated on the system. If, in the time interval defined by RateLimitInterval=, more messages than specified in RateLimitBurst= are logged by a service, all further messages within the interval are dropped until the interval is over. A message about the number of dropped messages is generated. This rate limiting is applied per-service, so that two services which log do not interfere with each other's limits. Defaults to 1000 messages in 30s. The time specification for RateLimitInterval= may be specified in the following units: "s", "min", "h", "ms", "us". To turn off any kind of rate limiting, set either value to 0.

Comment 16 Victor Stinner 2016-08-16 12:01:50 UTC
FYI my change https://review.openstack.org/#/c/322263/ was approved but failed to be merged. I'm still investigating why tests failed.

Comment 17 Victor Stinner 2016-09-26 12:09:43 UTC
Because of the lack of feedback, I consider that the issue was worked again and so that this side effect is not more an issue anymore.

Anyway, a generic solution to limit logs is to use journald which supports log rate limit.

Moreover, in parallel, a rate limiting feature was added to the development version of OpenStack (in Oslo Log).


Note You need to log in before you can comment on or make changes to this bug.