Description of problem: We think that nova-conductor.log, nova-api.log has no "backoff" for throttling logging of errors? If the OpenStack cluster is stressed there are written errors every 1 ms in the nova-conductor.log, here is example: 2015-12-15 20:32:27.101 22733 ERROR oslo_messaging._drivers.amqpdriver [-] Failed to process incoming message, retrying... 2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver Traceback (most recent call last): 2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 197, in poll 2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver self.conn.consume(limit=1) 2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_qpid.py", line 769, in consume 2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver six.next(it) 2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_qpid.py", line 698, in iterconsume 2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver yield self.ensure(_error_callback, _consume) 2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_qpid.py", line 614, in ensure 2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver return method() 2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_qpid.py", line 682, in _consume 2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver timeout=poll_timeout) 2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver File "<string>", line 6, in next_receiver 2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 685, in next_receiver 2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver if self._ecwait(lambda: self.incoming, timeout): 2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait 2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver result = self._ewait(lambda: self.closed or predicate(), timeout) 2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 591, in _ewait 2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver result = self.connection._ewait(lambda: self.error or predicate(), timeout) 2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 230, in _ewait 2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver self.check_error() 2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver File "/usr/lib/python2.7/site-packages/qpid/messaging/endpoints.py", line 223, in check_error 2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver raise e 2015-12-15 20:32:27.101 22733 TRACE oslo_messaging._drivers.amqpdriver error: [Errno 32] Broken pipe This ends up in "No space on disk left." and OpenStack environment is unusable. Could you please check if there is any implementation for throttling error logs? Thank you, Regards, Robin Černín
Hello, It might be good to sync this with others, like another error message that may occur in the nova-conductor.log: [-] Unexpected exception occurred N time(s)... retrying. So it would be like, [-] Failed to process incoming message occurred N time(s), retrying... Thank you, Regards, Robin Černín
oslo.messaging is a library consumed by other OpenStack projects and it has no control on the logging configuration. Each project provides a config file for logging (i.e: https://github.com/openstack/glance/blob/master/etc/glance-api.conf#L384 ) that can be tuned for this. Projects use oslo.log[0] to handle most of these configurations and instantiate the various loggers. Therefore, I'm moving this BZ under oslo.log and check if there's a way to throttle logs using this library. I'm not aware of a way to do that right now. Perhaps it could be done using syslog. [0] https://github.com/openstack/oslo.log
What if we were to (a) stop logging to syslog, (b) stop logging to files, and (c) just log to stdout/stderr (so all log messages would be handled by journald, and could eventually end up in syslog anyway). Then we could take advantage of the rate limiting support in journald: RateLimitInterval=, RateLimitBurst= Configures the rate limiting that is applied to all messages generated on the system. If, in the time interval defined by RateLimitInterval=, more messages than specified in RateLimitBurst= are logged by a service, all further messages within the interval are dropped until the interval is over. A message about the number of dropped messages is generated. This rate limiting is applied per-service, so that two services which log do not interfere with each other's limits. Defaults to 1000 messages in 30s. The time specification for RateLimitInterval= may be specified in the following units: "s", "min", "h", "ms", "us". To turn off any kind of rate limiting, set either value to 0.
FYI my change https://review.openstack.org/#/c/322263/ was approved but failed to be merged. I'm still investigating why tests failed.
Because of the lack of feedback, I consider that the issue was worked again and so that this side effect is not more an issue anymore. Anyway, a generic solution to limit logs is to use journald which supports log rate limit. Moreover, in parallel, a rate limiting feature was added to the development version of OpenStack (in Oslo Log).