Bug 1301751

Summary: Move all logging to stdout/err to allow systemd throttling logging of errors
Product: [Community] RDO Reporter: Alan Pevec <apevec>
Component: distributionAssignee: Alan Pevec <apevec>
Status: CLOSED WONTFIX QA Contact: Shai Revivo <srevivo>
Severity: high Docs Contact:
Priority: high    
Version: trunkCC: apevec, chris.brown, fdinitto, fpercoco, itamar.landsman, lars, lhh, markmc, michele, pablo.iranzo, rcernin, srevivo, vstinner
Target Milestone: Milestone3   
Target Release: trunk   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1294911 Environment:
Last Closed: 2018-11-21 00:37:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alan Pevec 2016-01-25 22:11:32 UTC
+++ This bug was initially created as a clone of Bug #1294911 +++

Description of problem:

We think that nova-conductor.log, nova-api.log has no "backoff" for throttling logging of errors?

If the OpenStack cluster is stressed there are written errors every 1 ms in the nova-conductor.log, here is example:

...snip, see original BZ for ugly unreadable logs examples...

This ends up in "No space on disk left." and OpenStack environment is unusable.

Could you please check if there is any implementation for throttling error logs?


--- Additional comment from Lars Kellogg-Stedman on 2016-01-19 16:55:54 EST ---

What if we were to (a) stop logging to syslog, (b) stop logging to files, and (c) just log to stdout/stderr (so all log messages would be handled by journald, and could eventually end up in syslog anyway).

Then we could take advantage of the rate limiting support in journald:

RateLimitInterval=, RateLimitBurst=

    Configures the rate limiting that is applied to all messages generated on the system. If, in the time interval defined by RateLimitInterval=, more messages than specified in RateLimitBurst= are logged by a service, all further messages within the interval are dropped until the interval is over. A message about the number of dropped messages is generated. This rate limiting is applied per-service, so that two services which log do not interfere with each other's limits. Defaults to 1000 messages in 30s. The time specification for RateLimitInterval= may be specified in the following units: "s", "min", "h", "ms", "us". To turn off any kind of rate limiting, set either value to 0.

Comment 2 Christopher Brown 2017-06-17 18:51:03 UTC
This is something I might be interested in picking up but my guess is that it needs _much_ wider input as it will be big change for lots of people.

Oslo supports systemd logging integration:

https://docs.openstack.org/developer/oslo.log/journal.html

There's also an argument that now we have integrated avail and perf monitoring, operators have no excuse not to monitor disk space etc etc.

Comments?

Comment 3 Victor Stinner 2018-06-20 13:15:42 UTC
> We think that nova-conductor.log, nova-api.log has no "backoff" for throttling logging of errors?

I implemented rate limiting in Oslo Log for bz#1294911:

https://docs.openstack.org/oslo.log/latest/configuration/index.html#DEFAULT.rate_limit_interval

rate_limit_interval
    Type:	integer
    Default:	0

    Interval, number of seconds, of log rate limiting.

rate_limit_burst
    Type:	integer
    Default:	0

    Maximum number of logged messages per rate_limit_interval.

rate_limit_except_level
    Type:	string
    Default:	CRITICAL

    Log level name used by rate limiting: CRITICAL, ERROR, INFO, WARNING, DEBUG or empty string. Logs with level greater or equal to rate_limit_except_level are not filtered. An empty string means that all levels are filtered.


Upstream issue (merged 1 year 9 months ago): https://review.openstack.org/#/c/322263/


Rate limiting means dropping logs which has an impact on security and debugging, so it's disabled by default.

Comment 4 Alan Pevec 2018-11-21 00:37:17 UTC
With the move to containerized deployments, logging is under container management control and cannot be solved in packaging so further enhancements should be in the deployment framework, TripleO.