Bug 1380458 - rsyslogd throttling too aggressive - critical messages lost when large cluster is in trouble
Summary: rsyslogd throttling too aggressive - critical messages lost when large cluste...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Documentation
Version: 3.3.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Mike Fiedler
QA Contact: Vikram Goyal
Vikram Goyal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-29 16:15 UTC by Mike Fiedler
Modified: 2017-02-23 15:18 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-02-23 15:18:08 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Mike Fiedler 2016-09-29 16:15:56 UTC
Description of problem:

For https://bugzilla.redhat.com/show_bug.cgi?id=1380455 we did not catch the critical messages which were being logged at the time the master controller failed over - don't know if there was a  panic, systemd activity ,etc.   All that was in the log was:

Sep 28 18:01:23 svt-m-1 atomic-openshift-master-controllers: I0928 18:01:23.450298   74228 priorities.go:39] Combined requested resources 2000 from existing pods exceeds capacity 1000 on node 192.1.2.45
Sep 28 18:01:23 svt-m-1 rsyslogd-2177: imjournal: begin to drop messages due to rate-limiting
Sep 28 18:08:52 svt-m-1 rsyslogd-2177: imjournal: 17351 messages lost due to rate-limiting
Sep 28 18:08:52 svt-m-1 atomic-openshift-node: I0928 18:08:52.873723   46320 roundrobin.go:273] LoadBalancerRR: Setting endpoints for default/kubernetes:https to [192.1.0.49:8443 192.1.0.51:8443]

The failover happened at ~18:02

We should set a wider window/less aggressive throttle for rsyslogd on masters/etcd/load balancers


Version-Release number of selected component (if applicable): 3.30.32


How reproducible: Always, if message bursts are high enough

Comment 1 Rich Megginson 2016-09-29 16:22:11 UTC
Peter, is this an rsyslog or journald configuration issue?

Comment 2 Timothy St. Clair 2016-09-29 16:35:01 UTC
PSA: This has happened to us on a number of occasions and is due to two parts. 

1. We still have hot-loop logging in some places.  
2. Throttles on logs need to be opened up.

Comment 3 Peter Portante 2016-10-03 01:59:53 UTC
@Rich, this is specifically an rsyslog default ratelimit setting. We need to keep in mind that systemd rate-limits per-service, while rsyslog rate-limits all logs read from the journal.

So we need to bump up the imjournal settings, so that the aggregate rate that all services can log at via the journal is handled.  See ratelimit.interval, ratelimit.burst at http://www.rsyslog.com/doc/v7-stable/configuration/modules/imjournal.html

Note that the historical reason for this was due to corrupted systemd journals causing rsyslog to see unlimited messages that were not there.

However, if one service is doing most of the messaging we still need to be sure we up the systemd configuration for journald.

That said, hopefully systemd is configured with persistent logs, meaning /var/log/journal directory is present, otherwise, under such a store, it might be easy to hit the 4GB in memory limit and the logs get logs.

Last I heard, OpenShift uses persistent logging so this might not be an issue.

If you need the proper settings, I can help track down the information and add it here.

Comment 4 Timothy St. Clair 2016-10-03 13:29:22 UTC
xref: https://github.com/kubernetes/kubernetes/issues/33935

Comment 5 Scott Dodson 2016-10-04 15:13:15 UTC
Yeah, lets get the suggested settings. RHEL7 by default doesn't persist the journal and we're not doing anything with that.

Comment 6 Mike Fiedler 2016-10-05 16:01:04 UTC
Who can provide suggested settings?  Rich or Peter, have any resources?

I know we can disable the throttling with

$SystemLogRateLimitInterval 0
$SystemLogRateLimitBurst  0


But that seems like it might not be the right answer.

Comment 15 Mike Fiedler 2017-02-09 14:27:18 UTC
I (originator of this bz) am going to backtrack on this a bit.   I, as system admin, would not expect or want the OpenShift installer to modify any customization I've made in this area and certainly would not want throttling disabled unless I've done it myself.  

I think this should be a documentation update or knowledge base article on OpenShift trouble shooting.   I am happy to help provide content.

Comment 16 Scott Dodson 2017-02-09 14:41:04 UTC
Moving to documentation component based on comment 15.

Comment 17 Vikram Goyal 2017-02-09 23:19:23 UTC
(In reply to Mike Fiedler from comment #15)
> I (originator of this bz) am going to backtrack on this a bit.   I, as
> system admin, would not expect or want the OpenShift installer to modify any
> customization I've made in this area and certainly would not want throttling
> disabled unless I've done it myself.  
> 
> I think this should be a documentation update or knowledge base article on
> OpenShift trouble shooting.   I am happy to help provide content.

Mike, are you able to push a PR [1] and tag me? I can then get someone from the docs team to clean it up and find the right place for it.

If not a PR, some text to me via email would work as well.

[1] https://github.com/openshift/openshift-docs

Comment 18 Mike Fiedler 2017-02-10 20:07:55 UTC
I should be able to do that - likely next week.   If you prefer, you can assign the bz to me until I have a PR for you.

Comment 19 Vikram Goyal 2017-02-23 08:10:22 UTC
Mike - did you create a PR?

Comment 20 Mike Fiedler 2017-02-23 15:18:08 UTC
The more I think about this one, the more I believe it is normal system admin activity - documenting the Linux config for rsyslog/journald in OpenShift documentation feels wrong.   If no one objects, I am going to close this one out.   If you feel strongly otherwise, please re-open.


Note You need to log in before you can comment on or make changes to this bug.