Bug 1450626

Summary:	[RFE] Enhance journald to allow rate-limits to be applied per unit instead of just per server
Product:	Red Hat Enterprise Linux 7	Reporter:	Peter Portante <pportant>
Component:	systemd	Assignee:	systemd-maint
Status:	CLOSED WONTFIX	QA Contact:	qe-baseos-daemons
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	7.5-Alt	CC:	aaron_wilk, aivaraslaimikis, dtardon, msekleta, pdwyer, systemd-maint-list
Target Milestone:	rc	Keywords:	FutureFeature
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1719577 (view as bug list)		Environment:
Last Closed:	2021-01-15 07:35:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Peter Portante 2017-05-14 04:43:24 UTC

See https://bugzilla.redhat.com/show_bug.cgi?id=1445797

For kubernetes, and potentially other sub-systems in a similar situation, having rate limits applied per-service does not work when most of the logging traffic sent to journald comes through one service.

Instead, if we had the ability to apply rate limiting on a per unit basis, we'd be able to effectively prevent one unit from starving out logs from all other units.

Comment 2 Michal Sekletar 2017-05-15 08:39:34 UTC

(In reply to Peter Portante from comment #0)
> See https://bugzilla.redhat.com/show_bug.cgi?id=1445797
> 
> For kubernetes, and potentially other sub-systems in a similar situation,
> having rate limits applied per-service does not work when most of the
> logging traffic sent to journald comes through one service.

kubernetes and others should then run containers as separate units. Kubernetes can create scope unit for each container. IIRC docker used to it by default, hence I am not sure how come you have this problem in the first place. Anyway, container processes then live in separate cgroup and malicious container can't mess up logging for other containers.

Also as Linux containers are built on top of namespaces and cgroups having all containers run in the same service (=~ cgroup) means you can't really manage system resources on per container base.

Btw, is there any other kernel based process group mechanism other than cgroups that we could leverage here?

Comment 6 Aaron 2019-02-11 17:51:41 UTC

(In reply to Michal Sekletar from comment #2)
> (In reply to Peter Portante from comment #0)
> > See https://bugzilla.redhat.com/show_bug.cgi?id=1445797
> > 
> > For kubernetes, and potentially other sub-systems in a similar situation,
> > having rate limits applied per-service does not work when most of the
> > logging traffic sent to journald comes through one service.
> 
> kubernetes and others should then run containers as separate units.
> Kubernetes can create scope unit for each container. IIRC docker used to it
> by default, hence I am not sure how come you have this problem in the first
> place. Anyway, container processes then live in separate cgroup and
> malicious container can't mess up logging for other containers.
> 
> Also as Linux containers are built on top of namespaces and cgroups having
> all containers run in the same service (=~ cgroup) means you can't really
> manage system resources on per container base.
> 
> Btw, is there any other kernel based process group mechanism other than
> cgroups that we could leverage here?

They don't all run in the same namespace/cgroup, but are in fact isolated. There are a few "best practices" recommended by K8s for collecting logging from containers:
-Use a node-level logging agent that runs on every node.
-Include a dedicated sidecar container for logging in an application pod.
--The sidecar container streams application logs to its own stdout.
--The sidecar container runs a logging agent, which is configured to pick up logs from an application container.
-Push logs directly to a backend from within an application.
https://kubernetes.io/docs/concepts/cluster-administration/logging/

The particular scenario that I believe Peter (as well as my team) is having trouble with, is where the jouranld log driver is used to send containers stdout/stderr to journald. For example, this could occur when using a node-level logging agent that reads events from journald and fwd's them off to say ELK. The issue that occurs is that containers log to stdout/stderror which is picked up by docker's journald log driver and written by docker.service to the journald for all containers on the node/host.

When this happens, the journald log driver appends some journald metadata to the event which are listed in the link below. One solution could be to have journald rate limit by unit and if it exists CONTAINER_ID_FULL instead of just service/unit.

Documentation on journald log driver - https://docs.docker.com/config/containers/logging/journald/
General documentation on docker log drivers - https://docs.docker.com/config/containers/logging/configure/

To work around this issue and report to the application owners when their app is consuming too much of the rate limit enforced on docker.service, we detect when a journald "Suppressed" event occurs via cron'ing something like this:
nice -n 18 ( journalctl --since "1 hour ago" --unit systemd-journald.service | grep -i "Suppressed" | wc -l )
...and if the returned count is >0 we then look deeper:
journalctl -o json-pretty --since "1 hour ago" | jq -s '[.[] | { name: (if .CONTAINER_NAME then .CONTAINER_NAME else ._COMM end), cursor:.__CURSOR }] | group_by(.name) | map({name: .[0].name, length: [.[].cursor] | length}) | sort_by(.length)'

It's a bit messy, but under normal circumstances the tenants/containers do not exceed the rate limit journald applies to docker.service. So when it does happen, we want to notify the application owner they probably should have a look if they aren't already due to some alert they've received.

What I think Peter and certainly myself are requesting is that journald somehow rate limit on a per container basis instead of all of docker.service which impacts other application's/container's abilities to log to journald. The above comment about checking if CONTAINER_ID_FULL exists was merely a suggestion, I'm sure there are other ways this could be done.

Comment 8 RHEL Program Management 2021-01-15 07:35:56 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.