1450626 – [RFE] Enhance journald to allow rate-limits to be applied per unit instead of just per server

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1450626 - [RFE] Enhance journald to allow rate-limits to be applied per unit instead of just per server

Summary: [RFE] Enhance journald to allow rate-limits to be applied per unit instead of...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	systemd
Sub Component:
Version:	7.5-Alt
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	systemd-maint
QA Contact:	qe-baseos-daemons
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-05-14 04:43 UTC by Peter Portante
Modified:	2021-01-15 07:35 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1719577 (view as bug list)
Environment:
Last Closed:	2021-01-15 07:35:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	systemd systemd issues 10230	0	None	closed	RFE: Per-unit journal settings	2020-07-12 05:28:38 UTC

Description Peter Portante 2017-05-14 04:43:24 UTC

See https://bugzilla.redhat.com/show_bug.cgi?id=1445797

For kubernetes, and potentially other sub-systems in a similar situation, having rate limits applied per-service does not work when most of the logging traffic sent to journald comes through one service.

Instead, if we had the ability to apply rate limiting on a per unit basis, we'd be able to effectively prevent one unit from starving out logs from all other units.

Comment 2 Michal Sekletar 2017-05-15 08:39:34 UTC

(In reply to Peter Portante from comment #0)
> See https://bugzilla.redhat.com/show_bug.cgi?id=1445797
> 
> For kubernetes, and potentially other sub-systems in a similar situation,
> having rate limits applied per-service does not work when most of the
> logging traffic sent to journald comes through one service.

kubernetes and others should then run containers as separate units. Kubernetes can create scope unit for each container. IIRC docker used to it by default, hence I am not sure how come you have this problem in the first place. Anyway, container processes then live in separate cgroup and malicious container can't mess up logging for other containers.

Also as Linux containers are built on top of namespaces and cgroups having all containers run in the same service (=~ cgroup) means you can't really manage system resources on per container base.

Btw, is there any other kernel based process group mechanism other than cgroups that we could leverage here?

Comment 6 Aaron 2019-02-11 17:51:41 UTC

(In reply to Michal Sekletar from comment #2)
> (In reply to Peter Portante from comment #0)
> > See https://bugzilla.redhat.com/show_bug.cgi?id=1445797
> > 
> > For kubernetes, and potentially other sub-systems in a similar situation,
> > having rate limits applied per-service does not work when most of the
> > logging traffic sent to journald comes through one service.
> 
> kubernetes and others should then run containers as separate units.
> Kubernetes can create scope unit for each container. IIRC docker used to it
> by default, hence I am not sure how come you have this problem in the first
> place. Anyway, container processes then live in separate cgroup and
> malicious container can't mess up logging for other containers.
> 
> Also as Linux containers are built on top of namespaces and cgroups having
> all containers run in the same service (=~ cgroup) means you can't really
> manage system resources on per container base.
> 
> Btw, is there any other kernel based process group mechanism other than
> cgroups that we could leverage here?

They don't all run in the same namespace/cgroup, but are in fact isolated. There are a few "best practices" recommended by K8s for collecting logging from containers:
-Use a node-level logging agent that runs on every node.
-Include a dedicated sidecar container for logging in an application pod.
--The sidecar container streams application logs to its own stdout.
--The sidecar container runs a logging agent, which is configured to pick up logs from an application container.
-Push logs directly to a backend from within an application.
https://kubernetes.io/docs/concepts/cluster-administration/logging/

The particular scenario that I believe Peter (as well as my team) is having trouble with, is where the jouranld log driver is used to send containers stdout/stderr to journald. For example, this could occur when using a node-level logging agent that reads events from journald and fwd's them off to say ELK. The issue that occurs is that containers log to stdout/stderror which is picked up by docker's journald log driver and written by docker.service to the journald for all containers on the node/host.

When this happens, the journald log driver appends some journald metadata to the event which are listed in the link below. One solution could be to have journald rate limit by unit and if it exists CONTAINER_ID_FULL instead of just service/unit.

Documentation on journald log driver - https://docs.docker.com/config/containers/logging/journald/
General documentation on docker log drivers - https://docs.docker.com/config/containers/logging/configure/

To work around this issue and report to the application owners when their app is consuming too much of the rate limit enforced on docker.service, we detect when a journald "Suppressed" event occurs via cron'ing something like this:
nice -n 18 ( journalctl --since "1 hour ago" --unit systemd-journald.service | grep -i "Suppressed" | wc -l )
...and if the returned count is >0 we then look deeper:
journalctl -o json-pretty --since "1 hour ago" | jq -s '[.[] | { name: (if .CONTAINER_NAME then .CONTAINER_NAME else ._COMM end), cursor:.__CURSOR }] | group_by(.name) | map({name: .[0].name, length: [.[].cursor] | length}) | sort_by(.length)'

It's a bit messy, but under normal circumstances the tenants/containers do not exceed the rate limit journald applies to docker.service. So when it does happen, we want to notify the application owner they probably should have a look if they aren't already due to some alert they've received.

What I think Peter and certainly myself are requesting is that journald somehow rate limit on a per container basis instead of all of docker.service which impacts other application's/container's abilities to log to journald. The above comment about checking if CONTAINER_ID_FULL exists was merely a suggestion, I'm sure there are other ways this could be done.

Comment 8 RHEL Program Management 2021-01-15 07:35:56 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Note You need to log in before you can comment on or make changes to this bug.