1306341 – spinning rt tasks: hung of jbd2 kworkers

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1306341 - spinning rt tasks: hung of jbd2 kworkers

Summary: spinning rt tasks: hung of jbd2 kworkers

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	kernel-rt
Sub Component:
Version:	7.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	7.3
Assignee:	Clark Williams
QA Contact:	Jiri Kastner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1442258
TreeView+	depends on / blocked

Reported:	2016-02-10 15:40 UTC by Daniel Bristot de Oliveira
Modified:	2017-11-29 16:55 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-11-29 16:55:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Daniel Bristot de Oliveira 2016-02-10 15:40:53 UTC

Description of problem:

Even isolating a CPU it is not possible to avoid some kworkers jobs like jbd2 on a given CPU. Hence, if a rt spinning thread is running for a long time, it can cause IO stucks and hung task messages.

One possible workaround is to increase kworker's priority, but ask kworkers
are created on demand under sched other, it is not possible to have a clean
workaround - for example, one may need to create a periodic script to check
kworker's priority.

For the rt spinning users, a perfect fix would be be able to avoid kworkers
like jbd2 on isolated CPUs.

I will post a crash dump analysis of a report from a customer.

Version-Release number of selected component (if applicable):
last seem on kernel 3.10.0-229.rt56.147.el6rt.x86_64

But I already saw it on many older kernels, and I never saw a solution for it, even upstream.

How reproducible:
Not easily reproducible, for now only on customer's workload.

Steps to Reproduce:
1.
2.
3.

Actual results:
jbd2 hung tasks.

Expected results:
no hung tasks.

Additional info:
I am working on a vmcore RCA for this problem, reported by a customer.

Comment 2 Luiz Capitulino 2017-05-30 18:59:30 UTC

We're debugging a KVM-RT issue that looks similar:

Bug 1448770 - several tasks blocked for more than 600 seconds
(see stack trace in bug 1448770 comment 25)

However, we haven't been able to get a working vmcore yet. And I haven't been able to reproduce myself.

Do you have a reproducer?

Comment 3 Daniel Bristot de Oliveira 2017-05-31 07:58:01 UTC

Unfortunately, we do not have a reproducer. Should we talk to storage/fs people?

Comment 4 Luiz Capitulino 2017-05-31 13:41:48 UTC

If they can help getting a reproducer, yes. But I think it's possible that bug 1448770 is the same issue and we have a reproducer for that one.

I also suspect that this issue is caused by workqueue numa scheduling, but I don't have enough data to confirm this yet (which would be very good news, since workqueue numa scheduling can be easily disabled).

Comment 5 Luiz Capitulino 2017-05-31 13:45:39 UTC

Never mind the workqueue numa scheduling hypothesis, at least for bug 1448770. The issue can be reproduced even when workqueue numa scheduling is disabled.

Comment 6 Clark Williams 2017-11-29 16:55:27 UTC

This bug has not been seen in months and can be worked around with the RT_RUNTIME_GREED feature. An actual fix to avoid starving kworkers/softirqd threads will require upstream RT architecture changes. Closing WONTFIX

Note You need to log in before you can comment on or make changes to this bug.