2009585 – Systems (Fedora openQA worker hosts) on kernel 6.0+ wind up in a state where forking does not work correctly, breaking most things (even with mq-deadline scheduler)

Bug 2009585 - Systems (Fedora openQA worker hosts) on kernel 6.0+ wind up in a state where forking does not work correctly, breaking most things (even with mq-deadline scheduler)

Summary: Systems (Fedora openQA worker hosts) on kernel 6.0+ wind up in a state where ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	37
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-10-01 01:09 UTC by Adam Williamson
Modified:	2023-12-06 00:51 UTC (History)
CC List:	25 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2023-12-05 21:02:09 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
journal from an affected boot (387.15 KB, application/gzip) 2021-10-01 01:22 UTC, Adam Williamson	no flags	Details
dmesg output showing blocked tasks and held locks (31.69 KB, text/plain) 2021-11-09 19:30 UTC, Adam Williamson	no flags	Details
dmesg output showing blocked tasks and held locks (5.15.0-0.rc7 version) (82.38 KB, text/plain) 2021-11-09 20:27 UTC, Adam Williamson	no flags	Details
5.12.0 sysrq+w sysrq+t (6.25 MB, text/plain) 2022-08-12 15:53 UTC, Chris Murphy	no flags	Details
View All

Description Adam Williamson 2021-10-01 01:09:00 UTC

This is a bit of a tricky issue to pin down, but here's my best info so far.

Recently, Fedora openQA worker host systems - these are fairly powerful bare metal systems that run dozens of VMs simultaneously, on which tests run - have been getting into stuck states. It has happened multiple times to two x86_64 worker hosts, and I think once to a Power9 host.

As best as I can tell, when the system gets "stuck", the problem is that processes can't fork successfully. Any time anything tries to fork, the child just winds up in a kind of zombie state where it doesn't do anything and can't be killed with either SIGTERM or SIGKILL. This obviously has all sorts of consequences. You can't ssh into the system any more. You can't run 'htop' or 'man'. The openQA worker processes get stuck and can't run jobs any more. And sssd keeps trying to retrieve a TGT from the KDC running on the FreeIPA server, which involves forking off a child process; the child goes zombie, sssd notices it couldn't kill it, so it forks off another child process and tries again, every ten seconds forever.

We (Kevin Fenzi has been helping out with trying to debug this, thanks Kevin) noticed the affected systems were on kernel 5.13, but another host which was not affected was on kernel 5.11. So today I downgraded two hosts that were repeatedly running into this issue - usually within a couple of hours of boot - to 5.11. They have both now survived over 4 hours (more like 8 in one case) without getting into the stuck state, so I'm inclined to think this is a kernel issue introduced somewhere between 5.11.21 and 5.13.15, approximately.

Significant attributes of these systems are, I guess, that they run these 'worker' processes I mentioned - 30 concurrently on each - which run automated tests in virtual machines (qemu VMs) that are, IIRC, forked off the worker processes. So there's quite a lot of that going on. They also have quite a lot of RAM (free -h reports 187Gi total).

I did notice this patch series from kernel 5.13 timeframe and wonder if it might be related:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fc7a5f6fd04bc18f309d9f979b32ef7d1d0a997

so I'm CCing Peter to see if he has any thoughts. If anyone has ideas on how to debug this further, I'd be happy to try them. Note strace is one of the things that doesn't work once the system reaches the stuck state.

Comment 1 Adam Williamson 2021-10-01 01:22:15 UTC

Created attachment 1827839 [details]
journal from an affected boot

Here's the journal from an affected boot. It's stuffed with openQA messages you might want to filter out. I can't see anything much relevant in there, but maybe someone else will be able to. I can't pinpoint the time at which the issue started happening, unfortunately. It's not easy to spot exactly, though I guess I could set up a loop that just tries to run 'man something' every second or whatever.

Comment 2 Adam Williamson 2021-10-01 15:35:14 UTC

Sorry, I should have specified that this is not *just* 5.13, it also happens on recent F34 kernels. The log I attached is actually from 5.14.7-200.fc34 . The only way we can have the affected workers run without hitting this issue is to downgrade them to 5.11.

Comment 3 Peter Xu 2021-10-01 17:03:55 UTC

Hi, Adam, I don't have an idea why 5fc7a5f6fd04bc18f309d9f979b32ef7d1d0a997 could have brought that difference; it's not failing fork() but instead unifying set_pte_at() for thp zero page.  So far it's still not clear to me on where the fork() failed at.

Would it make sense to try generate a coredump?  If ssh is not possible, maybe we could try a persistent ssh client without forking and "echo c > /proc/sysrq-trigger" after we enable sysrq (I thought it was by default disabled on Fedora34)?

Comment 4 Chris Murphy 2021-10-01 18:15:34 UTC

$ cat /proc/sys/kernel/sysrq
16

And both sysrq+t and sysrq+w work, not sure about sysrq+c.

Comment 5 Adam Williamson 2021-10-01 18:22:42 UTC

Thanks Peter! Justin asked me to try with 5.14.9 as 5.14.7 apparently had a *different* bug that would cause systems to hang soon after boot. It's possible we hit that and not the fork issue on 5.14.7, I don't recall for sure whether I confirmed the sssd processes (which are the easiest indicator of this specific bug) on 5.14.7 or if we just saw the systems got stuck and assumed it was the same bug.

I've got one of the affected systems booted to 5.14.9 now and am keeping an eye on it, I'll report back if it gets stuck or if it doesn't I'll check back in at the end of the day.

The commit was only a vague guess on my part, I'm no expert on this low-level kernel stuff :D I was just looking through 5.13 changelogs for things that looked at all possibly relevant.

Comment 6 Adam Williamson 2021-10-01 18:25:27 UTC

Oh, you can't ssh in *after* the system gets into the messed up state, but if you have a connection *already established* you can keep using it so long as you don't run anything else that triggers the problem, like htop or man (if you do it'll block the console and you cannot recover it). I do have a session established with the system this time, so if it does get stuck I can try and get a coredump somehow.

Comment 7 Adam Williamson 2021-10-01 20:55:22 UTC

So it just got stuck running 5.14.9. Unfortunately my ssh session hung when I ran ps aux, and I wanted Kevin to try logging in from the management console and check the state but he misunderstood and power cycled the box instead :/ So we don't know if was in the same forking-problem state, or something else. It definitely got into a broken state again, though. I'll update the next time it happens.

Comment 8 Adam Williamson 2021-10-01 22:03:58 UTC

Sigh. So it got stuck again, and this time I confirmed for sure it's in the same state as observed on kernel 5.13 - I saw the sssd ldap_child zombie processes piling up. But then I stupidly ran a 'less', which triggers the bug and locked out my session. We tried to get a coredump from the console via sideband management but couldn't. I'm now rebooting it again and we'll try one more time. Still, clearly the bug is still there with 5.14.9.

Comment 9 Justin M. Forbes 2021-10-01 22:22:21 UTC

That is good to know, though it would be nice to get some more actionable debug output from the system.  There is not much to go on with what has been posted.

Comment 10 Adam Williamson 2021-10-01 22:48:23 UTC

well, as I said, I'm open to suggestions about what you need :D we're trying to get a core dump now at least, though I don't know if that will be useful. is there anything else you can suggest?

Comment 11 Adam Williamson 2021-10-02 06:11:25 UTC

The system hit the bug another time, we did `echo c > /proc/sysrq-trigger`. The system went unresponsive, and, uh, that's it. No info appeared anywhere obvious. Was that what was supposed to happen? Did it dump a log file somewhere, or something?

We would like to help debug this, but we need instructions, we don't know what it is we have to do to get you the information you need. For now I've booted the system back to 5.11 so it can actually do some work.

Comment 12 Peter Xu 2021-10-04 18:30:38 UTC

Adam,

Did you enable kdump?  For example:

https://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes

Logically when kdump enabled and setup well, echo to sysrq-trigger will trigger the core dump, kdump will collect the core and it'll be put under /var/crash.

Thanks,
Peter

Comment 13 Adam Williamson 2021-10-04 18:40:33 UTC

Peter: no, because nobody said to do that. :D I will try that today.

Comment 14 Adam Williamson 2021-10-05 15:33:42 UTC

Still no dice :( I followed the instructions at https://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes . After `echo c > /proc/sysrq-trigger` the system never came back to life (as is suggested it should), and after a reboot, there is nothing in /var/crash.

Comment 15 Chris Murphy 2021-11-02 00:35:08 UTC

I'm not seeing any kernel traces or blocked task complaints in the attached log at all. sysrq+t is easier to capture and maybe go from there?

Comment 16 Adam Williamson 2021-11-02 01:01:38 UTC

This is probably because logging is affected by the bug. New messages stop appearing in the log right around the time the problem starts happening.

Does sysrq-t show anything that ps aux doesn't?

Comment 17 Chris Murphy 2021-11-02 01:36:19 UTC

This looks a bit suspiciously slow. Is it? 
>Sep 30 01:32:57 openqa-x86-worker04.iad2.fedoraproject.org worker[151049]: [info] disk_64bit_cockpit.qcow2: Processing chunk 3075/3381, avg. speed ~976.562 KiB/s

When the problem happens, what does `grep -R . /proc/pressure` show?

>This is probably because logging is affected by the bug. New messages stop appearing in the log right around the time the problem starts happening.

Slow flushing of the log to persistent storage too. Also suspicious. Might be a good time to learn more about the storage stack?
https://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F

>Does sysrq-t show anything that ps aux doesn't?

Yes quite a lot. It dumps to dmesg and typically fills it with the default log buffer size. Normally it'll get captured by journald like other kernel messages, but if persistent flushing is a problem then work around it by booting with kernel param log_buf_len=4M and now it should be big enough to grab with just dmesg. An alternative is switch to volatile journald logging by modifying /etc/systemd/journald.conf and adding a line

#Storage=auto
+Storage=volatile

Comment 18 Adam Williamson 2021-11-02 17:03:20 UTC

"This looks a bit suspiciously slow. Is it?"

No. It's, uh, hyper-normal: that speed is almost always reported as exactly that value. 221,187 times since Saturday, and it's almost always reported as the exact same value on other different systems with different hardware. So there's probably a bug in the calculation or something.

"Slow flushing of the log to persistent storage too. Also suspicious."

It's not "slow flushing", it just *stops*. We've had systems in the broken state for several days and there are no log messages during that period.

I'll check the other things when I have another clear day to just sit here and wait for systems to break.

Comment 19 Chris Murphy 2021-11-03 16:04:32 UTC

Oh I see. So it's possible there's an oops or hardlockup or something and not recorded in the persistent journal. You've got nothing newer or interesting in the virtual console when issuing dmesg command?

Let's try this:

# nano /etc/sysctl.d/99-custom.conf

kernel.hardlockup_panic = 1
kernel.hung_task_panic = 1
kernel.panic = 1
kernel.panic_on_oops = 1
kernel.panic_on_rcu_stall = 1
kernel.panic_on_warn = 1
kernel.softlockup_panic = 1

If any of those things happen, we'll get a panic. Also do this:

Complete steps 1, 2, 5 and 6.
https://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes

Only a panic will trigger kdump to capture a vmcore. Almost certainly a sysrq+c comes too late, sysrq+c is issued to test the kdump setup.

Comment 20 Chris Murphy 2021-11-03 16:31:11 UTC

If virtual console or netconsole are working, this is easier to setup than kdump:

# nano /etc/sysctl.d/99-custom.conf

kernel.hung_task_timeout_secs = 2

That will cause blocked task messages and traces to dump to dmesg if any task is blocking for 2s or more. Default is 120.

Comment 21 Adam Williamson 2021-11-08 17:23:21 UTC

I don't actually remember whether we checked dmesg at any point. Next time it happens and I have a console, I will.

I'm not clear on which bits of comment #19 and which bits of comment #20 would make sense? At kernel.hung_task_timeout_secs = 2 and kernel.hung_task_panic = 1 sound like they would conflict with each other?

Comment 22 Adam Williamson 2021-11-08 17:27:36 UTC

Trying to set kernel.hung_task_timeout_secs gives "sysctl: cannot stat /proc/sys/kernel/hung_task_timeout_secs: No such file or directory". From a bit of digging, it looks like several of those settings are tied to kernel config options we do not set, at least not for non-debug kernels. CONFIG_DETECT_HUNG_TASK is not set for any kernel I have installed.

Comment 23 Adam Williamson 2021-11-08 17:29:32 UTC

This happened again after upgrading to Fedora 35 and kernel 5.14.6-300.fc35, btw. I've now downgraded the box to 5.12.12, to give us some very broad triage (depending on whether the bug happens again, we can narrow it down to "between 5.11.21 and 5.12.12" or "between 5.12.12 and 5.13.15").

Comment 24 Adam Williamson 2021-11-08 19:42:44 UTC

Bug happened again on 5.12.12. dmesg doesn't show anything obviously illuminating, last message is:

[ 2764.251114] perf: interrupt took too long (2525 > 2500), lowering kernel.perf_event_max_sample_rate to 79000

previous message was 800 seconds before that.

Comment 25 Chris Murphy 2021-11-09 02:07:31 UTC

>I'm not clear on which bits of comment #19 and which bits of comment #20 would make sense?
I meant it as either #19 or #20. If you can run a debug kernel, only change kernel.hung_task_timeout_secs = 2

Comment 26 Adam Williamson 2021-11-09 19:29:29 UTC

Aha, so yes, booting with a debug kernel gets us somewhere. 2 actually seems to be the default value for that setting, and after the system has run for a bit we do get some stuff in dmesg. I'll attach it.

Comment 27 Adam Williamson 2021-11-09 19:30:43 UTC

Created attachment 1840939 [details]
dmesg output showing blocked tasks and held locks

This is from one of the affected openQA worker boxes, running kernel 5.16.0-0.rc0.20211104git7ddb58cb0eca.3.fc36.x86_64 (I figured I'd just go with a very recent debug kernel).

Comment 28 Adam Williamson 2021-11-09 20:26:33 UTC

Chris suggested CCing Eric, so hi, Eric! We've got boxes in Fedora infra that get into a kind of stuck state after running for a bit, started happening somewhere around 5.12.

Chris also suggested using a 5.16.0 kernel might be muddying the waters a bit (as it's known to have various issues rn) and the 2 second timeout might be too aggressive, so I tried again with the timeout at 10 seconds and kernel 5.15.0-0.rc7.20211027gitd25f27432f80.55.fc36.x86_64 . We do still see stuck tasks in that setup. Attaching the dmesg output.

Comment 29 Adam Williamson 2021-11-09 20:27:24 UTC

Created attachment 1840941 [details]
dmesg output showing blocked tasks and held locks (5.15.0-0.rc7 version)

Comment 30 Chris Murphy 2021-11-09 21:47:10 UTC

This is the thread I thought of when I saw these hung tasks.
https://lore.kernel.org/linux-xfs/20210614234145.GU664593@dread.disaster.area/

But the log in #27, there's fsync and fdatasync functions listed, so now I wonder if what we're still seeing is normal flushing of a busy system under load, and we're still not getting any hints about the actual hung task or whatever.

Comment 31 Adam Williamson 2021-11-09 22:07:15 UTC

After a while, this also showed up in dmesg:

[ 4053.313341] ------------[ cut here ]------------
[ 4053.313382] DMA-API: megaraid_sas 0000:18:00.0: cacheline tracking EEXIST, overlapping mappings aren't supported
[ 4053.313401] WARNING: CPU: 20 PID: 1120 at kernel/dma/debug.c:570 add_dma_entry+0x1c8/0x250
[ 4053.313421] Modules linked in: xt_CHECKSUM ip6table_mangle ip6table_nat iptable_mangle bridge stp llc rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache netfs binfmt_misc tun nfnetlink openvswitch nsh nf_conncount rfkill ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables xt_MASQUERADE iptable_nat nf_nat ipt_REJECT nf_reject_ipv4 xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter intel_rapl_msr intel_rapl_common isst_if_common skx_edac vfat fat nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt intel_pmc_bxt iTCO_vendor_support kvm_intel kvm dell_smbios irqbypass dcdbas dell_wmi_descriptor rapl intel_cstate wmi_bmof intel_uncore i2c_i801 mei_me lpc_ich joydev i2c_smbus mei intel_pch_thermal ipmi_ssif acpi_power_meter auth_rpcgss fuse sunrpc zram ip_tables xfs dm_crypt trusted asn1_encoder raid1 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel i40e
[ 4053.314320]  igb megaraid_sas mgag200 dca wmi ipmi_si ipmi_devintf ipmi_msghandler
[ 4053.314342] CPU: 20 PID: 1120 Comm: kworker/20:1H Kdump: loaded Tainted: G          I      --------- ---  5.15.0-0.rc7.20211027gitd25f27432f80.55.fc36.x86_64 #1
[ 4053.314349] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 2.12.2 07/09/2021
[ 4053.314353] Workqueue: kblockd blk_mq_run_work_fn
[ 4053.314364] RIP: 0010:add_dma_entry+0x1c8/0x250
[ 4053.314372] Code: ff 0f 84 97 00 00 00 4c 8b 67 50 4d 85 e4 75 03 4c 8b 27 e8 da 07 7d 00 48 89 c6 4c 89 e2 48 c7 c7 a8 4b 82 9a e8 76 89 c1 00 <0f> 0b 48 85 ed 0f 85 a4 35 c2 00 8b 05 c7 74 2f 02 85 c0 0f 85 f3
[ 4053.314377] RSP: 0018:ffffa7d74df67ac8 EFLAGS: 00010292
[ 4053.314383] RAX: 0000000000000064 RBX: 00000000ffffffff RCX: 0000000000000027
[ 4053.314388] RDX: ffff973a9e7daf68 RSI: 0000000000000001 RDI: ffff973a9e7daf60
[ 4053.314392] RBP: ffff973b0264d880 R08: 0000000000000000 R09: ffffa7d74df67908
[ 4053.314396] R10: ffffa7d74df67900 R11: ffff9752fff370e8 R12: ffff9723d1bc4ba0
[ 4053.314399] R13: 0000000000000001 R14: 0000000000000202 R15: 0000000065076540
[ 4053.314403] FS:  0000000000000000(0000) GS:ffff973a9e600000(0000) knlGS:0000000000000000
[ 4053.314407] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4053.314411] CR2: 00005618d747e110 CR3: 00000018f1046005 CR4: 00000000007726e0
[ 4053.314415] PKRU: 55555554
[ 4053.314418] Call Trace:
[ 4053.314431]  debug_dma_map_sg+0x24e/0x380
[ 4053.314451]  __dma_map_sg_attrs+0x91/0xe0
[ 4053.314462]  dma_map_sg_attrs+0xa/0x20
[ 4053.314468]  scsi_dma_map+0x35/0x40
[ 4053.314474]  megasas_build_and_issue_cmd_fusion+0x1e9/0x15c0 [megaraid_sas]
[ 4053.314495]  ? mark_held_locks+0x50/0x80
[ 4053.314521]  scsi_queue_rq+0x3c9/0xce0
[ 4053.314538]  blk_mq_dispatch_rq_list+0x13a/0x800
[ 4053.314576]  ? bfq_dispatch_request+0xd2/0x2520
[ 4053.314591]  ? sbitmap_get+0x86/0x190
[ 4053.314625]  __blk_mq_do_dispatch_sched+0x127/0x2e0
[ 4053.314647]  __blk_mq_sched_dispatch_requests+0xd6/0x130
[ 4053.314659]  blk_mq_sched_dispatch_requests+0x30/0x60
[ 4053.314666]  __blk_mq_run_hw_queue+0x39/0x70
[ 4053.314673]  process_one_work+0x2b5/0x5d0
[ 4053.314696]  worker_thread+0x55/0x3c0
[ 4053.314703]  ? process_one_work+0x5d0/0x5d0
[ 4053.314714]  kthread+0x149/0x170
[ 4053.314722]  ? set_kthread_struct+0x40/0x40
[ 4053.314732]  ret_from_fork+0x22/0x30
[ 4053.314763] irq event stamp: 2560191
[ 4053.314766] hardirqs last  enabled at (2560197): [<ffffffff991758f0>] __up_console_sem+0x60/0x70
[ 4053.314775] hardirqs last disabled at (2560202): [<ffffffff991758d5>] __up_console_sem+0x45/0x70
[ 4053.314782] softirqs last  enabled at (2560118): [<ffffffff990ef738>] __irq_exit_rcu+0xd8/0x110
[ 4053.314790] softirqs last disabled at (2560113): [<ffffffff990ef738>] __irq_exit_rcu+0xd8/0x110
[ 4053.314795] ---[ end trace f155bbe1c3163138 ]---
[ 4053.314799] DMA-API: Mapped at:
[ 4053.314802]  debug_dma_map_sg+0xcf/0x380
[ 4053.314808]  __dma_map_sg_attrs+0x91/0xe0
[ 4053.314813]  dma_map_sg_attrs+0xa/0x20
[ 4053.314818]  scsi_dma_map+0x35/0x40
[ 4053.314821]  megasas_build_and_issue_cmd_fusion+0x1e9/0x15c0 [megaraid_sas]

Comment 32 Adam Williamson 2021-11-11 01:35:09 UTC

OK, so the system did not reach the stuck state at the time of any of those messages I posted yesterday. It has now reached the stuck state, but no more messages in dmesg seem to be clearly associated with this. Since the last set of messages, only these have shown up:

[ 4277.880396] IPv6: ADDRCONF(NETDEV_CHANGE): tap22: link becomes ready
[ 4794.600717] qemu-system-x86 (123653) used greatest stack depth: 11088 bytes left
[ 4978.867522] IPv6: ADDRCONF(NETDEV_CHANGE): tap16: link becomes ready
[ 5578.245738] IPv6: ADDRCONF(NETDEV_CHANGE): tap0: link becomes ready
[ 7376.795616] IPv6: ADDRCONF(NETDEV_CHANGE): tap23: link becomes ready
[ 7394.968858] IPv6: ADDRCONF(NETDEV_CHANGE): tap24: link becomes ready
[ 7414.403118] IPv6: ADDRCONF(NETDEV_CHANGE): tap9: link becomes ready
[ 7467.510064] qemu-system-x86 (163334) used greatest stack depth: 11056 bytes left
[ 9397.121346] IPv6: ADDRCONF(NETDEV_CHANGE): tap2: link becomes ready
[10900.639014] IPv6: ADDRCONF(NETDEV_CHANGE): tap8: link becomes ready
[24705.899434] perf: interrupt took too long (4904 > 4893), lowering kernel.perf_event_max_sample_rate to 40000
[35602.551465] show_signal_msg: 12 callbacks suppressed
[35602.551474] sudo[764270]: segfault at 0 ip 00007f3290219a65 sp 00007fff6e239ed0 error 4 in sudoers.so[7f32901ee000+54000]
[35602.551767] Code: 89 f1 ba 10 03 00 00 48 89 ee 4c 89 ef c6 05 42 08 05 00 01 e8 cc 63 fd ff e9 63 fe ff ff 0f 1f 80 00 00 00 00 48 8b 7c 24 40 <8b> 07 85 c0 74 c9 48 8d 73 38 48 89 5c 24 10 48 89 74 24 18 44 89
[65179.537425] /usr/bin/isotov[1354435]: segfault at 594 ip 00007f2607c05290 sp 00007f2313ffe130 error 4 in libperl.so.5.34.0[7f2607b64000+18a000]
[65179.537452] Code: 8d 55 f9 83 e2 fb 74 05 83 fd 04 75 1d 48 8b 80 50 08 00 00 48 3b 05 37 d8 29 00 74 3d 89 ef 5d ff e0 0f 1f 84 00 00 00 00 00 <f6> 80 94 05 00 00 01 75 da 48 8b 90 28 06 00 00 48 85 d2 74 18 83
[80913.090798] kworker/dying (1540731) used greatest stack depth: 10992 bytes left
[84226.486875] kworker/dying (1696234) used greatest stack depth: 10944 bytes left

sudo segfaulted when I ctrl-c'ed it, I think. Not sure what caused the isotovideo segfault or if it may possibly be related to the bug, or what those kworker/dying messages mean.

Comment 33 Adam Williamson 2021-12-06 19:53:41 UTC

Can anyone suggest anything to help with this? It is really a big problem at this point. I'm running multiple systems in production on 5.11 kernels because it's all I know to do to avoid this problem. That's obviously not ideal, but I just don't have the ability to do anything else about it. I'm happy to do whatever else to provide debugging info, I've tried to do everything asked so far.

Comment 36 Chris Murphy 2021-12-07 19:49:02 UTC

Posted to upstream kvm list, cc'd qemu-devel
https://lore.kernel.org/kvm/CAJCQCtSx_OFkN1csWGQ2-pP1jLgziwr0oXoMMb4q8Y=UYPGqAg@mail.gmail.com/T/#u
https://lists.nongnu.org/archive/html/qemu-devel/2021-12/msg00866.html

Comment 37 Eric Sandeen 2021-12-07 20:13:37 UTC

Ugh, I apologize for missing the cc: on this back in November. Let me read back if I can still be helpful ...

Comment 38 Eric Sandeen 2021-12-07 20:27:21 UTC

Are there no full dmesgs available? We do see xfs threads waiting in log code, but we don't know why at this point. I'd like to know if i.e. there were any message from storage / block layer prior.  Still no luck getting a crashdump?

Comment 39 Adam Williamson 2021-12-07 22:06:09 UTC

Thanks Eric!

I don't think I kept the full dmesg outputs, no, sorry. I can get them again, though, that shouldn't be too much trouble. No, we never did manage to get a crash dump to work. See around comments 11-15 for when we were attempting that. A hurdle is that when the system is in the stuck state, logging to the journal appears to break.

If you have ideas of how we can actually get the necessary data out, please let me know and I will try that. We can have access to a root console when the system is in the broken state (by logging in before the broken state is reached), and we can run anything that doesn't get stuck; e.g. ps is okay, htop is not...I think cat is okay, less is not...we can echo to /proc/sysrq-trigger but it has to trigger something else that will actually work, it seemed like it didn't successfully dump anything to /var/crash when we tried that.

Thanks again!

Comment 40 Eric Sandeen 2021-12-07 22:51:57 UTC

Honestly I'd focus on getting crashdump to work, unless you think the failure is related to whatever is going on.

Comment 41 Chris Murphy 2021-12-08 19:29:16 UTC

Oops I missed that Eric commented here before I posted to the XFS list
https://lore.kernel.org/linux-xfs/CAJCQCtQdXZEXC+4iDgG9h5ETmytfaU1+mzAQ+sA9TfQ1qo3Y_w@mail.gmail.com/T/#u

I sorta wonder about this comment 31 trace about megaraid_sas and the ensuing block layer related things, if that could be an early instigator of the problem.

As for crash dump, I'm not sure which if any of the comment 19 sysctl changes should be made to get the kernel to panic so a crash dump can get created. Since sysrq+c, before any hang happens, doesn't produce a core file, I think there's a misconfiguration. I discovered when revising https://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes#How_to_Use_Kdump that `crashkernel=auto` is not supported in RHEL or Fedora, despite so much documentation mentioning it. So that's a likely source of failure. What does `kdumpctl status` report? And if /var/crash is on XFS on this megaraid_sas, is it possible the hang prevents the collection of a core file? How to improve the chances of collecting one?

Comment 42 Eric Sandeen 2021-12-08 19:37:43 UTC

Failing to dump to a problematic filesystem or device is a legit concern. Dumps can also be sent over the network, via nfs, ssh, etc. 

Not sure if you have access to
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/kernel_administration_guide/kernel_crash_dump_guide#sect-kdump-config-cli-target-type

but the red hatters should, and it describes these options for RHEL7 anyway, presumably they persist....

Comment 43 Adam Williamson 2022-01-07 07:24:11 UTC

So I think I finally got a dump, and it's, uh, 5.2GB?! Is that normal?

Comment 44 Adam Williamson 2022-01-07 18:50:45 UTC

OK, the crash dump can be grabbed at https://openqa.stg.fedoraproject.org/crashdump/vmcore . There's also https://openqa.stg.fedoraproject.org/crashdump/kexec-dmesg.log and https://openqa.stg.fedoraproject.org/crashdump/vmcore-dmesg.txt . These were generated by booting kernel 5.16.0-0.rc8.20220106git75acfdb6fd92.56.fc36.x86_64 with the kdump stuff configured and kernel.hung_task_timeout_secs set to 20, waiting for the system to get to the stuck state, and doing `echo c > /proc/sysrq-trigger`. I hope it's useful.

Comment 45 Chris Murphy 2022-01-09 20:04:42 UTC

The vmlinux file is in  kernel-debuginfo-5.16.0-0.rc8.20220106git75acfdb6fd92.56.fc36.x86_64.rpm found here
https://koji.fedoraproject.org/koji/buildinfo?buildID=1874703

I'm not sure why, but I needed crash-8.0.0, Fedora 35 has 7.3.0 but the 8.0 fc36 rpm installs on F35 OK
https://koji.fedoraproject.org/koji/buildinfo?buildID=1865596

Maybe someone else will have more luck, but I can't determine the ultimate source of the hang in the storage stack. By the time we get to
>[15420.385541] DMA-API: megaraid_sas 0000:18:00.0: cacheline tracking EEXIST, overlapping mappings aren't supported
there's been 120+ mentions of locks held by qemu, and it just looks like a lot of confusion.

Comment 46 Chris Murphy 2022-01-09 20:17:03 UTC

Adam, can you run each of these commands for about 30s?
>    iostat -x -d -m 5
>    vmstat 5
It might be useful if you run the two commands for 30s during normal operation; and then later once the problem has happened, for comparison. If either of them hang, use 'ps' to find the PID of the hanging command, and then `cat /proc/$PID/stack`.

Comment 47 Adam Williamson 2022-03-17 22:27:26 UTC

Sorry, I have not had time to work on this lately. I just re-tested and the issue still happens with 5.17rc8. I'll try and get the commands Chris recommended.

Comment 48 Adam Williamson 2022-03-18 00:26:44 UTC

OK, well, I let iostat -x -d -m 5 run for a long time, from right after boot till we starting hitting the problem. The numbers bounce around, more or less in line with how hard the tests are hitting the disk, as you'd expect. At peak times I saw %util for dm-0, dm-1, dm-2 and sda through sdj all go into the red, and also saw %wrqm spike into the red too. Here's a typical frame like that:

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
dm-0            16.80      3.68     0.00   0.00   45.32   224.52  511.60     85.35     0.00   0.00 2080.56   170.83    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00 1065.17  84.92
dm-1            16.80      3.68     0.00   0.00   45.33   224.52  451.40     85.35     0.00   0.00 2146.85   193.61    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00  969.85  84.36
dm-2             0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
md0              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
md2             26.20      3.68     0.00   0.00   35.56   143.97  802.80    114.23     0.00   0.00  896.62   145.71    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00  720.74  91.60
sda            517.00      2.42     1.80   0.35    2.90     4.80  885.60     15.08  2975.20  77.06    1.48    17.43    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.81  52.68
sdb            530.40      2.52    10.60   1.96    2.74     4.86  867.20     15.05  2986.40  77.50    1.30    17.77    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.58  53.70
sdc            561.00      2.67    17.40   3.01    3.82     4.88  824.80     14.82  2969.60  78.26    1.19    18.40    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    3.12  51.84
sdd            491.40      2.31     1.40   0.28    3.89     4.82  802.20     14.76  2976.80  78.77    1.40    18.84    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    3.03  51.50
sde            507.60      2.26     0.60   0.12    3.99     4.55  763.60     14.53  2956.00  79.47    1.48    19.48    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    3.16  51.86
sdf            356.00      1.65    28.80   7.48    4.01     4.74  677.40     14.12  2937.20  81.26    1.46    21.34    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.42  47.96
sdg            533.00      2.59     8.00   1.48    2.92     4.98  836.40     14.86  2968.40  78.02    1.20    18.19    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.56  51.44
sdh            448.20      2.15     0.00   0.00    3.41     4.92  772.80     14.64  2975.80  79.38    1.27    19.40    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.51  50.84
sdi            399.00      1.97     1.60   0.40    3.35     5.05  731.20     14.41  2959.20  80.19    1.27    20.18    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.26  47.66
sdj            351.20      1.64     0.00   0.00    3.15     4.77  709.80     14.26  2941.60  80.56    1.26    20.57    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.00  46.14
zram0            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00

After the problem had happened, the frames for quite a while look like low utilization, like this:

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
dm-0             0.00      0.00     0.00   0.00    0.00     0.00    1.00      0.08     0.00   0.00    0.00    78.40    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.10
dm-1             0.00      0.00     0.00   0.00    0.00     0.00    1.00      0.08     0.00   0.00    0.00    78.40    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.10
dm-2             0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
md0              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
md2              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
sda              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
sdb              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
sdc              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
sdd              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
sde              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
sdf              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
sdg              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
sdh              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
sdi              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
sdj              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
zram0            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00

Here's the output of vmstat 5 after the problem had happened:

[root@openqa-x86-worker04 adamwill][PROD-IAD2]# vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1 111      0 100037712  14864 41644512    0    0    46   532   58   19 14  9 50 27  0
 4 111      0 100040304  14864 41644524    0    0     0    13 16824 27792  3  2 12 83  0
 2 111      0 100039536  14864 41645004    0    0     0    54 16793 27918  3  2 12 83  0
 5 111      0 100041680  14864 41645192    0    0     0    46 16678 27588  3  2 12 83  0
 2 111      0 100052592  14864 41645352    0    0     0    30 16808 27698  3  2 12 83  0
 3 111      0 100033872  14864 41645708    0    0     0    62 16808 27642  3  2 12 83  0
 0 111      0 100043040  14864 41645864    0    0     0    14 16961 27766  3  2 12 83  0

I didn't catch the 'before' output yet.

Comment 49 Adam Williamson 2022-03-18 00:42:44 UTC

Here's vmstat 5 output right after boot, not in the broken state:

[root@openqa-x86-worker04 adamwill][PROD-IAD2]# vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
45  0      0 180021760  13820 1496956    0    0   137    16  799  612  5  9 82  4  0
41  0      0 176351200  13820 1653084    0    0   745   319 81752 123278 21 40 39  0  0
44  0      0 172036992  13820 2048024    0    0  2557  3164 98147 131975 48 23 28  1  0
48  0      0 168369952  13820 2697768    0    0     5 30378 102427 133948 56 24 19  1  0
13  0      0 166657440  13820 2974200    0    0  4016 15391 71049 83305 31 16 52  0  0
16  1      0 166369328  13820 3051480    0    0  1638 25072 51200 48765 19 12 69  0  0
21  0      0 166222192  13820 3179000    0    0  2465 19998 60589 124160 21 12 67  0  0
24  0      0 166029552  13820 3370268    0    0  1615 26035 65264 166186 23 12 65  0  0
35  4      0 165817728  13820 3577056    0    0   832 54101 121187 440012 20 16 64  0  0
18  0      0 165572592  13820 3721528    0    0  3846 33415 130621 458309 19 17 64  0  0

note there may be some churn happening there as it picks up jobs.

Comment 50 Adam Williamson 2022-03-18 00:55:26 UTC

By comparison, here's some outputs from kernel 5.11. This is vmstat 5 just after boot:

root@openqa-x86-worker04 adamwill][PROD-IAD2]# vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
45  0      0 132460304  14864 14067816    0    0   181   465  944  734 12 15 72  1  0
35  1      0 131705136  14864 14316840    0    0 13259 121481 75971 132196  7 55 38  0  0
37  0      0 129166632  14864 14865360    0    0  7601 104631 73000 118689  7 53 40  0  0
30  0      0 128337072  14864 15365120    0    0  9729 157691 55345 68777  8 37 55  0  0
41  0      0 127064752  14864 16591356    0    0 12501 138117 83845 132187  9 48 42  1  0
54  3      0 124393864  14864 18705364    0    0 11442 252188 123997 201744 10 60 29  1  0
208  0      0 123978344  14864 19272244    0    0  9224 354289 88250 180602  8 55 36  0  0
30  0      0 123914768  14864 19636116    0    0 20658 122108 73494 141597  9 43 47  1  0

and this is iostat -x -d -m 5:

[root@openqa-x86-worker04 adamwill][PROD-IAD2]# iostat -x -d -m 5
Linux 5.11.21-300.fc34.x86_64 (openqa-x86-worker04.iad2.fedoraproject.org) 	2022-03-18 	_x86_64_	(64 CPU)

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
dm-0            70.90     10.51     0.00   0.00   11.58   151.82  172.63     46.72     0.00   0.00  215.61   277.15    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   38.04  26.45
dm-1            70.46     10.50     0.00   0.00   11.61   152.54  130.60     46.75     0.00   0.00  180.72   366.55    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   24.42  25.97
dm-2             0.26      0.01     0.00   0.00    9.37    24.16    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.14
md0              2.31      0.15     0.00   0.00    1.44    65.82    0.04      0.01     0.00   0.00   36.27   138.97    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.24
md2             90.83     10.58     0.00   0.00   10.02   119.26  200.38     46.72     0.00   0.00   98.90   238.76    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   20.73  26.61
sda            166.17     28.89  6935.09  97.66    6.20   178.03   71.97      6.27  1535.36  95.52    2.74    89.22    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.23  22.51
sdb            177.48     28.80  6928.93  97.50    6.36   166.18   70.01      6.23  1526.18  95.61    1.79    91.08    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.25  21.88
sdc            162.12     28.72  6928.77  97.71    6.60   181.42   69.39      6.24  1530.79  95.66    2.71    92.13    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.26  21.66
sdd            162.88     28.70  6924.70  97.70    6.87   180.46   70.65      6.30  1544.39  95.63    2.56    91.33    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.30  21.68
sde            164.72     28.80  6943.29  97.68    6.59   179.06   75.85      6.37  1557.20  95.36    2.15    86.02    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.25  21.99
sdf            167.12     28.90  6967.95  97.66    6.86   177.07   79.52      6.45  1574.62  95.19    2.01    83.11    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.31  22.06
sdg            167.48     28.77  6935.82  97.64    6.96   175.92   75.29      6.26  1528.87  95.31    2.35    85.12    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.34  21.95
sdh            163.31     28.77  6935.71  97.70    7.43   180.37   71.10      6.24  1527.89  95.55    2.43    89.84    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.39  21.90
sdi            162.12     28.76  6940.17  97.72    6.86   181.68   71.82      6.28  1538.67  95.54    2.22    89.59    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.27  21.50
sdj            166.49     28.87  6964.17  97.67    6.24   177.55   77.24      6.41  1566.13  95.30    2.09    85.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.20  21.56
zram0            0.79      0.00     0.00   0.00    0.00     4.00    0.00      0.00     0.00   0.00    0.00     4.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00


Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
dm-0            37.80      7.77     0.00   0.00   20.14   210.58  338.40    115.71     0.00   0.00  143.46   350.14    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   49.31  44.82
dm-1            37.80      7.77     0.00   0.00   20.14   210.58  237.80    115.71     0.00   0.00  102.60   498.27    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   25.16  44.52
dm-2             0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
md0              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
md2             41.40      7.77     0.00   0.00   20.72   192.27  431.00    115.71     0.00   0.00   96.49   274.92    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   42.44  44.64
sda            105.00      5.30  1068.00  91.05   21.38    51.66  167.60     16.35  4021.40  96.00    1.26    99.86    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.46  23.12
sdb             94.20      5.45  1114.20  92.20   16.57    59.30  149.20     16.57  4096.80  96.49    1.49   113.71    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.78  19.48
sdc             97.20      5.11   973.80  90.92   18.34    53.88  159.00     16.31  4019.80  96.20    6.83   105.01    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.87  23.64
sdd            105.60      4.74   889.00  89.38   11.44    46.01  172.20     16.33  4013.40  95.89    1.31    97.12    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.43  19.34
sde             98.40      4.85   935.80  90.49   12.09    50.52  164.60     16.40  4039.00  96.08    3.92   102.04    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.83  20.98
sdf            104.40      4.80   912.00  89.73   10.50    47.07  170.80     16.46  4048.60  95.95    1.66    98.71    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.38  18.70
sdg            113.00      5.00  1008.80  89.93   16.51    45.30  172.80     15.89  3900.40  95.76    1.57    94.18    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    2.14  20.58
sdh            101.80      5.04  1016.20  90.89   14.15    50.65  157.00     15.76  3882.60  96.11    1.82   102.80    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.73  19.88
sdi             85.40      4.52   905.20  91.38   17.49    54.25  150.20     15.27  3763.00  96.16    3.15   104.09    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.97  20.28
sdj             90.20      4.36   852.80  90.43   16.22    49.50  153.40     15.27  3759.60  96.08    1.25   101.91    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.66  19.04
zram0            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00

The other frames are pretty similar. Looking back it seems like we didn't see red rrqm at all on 5.17, that seems like a difference. I'll let it work for a while then re-run the commands and see how they look like after it's been up for a bit.

Comment 51 Adam Williamson 2022-03-18 02:01:03 UTC

Here's the numbers a couple of hours later on 5.11, some tasks still running, system's working OK:

[root@openqa-x86-worker04 adamwill][PROD-IAD2]# vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 5  0      0 119383400  14864 32709164    0    0    97   672   44  143  8 17 75  0  0
 4  0      0 121403872  14864 32901996    0    0  2777  8450 18878 76178  6  5 90  0  0
 5  0      0 122674208  14864 31973156    0    0    38   259 15073 69337  5  4 90  0  0
 5  0      0 122486896  14864 32166596    0    0     0  1631 14479 77884  5  3 92  0  0
 1  0      0 124243376  14864 30410476    0    0   210   477 8492 23499  3  1 96  0  0
 4  0      0 124247080  14864 30410916    0    0    14 122827 9212 11548  3  3 94  0  0
 4  0      0 124254800  14864 30411592    0    0    10   305 6891 10757  3  1 96  0  0

[root@openqa-x86-worker04 adamwill][PROD-IAD2]# iostat -x -d -m 5
Linux 5.11.21-300.fc34.x86_64 (openqa-x86-worker04.iad2.fedoraproject.org) 	2022-03-18 	_x86_64_	(64 CPU)

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
dm-0            34.19      5.93     0.00   0.00   12.78   177.45  218.13     41.48     0.00   0.00  231.83   194.75    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   51.01  29.50
dm-1            34.16      5.92     0.00   0.00   12.79   177.60  188.35     41.50     0.00   0.00  211.19   225.64    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   40.21  29.40
dm-2             0.02      0.00     0.00   0.00    9.37    24.16    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.01
md0              0.20      0.01     0.00   0.00    1.45    65.46    0.01      0.00     0.00   0.00   19.86    73.45    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.02
md2             41.28      5.93     0.00   0.00   12.21   147.12  247.58     41.48     0.00   0.00  138.45   171.58    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   34.78  28.89
sda             71.94      4.32   886.01  92.49   16.09    61.54   87.85      5.82  1406.19  94.12    2.18    67.81    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.35  12.58
sdb             73.84      4.33   887.12  92.32   16.33    59.99   88.40      5.82  1406.46  94.09    1.78    67.42    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.36  12.03
sdc             73.42      4.33   889.02  92.37   15.40    60.36   89.41      5.82  1404.68  94.02    2.06    66.63    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.31  11.98
sdd             73.15      4.32   885.71  92.37   16.59    60.42   90.24      5.83  1408.10  93.98    2.07    66.20    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.40  11.98
sde             72.93      4.32   886.80  92.40   16.18    60.69   91.10      5.83  1404.89  93.91    1.99    65.48    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.36  12.12
sdf             72.93      4.33   889.51  92.42   16.59    60.84   90.45      5.83  1406.70  93.96    2.01    66.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.39  12.02
sdg             75.03      4.32   883.80  92.17   16.63    58.94   90.95      5.80  1398.44  93.89    1.96    65.30    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.43  11.95
sdh             72.37      4.31   884.51  92.44   16.99    60.98   88.19      5.77  1394.56  94.05    1.99    67.04    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.41  11.92
sdi             71.35      4.30   882.87  92.52   17.28    61.68   88.09      5.79  1399.95  94.08    1.93    67.36    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.40  11.99
sdj             71.80      4.31   887.10  92.51   16.36    61.54   88.94      5.82  1404.51  94.04    1.87    66.95    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    1.34  11.80
zram0            0.07      0.00     0.00   0.00    0.00     4.00    0.00      0.00     0.00   0.00    0.00     4.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00


Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
dm-0             0.00      0.00     0.00   0.00    0.00     0.00   17.20      0.51     0.00   0.00   16.19    30.60    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.28   2.94
dm-1             0.00      0.00     0.00   0.00    0.00     0.00   17.20      0.51     0.00   0.00   16.19    30.60    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.28   2.94
dm-2             0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
md0              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
md2              0.00      0.00     0.00   0.00    0.00     0.00   18.00      0.51     0.00   0.00   16.73    29.24    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.30   2.92
sda              6.80      0.19    43.00  86.35    1.76    29.29   13.40      0.20    43.00  76.24    0.16    15.41    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.01   1.04
sdb              4.00      0.04     6.80  62.96    2.25    10.80   10.40      0.05     7.00  40.23    0.13     4.86    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.01   0.80
sdc              4.80      0.09    19.20  80.00    2.79    20.00   11.20      0.10    19.40  63.40    0.16     9.22    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.02   1.22
sdd              6.60      0.28    64.20  90.68    5.27    42.91   12.80      0.28    64.60  83.46    0.22    22.70    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.04   1.76
sde              8.20      0.26    57.60  87.54    2.46    32.10   14.60      0.26    57.80  79.83    0.21    18.53    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.02   1.04
sdf              9.80      0.34    77.20  88.74    3.24    35.51   15.40      0.35    78.20  83.55    0.22    23.07    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.04   1.32
sdg              7.20      0.09    16.20  69.23    3.33    13.00   13.20      0.10    16.80  56.00    0.17     7.64    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.03   1.80
sdh              2.20      0.02     3.00  57.69    2.00     9.45    8.80      0.03     3.00  25.42    0.14     3.19    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.01   0.96
sdi              4.20      0.07    13.80  76.67    3.38    17.14   10.40      0.08    14.00  57.38    0.12     7.55    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.02   0.84
sdj              5.00      0.13    27.60  84.66    4.00    26.08   10.80      0.13    28.20  72.31    0.15    12.68    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.02   0.84
zram0            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00

Comment 52 Chris Murphy 2022-03-20 18:03:21 UTC

Well md2 @ 91%util definitely means it's busy. I do not know why md2 %util is 2x busier than any of its member drives.

Could Peter or Eric look at the crash dump Adam provided in c44? If it's not revealing, make suggestions on how to get a better crash dump? I threw out some ideas in c19 to get kdump to auto trigger, rather than manually triggering with sysrq+c well after the problem  has happened.

Comment 53 Ben Cotton 2022-05-12 15:54:11 UTC

This message is a reminder that Fedora Linux 34 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 34 on 2022-06-07.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '34'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 34 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 54 Adam Williamson 2022-05-12 19:53:26 UTC

Still valid at least up to 35. I'll upgrade the workers to 36 soon and I expect this will still be there, sadly.

Comment 55 Chris Murphy 2022-05-13 16:01:09 UTC

I think there are only two remaining options:

A. Find the bug by getting *all* of the information in https://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F to the upstream XFS list. And doing a kernel bisect.

Whenever anyone piece meals this, devs understandably get short tempered and point to this URL as they did to me when I first reported this bug (and I know better). I expect they will ask for a kernel bisect because the storage stack in this bug is complex: md raid, lvm, dmcrypt, xfs, that's four layers the bug could be in. The ultimate cause may not even be storage stack related, so pretty sure bet it's going to require a bisect by someone who can reproduce the problem.

B. Reprovision

(1) use a kickstart specifying --chunksize 64 for software raid; and (2) avoid resizing XFS, i.e. make sure the / file system, or separate /var, is the size you want at mkfs.xfs time, rather than later growing the LV and the file system, which is what was done with this file system. The upstream XFS convo on this bug mentioned these things so they're likely contributing factors in exposing the regression. XFS devs have never liked the mdadm 512 KiB default chunk size, especially for parity raid, and especially for metadata centric workloads (the combination of parity raid and metadata centric workloads, XFS devs consistently argue against, and favors raid10 instead). Note that xfs_growfs doesn't grow important fs structures including the journal. This file system's journal is too small for the size of the file system, indicating it was grown too much.

I don't see any shortcuts.

Comment 56 Adam Williamson 2022-05-13 17:10:54 UTC

We can probably do B, reprovisioning the worker hosts is quite trivial. I'd just have to work with infra folks to set them up as described.

Comment 57 Adam Williamson 2022-05-16 21:25:01 UTC

https://pagure.io/fedora-infrastructure/issue/10698 filed.

Comment 58 Adam Williamson 2022-05-27 23:36:47 UTC

So a quick update: we've been following Chris' idea, and the results are interesting yet frustrating.

First we re-deployed all the boxes as Fedora 36 with the root filesystem as btrfs on dm-crypt on mdraid level 10 (er, I think it goes that way around) with 256k chunk size. They fell over again. The symptoms weren't quite *identical* - I was getting odd errors when trying to ssh in, and one of the boxes actually seemed to sort of recover itself after a while and let me back in - but really similar; fundamentally after being up for a few hours the test jobs starting failing due to timeouts and lack of response from the box to the server, and remote access wasn't working. Reboot brought them back to working state till the problem happened again.

Then we re-deployed two of the boxes again. On one of them we just used a single disk and left the others all idle. It has an EFI system partition, an xfs /boot, and a btrfs root, all on a single disk. No dm-crypt.

On the other one, we did EFI system partition, mdraid level 1 /boot, and native btrfs raid (level 10) root. No dm-crypt.

The one using btrfs RAID seized up again. The one with the single disk configuration has been up for nearly five hours now and seems to be still working fine.

It's still early, and I'll keep checking in on these systems, but so far the indication is that the bug affects three different multi-disk configs (xfs-on-dmcrypt-on-mdraid6/512K chunks, btrfs-on-dmcrypt-on-mdraid10/256K chunks, btrfs-raid10) but does not affect a single disk config.

Comment 59 Adam Williamson 2022-05-27 23:55:40 UTC

Oh, forgot to mention - I've put a 5.11 kernel on the btrfs-raid10 box, and am gonna see if it survives the weekend that way. I'm betting it will.

Comment 60 Adam Williamson 2022-07-05 01:32:54 UTC

Latest updates:

1. I need to qualify "does not affect a single disk config" a bit. The system running a current kernel with a single-disk config does not get into the same hung state as the multi-disk systems, but it does every so often seem to glitch out. I'll see a set of tests failed with errors during disk image operations (I don't have the exact errors as this hasn't happened for a few hours and I can't easily find cases older than that). So I suspect a single-disk config does actually suffer from some less-fatal form of the problem.

2. As expected, btrfs-raid10 with 5.11 kernel works fine.

3. I've now reproduced the bug all the way back to kernel-5.12.0-0.rc0.20210225gitc03c21ba6f4e.160.fc35 , which is the earliest still-available 5.12 snapshot build in Koji. There were a couple of earlier builds run, but they weren't officially tagged, so they were garbage collected long ago.

4. We have a host set up as a single-system openQA server/worker whose results are unimportant, so we can use that box to triage further.

Comment 61 Chris Murphy 2022-08-11 17:03:47 UTC

Update: The git bisect started out OK but half way through started producing unbootable kernels. I gathered what I'd learned so far and posted to the upstream kernel, block, btrfs, and raid lists to seek advice. I got a reply from Josef Bacik who looked at the sysrq+w output for the plain btrfs raid10, and btrfs on dmcrypt on md raid configurations. He had me test with cgroups IO controller disabled, but the problem still happens. He says this is smelling like some kind of driver or hardware issue. Since the problem happens on nearly identical production setups, I interpret this as one or more bugs in any of the megaraid_sas driver, the megaraid firmware, or drive firmware. So pretty low level indeed and likely explains why it's taking awhile to be discovered: probably no one else is running this particular workload on this particular hardware setup.

Next steps: (a) I'll research availability of firmware updates for any of the hardware components, and file infra tickets to apply them if possible (b) finding the commit that results in the regression can probably result in a quirk being added to the kernel to resume the old behavior when it finds the older firmware. This means I need to continue the bisect. Josef offered his manual bisect strategy that avoids git bisect's problems with merge commits found in Linus's tree, that I've been running into:

>git log --oneline --no-merges v5.11..v5.12-rc1 > bisect.txt

>Open the file, go to the middle line, check out that commit, run test.
>If it fails delete that line and all the lines above it, if it
>succeeds delete that line and all the lines below it.

Comment 62 Adam Williamson 2022-08-11 17:12:08 UTC

The weird thing about that is I'm *pretty* sure we had problems on at least one other arch, and the other arch workers are obviously completely different hardware. I can put the stg worker hosts for aarch64 and ppc64le back on current kernels and see what happens, but it's kinda a bad time right now.

Comment 63 Chris Murphy 2022-08-11 17:32:34 UTC

Let's not just yet. I'm pretty sure I have hit this same bug in 5.19, it just takes one extra straw to break it. Bug 2117326.

Comment 64 Chris Murphy 2022-08-12 15:53:04 UTC

Created attachment 1905172 [details]
5.12.0 sysrq+w sysrq+t

On a whim, I switched to mq-deadline IO scheduler. Uptime is ~15 hours. I live switched during the workload in-progress to bfq on the 8 drives in the raid10 array. And within 10 minutes it cratered. Load average, io wait, and io pressure increasing, while actual IO decreasing to zero.

Attached is dmesg with sysrq+w first at [62277.077208] and sysrq+t at [62296.988645]. I can't tell if this is a bfq bug or some negative interaction between bfq and scsi or megaraid_sas though. But the correlation is pretty clear at this point.

Comment 65 Chris Murphy 2022-08-12 16:10:58 UTC

>I can put the stg worker hosts for aarch64 and ppc64le back on current kernels and see what happens, but it's kinda a bad time right now.

If you want to move to 5.18 or 5.19 series, I strongly suggest also adding this drop-in override:

/etc/udev/rules.d/60-block-scheduler.rules 
#override the default bfq
ACTION=="add", SUBSYSTEM=="block", \
  KERNEL=="mmcblk*[0-9]|msblk*[0-9]|mspblk*[0-9]|sd*[!0-9]|sr*", \
  ATTR{queue/scheduler}="mq-deadline"

Comment 66 Chris Murphy 2022-08-18 15:17:21 UTC

I posted to upstream kernel lists: kernel, block, raid, btrfs. Turns out bfq is just somehow "tickling" one or more issues in blk-mq in a way that mq-deadline is not. So there's a two part fix: 

* use mq-deadline as an immediate workaround, conveniently it's also the preferred IO scheduler for infra/releng/qa;
* expect kernel block devs to post fixes for the underlying issue, I've been testing candidate patches and will continue to do so as long as openqa-worker05 is available.

Gory details are in the upstream thread:
https://lore.kernel.org/linux-block/2004c259-6ec7-76d9-cad6-7c381dbfcf0c@leemhuis.info/T/#m559bda01cc2c87b50ed853a98c39b9fc2da010f3

Comment 67 Adam Williamson 2022-08-18 17:31:30 UTC

Thanks a whole bunch for your work on this, Chris, it's been super helpful.

All openQA workers are now running 5.19 kernels with mq-deadline and seem to be working fine.

Comment 68 Adam Williamson 2022-11-26 16:46:44 UTC

So...I'm not 100% sure, but I think this is maybe happening again a little with 6.x kernels :( It doesn't seem as bad, but we're hitting similar symptoms again. For instance, one of the x86_64 workers was failing jobs overnight (e.g. https://openqa.fedoraproject.org/tests/1612775 ), Kevin couldn't ssh into it so he rebooted it, and on reboot nine hours of logs are missing:

Nov 26 07:14:53 openqa-x86-worker02.iad2.fedoraproject.org worker[175902]: [info] Uploading serial0.txt
Nov 26 07:14:53 openqa-x86-worker02.iad2.fedoraproject.org worker[175902]: [info] Uploading video_time.vtt
Nov 26 07:14:53 openqa-x86-worker02.iad2.fedoraproject.org worker[175902]: [info] Uploading serial_terminal.txt
Nov 26 07:14:53 openqa-x86-worker02.iad2.fedoraproject.org worker[175902]: [info] Uploading virtio_console1.log
-- Boot f0a90e4d80654e4882fd3ecb80339ec5 --
Nov 26 16:32:26 openqa-x86-worker02.iad2.fedoraproject.org kernel: Linux version 6.0.8-200.fc36.x86_64 (mockbuild.fedoraproject.org) (gcc (GCC) 12.2.1 20220819 (Red Hat 12.>
Nov 26 16:32:26 openqa-x86-worker02.iad2.fedoraproject.org kernel: Command line: BOOT_IMAGE=(md/0)/vmlinuz-6.0.8-200.fc36.x86_64 root=/dev/mapper/vg_guests-LogVol00 ro rd.auto=1 crashkern>
Nov 26 16:32:26 openqa-x86-worker02.iad2.fedoraproject.org kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Nov 26 16:32:26 openqa-x86-worker02.iad2.fedoraproject.org kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Nov 26 16:32:26 openqa-x86-worker02.iad2.fedoraproject.org kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'

which all feels a lot like the same symptoms from this bug. It doesn't feel like systems go unresponsive as *often* as was the case originally, but it does feel quite a lot like the same basic issue. All systems are now using mq-deadline scheduler.

Comment 69 Ben Cotton 2022-11-29 17:06:22 UTC

This message is a reminder that Fedora Linux 35 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 35 on 2022-12-13.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '35'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 35 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 70 Adam Williamson 2022-11-30 18:25:03 UTC

Per above, we seem to be seeing something similar again unfortunately, so bumping to F37.

Comment 71 Adam Williamson 2022-12-06 20:46:23 UTC

So yeah, this is *definitely* a problem again with 6.0 kernels, even with mq-deadline. 5.19 kernels seem to be fine. I have had to versionlock the openQA workers to 5.19 kernels or else they keep seizing up.

Comment 72 Chris Murphy 2022-12-07 01:05:38 UTC

Issue two commands as root when the problem starts to happen (spiking load, but the system hasn't face planted). The Terminal instance should have a substantial scrollback, I use unlimited, but 10K lines might be sufficient.

echo w > /proc/sysrq-trigger

(cd /sys/kernel/debug/block/$disk && find . -type f -exec grep -aH . {} \;)

Copy/paste that into separate files that incorporate the kernel version, and attach to this bug report, and then I'll (re)start discussion upstream.

Comment 73 Aoife Moloney 2023-11-23 00:06:28 UTC

This message is a reminder that Fedora Linux 37 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 37 on 2023-12-05.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '37'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version. Note that the version field may be hidden.
Click the "Show advanced fields" button if you do not see it.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 37 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 74 Adam Williamson 2023-11-23 20:30:01 UTC

sorry I never got to cmurf's last debugging steps :(

I currently *still* have 5.19.17-200.fc36 pinned on all the openQA x86_64 workers. I'm currently upgrading the lab workers to a recent kernel to see if this is still an issue.

Comment 75 Adam Williamson 2023-11-24 17:25:27 UTC

Well, initial good news - all three stg openQA worker hosts have been on 6.5.12-300.fc39 for nearly a day without going unresponsive. It's been a relatively quiet day for openQA testing, though (thanksgiving and all). I'll see if they make it through a week.

Comment 76 Adam Williamson 2023-11-27 20:36:09 UTC

they all seem to have made it through the weekend, so I've updated the prod workers and we'll see if they make it unscathed to Friday.

Comment 77 Aoife Moloney 2023-12-05 21:02:09 UTC

Fedora Linux 37 entered end-of-life (EOL) status on None.

Fedora Linux 37 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 78 Adam Williamson 2023-12-06 00:51:01 UTC

Just to close this out, the prod workers made it happily through a week, so I'm calling this fixed again (for now).

Note You need to log in before you can comment on or make changes to this bug.

aarcange
acaringi
adscvr
airlied
alciregi
bskeggs
bugzilla
esandeen
hdegoede
jarodwilson
jeremy
jforbes
jglisse
jonathan
josef
kernel-maint
kevin
lgoncalv
linville
masami256
mchehab
okurz
peterx
ptalbert
steved