Bug 1843410
| Summary: | RT guest get Call Trace "INFO: task ld:274531 blocked for more than 600 seconds." | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Pei Zhang <pezhang> | ||||||
| Component: | kernel-rt | Assignee: | Marcelo Tosatti <mtosatti> | ||||||
| kernel-rt sub component: | Scheduler | QA Contact: | Pei Zhang <pezhang> | ||||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||||
| Severity: | high | ||||||||
| Priority: | high | CC: | bhu, chayang, jinzhao, jlelli, juzhang, kcarcia, knoel, lcapitulino, mtosatti, nilal, peterx, virt-maint, williams | ||||||
| Version: | 8.3 | Keywords: | ZStream | ||||||
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
||||||
| Target Release: | 8.0 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | kernel-rt-4.18.0-220.rt7.32.el8 | Doc Type: | If docs needed, set a value | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | |||||||||
| : | 1855779 (view as bug list) | Environment: | |||||||
| Last Closed: | 2020-11-04 02:25:41 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1817732, 1823810, 1855779 | ||||||||
| Attachments: |
|
||||||||
Clark, By looking at the trace, my initial guess for this issue would be that drain_all_pages() or flush_work() is scheduling work on the CPU where oslat is running (which is isolated). In other words, work that originated in an housekeeping CPU is being scheduled on an isolated CPU. Which will cause a stall, since oslat won't let the kworker run. Now, I might be completely wrong, but it does ring a bell to me that we used to have a patch in RHEL7 RT that would prevent either drain_all_pages() or flush_work() from scheduling work on isolated CPUs. I hope I'm not just creating noise, but if this is correct then this issue should be considered serious since it will affect DPDK workloads on RHEL8.2 the same way. Created attachment 1695007 [details]
Another Call Trace info in guest
"(3)Multiple VMs each with 1 rt vCPU" testing has finished. There are Call Trace in one of the four guests. It looks a bit different with above. I'm not sure if they are same issue, so I also paste it here and full # dmesg is attached.
Luiz,Clark,
Please let me know if they are different issues, then I can file new bug to track it. Thanks.
[ 4363.236900] INFO: task cc1:77565 blocked for more than 600 seconds.
[ 4363.238025] Not tainted 4.18.0-208.rt5.20.el8.x86_64 #1
[ 4363.239458] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4363.240796] cc1 D 0 77565 77560 0x00084080
[ 4363.241784] Call Trace:
[ 4363.242363] __schedule+0x342/0x850
[ 4363.243059] schedule+0x39/0xd0
[ 4363.243609] schedule_timeout+0x20e/0x410
[ 4363.245349] ? __schedule+0x34a/0x850
[ 4363.247042] ? ___preempt_schedule+0x16/0x18
[ 4363.247757] wait_for_completion+0x85/0xe0
[ 4363.249505] flush_work+0x11a/0x1c0
[ 4363.251160] ? flush_workqueue_prep_pwqs+0x130/0x130
[ 4363.253032] drain_all_pages+0x140/0x190
[ 4363.253703] __alloc_pages_slowpath+0x3f8/0xe20
[ 4363.255508] __alloc_pages_nodemask+0x1ca/0x2b0
[ 4363.257310] alloc_pages_vma+0xc5/0x170
[ 4363.260023] ? __switch_to_asm+0x35/0x70
[ 4363.260700] __handle_mm_fault+0x5a5/0xa70
[ 4363.262451] ? __switch_to_asm+0x35/0x70
[ 4363.264165] ? __switch_to_asm+0x41/0x70
[ 4363.264827] handle_mm_fault+0xd2/0x1e0
[ 4363.266536] __do_page_fault+0x28e/0x5d0
[ 4363.268246] do_page_fault+0x47/0x1b0
[ 4363.269925] ? async_page_fault+0x8/0x30
[ 4363.270598] async_page_fault+0x1e/0x30
[ 4363.272298] RIP: 0033:0x56115b27ffb0
[ 4363.274020] Code: Bad RIP value.
[ 4363.274580] RSP: 002b:00007ffe5c70d8d0 EFLAGS: 00010202
[ 4363.276479] RAX: 00007feb25025000 RBX: 00007feb24f01370 RCX: 000000000000000c
[ 4363.278653] RDX: 000056115beb4c40 RSI: 00007feb25213cb0 RDI: 0000000000002227
[ 4363.281835] RBP: 00007feb26692c60 R08: 0000000000000001 R09: 0000000000000002
[ 4363.285054] R10: 00007feb25022400 R11: 0000000000000006 R12: 0000000000000000
[ 4363.287225] R13: 00007feb24f013b8 R14: 00007ffe5c70d8f0 R15: 00007feb25025008
[ 4985.829739] INFO: task cc1:77565 blocked for more than 600 seconds.
[ 4985.830822] Not tainted 4.18.0-208.rt5.20.el8.x86_64 #1
[ 4985.832039] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4985.833382] cc1 D 0 77565 77560 0x00084080
[ 4985.834375] Call Trace:
[ 4985.834930] __schedule+0x342/0x850
[ 4985.835534] schedule+0x39/0xd0
[ 4985.836225] schedule_timeout+0x20e/0x410
[ 4985.838968] ? __schedule+0x34a/0x850
[ 4985.839595] ? ___preempt_schedule+0x16/0x18
[ 4985.842383] wait_for_completion+0x85/0xe0
[ 4985.845132] flush_work+0x11a/0x1c0
[ 4985.847752] ? flush_workqueue_prep_pwqs+0x130/0x130
[ 4985.848572] drain_all_pages+0x140/0x190
[ 4985.852288] __alloc_pages_slowpath+0x3f8/0xe20
[ 4985.855085] __alloc_pages_nodemask+0x1ca/0x2b0
[ 4985.857889] alloc_pages_vma+0xc5/0x170
[ 4985.858541] ? __switch_to_asm+0x35/0x70
[ 4985.860546] __handle_mm_fault+0x5a5/0xa70
[ 4985.862838] ? __switch_to_asm+0x35/0x70
[ 4985.863498] ? __switch_to_asm+0x41/0x70
[ 4985.866215] handle_mm_fault+0xd2/0x1e0
[ 4985.867911] __do_page_fault+0x28e/0x5d0
[ 4985.868575] do_page_fault+0x47/0x1b0
[ 4985.870238] ? async_page_fault+0x8/0x30
[ 4985.871946] async_page_fault+0x1e/0x30
[ 4985.872601] RIP: 0033:0x56115b27ffb0
[ 4985.874263] Code: Bad RIP value.
[ 4985.875868] RSP: 002b:00007ffe5c70d8d0 EFLAGS: 00010202
[ 4985.877749] RAX: 00007feb25025000 RBX: 00007feb24f01370 RCX: 000000000000000c
[ 4985.879910] RDX: 000056115beb4c40 RSI: 00007feb25213cb0 RDI: 0000000000002227
[ 4985.882072] RBP: 00007feb26692c60 R08: 0000000000000001 R09: 0000000000000002
[ 4985.885242] R10: 00007feb25022400 R11: 0000000000000006 R12: 0000000000000000
[ 4985.887400] R13: 00007feb24f013b8 R14: 00007ffe5c70d8f0 R15: 00007feb25025008
drain_all_pages does:
for_each_cpu(cpu, &cpus_with_pcps) {
struct pcpu_drain *drain = per_cpu_ptr(&pcpu_drain, cpu);
drain->zone = zone;
INIT_WORK(&drain->work, drain_local_pages_wq);
queue_work_on(cpu, mm_percpu_wq, &drain->work);
}
for_each_cpu(cpu, &cpus_with_pcps)
flush_work(&per_cpu_ptr(&pcpu_drain, cpu)->work);
And worker thread can't run on isolated CPU with oslat (or DPDK with FIFO) running there.
One way to fix it would be to make drain_all_pages() aware of isolated CPUs probably by deriving the housekeeping CPUs via housekeeping_cpumask().
If so, then there are other callers of queue_work_on() as well that may not respect CPU isolation and maybe we should look into them as well.
eg. swap.c:
for_each_online_cpu(cpu) {
struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
...
INIT_WORK(work, lru_add_drain_per_cpu);
queue_work_on(cpu, mm_percpu_wq, work);
However, we have to test this change extensively to make sure that we are not breaking anything.
Options: 1) Do a first pass on non-isolated CPUs trying to drain their per-CPU queues. Not guaranteed to work (but perhaps a workable intermediate fix). 2) Reduce the limit of pages in per-CPU free queue to a value that is not very significant (don't think its a good idea: if there is a free page somewhere, should be able to use it before OOM'ing). 3) Drain pages when entering "task isolated" mode. From task isolated patchset: + + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ + lru_add_drain(); + (In reply to Marcelo Tosatti from comment #6) > Options: > > 1) Do a first pass on non-isolated CPUs trying to drain their per-CPU queues. > Not guaranteed to work (but perhaps a workable intermediate fix). > Yes, I agree there is a fair chance that this will fail. Since you mentioned this I don't think I will favor this now. > 2) Reduce the limit of pages in per-CPU free queue to a value that is > not very significant (don't think its a good idea: if there is a free page > somewhere, should be able to use it before OOM'ing). +1, we should avoid doing this. > > 3) Drain pages when entering "task isolated" mode. > > From task isolated patchset: > > + > + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */ > + lru_add_drain(); > + It sounds like a decent idea, but I will also go through the code to understand if there are any downsides to it. Verified with 4.18.0-221.rt7.33.el8.x86_64: No any Call Trace in RT guest. (2)Single VM with 8 rt vCPUs: Max Latency: 9 8 8 8 8 8 8 8 (us) ==Versions== kernel-rt-4.18.0-221.rt7.33.el8.x86_64 qemu-kvm-5.0.0-0.scrmod+el8.3.0+7241+ccee766e.wrb200701.x86_64 tuned-2.14.0-2.el8.noarch libvirt-6.4.0-1.scrmod+el8.3.0+7241+ccee766e.x86_64 python3-libvirt-6.3.0-1.module+el8.3.0+6485+695fc960.x86_64 microcode_ctl-20200609-2.el8.x86_64 rt-tests-1.8-3.el8.x86_64 So this bug has been fixed very well. Move to 'VERIFIED'. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: kernel-rt security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:4609 |
Created attachment 1694764 [details] VM Call Trace info Description of problem: During KVM-RT oslat testing, there are Call Trace info in RT guest. Version-Release number of selected component (if applicable): 4.18.0-208.rt5.20.el8.x86_64 qemu-img-5.0.0-0.scrmod+el8.3.0+6791+fbd26444.wrb200527.x86_64 libvirt-6.3.0-1.scrmod+el8.3.0+6791+fbd26444.x86_64 tuned-2.13.0-6.el8.noarch How reproducible: 1/1 Steps to Reproduce: 1. Setup rt host 2. Boot rt guest 3. Start 12h oslat testing in guest. Meanwhile compiling kernel in both host and guest housekeeping cores. # /home/nfv-virt-rt-kvm/tools/oslat --cpu-list 2,3,4,5,6,7,8,9 --rtprio 1 --runtime 12h 4. Check guest dmesg log, there are Call Trace. Full log is attached. [ 7475.821066] INFO: task ld:274531 blocked for more than 600 seconds. [ 7475.822157] Not tainted 4.18.0-208.rt5.20.el8.x86_64 #1 [ 7475.823094] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7475.824392] ld D 0 274531 274530 0x00084080 [ 7475.825307] Call Trace: [ 7475.825761] __schedule+0x342/0x850 [ 7475.826377] schedule+0x39/0xd0 [ 7475.826923] schedule_timeout+0x20e/0x410 [ 7475.827610] ? __schedule+0x34a/0x850 [ 7475.828247] ? ___preempt_schedule+0x16/0x18 [ 7475.828953] wait_for_completion+0x85/0xe0 [ 7475.829653] flush_work+0x11a/0x1c0 [ 7475.830313] ? flush_workqueue_prep_pwqs+0x130/0x130 [ 7475.831148] drain_all_pages+0x140/0x190 [ 7475.831803] __alloc_pages_slowpath+0x3f8/0xe20 [ 7475.832571] ? mem_cgroup_commit_charge+0xcb/0x510 [ 7475.833371] __alloc_pages_nodemask+0x1ca/0x2b0 [ 7475.834134] pagecache_get_page+0xb5/0x2d0 [ 7475.834814] ? account_page_dirtied+0x11a/0x220 [ 7475.835579] grab_cache_page_write_begin+0x1f/0x40 [ 7475.836379] iomap_write_begin.constprop.44+0x1c1/0x370 [ 7475.837241] ? iomap_write_end+0x91/0x290 [ 7475.837911] iomap_write_actor+0x92/0x170 [ 7475.838603] ? iomap_write_end+0x290/0x290 [ 7475.839310] iomap_apply+0xba/0x130 [ 7475.839905] ? iomap_write_end+0x290/0x290 [ 7475.840610] iomap_file_buffered_write+0x62/0x90 [ 7475.841385] ? iomap_write_end+0x290/0x290 [ 7475.842132] xfs_file_buffered_aio_write+0xd6/0x390 [xfs] [ 7475.843057] ? xfs_file_buffered_aio_read+0x5b/0x120 [xfs] [ 7475.843941] new_sync_write+0x124/0x170 [ 7475.844610] vfs_write+0xa5/0x1a0 [ 7475.845201] ksys_write+0x52/0xc0 [ 7475.845770] do_syscall_64+0x87/0x1a0 [ 7475.846411] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 7475.847256] RIP: 0033:0x7f74c8a0caa8 [ 7475.847865] Code: Bad RIP value. [ 7475.848447] RSP: 002b:00007ffeb5854a28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 7475.849644] RAX: ffffffffffffffda RBX: 00000000016fb000 RCX: 00007f74c8a0caa8 [ 7475.850780] RDX: 00000000016fb000 RSI: 00007f74b0bfb478 RDI: 0000000000000006 [ 7475.851921] RBP: 00007f74b0bfb478 R08: fffffffffffffff8 R09: 00000000006fbd04 [ 7475.853069] R10: 00005593590ba700 R11: 0000000000000246 R12: 0000559358308fa0 [ 7475.854206] R13: 00000000016fb000 R14: 00007f74c8cd8880 R15: 00000000016fb000 [ 8098.413957] INFO: task ld:274531 blocked for more than 600 seconds. [ 8098.415126] Not tainted 4.18.0-208.rt5.20.el8.x86_64 #1 [ 8098.416079] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 8098.417319] ld D 0 274531 274530 0x00084080 [ 8098.418236] Call Trace: [ 8098.418690] __schedule+0x342/0x850 [ 8098.419312] schedule+0x39/0xd0 [ 8098.419860] schedule_timeout+0x20e/0x410 [ 8098.420558] ? __schedule+0x34a/0x850 [ 8098.421205] ? ___preempt_schedule+0x16/0x18 [ 8098.421932] wait_for_completion+0x85/0xe0 [ 8098.422618] flush_work+0x11a/0x1c0 [ 8098.423235] ? flush_workqueue_prep_pwqs+0x130/0x130 [ 8098.424067] drain_all_pages+0x140/0x190 [ 8098.424725] __alloc_pages_slowpath+0x3f8/0xe20 [ 8098.425492] ? mem_cgroup_commit_charge+0xcb/0x510 [ 8098.426297] __alloc_pages_nodemask+0x1ca/0x2b0 [ 8098.427064] pagecache_get_page+0xb5/0x2d0 [ 8098.427747] ? account_page_dirtied+0x11a/0x220 [ 8098.428517] grab_cache_page_write_begin+0x1f/0x40 [ 8098.429324] iomap_write_begin.constprop.44+0x1c1/0x370 [ 8098.430187] ? iomap_write_end+0x91/0x290 [ 8098.430857] iomap_write_actor+0x92/0x170 [ 8098.431551] ? iomap_write_end+0x290/0x290 [ 8098.432255] iomap_apply+0xba/0x130 [ 8098.432851] ? iomap_write_end+0x290/0x290 [ 8098.433555] iomap_file_buffered_write+0x62/0x90 [ 8098.434334] ? iomap_write_end+0x290/0x290 [ 8098.435085] xfs_file_buffered_aio_write+0xd6/0x390 [xfs] [ 8098.436051] ? xfs_file_buffered_aio_read+0x5b/0x120 [xfs] [ 8098.436958] new_sync_write+0x124/0x170 [ 8098.437610] vfs_write+0xa5/0x1a0 [ 8098.438206] ksys_write+0x52/0xc0 [ 8098.438780] do_syscall_64+0x87/0x1a0 [ 8098.439426] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 8098.440271] RIP: 0033:0x7f74c8a0caa8 [ 8098.440903] Code: Bad RIP value. [ 8098.441464] RSP: 002b:00007ffeb5854a28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 8098.442668] RAX: ffffffffffffffda RBX: 00000000016fb000 RCX: 00007f74c8a0caa8 [ 8098.443808] RDX: 00000000016fb000 RSI: 00007f74b0bfb478 RDI: 0000000000000006 [ 8098.444967] RBP: 00007f74b0bfb478 R08: fffffffffffffff8 R09: 00000000006fbd04 [ 8098.446108] R10: 00005593590ba700 R11: 0000000000000246 R12: 0000559358308fa0 [ 8098.447247] R13: 00000000016fb000 R14: 00007f74c8cd8880 R15: 00000000016fb000 Actual results: There are Call Trace in RT guest #dmesg. Expected results: There should be no error info in RT guest #dmesg. Additional info: 1. This Call Trace only shows in RT guest, RT host works well.