Bug 1843410

Summary:

RT guest get Call Trace "INFO: task ld:274531 blocked for more than 600 seconds."

Product:

Red Hat Enterprise Linux 8

Reporter:

Pei Zhang <pezhang>

Component:

kernel-rt

Assignee:

Marcelo Tosatti <mtosatti>

kernel-rt sub component:

Scheduler

QA Contact:

Pei Zhang <pezhang>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

high

CC:

bhu, chayang, jinzhao, jlelli, juzhang, kcarcia, knoel, lcapitulino, mtosatti, nilal, peterx, virt-maint, williams

Version:

8.3

Keywords:

ZStream

Target Milestone:

Flags:

pm-rhel: mirror+

Target Release:

8.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

kernel-rt-4.18.0-220.rt7.32.el8

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1855779 (view as bug list)

Environment:

Last Closed:

2020-11-04 02:25:41 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1817732, 1823810, 1855779

Attachments:

Description	Flags
VM Call Trace info	none
Another Call Trace info in guest	none

Description Pei Zhang 2020-06-03 09:11:11 UTC

Created attachment 1694764 [details]
VM Call Trace info

Description of problem:
During KVM-RT oslat testing, there are Call Trace info in RT guest.

Version-Release number of selected component (if applicable):
4.18.0-208.rt5.20.el8.x86_64
qemu-img-5.0.0-0.scrmod+el8.3.0+6791+fbd26444.wrb200527.x86_64
libvirt-6.3.0-1.scrmod+el8.3.0+6791+fbd26444.x86_64
tuned-2.13.0-6.el8.noarch


How reproducible:
1/1

Steps to Reproduce:
1. Setup rt host

2. Boot rt guest

3. Start 12h oslat testing in guest. Meanwhile compiling kernel in both host and guest housekeeping cores. 

# /home/nfv-virt-rt-kvm/tools/oslat --cpu-list 2,3,4,5,6,7,8,9 --rtprio 1 --runtime 12h

4. Check guest dmesg log, there are Call Trace. Full log is attached.

[ 7475.821066] INFO: task ld:274531 blocked for more than 600 seconds.
[ 7475.822157]       Not tainted 4.18.0-208.rt5.20.el8.x86_64 #1
[ 7475.823094] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7475.824392] ld              D    0 274531 274530 0x00084080
[ 7475.825307] Call Trace:
[ 7475.825761]  __schedule+0x342/0x850
[ 7475.826377]  schedule+0x39/0xd0
[ 7475.826923]  schedule_timeout+0x20e/0x410
[ 7475.827610]  ? __schedule+0x34a/0x850
[ 7475.828247]  ? ___preempt_schedule+0x16/0x18
[ 7475.828953]  wait_for_completion+0x85/0xe0
[ 7475.829653]  flush_work+0x11a/0x1c0
[ 7475.830313]  ? flush_workqueue_prep_pwqs+0x130/0x130
[ 7475.831148]  drain_all_pages+0x140/0x190
[ 7475.831803]  __alloc_pages_slowpath+0x3f8/0xe20
[ 7475.832571]  ? mem_cgroup_commit_charge+0xcb/0x510
[ 7475.833371]  __alloc_pages_nodemask+0x1ca/0x2b0
[ 7475.834134]  pagecache_get_page+0xb5/0x2d0
[ 7475.834814]  ? account_page_dirtied+0x11a/0x220
[ 7475.835579]  grab_cache_page_write_begin+0x1f/0x40
[ 7475.836379]  iomap_write_begin.constprop.44+0x1c1/0x370
[ 7475.837241]  ? iomap_write_end+0x91/0x290
[ 7475.837911]  iomap_write_actor+0x92/0x170
[ 7475.838603]  ? iomap_write_end+0x290/0x290
[ 7475.839310]  iomap_apply+0xba/0x130
[ 7475.839905]  ? iomap_write_end+0x290/0x290
[ 7475.840610]  iomap_file_buffered_write+0x62/0x90
[ 7475.841385]  ? iomap_write_end+0x290/0x290
[ 7475.842132]  xfs_file_buffered_aio_write+0xd6/0x390 [xfs]
[ 7475.843057]  ? xfs_file_buffered_aio_read+0x5b/0x120 [xfs]
[ 7475.843941]  new_sync_write+0x124/0x170
[ 7475.844610]  vfs_write+0xa5/0x1a0
[ 7475.845201]  ksys_write+0x52/0xc0
[ 7475.845770]  do_syscall_64+0x87/0x1a0
[ 7475.846411]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 7475.847256] RIP: 0033:0x7f74c8a0caa8
[ 7475.847865] Code: Bad RIP value.
[ 7475.848447] RSP: 002b:00007ffeb5854a28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 7475.849644] RAX: ffffffffffffffda RBX: 00000000016fb000 RCX: 00007f74c8a0caa8
[ 7475.850780] RDX: 00000000016fb000 RSI: 00007f74b0bfb478 RDI: 0000000000000006
[ 7475.851921] RBP: 00007f74b0bfb478 R08: fffffffffffffff8 R09: 00000000006fbd04
[ 7475.853069] R10: 00005593590ba700 R11: 0000000000000246 R12: 0000559358308fa0
[ 7475.854206] R13: 00000000016fb000 R14: 00007f74c8cd8880 R15: 00000000016fb000
[ 8098.413957] INFO: task ld:274531 blocked for more than 600 seconds.
[ 8098.415126]       Not tainted 4.18.0-208.rt5.20.el8.x86_64 #1
[ 8098.416079] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 8098.417319] ld              D    0 274531 274530 0x00084080
[ 8098.418236] Call Trace:
[ 8098.418690]  __schedule+0x342/0x850
[ 8098.419312]  schedule+0x39/0xd0
[ 8098.419860]  schedule_timeout+0x20e/0x410
[ 8098.420558]  ? __schedule+0x34a/0x850
[ 8098.421205]  ? ___preempt_schedule+0x16/0x18
[ 8098.421932]  wait_for_completion+0x85/0xe0
[ 8098.422618]  flush_work+0x11a/0x1c0
[ 8098.423235]  ? flush_workqueue_prep_pwqs+0x130/0x130
[ 8098.424067]  drain_all_pages+0x140/0x190
[ 8098.424725]  __alloc_pages_slowpath+0x3f8/0xe20
[ 8098.425492]  ? mem_cgroup_commit_charge+0xcb/0x510
[ 8098.426297]  __alloc_pages_nodemask+0x1ca/0x2b0
[ 8098.427064]  pagecache_get_page+0xb5/0x2d0
[ 8098.427747]  ? account_page_dirtied+0x11a/0x220
[ 8098.428517]  grab_cache_page_write_begin+0x1f/0x40
[ 8098.429324]  iomap_write_begin.constprop.44+0x1c1/0x370
[ 8098.430187]  ? iomap_write_end+0x91/0x290
[ 8098.430857]  iomap_write_actor+0x92/0x170
[ 8098.431551]  ? iomap_write_end+0x290/0x290
[ 8098.432255]  iomap_apply+0xba/0x130
[ 8098.432851]  ? iomap_write_end+0x290/0x290
[ 8098.433555]  iomap_file_buffered_write+0x62/0x90
[ 8098.434334]  ? iomap_write_end+0x290/0x290
[ 8098.435085]  xfs_file_buffered_aio_write+0xd6/0x390 [xfs]
[ 8098.436051]  ? xfs_file_buffered_aio_read+0x5b/0x120 [xfs]
[ 8098.436958]  new_sync_write+0x124/0x170
[ 8098.437610]  vfs_write+0xa5/0x1a0
[ 8098.438206]  ksys_write+0x52/0xc0
[ 8098.438780]  do_syscall_64+0x87/0x1a0
[ 8098.439426]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 8098.440271] RIP: 0033:0x7f74c8a0caa8
[ 8098.440903] Code: Bad RIP value.
[ 8098.441464] RSP: 002b:00007ffeb5854a28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 8098.442668] RAX: ffffffffffffffda RBX: 00000000016fb000 RCX: 00007f74c8a0caa8
[ 8098.443808] RDX: 00000000016fb000 RSI: 00007f74b0bfb478 RDI: 0000000000000006
[ 8098.444967] RBP: 00007f74b0bfb478 R08: fffffffffffffff8 R09: 00000000006fbd04
[ 8098.446108] R10: 00005593590ba700 R11: 0000000000000246 R12: 0000559358308fa0
[ 8098.447247] R13: 00000000016fb000 R14: 00007f74c8cd8880 R15: 00000000016fb000


Actual results:
There are Call Trace in RT guest #dmesg.

Expected results:
There should be no error info in RT guest #dmesg.

Additional info:
1. This Call Trace only shows in RT guest, RT host works well.

Comment 2 Luiz Capitulino 2020-06-03 19:32:11 UTC

Clark,

By looking at the trace, my initial guess for this issue
would be that drain_all_pages() or flush_work() is scheduling
work on the CPU where oslat is running (which is isolated).

In other words, work that originated in an housekeeping CPU
is being scheduled on an isolated CPU. Which will cause a stall,
since oslat won't let the kworker run.

Now, I might be completely wrong, but it does ring a bell to me
that we used to have a patch in RHEL7 RT that would prevent either
drain_all_pages() or flush_work() from scheduling work on isolated CPUs.

I hope I'm not just creating noise, but if this is correct then
this issue should be considered serious since it will affect
DPDK workloads on RHEL8.2 the same way.

Comment 3 Pei Zhang 2020-06-04 08:50:01 UTC

Created attachment 1695007 [details]
Another Call Trace info in guest

"(3)Multiple VMs each with 1 rt vCPU" testing has finished. There are Call Trace in one of the four guests. It looks a bit different with above. I'm not sure if they are same issue, so I also paste it here and full # dmesg is attached. 

Luiz,Clark,

Please let me know if they are different issues, then I can file new bug to track it. Thanks.


[ 4363.236900] INFO: task cc1:77565 blocked for more than 600 seconds.
[ 4363.238025]       Not tainted 4.18.0-208.rt5.20.el8.x86_64 #1
[ 4363.239458] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4363.240796] cc1             D    0 77565  77560 0x00084080
[ 4363.241784] Call Trace:
[ 4363.242363]  __schedule+0x342/0x850
[ 4363.243059]  schedule+0x39/0xd0
[ 4363.243609]  schedule_timeout+0x20e/0x410
[ 4363.245349]  ? __schedule+0x34a/0x850
[ 4363.247042]  ? ___preempt_schedule+0x16/0x18
[ 4363.247757]  wait_for_completion+0x85/0xe0
[ 4363.249505]  flush_work+0x11a/0x1c0
[ 4363.251160]  ? flush_workqueue_prep_pwqs+0x130/0x130
[ 4363.253032]  drain_all_pages+0x140/0x190
[ 4363.253703]  __alloc_pages_slowpath+0x3f8/0xe20
[ 4363.255508]  __alloc_pages_nodemask+0x1ca/0x2b0
[ 4363.257310]  alloc_pages_vma+0xc5/0x170
[ 4363.260023]  ? __switch_to_asm+0x35/0x70
[ 4363.260700]  __handle_mm_fault+0x5a5/0xa70
[ 4363.262451]  ? __switch_to_asm+0x35/0x70
[ 4363.264165]  ? __switch_to_asm+0x41/0x70
[ 4363.264827]  handle_mm_fault+0xd2/0x1e0
[ 4363.266536]  __do_page_fault+0x28e/0x5d0
[ 4363.268246]  do_page_fault+0x47/0x1b0
[ 4363.269925]  ? async_page_fault+0x8/0x30
[ 4363.270598]  async_page_fault+0x1e/0x30
[ 4363.272298] RIP: 0033:0x56115b27ffb0
[ 4363.274020] Code: Bad RIP value.
[ 4363.274580] RSP: 002b:00007ffe5c70d8d0 EFLAGS: 00010202
[ 4363.276479] RAX: 00007feb25025000 RBX: 00007feb24f01370 RCX: 000000000000000c
[ 4363.278653] RDX: 000056115beb4c40 RSI: 00007feb25213cb0 RDI: 0000000000002227
[ 4363.281835] RBP: 00007feb26692c60 R08: 0000000000000001 R09: 0000000000000002
[ 4363.285054] R10: 00007feb25022400 R11: 0000000000000006 R12: 0000000000000000
[ 4363.287225] R13: 00007feb24f013b8 R14: 00007ffe5c70d8f0 R15: 00007feb25025008
[ 4985.829739] INFO: task cc1:77565 blocked for more than 600 seconds.
[ 4985.830822]       Not tainted 4.18.0-208.rt5.20.el8.x86_64 #1
[ 4985.832039] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4985.833382] cc1             D    0 77565  77560 0x00084080
[ 4985.834375] Call Trace:
[ 4985.834930]  __schedule+0x342/0x850
[ 4985.835534]  schedule+0x39/0xd0
[ 4985.836225]  schedule_timeout+0x20e/0x410
[ 4985.838968]  ? __schedule+0x34a/0x850
[ 4985.839595]  ? ___preempt_schedule+0x16/0x18
[ 4985.842383]  wait_for_completion+0x85/0xe0
[ 4985.845132]  flush_work+0x11a/0x1c0
[ 4985.847752]  ? flush_workqueue_prep_pwqs+0x130/0x130
[ 4985.848572]  drain_all_pages+0x140/0x190
[ 4985.852288]  __alloc_pages_slowpath+0x3f8/0xe20
[ 4985.855085]  __alloc_pages_nodemask+0x1ca/0x2b0
[ 4985.857889]  alloc_pages_vma+0xc5/0x170
[ 4985.858541]  ? __switch_to_asm+0x35/0x70
[ 4985.860546]  __handle_mm_fault+0x5a5/0xa70
[ 4985.862838]  ? __switch_to_asm+0x35/0x70
[ 4985.863498]  ? __switch_to_asm+0x41/0x70
[ 4985.866215]  handle_mm_fault+0xd2/0x1e0
[ 4985.867911]  __do_page_fault+0x28e/0x5d0
[ 4985.868575]  do_page_fault+0x47/0x1b0
[ 4985.870238]  ? async_page_fault+0x8/0x30
[ 4985.871946]  async_page_fault+0x1e/0x30
[ 4985.872601] RIP: 0033:0x56115b27ffb0
[ 4985.874263] Code: Bad RIP value.
[ 4985.875868] RSP: 002b:00007ffe5c70d8d0 EFLAGS: 00010202
[ 4985.877749] RAX: 00007feb25025000 RBX: 00007feb24f01370 RCX: 000000000000000c
[ 4985.879910] RDX: 000056115beb4c40 RSI: 00007feb25213cb0 RDI: 0000000000002227
[ 4985.882072] RBP: 00007feb26692c60 R08: 0000000000000001 R09: 0000000000000002
[ 4985.885242] R10: 00007feb25022400 R11: 0000000000000006 R12: 0000000000000000
[ 4985.887400] R13: 00007feb24f013b8 R14: 00007ffe5c70d8f0 R15: 00007feb25025008

Comment 4 Marcelo Tosatti 2020-06-05 20:45:38 UTC

drain_all_pages does:

        for_each_cpu(cpu, &cpus_with_pcps) {
                struct pcpu_drain *drain = per_cpu_ptr(&pcpu_drain, cpu);

                drain->zone = zone;
                INIT_WORK(&drain->work, drain_local_pages_wq);
                queue_work_on(cpu, mm_percpu_wq, &drain->work);
        }
        for_each_cpu(cpu, &cpus_with_pcps)
                flush_work(&per_cpu_ptr(&pcpu_drain, cpu)->work);


And worker thread can't run on isolated CPU with oslat (or DPDK with FIFO) running there.

Comment 5 Nitesh Narayan Lal 2020-06-05 21:30:13 UTC

One way to fix it would be to make drain_all_pages() aware of isolated CPUs probably by deriving the housekeeping CPUs via housekeeping_cpumask().
If so, then there are other callers of queue_work_on() as well that may not respect CPU isolation and maybe we should look into them as well.
eg. swap.c:
        for_each_online_cpu(cpu) {
                struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
...
                        INIT_WORK(work, lru_add_drain_per_cpu);
                        queue_work_on(cpu, mm_percpu_wq, work);

However, we have to test this change extensively to make sure that we are not breaking anything.

Comment 6 Marcelo Tosatti 2020-06-05 21:36:45 UTC

Options:

1) Do a first pass on non-isolated CPUs trying to drain their per-CPU queues.
Not guaranteed to work (but perhaps a workable intermediate fix).

2) Reduce the limit of pages in per-CPU free queue to a value that is 
not very significant (don't think its a good idea: if there is a free page
somewhere, should be able to use it before OOM'ing).

3) Drain pages when entering "task isolated" mode.

From task isolated patchset:

+
+       /* Drain the pagevecs to avoid unnecessary IPI flushes later. */
+       lru_add_drain();
+

Comment 7 Nitesh Narayan Lal 2020-06-05 22:10:37 UTC

(In reply to Marcelo Tosatti from comment #6)
> Options:
> 
> 1) Do a first pass on non-isolated CPUs trying to drain their per-CPU queues.
> Not guaranteed to work (but perhaps a workable intermediate fix).
> 

Yes, I agree there is a fair chance that this will fail. Since you mentioned this I don't think I will favor this now.

> 2) Reduce the limit of pages in per-CPU free queue to a value that is 
> not very significant (don't think its a good idea: if there is a free page
> somewhere, should be able to use it before OOM'ing).

+1, we should avoid doing this.

> 
> 3) Drain pages when entering "task isolated" mode.
> 
> From task isolated patchset:
> 
> +
> +       /* Drain the pagevecs to avoid unnecessary IPI flushes later. */
> +       lru_add_drain();
> +

It sounds like a decent idea, but I will also go through the code to understand if there are any downsides to it.

Comment 19 Pei Zhang 2020-07-02 09:39:11 UTC

Verified with 4.18.0-221.rt7.33.el8.x86_64:

No any Call Trace in RT guest. 

(2)Single VM with 8 rt vCPUs:
 Max Latency:	 9 8 8 8 8 8 8 8 (us)


==Versions==
kernel-rt-4.18.0-221.rt7.33.el8.x86_64
qemu-kvm-5.0.0-0.scrmod+el8.3.0+7241+ccee766e.wrb200701.x86_64
tuned-2.14.0-2.el8.noarch
libvirt-6.4.0-1.scrmod+el8.3.0+7241+ccee766e.x86_64
python3-libvirt-6.3.0-1.module+el8.3.0+6485+695fc960.x86_64
microcode_ctl-20200609-2.el8.x86_64
rt-tests-1.8-3.el8.x86_64

So this bug has been fixed very well. Move to 'VERIFIED'.

Comment 25 errata-xmlrpc 2020-11-04 02:25:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: kernel-rt security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4609