1897869 – Running oslat in RT guest, guest kernel shows Call Trace: INFO: task kcompactd0:35 blocked for more than 600 seconds.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1897869 - Running oslat in RT guest, guest kernel shows Call Trace: INFO: task kcompactd0:35 blocked for more than 600 seconds.

Summary: Running oslat in RT guest, guest kernel shows Call Trace: INFO: task kcompact...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	kernel-rt
Sub Component:
Version:	8.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	8.4
Assignee:	Marcelo Tosatti
QA Contact:	Pei Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1883636 1898189
TreeView+	depends on / blocked

Reported:	2020-11-15 05:28 UTC by Pei Zhang
Modified:	2023-08-08 03:01 UTC (History)
CC List:	15 users (show)
Fixed In Version:	kernel-rt-4.18.0-288.rt7.53.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-18 15:12:53 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)
Full dmesg log of RT guest (51.21 KB, text/plain) 2020-11-15 05:28 UTC, Pei Zhang	no flags	Details
Full Call Trace dmesg log with 4.18.0-272.rt7.37.el8.x86_64 (49.84 KB, text/plain) 2021-01-12 07:07 UTC, Pei Zhang	no flags	Details
drain_all_pages + local_irq_save fix (6.14 KB, patch) 2021-01-27 11:21 UTC, Marcelo Tosatti	no flags	Details \| Diff
View All

Description Pei Zhang 2020-11-15 05:28:37 UTC

Created attachment 1729461 [details]
Full dmesg log of RT guest

Description of problem:
This issue is triggered during kvm-rt regular testing with oslat. When running oslat in the RT guest, RT guest kernel shows Call Trace error.

Version-Release number of selected component (if applicable):
kernel-rt-4.18.0-249.rt4.14.el8.x86_64
qemu-kvm-5.1.0-14.module+el8.3.0+8438+644aff69.x86_64
tuned-2.14.0-3.el8.noarch
python3-libvirt-6.6.0-1.module+el8.3.0+7572+bcbf6b90.x86_64
microcode_ctl-20201027-1.el8.x86_64
rt-tests-1.8-11.el8.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Setup RT host

2. Setup RT guest

3. In RT guest, running oslat in the RT vCPUs and compiling kernel in the housekeeping vCPUs paralleled.

# oslat --cpu-list 1 --rtprio 1 --runtime 12h 

# while true; do cd /home/nfv-virt-rt-kvm/src/kernel-rt/linux-4.18.0-249.rt4.14.el8; make -j2>/dev/null; make clean>/dev/null; done


4. Check RT guest #dmesg, there are Call Trace shows.

# dmesg
...

[ 4362.264644] INFO: task kcompactd0:35 blocked for more than 600 seconds.
[ 4362.265735]       Not tainted 4.18.0-249.rt4.14.el8.x86_64 #1
[ 4362.266649] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4362.267857] kcompactd0      D    0    35      2 0x80084000
[ 4362.268739] Call Trace:
[ 4362.269195]  __schedule+0x34d/0x8a0
[ 4362.269796]  ? migrate_enable+0x118/0x3a0
[ 4362.270456]  schedule+0x56/0xe0
[ 4362.271003]  __rt_mutex_slowlock+0xbe/0x130
[ 4362.275672]  rt_mutex_slowlock_locked+0xbc/0x260
[ 4362.276451]  rt_mutex_slowlock.constprop.31+0x5f/0x90
[ 4362.277271]  drain_all_pages+0x184/0x190
[ 4362.277927]  kcompactd_do_work+0x26f/0x330
[ 4362.278603]  ? kcompactd_do_work+0x330/0x330
[ 4362.285985]  kcompactd+0x8b/0x200
[ 4362.286555]  ? finish_wait+0x70/0x70
[ 4362.287420]  kthread+0x15d/0x180
[ 4362.287999]  ? __kthread_parkme+0xa0/0xa0
[ 4362.288674]  ret_from_fork+0x35/0x40
[ 4362.289302] INFO: task cc1:83995 blocked for more than 600 seconds.
[ 4362.290286]       Not tainted 4.18.0-249.rt4.14.el8.x86_64 #1
[ 4362.295064] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4362.296273] cc1             D    0 83995  83993 0x00084080
[ 4362.300874] Call Trace:
[ 4362.301353]  __schedule+0x34d/0x8a0
[ 4362.301967]  schedule+0x56/0xe0
[ 4362.302505]  schedule_timeout+0x250/0x400
[ 4362.306974]  ? preempt_count_add+0x68/0xa0
[ 4362.307666]  ? _raw_spin_unlock_irqrestore+0x20/0x60
[ 4362.308470]  wait_for_completion+0x85/0xe0
[ 4362.309154]  flush_work+0x123/0x1d0
[ 4362.309752]  ? flush_workqueue_prep_pwqs+0x130/0x130
[ 4362.310546]  ? __queue_work+0x1aa/0x590
[ 4362.323271]  drain_all_pages+0x140/0x190
[ 4362.324052]  __alloc_pages_slowpath+0x44c/0xe40
[ 4362.324862]  __alloc_pages_nodemask+0x208/0x2f0
[ 4362.325598]  alloc_pages_vma+0xc5/0x170
[ 4362.326247]  __handle_mm_fault+0x618/0xb50
[ 4362.328971]  ? __switch_to_asm+0x35/0x70
[ 4362.329636]  handle_mm_fault+0xd2/0x1e0
[ 4362.330275]  __do_page_fault+0x28e/0x5d0
[ 4362.334994]  do_page_fault+0x47/0x1b0
[ 4362.335626]  ? async_page_fault+0x8/0x30
[ 4362.336275]  async_page_fault+0x1e/0x30
[ 4362.336921] RIP: 0033:0x56128bd8e938
[ 4362.337523] Code: Bad RIP value.
[ 4362.338080] RSP: 002b:00007fffb9217fd0 EFLAGS: 00010216
[ 4362.350981] RAX: 00007feeed5cd000 RBX: 0000000000000090 RCX: 000000000000000c
[ 4362.352106] RDX: 00007feeed5cd000 RSI: 000000000000001b RDI: 0000000000002424
[ 4362.355258] RBP: 0000000000000003 R08: 0000000000000001 R09: 0000000010000000
[ 4362.356403] R10: 0000000010000001 R11: 0000000000000006 R12: 000000001f955ef3
[ 4362.357513] R13: 00007feeed5bffa0 R14: 00007feeeffa45e8 R15: 0000000000000003
[ 4984.857254] INFO: task kcompactd0:35 blocked for more than 600 seconds.
[ 4984.858350]       Not tainted 4.18.0-249.rt4.14.el8.x86_64 #1
[ 4984.859265] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4984.862392] kcompactd0      D    0    35      2 0x80084000
[ 4984.863271] Call Trace:
[ 4984.863718]  __schedule+0x34d/0x8a0
[ 4984.864575]  ? migrate_enable+0x118/0x3a0
[ 4984.865253]  schedule+0x56/0xe0
[ 4984.865793]  __rt_mutex_slowlock+0xbe/0x130
[ 4984.866491]  rt_mutex_slowlock_locked+0xbc/0x260
[ 4984.867247]  rt_mutex_slowlock.constprop.31+0x5f/0x90
[ 4984.868057]  drain_all_pages+0x184/0x190
[ 4984.872802]  kcompactd_do_work+0x26f/0x330
[ 4984.876563]  ? kcompactd_do_work+0x330/0x330
[ 4984.877280]  kcompactd+0x8b/0x200
[ 4984.877850]  ? finish_wait+0x70/0x70
[ 4984.878465]  kthread+0x15d/0x180
[ 4984.879019]  ? __kthread_parkme+0xa0/0xa0
[ 4984.879693]  ret_from_fork+0x35/0x40
[ 4984.884423] INFO: task cc1:83995 blocked for more than 600 seconds.
[ 4984.885424]       Not tainted 4.18.0-249.rt4.14.el8.x86_64 #1
[ 4984.886339] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4984.887552] cc1             D    0 83995  83993 0x00084080
[ 4984.892510] Call Trace:
[ 4984.892962]  __schedule+0x34d/0x8a0
[ 4984.893568]  schedule+0x56/0xe0
[ 4984.894107]  schedule_timeout+0x250/0x400
[ 4984.894781]  ? preempt_count_add+0x68/0xa0
[ 4984.895476]  ? _raw_spin_unlock_irqrestore+0x20/0x60
[ 4984.900353]  wait_for_completion+0x85/0xe0
[ 4984.901043]  flush_work+0x123/0x1d0
[ 4984.901652]  ? flush_workqueue_prep_pwqs+0x130/0x130
[ 4984.902460]  ? __queue_work+0x1aa/0x590
[ 4984.903101]  drain_all_pages+0x140/0x190
[ 4984.903761]  __alloc_pages_slowpath+0x44c/0xe40
[ 4984.908589]  __alloc_pages_nodemask+0x208/0x2f0
[ 4984.909344]  alloc_pages_vma+0xc5/0x170
[ 4984.909984]  __handle_mm_fault+0x618/0xb50
[ 4984.910670]  ? __switch_to_asm+0x35/0x70
[ 4984.911332]  handle_mm_fault+0xd2/0x1e0
[ 4984.911970]  __do_page_fault+0x28e/0x5d0
[ 4984.916697]  do_page_fault+0x47/0x1b0
[ 4984.917330]  ? async_page_fault+0x8/0x30
[ 4984.917981]  async_page_fault+0x1e/0x30
[ 4984.918632] RIP: 0033:0x56128bd8e938
[ 4984.919249] Code: Bad RIP value.
[ 4984.919800] RSP: 002b:00007fffb9217fd0 EFLAGS: 00010216
[ 4984.924701] RAX: 00007feeed5cd000 RBX: 0000000000000090 RCX: 000000000000000c
[ 4984.925818] RDX: 00007feeed5cd000 RSI: 000000000000001b RDI: 0000000000002424
[ 4984.926930] RBP: 0000000000000003 R08: 0000000000000001 R09: 0000000010000000
[ 4984.928049] R10: 0000000010000001 R11: 0000000000000006 R12: 000000001f955ef3
[ 4984.933230] R13: 00007feeed5bffa0 R14: 00007feeeffa45e8 R15: 0000000000000003
[ 5607.449833] INFO: task kcompactd0:35 blocked for more than 600 seconds.
[ 5607.450967]       Not tainted 4.18.0-249.rt4.14.el8.x86_64 #1
[ 5607.451887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5607.453095] kcompactd0      D    0    35      2 0x80084000
[ 5607.454541] Call Trace:
[ 5607.455005]  __schedule+0x34d/0x8a0
[ 5607.455593]  ? migrate_enable+0x118/0x3a0
[ 5607.456261]  schedule+0x56/0xe0
[ 5607.456807]  __rt_mutex_slowlock+0xbe/0x130
[ 5607.457491]  rt_mutex_slowlock_locked+0xbc/0x260
[ 5607.458539]  rt_mutex_slowlock.constprop.31+0x5f/0x90
[ 5607.459373]  drain_all_pages+0x184/0x190
[ 5607.460031]  kcompactd_do_work+0x26f/0x330
[ 5607.460704]  ? kcompactd_do_work+0x330/0x330
[ 5607.461449]  kcompactd+0x8b/0x200
[ 5607.462300]  ? finish_wait+0x70/0x70
[ 5607.462913]  kthread+0x15d/0x180
[ 5607.463464]  ? __kthread_parkme+0xa0/0xa0
[ 5607.464132]  ret_from_fork+0x35/0x40
[ 5607.464761] INFO: task cc1:83995 blocked for more than 600 seconds.
[ 5607.465750]       Not tainted 4.18.0-249.rt4.14.el8.x86_64 #1
[ 5607.466998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5607.468201] cc1             D    0 83995  83993 0x00084080
[ 5607.469078] Call Trace:
[ 5607.469521]  __schedule+0x34d/0x8a0
[ 5607.470338]  schedule+0x56/0xe0
[ 5607.470889]  schedule_timeout+0x250/0x400
[ 5607.471551]  ? preempt_count_add+0x68/0xa0
[ 5607.472237]  ? _raw_spin_unlock_irqrestore+0x20/0x60
[ 5607.473286]  wait_for_completion+0x85/0xe0
[ 5607.476072]  flush_work+0x123/0x1d0
[ 5607.482434]  ? flush_workqueue_prep_pwqs+0x130/0x130
[ 5607.483249]  ? __queue_work+0x1aa/0x590
[ 5607.483897]  drain_all_pages+0x140/0x190
[ 5607.484548]  __alloc_pages_slowpath+0x44c/0xe40
[ 5607.485293]  __alloc_pages_nodemask+0x208/0x2f0
[ 5607.490100]  alloc_pages_vma+0xc5/0x170
[ 5607.490741]  __handle_mm_fault+0x618/0xb50
[ 5607.491429]  ? __switch_to_asm+0x35/0x70
[ 5607.492091]  handle_mm_fault+0xd2/0x1e0
[ 5607.492728]  __do_page_fault+0x28e/0x5d0
[ 5607.493387]  do_page_fault+0x47/0x1b0
[ 5607.498087]  ? async_page_fault+0x8/0x30
[ 5607.498740]  async_page_fault+0x1e/0x30
[ 5607.499394] RIP: 0033:0x56128bd8e938
[ 5607.500008] Code: Bad RIP value.
[ 5607.500557] RSP: 002b:00007fffb9217fd0 EFLAGS: 00010216
[ 5607.501399] RAX: 00007feeed5cd000 RBX: 0000000000000090 RCX: 000000000000000c
[ 5607.506562] RDX: 00007feeed5cd000 RSI: 000000000000001b RDI: 0000000000002424
[ 5607.507674] RBP: 0000000000000003 R08: 0000000000000001 R09: 0000000010000000
[ 5607.508783] R10: 0000000010000001 R11: 0000000000000006 R12: 000000001f955ef3
[ 5607.517970] R13: 00007feeed5bffa0 R14: 00007feeeffa45e8 R15: 0000000000000003
[ 6230.042436] INFO: task kcompactd0:35 blocked for more than 600 seconds.
[ 6230.043529]       Not tainted 4.18.0-249.rt4.14.el8.x86_64 #1
[ 6230.044705] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6230.045938] kcompactd0      D    0    35      2 0x80084000
[ 6230.046817] Call Trace:
[ 6230.047265]  __schedule+0x34d/0x8a0
[ 6230.047867]  ? migrate_enable+0x118/0x3a0
[ 6230.048805]  schedule+0x56/0xe0
[ 6230.049349]  __rt_mutex_slowlock+0xbe/0x130
[ 6230.050053]  rt_mutex_slowlock_locked+0xbc/0x260
[ 6230.050806]  rt_mutex_slowlock.constprop.31+0x5f/0x90
[ 6230.051624]  drain_all_pages+0x184/0x190
[ 6230.052272]  kcompactd_do_work+0x26f/0x330
[ 6230.057014]  ? kcompactd_do_work+0x330/0x330
[ 6230.057740]  kcompactd+0x8b/0x200
[ 6230.058309]  ? finish_wait+0x70/0x70
[ 6230.058920]  kthread+0x15d/0x180
[ 6230.059481]  ? __kthread_parkme+0xa0/0xa0
[ 6230.060140]  ret_from_fork+0x35/0x40
[ 6230.068854] INFO: task cc1:83995 blocked for more than 600 seconds.
[ 6230.069853]       Not tainted 4.18.0-249.rt4.14.el8.x86_64 #1
[ 6230.070766] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6230.071973] cc1             D    0 83995  83993 0x00084080
[ 6230.073157] Call Trace:
[ 6230.073619]  __schedule+0x34d/0x8a0
[ 6230.074208]  schedule+0x56/0xe0
[ 6230.076804]  schedule_timeout+0x250/0x400
[ 6230.077491]  ? preempt_count_add+0x68/0xa0
[ 6230.078166]  ? _raw_spin_unlock_irqrestore+0x20/0x60
[ 6230.084698]  wait_for_completion+0x85/0xe0
[ 6230.085384]  flush_work+0x123/0x1d0
[ 6230.085983]  ? flush_workqueue_prep_pwqs+0x130/0x130
[ 6230.086801]  ? __queue_work+0x1aa/0x590
[ 6230.087460]  drain_all_pages+0x140/0x190
[ 6230.088115]  __alloc_pages_slowpath+0x44c/0xe40
[ 6230.090916]  __alloc_pages_nodemask+0x208/0x2f0
[ 6230.091680]  alloc_pages_vma+0xc5/0x170
[ 6230.092319]  __handle_mm_fault+0x618/0xb50
[ 6230.097059]  ? __switch_to_asm+0x35/0x70
[ 6230.097727]  handle_mm_fault+0xd2/0x1e0
[ 6230.098366]  __do_page_fault+0x28e/0x5d0
[ 6230.099024]  do_page_fault+0x47/0x1b0
[ 6230.099648]  ? async_page_fault+0x8/0x30
[ 6230.100296]  async_page_fault+0x1e/0x30
[ 6230.105005] RIP: 0033:0x56128bd8e938
[ 6230.105638] Code: Bad RIP value.
[ 6230.106190] RSP: 002b:00007fffb9217fd0 EFLAGS: 00010216
[ 6230.107293] RAX: 00007feeed5cd000 RBX: 0000000000000090 RCX: 000000000000000c
[ 6230.112461] RDX: 00007feeed5cd000 RSI: 000000000000001b RDI: 0000000000002424
[ 6230.113596] RBP: 0000000000000003 R08: 0000000000000001 R09: 0000000010000000
[ 6230.114707] R10: 0000000010000001 R11: 0000000000000006 R12: 000000001f955ef3
[ 6230.115815] R13: 00007feeed5bffa0 R14: 00007feeeffa45e8 R15: 0000000000000003
[ 6852.634992] INFO: task kcompactd0:35 blocked for more than 600 seconds.
[ 6852.636091]       Not tainted 4.18.0-249.rt4.14.el8.x86_64 #1
[ 6852.641392] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6852.642621] kcompactd0      D    0    35      2 0x80084000
[ 6852.643500] Call Trace:
[ 6852.643948]  __schedule+0x34d/0x8a0
[ 6852.644552]  ? migrate_enable+0x118/0x3a0
[ 6852.645510]  schedule+0x56/0xe0
[ 6852.646064]  __rt_mutex_slowlock+0xbe/0x130
[ 6852.646750]  rt_mutex_slowlock_locked+0xbc/0x260
[ 6852.647511]  rt_mutex_slowlock.constprop.31+0x5f/0x90
[ 6852.648330]  drain_all_pages+0x184/0x190
[ 6852.654084]  kcompactd_do_work+0x26f/0x330
[ 6852.654797]  ? kcompactd_do_work+0x330/0x330
[ 6852.655507]  kcompactd+0x8b/0x200
[ 6852.656081]  ? finish_wait+0x70/0x70
[ 6852.656681]  kthread+0x15d/0x180
[ 6852.666891]  ? __kthread_parkme+0xa0/0xa0
[ 6852.667571]  ret_from_fork+0x35/0x40
....


Actual results:
There are Call Trace in RT guest kernel dmesg.

Expected results:
There should be no Call Trace info showed.

Additional info:
 1. Only oslat testing hit this issue. cyclictest testing looks good, didn't hit this problem.

Comment 1 Pei Zhang 2020-11-15 09:27:34 UTC

Update:

This issue doesn't affect the oslat latency value, we still can get expected results.

==Results==
(1)Single VM with 1 rt vCPU:
 Max Latency:         8 (us)

(2)Single VM with 8 rt vCPUs:
 Max Latency:         10 10 10 9 10 9 9 9 (us)

(3)Multiple VMs each with 1 rt vCPU:
- VM1
 Max Latency:         8 (us)

- VM2
 Max Latency:         8 (us)

- VM3
 Max Latency:         8 (us)

- VM4
 Max Latency:         10 (us)

Comment 2 Juri Lelli 2020-11-16 06:20:32 UTC

I believe this is caused by the fact that we needed to revert Marcelo's
change because of https://bugzilla.redhat.com/show_bug.cgi?id=1894793#c7

I just quickly looked at the stack trace, but my suspect is that drain_all_pages
queues kcompactd on isolated cores that are running oslat busy threads and
thus it gets starved by those.

I guess this is a sign we'd need to adapt Marcelo's change to 8.4.

Comment 9 Pei Zhang 2021-01-12 07:07:30 UTC

Created attachment 1746548 [details]
Full Call Trace dmesg log with 4.18.0-272.rt7.37.el8.x86_64

Comment 15 Pei Zhang 2021-01-19 03:43:07 UTC

(In reply to Pei Zhang from comment #1)
> Update:
> 
> This issue doesn't affect the oslat latency value, we still can get expected
> results.

In latest 2 testing runs, the oslat max latency exceeds 15us(run 1 is 16us, run 2 is 21us). So this issue also affects the performance.

== Run 1:

Results:
(1)Single VM with 1 rt vCPU:
     Maximum:	 8 (us)

(2)Single VM with 8 rt vCPUs:
     Maximum:	 10 9 9 9 9 8 8 8 (us)

(3)Multiple VMs each with 1 rt vCPU:
- VM1
     Maximum:	 15 (us)

- VM2
     Maximum:	 16 (us)

- VM3
     Maximum:	 15 (us)

- VM4
     Maximum:	 14 (us)

Versions:
kernel-rt-4.18.0-269.rt7.34.el8.x86_64
qemu-kvm-5.2.0-2.scrmod+el8.4.0+9296+986afc72.wrb210106.x86_64
tuned-2.15.0-1.el8.noarch
libvirt-6.10.0-1.scrmod+el8.4.0+9296+986afc72.x86_64
python3-libvirt-6.6.0-1.module+el8.3.0+7572+bcbf6b90.x86_64
microcode_ctl-20201112-1.el8.x86_64
rt-tests-1.9-2.el8.x86_64


== Run 2:

Results:
(1)Single VM with 1 rt vCPU:
     Maximum:	 10 (us)

(2)Single VM with 8 rt vCPUs:
     Maximum:	 10 10 10 10 10 10 10 10 (us)

(3)Multiple VMs each with 1 rt vCPU:
- VM1
     Maximum:	 16 (us)

- VM2
     Maximum:	 17 (us)

- VM3
     Maximum:	 18 (us)

- VM4
     Maximum:	 21 (us)

Versions:
kernel-rt-4.18.0-272.rt7.37.el8.x86_64
qemu-kvm-5.2.0-2.scrmod+el8.4.0+9296+986afc72.wrb210106.x86_64
tuned-2.15.0-1.el8.noarch
libvirt-6.10.0-1.scrmod+el8.4.0+9296+986afc72.x86_64
microcode_ctl-20201112-1.el8.x86_64
rt-tests-1.9-2.el8.x86_64

Comment 19 Marcelo Tosatti 2021-01-26 20:55:27 UTC

Hi Pei, 

Can you please confirm this kernel-rt build fixes the problem?

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=34531073

Comment 20 Nitesh Narayan Lal 2021-01-26 21:11:27 UTC

Hi Marcelo,

I was looking through your last upstream posting of "re-enable remote
per-cpu-pages draining".

In that thread, Sebastian mentioned [1] that Joe Korty reported that the
remote draining patch-set somehow broke a workload for him.
I haven't been able to find out this thread where Joe reported this, maybe
you will know.

Do you know if that issue still exists?

[1] https://www.spinics.net/lists/linux-rt-users/msg22238.html

Thanks
Nitesh

Comment 21 Pei Zhang 2021-01-27 10:31:04 UTC

(In reply to Marcelo Tosatti from comment #19)
> Hi Pei, 
> 
> Can you please confirm this kernel-rt build fixes the problem?
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=34531073

Hi Marcelo,

Testing with this build is ongoing now. Results will be updated tomorrow. Thanks.

Best regards,

Pei

Comment 22 Marcelo Tosatti 2021-01-27 11:20:23 UTC

(In reply to Nitesh Narayan Lal from comment #20)
> Hi Marcelo,
> 
> I was looking through your last upstream posting of "re-enable remote
> per-cpu-pages draining".
> 
> In that thread, Sebastian mentioned [1] that Joe Korty reported that the
> remote draining patch-set somehow broke a workload for him.
> I haven't been able to find out this thread where Joe reported this, maybe
> you will know.
> 
> Do you know if that issue still exists?
> 
> [1] https://www.spinics.net/lists/linux-rt-users/msg22238.html
> 
> Thanks
> Nitesh

Nitesh,

Haven't got access to Joe Korty's report, but for 
https://bugzilla.redhat.com/show_bug.cgi?id=1894793#c7

the issue was that for the new local lock interface, 
"local_lock_irqsave" was not disabling interrupts 
(attaching the patch).
Maybe its the same issue.

Comment 23 Marcelo Tosatti 2021-01-27 11:21:06 UTC

Created attachment 1751221 [details]
drain_all_pages + local_irq_save fix

Comment 24 Pei Zhang 2021-01-29 03:34:18 UTC

(In reply to Pei Zhang from comment #21)
> (In reply to Marcelo Tosatti from comment #19)
> > Hi Pei, 
> > 
> > Can you please confirm this kernel-rt build fixes the problem?
> > 
> > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=34531073
> 
> Hi Marcelo,
> 
> Testing with this build is ongoing now. Results will be updated tomorrow.
> Thanks.

Hello Marcelo,

4.18.0-275.rt7.40.el8rdrain2.x86_64 doesn't work very well.

In our latest several runs, here are the testing results:

1. The Call Trace issue is gone. No any Call Trace in rt guest any more.

2. There is oslat latency spike (19us) in 1 hour testing. This run was testing with "-w memmove -m 4K".

(1)Single VM with 1 rt vCPU:
     Maximum:	 10 (us)

(2)Single VM with 8 rt vCPUs:
     Maximum:	 10 10 19 10 10 10 10 10 (us)    
                       ^^

(3)Multiple VMs each with 1 rt vCPU:
- VM1
     Maximum:	 10 (us)

- VM2
     Maximum:	 10 (us)

- VM3
     Maximum:	 10 (us)

- VM4
     Maximum:	 10 (us)

==Versions==
kernel-rt-4.18.0-275.rt7.40.el8rdrain2.x86_64
qemu-kvm-5.2.0-3.scrmod+el8.4.0+9533+d0097f55.wrb210120.x86_64
tuned-2.15.0-1.el8.noarch
libvirt-7.0.0-2.scrmod+el8.4.0+9533+d0097f55.x86_64
python3-libvirt-6.10.0-1.module+el8.4.0+8948+a39b3f3a.x86_64
microcode_ctl-20201112-1.el8.x86_64
rt-tests-1.10-1.el8.x86_64


beaker job: https://beaker.engineering.redhat.com/jobs/5039943

3. We hit RT guest hang during the oslat testing. This issue is not 100% reproduced, I just hit it once. As I cannot get access to the guest, so there is no other info can be provided now.


Best regards,

Pei

> 
> Best regards,
> 
> Pei

Comment 25 Marcelo Tosatti 2021-01-29 10:07:40 UTC

(In reply to Pei Zhang from comment #24)
> (In reply to Pei Zhang from comment #21)
> > (In reply to Marcelo Tosatti from comment #19)
> > > Hi Pei, 
> > > 
> > > Can you please confirm this kernel-rt build fixes the problem?
> > > 
> > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=34531073
> > 
> > Hi Marcelo,
> > 
> > Testing with this build is ongoing now. Results will be updated tomorrow.
> > Thanks.
> 
> Hello Marcelo,
> 
> 4.18.0-275.rt7.40.el8rdrain2.x86_64 doesn't work very well.

Pei,

The patch is very broken. Should have a new brew build today.

Comment 26 Marcelo Tosatti 2021-01-29 18:37:15 UTC

(In reply to Marcelo Tosatti from comment #25)
> (In reply to Pei Zhang from comment #24)
> > (In reply to Pei Zhang from comment #21)
> > > (In reply to Marcelo Tosatti from comment #19)
> > > > Hi Pei, 
> > > > 
> > > > Can you please confirm this kernel-rt build fixes the problem?
> > > > 
> > > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=34531073
> > > 
> > > Hi Marcelo,
> > > 
> > > Testing with this build is ongoing now. Results will be updated tomorrow.
> > > Thanks.
> > 
> > Hello Marcelo,
> > 
> > 4.18.0-275.rt7.40.el8rdrain2.x86_64 doesn't work very well.
> 
> Pei,
> 
> The patch is very broken. Should have a new brew build today.

Pei,

This should work:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=34606898

Comment 27 Pei Zhang 2021-02-01 07:55:46 UTC

(In reply to Marcelo Tosatti from comment #26)
[...]
> 
> Pei,
> 
> This should work:
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=34606898

Thanks Marcelo. The testing with this build is ongoing now. Results will be updated soon(should before Wednesday).

Best regards,

Pei

Comment 46 Pei Zhang 2021-02-22 09:13:29 UTC

Verification:

Versions:
kernel-rt-4.18.0-288.rt7.53.el8.x86_64
qemu-kvm-5.2.0-7.module+el8.4.0+9943+d64b3717.x86_64
tuned-2.15.0-1.el8.noarch
libvirt-7.0.0-3.scrmod+el8.4.0+9884+244fb11b.x86_64
microcode_ctl-20201112-2.el8.x86_64
rt-tests-1.10-1.el8.x86_64

== 12 hours oslat: max latency is 6us. No any error shows.
(1)Single VM with 1 rt vCPU:
     Maximum:	 6 (us)

(2)Single VM with 8 rt vCPUs:
     Maximum:	 6 6 6 6 6 6 6 6 (us)

(3)Multiple VMs each with 1 rt vCPU:
- VM1
     Maximum:	 6 (us)

- VM2
     Maximum:	 6 (us)

- VM3
     Maximum:	 6 (us)

- VM4
     Maximum:	 6 (us)


== 12 hours cyclictest: max latency is 29ys. No any error shows.

(1)Single VM with 1 rt vCPU:
# Min Latencies: 00007
# Avg Latencies: 00010
# Max Latencies: 00024

(2)Single VM with 8 rt vCPUs:
# Min Latencies: 00007 00012 00012 00012 00012 00012 00012 00012
# Avg Latencies: 00009 00012 00012 00012 00012 00012 00012 00012
# Max Latencies: 00021 00022 00023 00022 00022 00022 00022 00022

(3)Multiple VMs each with 1 rt vCPU:
- VM1
# Min Latencies: 00007
# Avg Latencies: 00009
# Max Latencies: 00022

- VM2
# Min Latencies: 00007
# Avg Latencies: 00009
# Max Latencies: 00020

- VM3
# Min Latencies: 00007
# Avg Latencies: 00009
# Max Latencies: 00022

- VM4
# Min Latencies: 00007
# Avg Latencies: 00009
# Max Latencies: 00029


So this bug has been fixed. Move to 'VERIFIED'.

Comment 48 errata-xmlrpc 2021-05-18 15:12:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel-rt security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1739

Note You need to log in before you can comment on or make changes to this bug.