Bug 1448770 - INFO: task xfsaild/dm-2:1175 blocked for more than 600 seconds
Summary: INFO: task xfsaild/dm-2:1175 blocked for more than 600 seconds
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel-rt
Version: 7.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Luis Claudio R. Goncalves
QA Contact: Pei Zhang
URL:
Whiteboard:
Keywords: Reopened
Depends On:
Blocks: 1175461 1442258
TreeView+ depends on / blocked
 
Reported: 2017-05-08 03:34 UTC by Pei Zhang
Modified: 2018-04-10 09:09 UTC (History)
20 users (show)

(edit)
Clone Of:
: 1539394 (view as bug list)
(edit)
Last Closed: 2018-04-10 09:07:09 UTC


Attachments (Terms of Use)
Call trace info after Open vSwitch setup. (140.26 KB, text/plain)
2017-05-10 14:26 UTC, Pei Zhang
no flags Details
Pure rt testing hit hang issue (28.10 KB, text/plain)
2017-05-17 10:15 UTC, Pei Zhang
no flags Details
Script we use for cyclictest in rt environment. (4.52 KB, application/x-shellscript)
2017-05-31 09:54 UTC, Pei Zhang
no flags Details
Call trace info with 4G memory (27.04 KB, text/plain)
2017-06-02 14:59 UTC, Pei Zhang
no flags Details
Call trace info with 10G memory (159.12 KB, text/plain)
2017-06-04 09:25 UTC, Pei Zhang
no flags Details
Call trace info with 3.10.0-679.rt56.602.el7.x86_64 (161.42 KB, text/plain)
2017-06-09 01:11 UTC, Pei Zhang
no flags Details
Call trace info with 3.10.0-732.rt56.660.el7.x86_64 (33.90 KB, text/plain)
2017-10-16 10:02 UTC, Pei Zhang
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:0676 None None None 2018-04-10 09:09 UTC

Description Pei Zhang 2017-05-08 03:34:01 UTC
Description of problem:
When installing VM in rt host, both host and VM installation can work, but there are call traces info "INFO: task rcub/1:18 blocked for more than 600 seconds."repeatedly show during the process of VM installation.


Version-Release number of selected component (if applicable):
3.10.0-663.rt56.582.el7.x86_64
qemu-kvm-rhev-2.9.0-3.el7.x86_64
libvirt-3.2.0-4.el7.x86_64


How reproducible:
Currently running 1 time


Steps to Reproduce:
1. Prepare rt host
# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-663.rt56.582.el7.x86_64 root=/dev/mapper/rhel_dell--per730--27-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per730-27/root rd.lvm.lv=rhel_dell-per730-27/swap console=ttyS0,115200n81 default_hugepagesz=1G iommu=pt intel_iommu=on isolcpus=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,30,28,26,24,22,20,18,16 nohz=on nohz_full=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,30,28,26,24,22,20,18,16 rcu_nocbs=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,30,28,26,24,22,20,18,16 intel_pstate=disable nosoftlockup


2. Install VM, refer to Reference[1]. Call traces show in host.


Actual results:
Host works well, VM installation can finish, but call traces shows in host.


Expected results:
There should not be call traces info in dmesg.


Additional info:
Call trace info:
[ 1818.763069] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 8, t=60002 jiffies, g=18839, c=18838, q=5121)
[ 1818.763071] All QSes seen, last rcu_preempt kthread activity 1 (4296486586-4296486585), jiffies_till_next_fqs=3
[ 1818.763073] swapper/8       R  running task        0     0      1 0x00080000
[ 1818.763075]  ffff94175abe8000 21df5d4421df5489 ffff941cde703dd0 ffffffff8a0be8d6
[ 1818.763076]  ffff941cde712080 ffffffff8aa44940 ffff941cde703e38 ffffffff8a13b3cd
[ 1818.763077]  0000000000000000 ffff941cde712080 0000000000001401 0000000000000000
[ 1818.763077] Call Trace:
[ 1818.763084]  <IRQ>  [<ffffffff8a0be8d6>] sched_show_task+0xb6/0x120
[ 1818.763088]  [<ffffffff8a13b3cd>] rcu_check_callbacks+0x83d/0x860
[ 1818.763091]  [<ffffffff8a091ec1>] update_process_times+0x41/0x70
[ 1818.763094]  [<ffffffff8a0ee815>] tick_sched_handle.isra.18+0x25/0x60
[ 1818.763096]  [<ffffffff8a0eebb3>] tick_sched_timer+0x43/0x70
[ 1818.763098]  [<ffffffff8a0adea4>] __run_hrtimer+0xc4/0x2c0
[ 1818.763100]  [<ffffffff8a0eeb70>] ? tick_sched_do_timer+0x50/0x50
[ 1818.763101]  [<ffffffff8a0aee20>] hrtimer_interrupt+0x130/0x350
[ 1818.763105]  [<ffffffff8a0473a5>] local_apic_timer_interrupt+0x35/0x60
[ 1818.763108]  [<ffffffff8a6ba4dd>] smp_apic_timer_interrupt+0x3d/0x50
[ 1818.763111]  [<ffffffff8a6b8c5d>] apic_timer_interrupt+0x6d/0x80
[ 1818.763114]  <EOI>  [<ffffffff8a6aec46>] ? native_safe_halt+0x6/0x10
[ 1818.763116]  [<ffffffff8a6aea5c>] default_idle+0x2c/0x140
[ 1818.763119]  [<ffffffff8a02707e>] arch_cpu_idle+0x2e/0x40
[ 1818.763122]  [<ffffffff8a0e2ecc>] cpu_startup_entry+0x14c/0x1d0
[ 1818.763124]  [<ffffffff8a045454>] start_secondary+0x1b4/0x220
[ 2434.270593] INFO: task rcub/1:18 blocked for more than 600 seconds.
[ 2434.444825] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2434.453564] rcub/1          D ffff94175abc3090     0    18      2 0x00000000
[ 2434.461446]  ffff94175abdbcc0 0000000000000002 ffff94175abdbfd8 ffff94175abdbfd8
[ 2434.469736]  ffff94175abdbfd8 ffff94175abdbfd8 ffffffff8a9fe480 ffff94175abc3090
[ 2434.478021]  ffff94175abc3090 0000000000000000 0000000000000000 ffff94175abc3090
[ 2434.486310] Call Trace:
[ 2434.489041]  [<ffffffff8a6acea0>] schedule+0x30/0x96
[ 2434.494580]  [<ffffffff8a6adacd>] __rt_mutex_slowlock+0xdd/0x160
[ 2434.501282]  [<ffffffff8a6adf5f>] rt_mutex_slowlock_locked+0xbf/0x1a0
[ 2434.508468]  [<ffffffff8a6ae0bc>] rt_mutex_slowlock+0x7c/0xc0
[ 2434.514877]  [<ffffffff8a6ae171>] rt_mutex_lock+0x31/0x40
[ 2434.520901]  [<ffffffff8a137588>] rcu_boost_kthread+0x148/0x390
[ 2434.527504]  [<ffffffff8a137440>] ? rcu_report_unblock_qs_rnp+0xe0/0xe0
[ 2434.534883]  [<ffffffff8a0a9b4f>] kthread+0xcf/0xe0
[ 2434.540325]  [<ffffffff8a0a9a80>] ? kthread_worker_fn+0x170/0x170
[ 2434.547123]  [<ffffffff8a6b7f18>] ret_from_fork+0x58/0x90
[ 2434.553146]  [<ffffffff8a0a9a80>] ? kthread_worker_fn+0x170/0x170
[ 3034.498594] INFO: task rcub/1:18 blocked for more than 600 seconds.
[ 3034.505588] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3034.514325] rcub/1          D ffff94175abc3090     0    18      2 0x00000000
[ 3034.522209]  ffff94175abdbcc0 0000000000000002 ffff94175abdbfd8 ffff94175abdbfd8
[ 3034.530496]  ffff94175abdbfd8 ffff94175abdbfd8 ffffffff8a9fe480 ffff94175abc3090
[ 3034.538781]  ffff94175abc3090 0000000000000000 0000000000000000 ffff94175abc3090
[ 3034.547064] Call Trace:
[ 3034.549796]  [<ffffffff8a6acea0>] schedule+0x30/0x96
[ 3034.555333]  [<ffffffff8a6adacd>] __rt_mutex_slowlock+0xdd/0x160
[ 3034.562034]  [<ffffffff8a6adf5f>] rt_mutex_slowlock_locked+0xbf/0x1a0
[ 3034.569221]  [<ffffffff8a6ae0bc>] rt_mutex_slowlock+0x7c/0xc0
[ 3034.575633]  [<ffffffff8a6ae171>] rt_mutex_lock+0x31/0x40
[ 3034.581656]  [<ffffffff8a137588>] rcu_boost_kthread+0x148/0x390
[ 3034.588260]  [<ffffffff8a137440>] ? rcu_report_unblock_qs_rnp+0xe0/0xe0
[ 3034.595642]  [<ffffffff8a0a9b4f>] kthread+0xcf/0xe0
[ 3034.601082]  [<ffffffff8a0a9a80>] ? kthread_worker_fn+0x170/0x170
[ 3034.607882]  [<ffffffff8a6b7f18>] ret_from_fork+0x58/0x90
[ 3034.613904]  [<ffffffff8a0a9a80>] ? kthread_worker_fn+0x170/0x170
[ 3634.559834] INFO: task rcub/1:18 blocked for more than 600 seconds.
[ 3634.566826] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3634.575562] rcub/1          D ffff94175abc3090     0    18      2 0x00000000
[ 3634.583445]  ffff94175abdbcc0 0000000000000002 ffff94175abdbfd8 ffff94175abdbfd8
[ 3634.591733]  ffff94175abdbfd8 ffff94175abdbfd8 ffffffff8a9fe480 ffff94175abc3090
[ 3634.600021]  ffff94175abc3090 0000000000000000 0000000000000000 ffff94175abc3090
[ 3634.608307] Call Trace:
[ 3634.611039]  [<ffffffff8a6acea0>] schedule+0x30/0x96
[ 3634.616578]  [<ffffffff8a6adacd>] __rt_mutex_slowlock+0xdd/0x160
[ 3634.623279]  [<ffffffff8a6adf5f>] rt_mutex_slowlock_locked+0xbf/0x1a0
[ 3634.630464]  [<ffffffff8a6ae0bc>] rt_mutex_slowlock+0x7c/0xc0
[ 3634.636875]  [<ffffffff8a6ae171>] rt_mutex_lock+0x31/0x40
[ 3634.642899]  [<ffffffff8a137588>] rcu_boost_kthread+0x148/0x390
[ 3634.649504]  [<ffffffff8a137440>] ? rcu_report_unblock_qs_rnp+0xe0/0xe0
[ 3634.656884]  [<ffffffff8a0a9b4f>] kthread+0xcf/0xe0
[ 3634.662325]  [<ffffffff8a0a9a80>] ? kthread_worker_fn+0x170/0x170
[ 3634.669124]  [<ffffffff8a6b7f18>] ret_from_fork+0x58/0x90
[ 3634.675146]  [<ffffffff8a0a9a80>] ? kthread_worker_fn+0x170/0x170
[ 4234.620883] INFO: task rcub/1:18 blocked for more than 600 seconds.
[ 4234.627879] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4234.636616] rcub/1          D ffff94175abc3090     0    18      2 0x00000000
[ 4234.644499]  ffff94175abdbcc0 0000000000000002 ffff94175abdbfd8 ffff94175abdbfd8
[ 4234.652787]  ffff94175abdbfd8 ffff94175abdbfd8 ffffffff8a9fe480 ffff94175abc3090
[ 4234.661075]  ffff94175abc3090 0000000000000000 0000000000000000 ffff94175abc3090
[ 4234.669362] Call Trace:
[ 4234.672093]  [<ffffffff8a6acea0>] schedule+0x30/0x96
[ 4234.677621]  [<ffffffff8a6adacd>] __rt_mutex_slowlock+0xdd/0x160
[ 4234.684324]  [<ffffffff8a6adf5f>] rt_mutex_slowlock_locked+0xbf/0x1a0
[ 4234.691509]  [<ffffffff8a6ae0bc>] rt_mutex_slowlock+0x7c/0xc0
[ 4234.697919]  [<ffffffff8a6ae171>] rt_mutex_lock+0x31/0x40
[ 4234.703942]  [<ffffffff8a137588>] rcu_boost_kthread+0x148/0x390
[ 4234.710547]  [<ffffffff8a137440>] ? rcu_report_unblock_qs_rnp+0xe0/0xe0
[ 4234.717928]  [<ffffffff8a0a9b4f>] kthread+0xcf/0xe0
[ 4234.723368]  [<ffffffff8a0a9a80>] ? kthread_worker_fn+0x170/0x170
[ 4234.730165]  [<ffffffff8a6b7f18>] ret_from_fork+0x58/0x90
[ 4234.736187]  [<ffffffff8a0a9a80>] ? kthread_worker_fn+0x170/0x170
[ 4845.097908] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 4, t=60002 jiffies, g=52543, c=52542, q=5291)
[ 4845.097909] All QSes seen, last rcu_preempt kthread activity 1 (4299513228-4299513227), jiffies_till_next_fqs=3
[ 4845.097911] swapper/4       R  running task        0     0      1 0x00080000
[ 4845.097913]  ffff94175abc40c0 18f2b20e18f2a953 ffff941cde683dd0 ffffffff8a0be8d6
[ 4845.097914]  ffff941cde692080 ffffffff8aa44940 ffff941cde683e38 ffffffff8a13b3cd
[ 4845.097915]  0000000000000000 ffff941cde692080 00000000000014ab 0000000000000000
[ 4845.097916] Call Trace:
[ 4845.097922]  <IRQ>  [<ffffffff8a0be8d6>] sched_show_task+0xb6/0x120
[ 4845.097926]  [<ffffffff8a13b3cd>] rcu_check_callbacks+0x83d/0x860
[ 4845.097929]  [<ffffffff8a091ec1>] update_process_times+0x41/0x70
[ 4845.097932]  [<ffffffff8a0ee815>] tick_sched_handle.isra.18+0x25/0x60
[ 4845.097934]  [<ffffffff8a0eebb3>] tick_sched_timer+0x43/0x70
[ 4845.097936]  [<ffffffff8a0adea4>] __run_hrtimer+0xc4/0x2c0
[ 4845.097938]  [<ffffffff8a0eeb70>] ? tick_sched_do_timer+0x50/0x50
[ 4845.097939]  [<ffffffff8a0aee20>] hrtimer_interrupt+0x130/0x350
[ 4845.097943]  [<ffffffff8a0473a5>] local_apic_timer_interrupt+0x35/0x60
[ 4845.097946]  [<ffffffff8a6ba4dd>] smp_apic_timer_interrupt+0x3d/0x50
[ 4845.097949]  [<ffffffff8a6b8c5d>] apic_timer_interrupt+0x6d/0x80
[ 4845.097952]  <EOI>  [<ffffffff8a6aec46>] ? native_safe_halt+0x6/0x10
[ 4845.097954]  [<ffffffff8a6aea5c>] default_idle+0x2c/0x140
[ 4845.097957]  [<ffffffff8a02707e>] arch_cpu_idle+0x2e/0x40
[ 4845.097959]  [<ffffffff8a0e2ecc>] cpu_startup_entry+0x14c/0x1d0
[ 4845.097961]  [<ffffffff8a045454>] start_secondary+0x1b4/0x220



Reference: 
[1]
virt-install --os-variant=rhel7 \
--noautoconsole \
--graphics type=vnc,listen=0.0.0.0,port=5931 \
--name=rhel7.4_rt \
--memory=8192,hugepages=yes \
--memorybacking hugepages=yes,size=1,unit=G,nodeset=1,locked=yes \
--numatune=1 --vcpus=6,cpuset=30,31,29,27,25,23 \
--disk path=/home/images_nfv-virt-rt-kvm/rhel7.4_rt.qcow2,bus=virtio,cache=none,format=qcow2,io=threads,size=20 \
-l http://download.eng.pek2.redhat.com//nightly/RHEL-7.4-20170505.n.0/compose/Server/x86_64/os/ -x ks=http://10.66.9.128/kickstart-rhel7.cfg  \
--network bridge=switch,model=virtio,mac=88:66:da:5f:dd:01

Comment 2 Pei Zhang 2017-05-09 04:36:05 UTC
Do same testing with 3.10.0-664.rt56.583.el7.x86_64, still hit same issue.

Comment 3 Luiz Capitulino 2017-05-09 14:10:01 UTC
Pei, is the problem 100% reproducible or do you have to try several times?

I'm not being able to reproduce it, so maybe you could give access to you machine (in private) so that I can debug it?

Comment 4 Pei Zhang 2017-05-10 00:37:08 UTC
(In reply to Luiz Capitulino from comment #3)
> Pei, is the problem 100% reproducible or do you have to try several times?

Hi Luiz, this problem is 100% reproducible. Each time testing I re-installed the host and prepared rt environment. Then I started to install the guest, once the guest started installation(image size is increasing) the call trace will show. 

> I'm not being able to reproduce it, so maybe you could give access to you
> machine (in private) so that I can debug it?

Sure. I'll let you know once I prepare this environment ready. (These hosts are used for other testing now).

Comment 5 Pei Zhang 2017-05-10 14:26 UTC
Created attachment 1277629 [details]
Call trace info after Open vSwitch setup.

Sorry, I missed a step[1]. As the call traces shows during the process of guest installation, so I wrongly ignored this openvswitch step. Sorry for the confusion.

Step[1]: Before installing the guest, I also setup openvswitch. And bind all pmd threads to 8 isolated cores(core1,3,5,7,9,11,13,15) and assign them 95 priority. 

[cmd]:/usr/bin/ovs-vsctl set Open_vSwitch . other_config={}
[cmd]:/usr/bin/ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0xAAAA
[cmd]:/usr/bin/ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x2
[cmd]:/usr/bin/ovs-vsctl set Open_vSwitch . other_config:n-dpdk-rxqs=2
[cmd]:/usr/bin/ovs-vsctl set Interface dpdk0 options:n_rxq=2
[cmd]:/usr/bin/ovs-vsctl set Interface dpdk0 options:n_txq=2
[cmd]:/usr/bin/ovs-vsctl set Interface dpdk1 options:n_rxq=2
[cmd]:/usr/bin/ovs-vsctl set Interface dpdk1 options:n_txq=2

# chrt -fp 95 $pmd_process_id


Update: Call trace info changes.

Today when I try to re-install this environment, I found that the call traces info changes. There are call traces info show after the openvswitch setup. But during the installation of guest, call trace info didn't show.

The issue changes, perhaps because I was testing with different Open vSwitch versions. In today testing I tested with latest openvswitch-2.6.1-18.git20161206.el7fdp, in formal testing I tested with openvswitch-2.6.1-17.git20161206.el7fdp. 

I guess I need to confirm the problem and find the exact steps to reproduce.

Comment 7 Luiz Capitulino 2017-05-10 15:27:09 UTC
(In reply to Pei Zhang from comment #5)

> I guess I need to confirm the problem and find the exact steps to reproduce.

Yes, please do so. Also, please double check the following:

1. The PMD threads are really running on isolated cores. You can do this simply by running top -d1, as the PMD threads take 100% of the CPU, you can confirm that the CPUs running at 100% are all isolated (good idea to double check /proc/cmdline too)

2. When assigning fifo priority, make sure you're not including extra non-PMD threads in chrt command-line

Comment 9 Luiz Capitulino 2017-05-15 17:29:50 UTC
Pei Zhang,

This issue looks very similar to bug 1438120, especially what you and Marcelo have been discussing in bug 1438120 comment 46 and bug 1438120 comment 48.

I'm closing this as a dupe. If you disagree, please re-open. Also, would be nice if you added the reproducer for this bug in bug 1438120 too.

*** This bug has been marked as a duplicate of bug 1438120 ***

Comment 10 Pei Zhang 2017-05-17 10:15 UTC
Created attachment 1279613 [details]
Pure rt testing hit hang issue

Hi Luiz,

I hit host hang issue with pure kvm-rt testing.

Versions:
3.10.0-667.rt56.586.el7.x86_64
tuned-2.8.0-3.el7.noarch
tuned-2.8.0-3.el7.noarch
qemu-kvm-rhev-2.9.0-5.el7.x86_64
libvirt-3.2.0-5.el7.x86_64

Host dell-per430-09.lab.eng.pek2.redhat.com is only used for kvm-rt testing with automation. So only kvm-rt related packages are installed and used. (no Open vSwitch, DPDK).


I'd like to share some info about our automation.
We test 3 scenarios orderly:
(1) Single guest with 1 rt vCPU
(2) Single guest with 8 rt vCPUs
(3) Four guests each with 1 rt vCPU

For each scenario, we totally have 3 main steps:
 - First install/update the latest rt packages, including kernel-rt, rt-tests, tuned, qemu-kvm-rhev, libvirt. Also set 1G hugepage and isolate cores.
 - Then install a new guest. Define the VM as rt requirement and install the latest rt packages kernel-rt, rt-tests, tuned. Also set 1G hugepage and isolate cores.
 - Finally do latency testing, normally 12h. Sometimes 1h.

I hit hang issue when the reboot guest after"/usr/sbin/tuned-adm profile realtime-virtual-guest" of scenario 2. 

Currently I am not sure if this issue is reproducible. I'll keep testing.

Once it's reproducible, perhaps I'll need to re-open this bug.


Thanks,
Pei

Comment 11 Luiz Capitulino 2017-05-17 16:09:20 UTC
I'm re-opening, as it looks different now. Before starting your tests could you set:

# sysctl kernel.hung_task_panic=1

And then make sure the system is setup to generate a vmcore and make it available once you get it?

Comment 12 Marcelo Tosatti 2017-05-18 03:22:51 UTC
(In reply to Pei Zhang from comment #10)
> Created attachment 1279613 [details]
> Pure rt testing hit hang issue
> 
> Hi Luiz,
> 
> I hit host hang issue with pure kvm-rt testing.
> 
> Versions:
> 3.10.0-667.rt56.586.el7.x86_64
> tuned-2.8.0-3.el7.noarch
> tuned-2.8.0-3.el7.noarch
> qemu-kvm-rhev-2.9.0-5.el7.x86_64
> libvirt-3.2.0-5.el7.x86_64
> 
> Host dell-per430-09.lab.eng.pek2.redhat.com is only used for kvm-rt testing
> with automation. So only kvm-rt related packages are installed and used. (no
> Open vSwitch, DPDK).
> 
> 
> I'd like to share some info about our automation.
> We test 3 scenarios orderly:
> (1) Single guest with 1 rt vCPU
> (2) Single guest with 8 rt vCPUs
> (3) Four guests each with 1 rt vCPU
> 
> For each scenario, we totally have 3 main steps:
>  - First install/update the latest rt packages, including kernel-rt,
> rt-tests, tuned, qemu-kvm-rhev, libvirt. Also set 1G hugepage and isolate
> cores.
>  - Then install a new guest. Define the VM as rt requirement and install the
> latest rt packages kernel-rt, rt-tests, tuned. Also set 1G hugepage and
> isolate cores.
>  - Finally do latency testing, normally 12h. Sometimes 1h.
> 
> I hit hang issue when the reboot guest after"/usr/sbin/tuned-adm profile
> realtime-virtual-guest" of scenario 2. 
> 
> Currently I am not sure if this issue is reproducible. I'll keep testing.
> 
> Once it's reproducible, perhaps I'll need to re-open this bug.
> 
> 
> Thanks,
> Pei

Consider the following scenario:

1) qemu-vcpu with FIFO 1 priority on isolated core, HLT's for some reason,
waiting for APIC timer interrupt.
2) Lower priority thread executes, grabs a spinlock (err, mutex).
3) APIC timer interrupt fires, qemu-vcpu enters region where it won't yield.
4) Host system hangs because spinlock that lower priority thread grabbed at 2) is never released.

So i think DPDK must be changed to yield for short periods of time.

Comment 13 Luiz Capitulino 2017-05-18 14:31:02 UTC
If I understood correctly, this issue is hit when the KVM-RT guest is being configured so there is nothing taking 100% of the CPU all the time.

Comment 14 Marcelo Tosatti 2017-05-19 15:01:50 UTC
(In reply to Luiz Capitulino from comment #13)
> If I understood correctly, this issue is hit when the KVM-RT guest is being
> configured so there is nothing taking 100% of the CPU all the time.

That is right, this is just another datapoint in the matter of:

"Its OK to have lower priority threads with busy-spinning DPDK on host or busy-spinning DPDK inside vcpu hanging there forever" is wrong.

Comment 15 Luiz Capitulino 2017-05-19 17:52:57 UTC
I disagree, but I don't think this is the right place to discuss this.

Comment 17 Pei Zhang 2017-05-23 01:43:22 UTC
(In reply to Luiz Capitulino from comment #11)
> I'm re-opening, as it looks different now. Before starting your tests could
> you set:
> 
> # sysctl kernel.hung_task_panic=1
> 
> And then make sure the system is setup to generate a vmcore and make it
> available once you get it?

OK. I have add this command to my testing.

Once there are call trace shows, I'll generate the vmcore by 
"echo c > /proc/sysrq-trigger".


Thanks,
Pei

Comment 22 Pei Zhang 2017-05-26 10:50:29 UTC
This bug mentioned 2 scenarios, they both hit call trace issues:
(1) With booting ovs 

Setting ovs pmds fifo:95     call trace shows, see [1] if needed
Setting ovs pmds fifo:1      works well, no call trace

As pmds should be assigned fifo:1. So these are expected results.

I have a question: should we remind users in someplace or promote some readable warning info when they assign invalid fifo priority like fifo:95? The call trace looks a bit unfriendly.

[1]http://fileshare.englab.nay.redhat.com/pub/section2/coredump/var/crash/pezhang/bug1448770/ 



(2) Pure rt testting
I'm still trying to reproduce this scenario now. Will update my results asap.



Thanks,
Pei

Comment 24 Pei Zhang 2017-05-27 08:20:04 UTC
(In reply to Luiz Capitulino from comment #11)
> I'm re-opening, as it looks different now. Before starting your tests could
> you set:
> 
> # sysctl kernel.hung_task_panic=1
> 
> And then make sure the system is setup to generate a vmcore and make it
> available once you get it?

Luiz, about the vmcore, please refer to [1]:

[1]http://fileshare.englab.nay.redhat.com/pub/section2/coredump/var/crash/pezhang/bug1448770/hang_task/


Note: This was testing in pure rt environment.


Best Regards,
Pei

Comment 25 Pei Zhang 2017-05-27 14:04:05 UTC
Another vmcore during single VM with 8vCPUs latency testing, please refer to: 

http://fileshare.englab.nay.redhat.com/pub/section2/coredump/var/crash/pezhang/bug1448770/hang_task2/



Best Regards,
Pei

Comment 26 Luiz Capitulino 2017-05-29 14:51:51 UTC
Pei,

I'm getting this error when try to run crash with the vmcore from comment 25:

crash: page excluded: kernel virtual address: ffffffffb75c6370  type: "timekeeper xtime_sec"
crash: cannot determine base kernel version
crash: /usr/lib/debug/lib/modules/3.10.0-672.rt56.593.el7.x86_64.debug/vmlinux and ./vmcore do not match!

I got the same error with the vmcore from comment 24, although I of course tried with debug kernel 3.10.0-671.rt56.592.el7.x86_64.

Another question, can you describe what's "pure rt testing"? If you're just running cyclictest, can you describe how you do it?

Comment 27 Luiz Capitulino 2017-05-30 21:13:00 UTC
Pei,

Once you get a working vmcore, would you please pass workqueue.disable_numa=1 in the kernel command-line and try to reproduce again if you have cycles?

Thanks.

Comment 28 Pei Zhang 2017-05-31 09:54 UTC
Created attachment 1283686 [details]
Script we use for cyclictest in rt environment.

(In reply to Luiz Capitulino from comment #26)
> Pei,
> 
> I'm getting this error when try to run crash with the vmcore from comment 25:
> 
> crash: page excluded: kernel virtual address: ffffffffb75c6370  type:
> "timekeeper xtime_sec"
> crash: cannot determine base kernel version
> crash:
> /usr/lib/debug/lib/modules/3.10.0-672.rt56.593.el7.x86_64.debug/vmlinux and
> ./vmcore do not match!
> 
> I got the same error with the vmcore from comment 24, although I of course
> tried with debug kernel 3.10.0-671.rt56.592.el7.x86_64.

Hi Luiz, 

The vmcore can be open using "#crash" in my testing machine:
# crash  /usr/lib/debug/lib/modules/3.10.0-672.rt56.593.el7.x86_64/vmlinux vmcore

I am not sure if you also use '#crash', if not, could you share your method? So next time, I could check the validation of the vmcore before I upload them.

Also, you can access to these vmcore in my testing machine(Perhaps the process of download made file demage). I still kept the testing environment. I'll add the host info in next Comment.

> Another question, can you describe what's "pure rt testing"? If you're just
> running cyclictest, can you describe how you do it?

Sorry for the confusion :)

Pure rt testing means only testing latency with cyclictest in rt environment, no other components are involved. Only rt/rt-kvm related packages kernel-rt/kernel-rt-kvm, tuned, qemu-kvm-rhev, libvirt, rt-tests are installed.

As we also test L2 network latency/migration with Open vSwitch, dpdk, and vhost-user in rt environment. This scenario involves other components besides rt, so that's why we call the cyclictest as pure rt testing.

For the cyclictest, we use the script developed by you(We made a little change about log output to meet our automation requirement). Please check the attachment.

# /usr/bin/sh /home/nfv-virt-rt-kvm/tests/run-cyclictest.sh -d 1h -c 1 -r

In automation, we access the VM by ssh from another host, using paramiko python library. Then execute the run-cyclictest.sh in VM. Besides this, no other more application is running in VM.

(In reply to Luiz Capitulino from comment #27)
> Pei,
> 
> Once you get a working vmcore, would you please pass
> workqueue.disable_numa=1 in the kernel command-line and try to reproduce
> again if you have cycles?

OK. I have added workqueue.disable_numa=1 to kernel command-line. I'll update my testing results here tomorrow.


Thanks,
Pei

Comment 30 Pei Zhang 2017-05-31 10:08:47 UTC
(In reply to Luiz Capitulino from comment #27)
> Pei,
> 
> Once you get a working vmcore, would you please pass
> workqueue.disable_numa=1 in the kernel command-line and try to reproduce
> again if you have cycles?

Luiz, I reproduced with workqueue.disable_numa=1 just now. Please see these vmcore info in 10.73.73.63:
 
/var/crash/127.0.0.1-2017-05-31-05:59:47

Thanks,
Pei

Comment 31 Luiz Capitulino 2017-05-31 17:52:56 UTC
Pei,

After comparing the difference between our systems, I was able to find out that I wasn't using the correct debuginfo package. That is, I was using kernel-rt-debug-debuginfo instead of kernel-rt-debuginfo. Now I'm able to use all vmcores you provded.

I have two updates:

- vmcore from comment 25 is not an issue: you don't have the host profile applied, this can be seeing by checking the kernel command-line in the dmesg file. Not having the profile applied means you don't have isolation, which can cause the issue you saw

- The vmcores from comment 24 and comment 30 look similar and looks like an real issue. I'm still investigating and should have more information later today or tomorrow

I'll update the summary of the BZ to better match with the issue we can reproduce and will assign it to me (which is something I forgot to do before).

Comment 32 Luiz Capitulino 2017-06-01 20:22:24 UTC
Pei,

I have questions and things I'd like you to try.

Questions:

- Are you able to reproduce the issue if you start a VM by hand and then execute run-cyclictest.sh also be hand? Or do you only reproduce when the test-case is executed by your automation scripts?

- In comment 28 the command-line for run-cyclictest.sh has "-c 1", however the guest from comment 25 (you last vmcore) has 10 vCPUs. Did you use the same command-line for for this guest?

- How much memory did you assign to you guest? Does it have swap space?

Now, here's the things I'd like you to try:

1. Assign at least 1GB of memory per vCPU to the guest. Remove any swap space. Try to reproduce

2. If you still get the problem and the problem is only reproducible via your automation scripts: add a "sleep 10m" before starting the first guest in your automation script. This is just a brute-force guarantee that the starting of the guest is not racing with the tuned profile being applied during boot in the host

Comment 33 Pei Zhang 2017-06-02 01:11:04 UTC
(In reply to Luiz Capitulino from comment #32)
> Pei,
> 
> I have questions and things I'd like you to try.
> 
> Questions:
> 
> - Are you able to reproduce the issue if you start a VM by hand and then
> execute run-cyclictest.sh also be hand? Or do you only reproduce when the
> test-case is executed by your automation scripts?

I only reproduced by automation scripts.

> - In comment 28 the command-line for run-cyclictest.sh has "-c 1", however
> the guest from comment 25 (you last vmcore) has 10 vCPUs. Did you use the
> same command-line for for this guest?

In comment 28, "-c 1" is for single VM singe rt vCPU testing.
In Comment 25, it's for single VM 8 rt vCPUs(with 2 housekeeping vCPUs), so the command of cyclictest should be below: 
# /usr/bin/sh /home/nfv-virt-rt-kvm/tests/run-cyclictest.sh -d 1h -c 1,2,3,4,5,6,7,8 -r
 
> - How much memory did you assign to you guest? Does it have swap space?

I assigned 4G memory to guest. Yes, both host and guest have swap space.

> Now, here's the things I'd like you to try:
> 
> 1. Assign at least 1GB of memory per vCPU to the guest. Remove any swap
> space. Try to reproduce

ok.

I'll assign 2G for single VM with 1 rt vCPU(with 1 housekeeping vCPUs), assign 10G for single VM with 8 rt vCPUs(with 2 housekeeping vCPUs).

And I'll remove all swap space in both host and guest by: 
# swapoff -a

> 2. If you still get the problem and the problem is only reproducible via
> your automation scripts: add a "sleep 10m" before starting the first guest
> in your automation script. This is just a brute-force guarantee that the
> starting of the guest is not racing with the tuned profile being applied
> during boot in the host

ok. I'll try to reproduce this issue manually first.


Thanks,
Pei

Comment 34 Luiz Capitulino 2017-06-02 03:23:31 UTC
I think I'm very close to reproduce this issue, my reproducer is:

1. Start a guess with 2GB and 8 vCPUs (where 6 vCPUs are isolated)
2. Start a kernel build in the guest with make -j2
3. In a few minutes the guest will run out of memory and the OOM killer will run
4. Wait several minutes and check blocked tasks in the host (with echo w > /proc/sysrq-trigger)

I see the same tasks blocked as your vmcore in comment 25 for several minutes, but my host recovers at some point. I'll try with 10 vCPUs. How many vCPUs are isolated in the guest?

What you could do to help me confirm this reproducer is:

1. Try to reproduce manually with 4GB. It should reproduce if I'm on the right track

2. Try to reproduce with a guest with 10GB or more. It should not reproduce if I'm on the right track

Comment 35 Pei Zhang 2017-06-02 11:22:15 UTC
(In reply to Luiz Capitulino from comment #34)
> I think I'm very close to reproduce this issue, my reproducer is:
> 
> 1. Start a guess with 2GB and 8 vCPUs (where 6 vCPUs are isolated)
> 2. Start a kernel build in the guest with make -j2
> 3. In a few minutes the guest will run out of memory and the OOM killer will
> run
> 4. Wait several minutes and check blocked tasks in the host (with echo w >
> /proc/sysrq-trigger)
> 
> I see the same tasks blocked as your vmcore in comment 25 for several
> minutes, but my host recovers at some point. I'll try with 10 vCPUs. How
> many vCPUs are isolated in the guest?

With 10vCPUs, I isolated all rt 8 vCPUs. 

> 
> What you could do to help me confirm this reproducer is:
> 1. Try to reproduce manually with 4GB. It should reproduce if I'm on the
> right track

I am testing this scenario now. Currently didn't reproduce yet. I'll keep testing.

> 2. Try to reproduce with a guest with 10GB or more. It should not reproduce
> if I'm on the right track

Thanks,
Pei

Comment 36 Pei Zhang 2017-06-02 11:25:35 UTC
Luiz, I have a question:

For 8 rt vCPUs, how many housekeeping vCPUs should be used? Are there some rules should be followed when setting the housekeeping vCPUs? (Currently I am using 2 housekeeping cores)

Thanks,
Pei

Comment 37 Pei Zhang 2017-06-02 14:59 UTC
Created attachment 1284457 [details]
Call trace info with 4G memory

(In reply to Pei Zhang from comment #35)
...
> > 
> > What you could do to help me confirm this reproducer is:
> > 1. Try to reproduce manually with 4GB. It should reproduce if I'm on the
> > right track
> 
> I am testing this scenario now. Currently didn't reproduce yet. I'll keep
> testing.

Reproducible now, call trace info please see attachment.

Note:
- Guest with 4G memory, 10 vCPUs(8 rt vCPUs and 2 housekeeping vCPUs)
- In guest, Build kernel with make -j2

Comment 38 Luiz Capitulino 2017-06-02 15:25:10 UTC
That's excellent news! Let's see if I can reproduce it now.

What I'd like to ask now is to try with older RT kernels in the host so that we can determine if this is a regression of a problem that has always existed.

Comment 39 Pei Zhang 2017-06-04 09:25 UTC
Created attachment 1284723 [details]
Call trace info with 10G memory

Update:
Guest with 10G memory also hit call trace issue, please refer to attachment.

Note:
- Guest with 10G memory, 10 vCPUs(8 rt vCPUs and 2 housekeeping vCPUs)
- In guest, Build kernel with make -j2

Comment 40 Pei Zhang 2017-06-04 09:31:59 UTC
(In reply to Luiz Capitulino from comment #38)
> That's excellent news! Let's see if I can reproduce it now.
> 
> What I'd like to ask now is to try with older RT kernels in the host so that
> we can determine if this is a regression of a problem that has always
> existed.

OK, I'll try old kernels to confirm if it's regression.


Best Regards,
Pei

Comment 41 Luiz Capitulino 2017-06-05 17:34:10 UTC
If you can reproduce with 10G then the root cause is not what I was thinking it could be and the issue may be more severe.

To summarize, here's what we know about the issue:

1. It happens on the host
2. The guest is doing high I/O load from the housekeeping CPUs
3. xfsaild and several dozen of kworkers are blocked in D state
4. All blocked tasks are running on housekeeping cores in the host

When I try to reproduce the issue, I see xfsaild and lots of kworkers blocked, but they get unblocked before the hung task message triggers. So, everything works fine although the initial state for the bug sets in.

I'll keep trying to reproduce. Pei, determining if this is a regression is probably the most important right now.

Comment 42 Pei Zhang 2017-06-06 08:58:10 UTC
(In reply to Luiz Capitulino from comment #41)
> If you can reproduce with 10G then the root cause is not what I was thinking
> it could be and the issue may be more severe.
> 
> To summarize, here's what we know about the issue:
> 
> 1. It happens on the host
> 2. The guest is doing high I/O load from the housekeeping CPUs
> 3. xfsaild and several dozen of kworkers are blocked in D state
> 4. All blocked tasks are running on housekeeping cores in the host
> 
> When I try to reproduce the issue, I see xfsaild and lots of kworkers
> blocked, but they get unblocked before the hung task message triggers. 

Luiz, if you keeps running a longer time(for example, about 24h), probably you will can hit this hung task, host become hang.

> So,
> everything works fine although the initial state for the bug sets in.
> 
> I'll keep trying to reproduce. Pei, determining if this is a regression is
> probably the most important right now.

OK. I'm testing with 3.10.0-514.rt56.420.el7.x86_64 now, currently not reproduced yet.


Thanks,
Pei

Comment 43 Pei Zhang 2017-06-07 01:06:41 UTC
Update: 
3.10.0-514.rt56.420.el7.x86_64 works well(keeps running about 18 hours), both host and guest works well, no block info or hang info in #dmesg.

Comment 44 Pei Zhang 2017-06-07 10:04:33 UTC
Update:
3.10.0-600.rt56.512.el7.x86_64 works well currently(keep running about 10hours).

Comment 45 Luiz Capitulino 2017-06-07 13:32:09 UTC
Moving to 7.5. This is a bit of a tough call to make as there's an indication that this could be a regression, but now we're only considering blocker-rc issues and I don't think this one qualifies as it takes very long to reproduce and I haven't been able to reproduce myself. Also, I don't think we'll get to the root cause on time for the 7.4 release.

Finally, the RT team thinks this could be a manifestation of a very old bug (although, in that case, this wouldn't be a regression):

Bug 1292927 - NETDEV WATCHDOG: p3p1 (sfc): transmit queue 4 timed out

Worst case, we release a fix in a z-stream kernel.

Comment 47 Luiz Capitulino 2017-06-07 13:38:48 UTC
Pei,

Have you reproduced this issue with kernel-rt-3.10.0-675.rt56.597.el7 or more recent? Otherwise, you should give it a try when you have time as this could be Bug 1452675 - unbound kworkers being activated on masked CPUs.

Comment 48 Pei Zhang 2017-06-08 04:00:01 UTC
(In reply to Luiz Capitulino from comment #47)
> Pei,
> 
> Have you reproduced this issue with kernel-rt-3.10.0-675.rt56.597.el7 or
> more recent? Otherwise, you should give it a try when you have time as this
> could be Bug 1452675 - unbound kworkers being activated on masked CPUs.

I didn't test recent kernels before. I am trying to reproduce with 3.10.0-679.rt56.602.el7.x86_64 now. 

Thanks,
Pei

Comment 49 Pei Zhang 2017-06-08 12:19:16 UTC
Update:

Version: 3.10.0-679.rt56.602.el7.x86_64

- Guest with 4G memory, 10 vCPUs(8 rt vCPUs and 2 housekeeping vCPUs)
- In guest, Build kernel with make -j2

No any blocker or hung info shows in rt host(running about 8 hours).


Luiz, now I'm confirm with automation(needs about 2 days). If still no error info, probably you are right, bug[1] has solved this problem.

Bug 1452675 - unbound kworkers being activated on masked CPUs

Comment 50 Luiz Capitulino 2017-06-08 13:42:12 UTC
OK, it's important to be sure the issue really doesn't happen before closing the BZ. So, please take your time.

I think this might be the reason why I haven't been able to reproduce, but I haven't tried with an older kernel (and won't try for now due to more important things to do).

Comment 51 Pei Zhang 2017-06-09 01:11 UTC
Created attachment 1286275 [details]
Call trace info with 3.10.0-679.rt56.602.el7.x86_64

(In reply to Luiz Capitulino from comment #50)
> OK, it's important to be sure the issue really doesn't happen before closing
> the BZ. So, please take your time.
> 
> I think this might be the reason why I haven't been able to reproduce, but I
> haven't tried with an older kernel (and won't try for now due to more
> important things to do).

With automation testing, can still hit this problem. 

Versions I tested with:
3.10.0-679.rt56.602.el7.x86_64
qemu-kvm-rhev-2.9.0-9.el7.x86_64
tuned-2.8.0-4.el7.noarch
libvirt-3.2.0-9.el7.x86_64


I noticed this in automation:

After "/usr/sbin/tuned-adm profile realtime-virtual-guest" in guest, then reboot, # ping $guest_ip works, but #ssh doesn't. As in rt guest, there is no console/spice/vnc device added, so #ssh is the only way to access to the guest. So I can't know what happened in guest this time.

Several minutes(probably 15m) later, Call Trace shows in host.

I'll add console in rt guest with next reproducing.


Thanks,
Pei

Comment 52 Luiz Capitulino 2017-06-12 13:13:57 UTC
Pei,

We have three possibilites for this issue:

1. It's a regression
2. It's bug 1292927
3. We don't know what it is but it has always existed

How long does it take for you to reproduce? If it less than one hour, I'd like to ask you to keep investigating if it's a regression. I know that on comment 43 we have an indication that this could be a regression, but did you test it with your automation?

It will be hard to tell if this is bug 1292927, but I will give it a try soon.

Comment 53 Pei Zhang 2017-06-14 00:37:20 UTC
(In reply to Luiz Capitulino from comment #52)
> Pei,
> 
> We have three possibilites for this issue:
> 
> 1. It's a regression
> 2. It's bug 1292927
> 3. We don't know what it is but it has always existed
> 
> How long does it take for you to reproduce? If it less than one hour, I'd
> like to ask you to keep investigating if it's a regression. 

Luiz, the time to reproduce is uncertain. Sometimes 1h, however sometimes 3 hours or more are needed.

In the latest testing with 3.10.0-680.rt56.604.el7.x86_64, the blocker Call Trace info shows after 3.2 hours later.

> I know that on
> comment 43 we have an indication that this could be a regression, but did
> you test it with your automation?

No, I didn't try with automation yet. Next I'll try old versions with automation to confirm if it's regression.

> 
> It will be hard to tell if this is bug 1292927, but I will give it a try
> soon.


Thanks,
Pei

Comment 54 Pei Zhang 2017-06-16 01:54:45 UTC
Update:

When I try to handle bug[1] with 3.10.0-680.rt56.604.el7.x86_64, do latency testing 12 hours of single VM with 8vCPUs, besides Call Trace info in host, the guest become hang(Can not access the guest by #ssh).

I'm testing manually, not with automation.

Hit guest hang issue in before testing, see Comment 10.

[1]Bug 1447938 - Boot single VM with 8 rt vCPUs, the latency value is high


I'm reproducing with old versions now.

Comment 55 Luiz Capitulino 2017-08-14 16:04:18 UTC
I think I might have reproduced this issue on a machine setup for zero-loss testing (that is, it has DPDK PMD threads in the host and guest). The issue triggers sometimes when I try to install an rpm package or copy a large file to another machine using rsync. I'll be investigating this later this week (as the machine is continously running some tests now). Here's the full backtrace:

[ 1653.856334] SysRq : HELP : loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) show-memory-usage(m) nice-all-RT-tasks(n) poweroff(o) show-registers(p) show-all-timers(q) unraw(r) sync(s) show-task-states(t) unmount(u) force-fb(V) show-blocked-tasks(w) dump-ftrace-buffer(z)
[ 1672.958914] SysRq : Show Blocked State
[ 1672.963104]   task                        PC stack   pid father
[ 1672.969726] xfsaild/dm-0    D ffff88086a8e1050     0   661      2 0x00000000
[ 1672.977610]  ffff8808693f3d28 0000000000000002 ffff8808693f3fd8 ffff8808693f3fd8
[ 1672.985898]  ffff8808693f3fd8 ffff8808693f3fd8 ffff88066d595190 ffff88086a8e1050
[ 1672.994187]  0000000000000000 ffff88086a8e1050 ffff88046b316d50 ffff88046b332000
[ 1673.002476] Call Trace:
[ 1673.005208]  [<ffffffff816aefd0>] schedule+0x30/0x96
[ 1673.010785]  [<ffffffffc038ea04>] _xfs_log_force+0x1e4/0x2e0 [xfs]
[ 1673.017677]  [<ffffffff810bf620>] ? wake_up_state+0x20/0x20
[ 1673.023909]  [<ffffffffc039b514>] ? xfsaild+0x184/0x820 [xfs]
[ 1673.030334]  [<ffffffffc038eb2c>] xfs_log_force+0x2c/0xa0 [xfs]
[ 1673.036951]  [<ffffffffc039b390>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
[ 1673.045023]  [<ffffffffc039b514>] xfsaild+0x184/0x820 [xfs]
[ 1673.051251]  [<ffffffffc039b390>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
[ 1673.059321]  [<ffffffffc039b390>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
[ 1673.067382]  [<ffffffff810a9b9f>] kthread+0xcf/0xe0
[ 1673.072823]  [<ffffffff810a9ad0>] ? kthread_worker_fn+0x170/0x170
[ 1673.079623]  [<ffffffff816ba058>] ret_from_fork+0x58/0x90 
[ 1673.085647]  [<ffffffff810a9ad0>] ? kthread_worker_fn+0x170/0x170
[ 1673.092454] kworker/3:1     D ffff8808633b61e0     0  3089      2 0x00000080
[ 1673.100351] Workqueue: xfs-cil/dm-0 xlog_cil_push_work [xfs]
[ 1673.106671]  ffff88085fa4fbe0 0000000000000002 ffff88085fa4ffd8 ffff88085fa4ffd8
[ 1673.114961]  ffff88085fa4ffd8 ffff88085fa4ffd8 ffff88066d5961e0 ffff8808633b61e0
[ 1673.123247]  ffff88046b316d50 ffff8807cbecaac8 ffff8808633b61e0 ffff88046b316d00
[ 1673.131535] Call Trace:
[ 1673.134263]  [<ffffffff816aefd0>] schedule+0x30/0x96
[ 1673.139814]  [<ffffffffc038dc66>] xlog_state_get_iclog_space+0x116/0x340 [xfs]
[ 1673.147885]  [<ffffffffc038d924>] ? xlog_sync+0x314/0x440 [xfs]
[ 1673.154491]  [<ffffffff810bf620>] ? wake_up_state+0x20/0x20
[ 1673.160717]  [<ffffffffc038e189>] xlog_write+0x1a9/0x750 [xfs]
[ 1673.167237]  [<ffffffffc03900d6>] xlog_cil_push+0x276/0x480 [xfs]
[ 1673.174046]  [<ffffffffc03902f5>] xlog_cil_push_work+0x15/0x20 [xfs]
[ 1673.181137]  [<ffffffff810a16d6>] process_one_work+0x176/0x4a0
[ 1673.187645]  [<ffffffff810a2585>] worker_thread+0x265/0x3f0
[ 1673.193861]  [<ffffffff810a2320>] ? manage_workers.isra.36+0x2b0/0x2b0
[ 1673.201145]  [<ffffffff810a9b9f>] kthread+0xcf/0xe0
[ 1673.206587]  [<ffffffff810a9ad0>] ? kthread_worker_fn+0x170/0x170
[ 1673.213385]  [<ffffffff816ba058>] ret_from_fork+0x58/0x90
[ 1673.219409]  [<ffffffff810a9ad0>] ? kthread_worker_fn+0x170/0x170
[ 1673.226209] kworker/3:0     D ffff880861a6d190     0  3831      2 0x00000080
[ 1673.234105] Workqueue: xfs-sync/dm-0 xfs_log_worker [xfs]
[ 1673.240135]  ffff8807f922bb28 0000000000000002 ffff8807f922bfd8 ffff8807f922bfd8
[ 1673.248422]  ffff8807f922bfd8 ffff8807f922bfd8 ffff8808633b61e0 ffff880861a6d190
[ 1673.256713]  7fffffffffffffff ffff8807f922bc70 ffff880861a6d190 ffffffffffffffff
[ 1673.265003] Call Trace:
[ 1673.267729]  [<ffffffff816aefd0>] schedule+0x30/0x96
[ 1673.273267]  [<ffffffff816ad161>] schedule_timeout+0x2e1/0x370
[ 1673.279776]  [<ffffffff812fa608>] ? blk_flush_plug_list+0x1d8/0x220
[ 1673.286770]  [<ffffffff816adf54>] wait_for_completion+0xc4/0x100
[ 1673.293474]  [<ffffffff8109f2f8>] flush_work+0xf8/0x180
[ 1673.299304]  [<ffffffff8109e020>] ? flush_workqueue_prep_pwqs+0x1d0/0x1d0
[ 1673.306890]  [<ffffffffc0390aaa>] xlog_cil_force_lsn+0x8a/0x240 [xfs]
[ 1673.314078]  [<ffffffff810c7cac>] ? dequeue_entity+0x11c/0x620
[ 1673.320587]  [<ffffffff811a4a6c>] ? next_zone+0x2c/0x30
[ 1673.326427]  [<ffffffffc038e8a5>] _xfs_log_force+0x85/0x2e0 [xfs]
[ 1673.333226]  [<ffffffff810ba38b>] ? migrate_enable+0xdb/0x210
[ 1673.339648]  [<ffffffffc038ebc4>] ? xfs_log_worker+0x24/0x50 [xfs]
[ 1673.346553]  [<ffffffffc038eb2c>] xfs_log_force+0x2c/0xa0 [xfs]
[ 1673.353169]  [<ffffffffc038ebc4>] xfs_log_worker+0x24/0x50 [xfs]
[ 1673.359871]  [<ffffffff810a16d6>] process_one_work+0x176/0x4a0
[ 1673.366379]  [<ffffffff810a248c>] worker_thread+0x16c/0x3f0
[ 1673.372596]  [<ffffffff810a2320>] ? manage_workers.isra.36+0x2b0/0x2b0
[ 1673.379880]  [<ffffffff810a9b9f>] kthread+0xcf/0xe0
[ 1673.385321]  [<ffffffff810a9ad0>] ? kthread_worker_fn+0x170/0x170
[ 1673.392120]  [<ffffffff816ba058>] ret_from_fork+0x58/0x90
[ 1673.398144]  [<ffffffff810a9ad0>] ? kthread_worker_fn+0x170/0x170
[ 1673.404943] rpm             D ffff8806406ca0a0     0  7465   3063 0x00000080
[ 1673.412825]  ffff880859b8f708 0000000000000002 ffff880859b8ffd8 ffff880859b8ffd8
[ 1673.421112]  ffff880859b8ffd8 ffff880859b8ffd8 ffffffff81a02480 ffff8806406ca0a0
[ 1673.429401]  ffff88079422b9f0 0000000000000002 0000000000000000 ffff8806406ca0a0
[ 1673.437690] Call Trace:
[ 1673.440417]  [<ffffffff816aefd0>] schedule+0x30/0x96
[ 1673.445956]  [<ffffffff816ad161>] schedule_timeout+0x2e1/0x370
[ 1673.452477]  [<ffffffffc038b19e>] ? xlog_bdstrat+0x2e/0x60 [xfs]
[ 1673.459181]  [<ffffffffc038d924>] ? xlog_sync+0x314/0x440 [xfs]
[ 1673.465788]  [<ffffffff816adc4a>] __down_common+0xa1/0xfb
[ 1673.471822]  [<ffffffffc0368eeb>] ? _xfs_buf_find+0x17b/0x380 [xfs]
[ 1673.478816]  [<ffffffff816adcc1>] __down+0x1d/0x1f
[ 1673.484161]  [<ffffffff810b0651>] down+0x41/0x50
[ 1673.489320]  [<ffffffffc0368c6c>] xfs_buf_lock+0x3c/0x140 [xfs]
[ 1673.495934]  [<ffffffffc0368eeb>] _xfs_buf_find+0x17b/0x380 [xfs]
[ 1673.502741]  [<ffffffffc036911a>] xfs_buf_get_map+0x2a/0x280 [xfs]
[ 1673.509651]  [<ffffffffc039c7a1>] xfs_trans_get_buf_map+0x131/0x1b0 [xfs]
[ 1673.517235]  [<ffffffffc0346842>] xfs_da_get_buf+0xc2/0x100 [xfs]
[ 1673.524043]  [<ffffffffc034d95f>] xfs_dir3_data_init+0x5f/0x280 [xfs]
[ 1673.531238]  [<ffffffffc034c708>] xfs_dir2_sf_to_block+0xc8/0x620 [xfs]
[ 1673.538619]  [<ffffffff810ba38b>] ? migrate_enable+0xdb/0x210
[ 1673.545041]  [<ffffffffc037adfa>] ? xfs_setup_inode+0xaa/0x180 [xfs]
[ 1673.552142]  [<ffffffffc037d87d>] ? xfs_ialloc+0x2fd/0x530 [xfs]
[ 1673.558853]  [<ffffffffc035473e>] xfs_dir2_sf_addname+0xce/0x5c0 [xfs]
[ 1673.566150]  [<ffffffffc038a867>] ? kmem_alloc+0x97/0x130 [xfs]
[ 1673.572755]  [<ffffffffc034a5f8>] xfs_dir_createname+0x198/0x1f0 [xfs]
[ 1673.580051]  [<ffffffffc037e5ad>] xfs_create+0x5fd/0x730 [xfs]
[ 1673.586570]  [<ffffffffc037b099>] xfs_vn_mknod+0xb9/0x260 [xfs]
[ 1673.593175]  [<ffffffff810ba38b>] ? migrate_enable+0xdb/0x210
[ 1673.599594]  [<ffffffffc037b273>] xfs_vn_create+0x13/0x20 [xfs]
[ 1673.606200]  [<ffffffff812081cd>] vfs_create+0xcd/0x130
[ 1673.612029]  [<ffffffff8120be72>] do_last+0x1122/0x12f0
[ 1673.617858]  [<ffffffff8120c102>] path_openat+0xc2/0x490
[ 1673.623784]  [<ffffffff8120d93b>] do_filp_open+0x4b/0xb0
[ 1673.629710]  [<ffffffff8121b854>] ? __alloc_fd+0xb4/0x140
[ 1673.635732]  [<ffffffff811fa073>] do_sys_open+0xf3/0x1f0
[ 1673.641660]  [<ffffffff8102b473>] ? syscall_trace_enter+0x173/0x240
[ 1673.648652]  [<ffffffff811fa18e>] SyS_open+0x1e/0x20
[ 1673.654188]  [<ffffffff816ba314>] tracesys+0xdd/0xe2

Comment 56 Luiz Capitulino 2017-08-14 17:39:57 UTC
Pei,

Could you try this with kernel 3.10.0-695.rt56.620:

"""
This feature is enabled with the following command:
  # echo RT_RUNTIME_GREED > /sys/kernel/debug/sched_features

The user needs also to disable NO_RT_RUNTIME_SHARE logic,
to keep all CPUs with the same rt_runtime.
  # echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features

Then, bound the throttling to to 10 us:

sysctl -w kernel.sched_rt_runtime_us=999990
sysctl -w kernel.sched_rt_period_us=1000000
"""

You only have to do that in the host. What we want do observe is the following:

1. The hang issue is solved
2. There's no spike in cyclictest

Comment 57 Pei Zhang 2017-08-15 09:11:10 UTC
(In reply to Luiz Capitulino from comment #56)
> Pei,
> 
> Could you try this with kernel 3.10.0-695.rt56.620:
> 
> """
> This feature is enabled with the following command:
>   # echo RT_RUNTIME_GREED > /sys/kernel/debug/sched_features

Hi Luiz, 

Seems 3.10.0-695.rt56.620.el7.x86_64 doesn't support this option RT_RUNTIME_GREED. As it fails as below:

# echo RT_RUNTIME_GREED > /sys/kernel/debug/sched_features
-bash: echo: write error: Invalid argument

However with 3.10.0-697.rt56.623.el7.x86_64, it works. So testing with 3.10.0-697.rt56.623.el7.x86_64 is also OK, right?


> The user needs also to disable NO_RT_RUNTIME_SHARE logic,
> to keep all CPUs with the same rt_runtime.
>   # echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features
> 
> Then, bound the throttling to to 10 us:
> 
> sysctl -w kernel.sched_rt_runtime_us=999990
> sysctl -w kernel.sched_rt_period_us=1000000
> """
> 
> You only have to do that in the host. What we want do observe is the
> following:
> 
> 1. The hang issue is solved
> 2. There's no spike in cyclictest

I guess I should check hang issue first. As in formal testing, this hang issue can also show before doing cyclictest, and also because cyclictest testing costs more time. Is this OK?


Thanks,
Pei

Comment 58 Luiz Capitulino 2017-08-15 13:03:46 UTC
Absolutely, the answer is yes for both questions.

There's one detail though: if you confirm that kernel 3.10.0-697.rt56.623.el7.x86_64 fixes the hang, we can't consider this BZ fixed before we test cyclictest because there's a chance the fix could generate a spike.

However, I think it may be very hard to determine whether or not the fix causes a spike because to do that, we'd have to reproduce the problem precisely when cyclictest wakes up. This is not so hard to do for issues that we can reproduce several times an hour, but for things that trig once a day it may be impossible.

But as you said, let's start by checking if the hang is fixed.

Comment 59 Marcelo Tosatti 2017-08-15 23:54:05 UTC
(In reply to Luiz Capitulino from comment #58)
> Absolutely, the answer is yes for both questions.
> 
> There's one detail though: if you confirm that kernel
> 3.10.0-697.rt56.623.el7.x86_64 fixes the hang, we can't consider this BZ
> fixed before we test cyclictest because there's a chance the fix could
> generate a spike.
> 
> However, I think it may be very hard to determine whether or not the fix
> causes a spike because to do that, we'd have to reproduce the problem
> precisely when cyclictest wakes up. This is not so hard to do for issues
> that we can reproduce several times an hour, but for things that trig once a
> day it may be impossible.
> 
> But as you said, let's start by checking if the hang is fixed.

Luiz, with queuelat _any_ spike is detected (because it uses the CPU constantly). Can try that.

Comment 60 Luiz Capitulino 2017-08-16 12:44:52 UTC
We probably should do that. I think Pei still has to test with cyclictest to see if there's any obvious regressions and then give queuelat a try.

Comment 61 Luiz Capitulino 2017-08-16 20:52:31 UTC
Just when I was getting excited for (probably) having reproduced this issue, it won't happen again. I've tried to reproduce it for several hours today, but it just won't happen.

Based on the trace from comment 55, my hypothesis for this issue is:

1. xfs decides it's time to sync its file-system log
2. The following code path executes from a kworker

process_one_work()
  xfs_log_worker()
    xfs_log_force()
      _xfs_log_force()
        xlog_cil_force()
          xlog_cil_force_lsn()
            xlog_cil_push_now()
              flush_work()

3. flush_work() is probably trying to insert work on a kworker running on an isolated core (or maybe on all cores in the system?). That kworker will never run on isolated cores, so the kworker from item 2 item hangs forever

4. xfs threads and anything doing I/O hangs in a row, probably because the kworker from comment 2 is holding locks

If this is our issue, then the proposed fix in comment 56 is the best solution we have. There are two important details here. First, we'll need support for this in the tuned profiles. Secondly, we have to check if this causes a spike as discussed in previous comments (if this causes a spike, then we'll have to live with it for now I'm afraid).

A simple way to check if the issue from comment 55 is the same issue as this BZ, would be to take a look at the vmcore of this issue and confirm that the same sequence of events is happening. Pei, could you provide an up to date vmcore for me to check this?

Comment 62 Pei Zhang 2017-08-17 02:20:06 UTC
(In reply to Luiz Capitulino from comment #61)
> Just when I was getting excited for (probably) having reproduced this issue,
> it won't happen again. I've tried to reproduce it for several hours today,
> but it just won't happen.
> 
> Based on the trace from comment 55, my hypothesis for this issue is:
> 
> 1. xfs decides it's time to sync its file-system log
> 2. The following code path executes from a kworker
> 
> process_one_work()
>   xfs_log_worker()
>     xfs_log_force()
>       _xfs_log_force()
>         xlog_cil_force()
>           xlog_cil_force_lsn()
>             xlog_cil_push_now()
>               flush_work()
> 
> 3. flush_work() is probably trying to insert work on a kworker running on an
> isolated core (or maybe on all cores in the system?). That kworker will
> never run on isolated cores, so the kworker from item 2 item hangs forever
> 
> 4. xfs threads and anything doing I/O hangs in a row, probably because the
> kworker from comment 2 is holding locks
> 
> If this is our issue, then the proposed fix in comment 56 is the best
> solution we have. There are two important details here. First, we'll need
> support for this in the tuned profiles. Secondly, we have to check if this
> causes a spike as discussed in previous comments (if this causes a spike,
> then we'll have to live with it for now I'm afraid).
> 
> A simple way to check if the issue from comment 55 is the same issue as this
> BZ, would be to take a look at the vmcore of this issue and confirm that the
> same sequence of events is happening. Pei, could you provide an up to date
> vmcore for me to check this?

Sure. I need to re-run to get this vmcore, I'll do this work today.

I have finished 2 automation runs with Comment 56 currently, didn't hit hang issue. I'll continue another 3 runs to confirm more accurately about this hang issue. Will finish it by today.

Comment 63 Pei Zhang 2017-08-17 16:09:44 UTC
Update:

1. With Comment 56, both 5 automation test runs got PASS results(1 run with 3 test cases, and 4 run with 1 test case), all of them didn't hit hang issue. So Comment 56 also solves the hang issue in our automation testing.

2. About the vmcore, the testing is still ongoing. I'll upload it once I got it.

Comment 64 Luiz Capitulino 2017-08-17 19:52:45 UTC
Pei,

Sorry for not saying this before, but there's something we could do to tell whether or not the fix kicked in and avoided the hang. There are two things we could do:

1. Grep dmesg:

# dmesg | grep throttling
sched: RT throttling activated

2. Use trace-cmd:

# trace-cmd record -esched_sched -Mcpumask_of_isolated_cpus_running_rt_vcpus

Item 2 is better because we'll know the duration of the interruption and also whether or not it happned more than once.

Comment 65 Pei Zhang 2017-08-17 23:28:20 UTC
(In reply to Luiz Capitulino from comment #64)
> Pei,
> 
> Sorry for not saying this before, but there's something we could do to tell
> whether or not the fix kicked in and avoided the hang. There are two things
> we could do:
> 
> 1. Grep dmesg:
> 
> # dmesg | grep throttling
> sched: RT throttling activated
> 
> 2. Use trace-cmd:
> 
> # trace-cmd record -esched_sched -Mcpumask_of_isolated_cpus_running_rt_vcpus
> 
> Item 2 is better because we'll know the duration of the interruption and
> also whether or not it happned more than once.

Luiz, I'd like to confirm: we should do these for generating the hang vmcore, right? I mean testing without steps of Comment 56.

> # dmesg | grep throttling
> sched: RT throttling activated

Do you know how to activate RT throttling? I got nothing with "#dmesg | grep throttling"


Thanks,
Pei

Comment 66 Luiz Capitulino 2017-08-18 00:02:59 UTC
(In reply to Pei Zhang from comment #65)
> (In reply to Luiz Capitulino from comment #64)
> > Pei,
> > 
> > Sorry for not saying this before, but there's something we could do to tell
> > whether or not the fix kicked in and avoided the hang. There are two things
> > we could do:
> > 
> > 1. Grep dmesg:
> > 
> > # dmesg | grep throttling
> > sched: RT throttling activated
> > 
> > 2. Use trace-cmd:
> > 
> > # trace-cmd record -esched_sched -Mcpumask_of_isolated_cpus_running_rt_vcpus
> > 
> > Item 2 is better because we'll know the duration of the interruption and
> > also whether or not it happned more than once.
> 
> Luiz, I'd like to confirm: we should do these for generating the hang
> vmcore, right? I mean testing without steps of Comment 56.

No, it's the opposite: you should do these with steps from comment 56 and not to generate the vmcore. What we want to confirm is that the hang would trigger but the fix from comment 56 kicked in and avoided the hang. If we confirm that, then we don't even need the vmcore anymore.

I'm sorry for not having said this in comment 56, but only today I had the time to study the fix and figure out how to show that it's avoiding the bug.

> > # dmesg | grep throttling
> > sched: RT throttling activated
> 
> Do you know how to activate RT throttling? I got nothing with "#dmesg | grep
> throttling"

I guess it means the hang would not occur then, right Daniel?

Comment 67 Pei Zhang 2017-08-18 00:22:45 UTC
Thank you Luiz for explain. I understand now. And I'll do testing with Comment 56 + Comment 64 today.

Comment 68 Pei Zhang 2017-08-18 09:40:35 UTC
(In reply to Luiz Capitulino from comment #64)
> Pei,
> 
> Sorry for not saying this before, but there's something we could do to tell
> whether or not the fix kicked in and avoided the hang. There are two things
> we could do:
> 
> 1. Grep dmesg:
> 
> # dmesg | grep throttling
> sched: RT throttling activated
> 
> 2. Use trace-cmd:
> 
> # trace-cmd record -esched_sched -Mcpumask_of_isolated_cpus_running_rt_vcpus

Luiz, should this be -esched instead of -esched_sched? As I failed start -esched_sched, or did I miss some options? 

- -esched_sched fails start
# trace-cmd record -esched_sched -M40000
trace-cmd: No such file or directory
  No events enabled with sched_sched

- -esched works. 
# trace-cmd record -esched -M40000
Hit Ctrl^C to stop recording


Thanks,
Pei
 
> Item 2 is better because we'll know the duration of the interruption and
> also whether or not it happned more than once.

Comment 69 Luiz Capitulino 2017-08-18 12:40:38 UTC
Ouch, sorry. It's -esched_switch. I don't know why I wrote -esched_sched.

Comment 70 Daniel Bristot de Oliveira 2017-08-21 08:53:51 UTC
> > 
> > Do you know how to activate RT throttling? I got nothing with "#dmesg | grep
> > throttling"
> 
> I guess it means the hang would not occur then, right Daniel?

Right, Sir!

-- Daniel

Comment 71 Pei Zhang 2017-08-21 14:28:49 UTC
Thanks Luiz and Daniel for confirm.

Versions:
3.10.0-701.rt56.627.el7.x86_64
qemu-kvm-rhev-2.9.0-16.el7_4.4.x86_64
tuned-2.8.0-5.el7.noarch
libvirt-3.2.0-14.el7.x86_64

Steps:
1. In Host
# echo RT_RUNTIME_GREED > /sys/kernel/debug/sched_features
# echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features
# sysctl -w kernel.sched_rt_runtime_us=999990
# sysctl -w kernel.sched_rt_period_us=1000000
# trace-cmd record -esched_switch -M40000

2. Install a new rt guest
...
  <vcpu placement='static'>2</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='19'/>
    <vcpupin vcpu='1' cpuset='18'/>
    <emulatorpin cpuset='1,3,5,7,9'/>
    <vcpusched vcpus='0-1' scheduler='fifo' priority='1'/>
  </cputune>


3. Start latency testing(5 minutes)

4. Check throttling
# dmesg | grep throttling
[ 6777.817775] sched: RT throttling activated

5. Ctrl +C stop trace-cmd,  trace.data reference:
http://fileshare.englab.nay.redhat.com/pub/section2/coredump/var/crash/pezhang/bug1448770/trace/

Comment 72 Pei Zhang 2017-08-21 14:54:45 UTC
Another update:

When testing with 3.10.0-701.rt56.627.el7.x86_64, without step[1], the hang issue also has gone.

I run about 7 automation jobs, they all get PASS.(I was trying to get a hang vmcore, however all of them works well, no hang any more)


[1]
# echo RT_RUNTIME_GREED > /sys/kernel/debug/sched_features
# echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features
# sysctl -w kernel.sched_rt_runtime_us=999990
# sysctl -w kernel.sched_rt_period_us=1000000

Comment 73 Luiz Capitulino 2017-08-21 19:38:09 UTC
OK, so there lots of things to comment on.

First, regarding comment 71 we can see the throttling message in dmesg. This means that the fix did kick in and avoided the hang. The trace output was less useful than I thought it would be, there's lots of context switches there probably because the automation scripts are doing lots of stuff. But that doesn't matter much I guess.

Regarding comment 72, as Clark told us on the phone, maybe you're just being lucky or kernel -701 has changed the race window somehow. Best way to know would be to bisect the kernel and look for the change that may have changed this behavior, but I'm not sure it's worth it in your case because this is going to be very time consuming.

Regarding my hang from comment 61. As I wasn't being able to get it with kernel -701 (like you above), I went back to kernel -693 and I'm still unable to reproduce. It looks like this one went away :((

A similar issue that's 100% reproducible is starting tracing when there's a fifo task taking 100% of the CPU. Daniel's fix from comment 56 fix this one for me. The only detail is that I didn't change the sched_rt_runtime_us and sched_rt_period_us parameters, I only did:

# echo RT_RUNTIME_GREED > /sys/kernel/debug/sched_features
# echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features

Which means that we are allowing the task preempting off the RT task to execute for how long it wants. The alternative is not much better, which is to set the rt_runtime_us and rt_period_us parameters, which only enforce the execution time from the tick which is in miliseconds interval.

As a next step, we have two options:

1. Enable the feature without a execution time limit, as I did in my testing above. Pros: fix the hang. Cons: worst case scenario is several spikes in the ms range

2. Improve the feature to guarantee that the RT task is preempted for very short period of time (eg. 10us). Daniel is looking into this, but I'm afraid this may be difficult and not backportable to 7.4.z.

Comment 74 Pei Zhang 2017-08-22 10:01:11 UTC
(In reply to Luiz Capitulino from comment #73)
> OK, so there lots of things to comment on.
> 
> First, regarding comment 71 we can see the throttling message in dmesg. This
> means that the fix did kick in and avoided the hang. The trace output was
> less useful than I thought it would be, there's lots of context switches
> there probably because the automation scripts are doing lots of stuff. But
> that doesn't matter much I guess.
> 
> Regarding comment 72, as Clark told us on the phone, maybe you're just being
> lucky or kernel -701 has changed the race window somehow. Best way to know
> would be to bisect the kernel and look for the change that may have changed
> this behavior, but I'm not sure it's worth it in your case because this is
> going to be very time consuming.

The latest hang was when testing with kernel-rt-3.10.0-695.rt56.621.el7 in Aug 8, but unfortunately I didn't save this hang vmcore.

Yes, it's indeed time consuming, so probably I can try bisect testing after finishing the latency testing below.

> Regarding my hang from comment 61. As I wasn't being able to get it with
> kernel -701 (like you above), I went back to kernel -693 and I'm still
> unable to reproduce. It looks like this one went away :((
> 
> A similar issue that's 100% reproducible is starting tracing when there's a
> fifo task taking 100% of the CPU. Daniel's fix from comment 56 fix this one
> for me. The only detail is that I didn't change the sched_rt_runtime_us and
> sched_rt_period_us parameters, I only did:
> 
> # echo RT_RUNTIME_GREED > /sys/kernel/debug/sched_features
> # echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features
> 
> Which means that we are allowing the task preempting off the RT task to
> execute for how long it wants. The alternative is not much better, which is
> to set the rt_runtime_us and rt_period_us parameters, which only enforce the
> execution time from the tick which is in miliseconds interval.
> 
> As a next step, we have two options:
> 
> 1. Enable the feature without a execution time limit, as I did in my testing
> above. Pros: fix the hang. Cons: worst case scenario is several spikes in
> the ms range

OK, and I am doing this testing this week.


Thanks,
Pei

> 2. Improve the feature to guarantee that the RT task is preempted for very
> short period of time (eg. 10us). Daniel is looking into this, but I'm afraid
> this may be difficult and not backportable to 7.4.z.

Comment 75 Pei Zhang 2017-08-23 02:18:44 UTC
There is spike when testing with [1], the max latency is 00026. This spike showed quickly when starting the latency testing.


[1]
# echo RT_RUNTIME_GREED > /sys/kernel/debug/sched_features
# echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features


==Versions==
3.10.0-703.rt56.630.el7.x86_64
tuned-2.8.0-5.el7.noarch
qemu-kvm-rhev-2.9.0-16.el7_4.4.x86_64
libvirt-3.2.0-14.el7.x86_64


==Latency testing results==
Test started at Tue Aug 22 21:57:26 CST 2017

Test duration:    24h
Run rteval:       n
Run stress:       y
Isolated CPUs:    1
Kernel:           3.10.0-703.rt56.630.el7.x86_64
Kernel cmd-line:  BOOT_IMAGE=/vmlinuz-3.10.0-703.rt56.630.el7.x86_64 root=/dev/mapper/rhel_bootp--73--75--245-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto rd.lvm.lv=rhel_bootp-73-75-245/root rd.lvm.lv=rhel_bootp-73-75-245/swap rhgb quiet default_hugepagesz=1G iommu=pt intel_iommu=on isolcpus=1 intel_pstate=disable nosoftlockup skew_tick=1 nohz=on nohz_full=1 rcu_nocbs=1
Machine:          bootp-73-75-245.lab.eng.pek2.redhat.com
CPU:              Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
Results dir:      /home/log_nfv-virt-rt-kvm
running stress
   taskset -c 1 /home/nfv-virt-rt-kvm/tests/stress --cpu 1
starting Tue Aug 22 21:57:27 CST 2017
   taskset -c 1 cyclictest -m -n -q -p95 -D 24h -h60 -i 200 -t 1 -a 1 -b25 --notrace
ended Tue Aug 22 21:57:27 CST 2017

Test ended at Tue Aug 22 21:57:27 CST 2017

Latency testing results:
# Min Latencies: 00026
# Avg Latencies: 00026
# Max Latencies: 00026


==More detail testing info reference==
http://lab-01.rhts.eng.pek2.redhat.com/beaker/logs/tasks/60249+/60249114/TESTOUT.log

Comment 76 Pei Zhang 2017-08-23 08:33:08 UTC
This hang issue can be reproduced with latest kernel-rt, so the hang issue still exists. 


Versions:
3.10.0-703.rt56.630.el7.x86_64
tuned-2.8.0-5.el7.noarch
qemu-kvm-rhev-2.9.0-16.el7_4.4.x86_64
libvirt-3.2.0-14.el7.x86_64


vmcore reference:
http://fileshare.englab.nay.redhat.com/pub/section2/coredump/var/crash/pezhang/bug1448770/aug23_vmcore/


Note:
1. When I was debugging the automation codes(download .src.rpm, unzip it, make -j2, these steps are executed more than once) in rt guest, I hit this issue. 

2. I didn't execute below commands in host.
# echo RT_RUNTIME_GREED > /sys/kernel/debug/sched_features
# echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features

Comment 79 Marcelo Tosatti 2017-08-25 01:04:10 UTC
Luiz, Daniel,

As Daniel suggested in the KVM-RT threads, nanosleep(X) should be
enough to fix the problem (in DPDK). With X between 8us and 5us, 
that should be sufficient to let the XFS thread run.

I suppose that one nanosleep every 1000us, which is less than 1%
of total time, should be enough.

I'll give that a try here and see the effects on latency.

Comment 80 Luiz Capitulino 2017-08-25 13:28:23 UTC
That may work, the only two details that come to my mind are:

1. I'd expect this solution to be a very hard sell on upstream. We're doing two things that DPDK was designed to avoid: entering the kernel and blocking

2. Remember that we may have several PMD threads running. In my testbed for example, I have 6. We'd have to study the effects of having several PMD threads blocking for X microseconds (maybe they should block at the same time? How to coordinate this with PMD threads running in the guest?)

Also, if Pei confirms that the kernel I provided fixes the issue and xfs guys agree in changing xfs, then this BZ wouldn't depend on that solution. So, I'll focus on that route for now. Although we know that this issue is theoretically possible to happen anywhere in the kernel.

Comment 83 Marcelo Tosatti 2017-08-28 22:28:56 UTC
(In reply to Luiz Capitulino from comment #80)
> That may work, the only two details that come to my mind are:
> 
> 1. I'd expect this solution to be a very hard sell on upstream. We're doing
> two things that DPDK was designed to avoid: entering the kernel and blocking

Its designed not to enter the kernel and to not block for a reason: to avoid excessive latencies. If nanosleep does not cause excessive latencies, its fine.


> 2. Remember that we may have several PMD threads running. In my testbed for
> example, I have 6. We'd have to study the effects of having several PMD
> threads blocking for X microseconds (maybe they should block at the same
> time? How to coordinate this with PMD threads running in the guest?)
> 
> Also, if Pei confirms that the kernel I provided fixes the issue and xfs
> guys agree in changing xfs, then this BZ wouldn't depend on that solution.
> So, I'll focus on that route for now. Although we know that this issue is
> theoretically possible to happen anywhere in the kernel.

Sure, but its still fixing a wide ranging problem in one particular site, 
while the nanosleep fix fixes the wide ranging problem: which is the same
issue as for 

https://bugzilla.redhat.com/show_bug.cgi?id=1455677

Comment 84 Marcelo Tosatti 2017-08-28 22:31:11 UTC
(In reply to Luiz Capitulino from comment #80)
> That may work, the only two details that come to my mind are:
> 
> 1. I'd expect this solution to be a very hard sell on upstream. We're doing
> two things that DPDK was designed to avoid: entering the kernel and blocking
> 
> 2. Remember that we may have several PMD threads running. In my testbed for
> example, I have 6. We'd have to study the effects of having several PMD
> threads blocking for X microseconds (maybe they should block at the same
> time? How to coordinate this with PMD threads running in the guest?)

Do you worry about locks or the processor entering a deep C state?

> 
> Also, if Pei confirms that the kernel I provided fixes the issue and xfs
> guys agree in changing xfs, then this BZ wouldn't depend on that solution.
> So, I'll focus on that route for now. Although we know that this issue is
> theoretically possible to happen anywhere in the kernel.

Comment 85 Luiz Capitulino 2017-08-29 13:06:35 UTC
Locks can be a problem too, but what I was thinking is: suppose you have two PMD threads A and B, each entering the kernel at different times for 10us. If thread B enters the kernel at the same time thread A is leaving it, wouldn't the total sleep duration be 20us if we consider the entire host as a single packet processing unit? I'm not sure if this makes sense, but in any case, the best place to discuss this solution is in the DPDK ML.

Comment 86 Luiz Capitulino 2017-08-29 13:17:48 UTC
Thinking about this again, I don't think my commetn makes sense. We probably want the opposite of what I'm suggesting (eg. something like skew_tick=1 behavior). But anyways, I think this discussion has to be moved to the DPDK ML.

Comment 91 Luiz Capitulino 2017-08-31 19:09:11 UTC
We have found out that we can't reproduce the issue with the following commit reverted:

commit b829c6b75e5bde74e9b6cb1e8141d605e5cad9f2
Author: Brian Foster <bfoster@redhat.com>
Date:   Mon Apr 3 14:22:35 2017 -0400

    [fs] xfs: use dedicated log worker wq to avoid deadlock with cil wq

That commit is a backport from upstream commit 696a562072. This commit introduces a new workqueue to xfs called xfs-sync and moves some work previously performed by the xfs-log workqueue to the new workqueue. The xfs-sync workqueue doesn't have the MEM_RECLAIM and WQ_HIGHPRI flags (which are present in the xfs-log workqueue).

My _hypothesis_ for this issue is that the WQ_MEM_RECLAIM flag prevents this issue. This flag creates a rescuer thread associated with the workqueue. I guess that the workqueue subsystem may queue work to the rescuer thread instead of picking up a kworker under certain conditions. The rescuer thread respects isolcpus so it will never run on an isolated CPU. However, without WQ_MEM_RECLAIM work queued for the workqueue is always queued for a kworker, which can end up executing on an isolated CPU. When this happens with KVM-RT, the kworker will starve, as there are RT tasks taking 100% of isolated CPUs all the time.

If my hypothesis is correct, then there are two important things to observe:

1. This is a regression affecting the 7.4 RT kernel

2. The fix may be simple: we may just need to add the WQ_MEM_RECLAIM flag to xfs-sync. However, we haven't checked if this will actually fix the issue

I'll get xfs people involved in this discussion.

Comment 94 Luiz Capitulino 2017-09-05 13:01:15 UTC
As it turns out, Pei is still able to reproduce this issue even with commit from comment 91 reverted. I'll continue to look into this.

Comment 95 Luiz Capitulino 2017-10-02 14:05:20 UTC
I've been lended a machine that was able to reproduce this in a matter of hours. I say "was" because this issue just went away in the past week and I haven't been able to reproduce it for several days now.

In any case, I'll post the debugging information I was able to collect while the problem was occurring.

The xfsaild thread is never woken up from this code path:

xfsaild()
  xfsaild_push()
    xfs_log_force()
      _xfs_log_force()
        xlog_wait()
          add_wait_queue_exclusive(iclog->ic_force_wait)
            schedule()

I believe this is the code path that's supposed to wake it up:

xfs_buf_ioend_async() (from ksoftirqd or megasas irq handler, queues
                       xfs_buf_ioend_work() to bp->b_ioend_work)
  xfs_buf_ioend_work() (from kworker starting here)
    xfs_buf_ioend()
      xlog_iodone()
        xlog_state_do_callback()
          wake_up_all(iclog->ic_force_wait)

So, our problem is: the wakeup code path is not executing or it is executing, but is not reaching wake_up_all(). We find out why this is so, then we'll know what's wrong. Here's the hypothesis I came up with:

1. The wake up is missed. That is, the wake up never happens due
   to a race condition

2. The wake up function is queued to a kworker bound to an isolated
   CPU, so it will never run since the fifo:1 task executing there
   won't let it

3. xfsaild is being woken up, but it wakes up on an isolated core
   so it will never run

We have some evidence that (2) or (3) may be happening:

- For (2): when enabling RT throttling, I see kworkers preempting vCPU threads and executing on isolated cores. However, for some reason tracing gets automagically disabled and I haven't been able to confirm what the kworker is doing (it could be vmstat for example)

- For (3): when the wakeup does happen, xfsaild almost aways wakeup in a different core than where it was queued and woken up. Of course this is expected, but I find it curious that blocking and waking up happens on the same core, but xfsaild wakes up on a different core

Lastly, I have several messages like this in dmesg:

CPU9: NOHZ: local_softirq_pending: SCHED

This seems to indicate that the ksoftirqd thread is getting blocked while a softirq is raised. I haven't investigated if this has any connection to our bug.

Comment 96 Pei Zhang 2017-10-16 10:02 UTC
Created attachment 1339148 [details]
Call trace info with 3.10.0-732.rt56.660.el7.x86_64

Upload the Call trace info with latest testing versions.

Comment 100 Luiz Capitulino 2017-11-13 18:14:07 UTC
We have some very important news on this issue. Pei, has a machine that can reproduce this issue in a few hours. She was able to bisect kernel versions and find the following:

"""
versions                       latency_running_time  results
3.10.0-660.rt56.578.el7.x86_64    6h                 work
3.10.0-663.rt56.582.el7.x86_64    <2h                fail
"""

This means that RT kernel 663.rt56.582 is the first bad kernel. This kernel contains the (RT specific) series that lifts the single reader on rwsems (bug 1438512). Reverting this series makes the xfs hang go away!

We have two possibilities:

1. This series has a bug
2. This series is uncovering a locking issue in xfs or somewhere else

Giving that this series is very important for the RT kernel, we're going to try to fix the issue (whatever it may be). If we don't get it fixed on time for 7.5 beta deadline, we're going to revert it.

Comment 101 Clark Williams 2017-11-27 16:59:35 UTC
Testing shows that reverting the rwsems commits causes this problem to go away. Commits are:

5d6a5f7fe - rtmutex: Make lock_killable work
354133ae7 - rtmutex: Provide rt_mutex_lock_state()
8909336e9 - rtmutex: Provide locked slowpath
e384d264a - rwsem/rt: Lift single reader restriction
e87c7cfda - rwsem/rt: get rid of referernces to rtmutex_chainwalk enums from upstream

Not sure at this point whether this is a race condition being moved or fixing a bug. For now we'll revert the rwsems commits above and regroup for 7.6. Plan for 7.6 is to pull in not just rwsems but rwlock patches as well.

Comment 102 Luiz Capitulino 2017-11-27 17:16:06 UTC
Btw, because of bug 1516984, you'll have to provide a -794 kernel otherwise we won't be able to test it.

Comment 103 Luiz Capitulino 2017-11-28 13:57:58 UTC
As we've decided that the solution to this bug is to revert the series from comment 101, I'm reassining this to Luis Claudio.

Comment 104 Luis Claudio R. Goncalves 2017-11-28 14:40:16 UTC
I have performed the revert on kernel-rt-3.10.0-794.rt56.729.el7__bz1448770.el7, that is a scratch build on the way. The source code can be found here:

  Repository: git://git.engineering.redhat.com/users/lgoncalv/mrg-rt-v2-devel.git
  Branch: kernel-rt-3.10.0-794.rt56.729.el7__bz1448770
  URL: http://git.engineering.redhat.com/git/users/lgoncalv/mrg-rt-v2-devel.git/log/?h=kernel-rt-3.10.0-794.rt56.729.el7__bz1448770

This build have basically two commits:

  d9be3d53d656 [rt] rwsem/rt: revert the "rwsem/rt: lift single reader restriction" series
  6d2a1d604a59 [rt] net: sk_busy_loop: do not disable preemption on RT

The second commit is a fix present in -799 that may be helpful to your tests.

As soon as the scratch build is done, I will post the related URLs here.

Comment 105 Luis Claudio R. Goncalves 2017-11-28 16:48:21 UTC
The files for this kernel can be found here:

   http://rtserver.farm.hsv.redhat.com/bz1448770/

Information about this kernel build:

    https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=14637131

Comment 106 Pei Zhang 2017-11-29 01:38:10 UTC
(In reply to Luis Claudio R. Goncalves from comment #105)
> The files for this kernel can be found here:
> 
>    http://rtserver.farm.hsv.redhat.com/bz1448770/
> 
> Information about this kernel build:
> 
>     https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=14637131

Get it. I'm testing this build now.


Thanks,
Pei

Comment 107 Pei Zhang 2017-12-04 10:18:29 UTC
(In reply to Luis Claudio R. Goncalves from comment #105)
> The files for this kernel can be found here:
> 
>    http://rtserver.farm.hsv.redhat.com/bz1448770/
> 
> Information about this kernel build:
> 
>     https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=14637131

Testing this build 4 times, each running 24 hours, all get PASS results. The xfs issue is gone with this build.

And the latency results look good too. Max Latency value is 00020.

run 1:
# Min Latencies: 00004 00006 00005 00006 00006 00006 00004 00006
# Avg Latencies: 00006 00006 00006 00006 00006 00006 00006 00006
# Max Latencies: 00017 00017 00016 00017 00016 00017 00016 00017

run 2:
# Min Latencies: 00004 00006 00006 00006 00006 00006 00006 00006
# Avg Latencies: 00006 00006 00006 00006 00006 00006 00006 00006
# Max Latencies: 00016 00017 00015 00016 00015 00015 00015 00015

run 3:
# Min Latencies: 00004 00005 00006 00005 00006 00006 00006 00006
# Avg Latencies: 00006 00006 00006 00006 00006 00006 00006 00006
# Max Latencies: 00017 00015 00015 00015 00016 00016 00016 00015

run 4:
# Min Latencies: 00004 00006 00006 00006 00006 00006 00006 00006
# Avg Latencies: 00006 00006 00007 00006 00006 00006 00006 00006
# Max Latencies: 00017 00016 00020 00016 00016 00016 00017 00017


Thanks,
Pei

Comment 108 Luis Claudio R. Goncalves 2017-12-07 17:38:58 UTC
The required revert has been applied to kernel-rt-3.10.0-812.rt56.750.el7, as the following commit:

   a3958191652c rwsem/rt: revert the "rwsem/rt: lift single reader restriction" series

Comment 109 Pei Zhang 2017-12-11 08:12:23 UTC
Testing kernel-rt-3.10.0-817.rt56.755.el7.x86_64 on host[1], this xfs issue has gone when doing 24 hours latency testing.

Versions:
kernel-rt-3.10.0-817.rt56.755.el7.x86_64
libvirt-3.9.0-5.el7.x86_64
tuned-2.9.0-1.el7.noarch
qemu-kvm-rhev-2.10.0-11.el7.x86_64


Results:
# Min Latencies: 00004 00006 00005 00006 00006 00006 00006 00006
# Avg Latencies: 00006 00006 00006 00006 00006 00006 00006 00006
# Max Latencies: 00016 00016 00015 00015 00015 00015 00015 00015


[1]dell-per430-09.lab.eng.pek2.redhat.com


Thanks,
Pei

Comment 111 Pei Zhang 2017-12-31 02:11:08 UTC
==Versions==
tuned-2.9.0-1.el7.noarch
kernel-rt-3.10.0-824.rt56.762.el7.x86_64
libvirt-3.9.0-6.el7.x86_64
qemu-kvm-rhev-2.10.0-13.el7.x86_64

24 hours latency testing results:
(1)Single VM with 1 rt vCPU:
# Min Latencies: 00006
# Avg Latencies: 00006
# Max Latencies: 00017

(2)Single VM with 8 rt vCPUs:
# Min Latencies: 00004 00006 00006 00006 00006 00006 00006 00006
# Avg Latencies: 00006 00006 00006 00006 00006 00006 00006 00006
# Max Latencies: 00016 00018 00019 00018 00016 00016 00018 00018

(3)Multiple VMs each with 1 rt vCPU:
- VM1
# Min Latencies: 00006
# Avg Latencies: 00006
# Max Latencies: 00016

- VM2
# Min Latencies: 00006
# Avg Latencies: 00006
# Max Latencies: 00019

- VM3
# Min Latencies: 00006
# Avg Latencies: 00007
# Max Latencies: 00018

- VM4
# Min Latencies: 00006
# Avg Latencies: 00006
# Max Latencies: 00017


Above 3 scenarios get PASS results in separate 2 hosts[1][2].

[1]dell-per430-09.lab.eng.pek2.redhat.com
[2]dell-per430-11.lab.eng.pek2.redhat.com


So this bug has been fixed. Thank you all.


Move status of this bug to 'VERIFIED'.

Comment 114 errata-xmlrpc 2018-04-10 09:07:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:0676


Note You need to log in before you can comment on or make changes to this bug.