Bug 1438120

Summary:

Real Time host become hang when doing ping-pong migration over dpdk+openvswitch with setting ovs pmd 95 priority

Product:

Red Hat Enterprise Linux 7

Reporter:

Pei Zhang <pezhang>

Component:

kernel-rt

Assignee:

Marcelo Tosatti <mtosatti>

kernel-rt sub component:

KVM

QA Contact:

Pei Zhang <pezhang>

Status:

CLOSED CURRENTRELEASE

Docs Contact:

Severity:

high

Priority:

high

CC:

bhu, chayang, hhuang, juzhang, knoel, lcapitulino, michen, pagupta, peterx, pezhang, virt-maint, xfu

Version:

7.4

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-07-11 01:05:03 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1452357

Bug Blocks:

Attachments:

Description	Flags
Console log when host is hang - 2	none
Call trace info when testing build from pagupta	none
Call trace info when testing build from marcelo	none
trace.txt from /sys/kernel/debug/tracing/trace_pipe	none
Call trace info of Comment 31	none
trace.txt from /sys/kernel/debug/tracing/trace_pipe - update	none
Call trace info of Comment 33	none
Call trace info of Comment 33 - update	none

Description Pei Zhang 2017-04-01 01:56:24 UTC

Description of problem:

Testing live migration over dpdk+openvswitch in real time environment with 3 hosts: host1, host2, host3. host1 and host2 is for doing ping-pong migrations, and host3 is running moongen and ping.

If setting "# chrt -fp 95 $ovs_pmd_process_id" in host1 and host2 and launching testpmd using "chrt -f 95 /usr/bin/testpmd" in VM, after several ping-pong migrations, host will become hang.

Version-Release number of selected component (if applicable):
3.10.0-631.rt56.546.el7.x86_64
qemu-kvm-rhev-2.9.0-0.el7.patchwork201703291116.x86_64
libvirt-3.1.0-2.el7.x86_64
dpdk-16.11-5.el7fdb.x86_64
openvswitch-2.6.1-15.git20161206.el7fdb.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Setup rt host1 and host2, see [1]

2. Start openvswitch and set pmd 95 priority in host1 and host2, see[2]

3. Launch rt VM, see[3]

4. Start testpmd in VM, see[4]

5. Start MoonGen in host3 and start ping VM, see[5]

6. Doing migration
# virsh migrate --verbose --persistent --live rhel7.4_rt qemu+ssh://192.168.1.1/system 

7. After migration finished, quit testpmd in VM,  and stop MoonGen and Ping

8. Repeat step4~step7.

Actual results:
After several ping-pong migration, host1 or host2 will become hang and call trace occurs.

Expected results:
Hosts should work well.

Additional info:
1. Call trace of host1
[ 1617.705762] CPU0: NOHZ: local_softirq_pending: SCHED
[ 1617.706760] CPU0: NOHZ: local_softirq_pending: SCHED
[ 1617.707760] CPU0: NOHZ: local_softirq_pending: SCHED
[ 1617.708760] CPU0: NOHZ: local_softirq_pending: SCHED
[ 1617.709760] CPU0: NOHZ: local_softirq_pending: SCHED
[ 1617.710759] CPU0: NOHZ: local_softirq_pending: SCHED
[ 1617.711761] CPU0: NOHZ: local_softirq_pending: SCHED
[ 1617.712760] CPU0: NOHZ: local_softirq_pending: SCHED
[ 1617.713760] CPU0: NOHZ: local_softirq_pending: SCHED
[ 1677.530828] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 1)
[ 1677.530830] All QSes seen, last rcu_preempt kthread activity 2 (4296345189-4293
[ 1677.530832] pmd91           R  running task        0  3892      1 0x00080080
[ 1677.530834]  ffff88046cf40000 000000005617b23e ffff88046ed43dd0 ffffffff990bc36
[ 1677.530836]  ffff88046ed520c0 ffffffff99a2bc00 ffff88046ed43e38 ffffffff991386d
[ 1677.530837]  ffff8802ec8650e0 ffff88046ed520c0 00000000000019db 000000000000000
[ 1677.530837] Call Trace:
[ 1677.530847]  <IRQ>  [<ffffffff990bc396>] sched_show_task+0xb6/0x120
[ 1677.530851]  [<ffffffff9913867d>] rcu_check_callbacks+0x83d/0x860
[ 1677.530855]  [<ffffffff9908fef1>] update_process_times+0x41/0x70
[ 1677.530858]  [<ffffffff990ebed5>] tick_sched_handle.isra.18+0x25/0x60
[ 1677.530860]  [<ffffffff990ec273>] tick_sched_timer+0x43/0x70
[ 1677.530864]  [<ffffffff990abed4>] __run_hrtimer+0xc4/0x2c0
[ 1677.530866]  [<ffffffff990ec230>] ? tick_sched_do_timer+0x50/0x50
[ 1677.530867]  [<ffffffff990ace50>] hrtimer_interrupt+0x130/0x350
[ 1677.530871]  [<ffffffff99044b85>] local_apic_timer_interrupt+0x35/0x60
[ 1677.530874]  [<ffffffff996aa09d>] smp_apic_timer_interrupt+0x3d/0x50
[ 1677.530878]  [<ffffffff996a881d>] apic_timer_interrupt+0x6d/0x80
[ 1814.395397] sd 0:0:0:0: [sda] tag#0 task abort called for scmd(ffff88086532b2c)
[ 1814.395400] sd 0:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 3f b3 db 78 00 00 18 0
[ 1814.395402] sd 0:0:0:0: task abort: FAILED scmd(ffff88086532b2c0)
[ 1814.395404] sd 0:0:0:0: [sda] tag#1 task abort called for scmd(ffff88086532ad8)
[ 1814.395405] sd 0:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 41 2f 65 70 00 00 08 0
[ 1814.395406] sd 0:0:0:0: task abort: FAILED scmd(ffff88086532ad80)
[ 1849.328273] sd 0:0:0:0: [sda] tag#2 task abort called for scmd(ffff880869b56f4)
[ 1849.328275] sd 0:0:0:0: [sda] tag#2 CDB: Write(10) 2a 00 41 40 79 f8 00 00 38 0
[ 1849.328277] sd 0:0:0:0: task abort: FAILED scmd(ffff880869b56f40)
[ 1849.328281] sd 0:0:0:0: target reset called for scmd(ffff88086532b2c0)
[ 1849.328283] sd 0:0:0:0: [sda] tag#0 megasas: target reset FAILED!!
[ 1849.328286] sd 0:0:0:0: [sda] tag#33 Controller reset is requested due to IO tt
[ 1849.328286] SCSI command pointer: (ffff88086532b480)	 SCSI host state: 5	 I
[ 1849.328288]  <EOI> 
[ 1849.328288] IO request frame:
	0000000a 00000000 00000000 ffcde100 00600002 00000020 00000000 00001000 
	00000000 0000400a 00000000 00000000 00000000 00000000 00000000 01000000 
	bd42002a 00006538 00000008 00000000 00000000 00000000 00000000 00000000 
	005a0000 00000000 00000000 00000000 00000000 00000000 00000110 00000000 
	fff93000 00000000 00001000 40000000 
[ 1857.529992] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 10, t)
[ 1857.529993] All QSes seen, last rcu_preempt kthread activity 1 (4296525194-42965253
[ 1857.529996] pmd91           R  running task        0  3892      1 0x00080080
[ 1857.529998]  ffff88046cf40000 000000005617b23e ffff88046ed43dd0 ffffffff990bc396
[ 1857.529999]  ffff88046ed520c0 ffffffff99a2bc00 ffff88046ed43e38 ffffffff9913867d
[ 1857.530000]  ffff8802ec8650e0 ffff88046ed520c0 0000000000005955 0000000000000000
[ 1857.530001] Call Trace:
[ 1857.530008]  <IRQ>  [<ffffffff990bc396>] sched_show_task+0xb6/0x120
[ 1857.530012]  [<ffffffff9913867d>] rcu_check_callbacks+0x83d/0x860
[ 1857.530016]  [<ffffffff9908fef1>] update_process_times+0x41/0x70
[ 1857.530019]  [<ffffffff990ebed5>] tick_sched_handle.isra.18+0x25/0x60
[ 1857.530021]  [<ffffffff990ec273>] tick_sched_timer+0x43/0x70
[ 1857.530025]  [<ffffffff990abed4>] __run_hrtimer+0xc4/0x2c0
[ 1857.530027]  [<ffffffff990ec230>] ? tick_sched_do_timer+0x50/0x50
[ 1857.530028]  [<ffffffff990ace50>] hrtimer_interrupt+0x130/0x350
[ 1857.530032]  [<ffffffff99044b85>] local_apic_timer_interrupt+0x35/0x60
[ 1857.530035]  [<ffffffff996aa09d>] smp_apic_timer_interrupt+0x3d/0x50
[ 1857.530038]  [<ffffffff996a881d>] apic_timer_interrupt+0x6d/0x80
[ 2037.529086]  <EOI> 
[ 2037.529094] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 10, t)
[ 2037.529095] All QSes seen, last rcu_preempt kthread activity 0 (4296705199-42967053
[ 2037.529096] pmd91           R  running task        0  3892      1 0x00080080
[ 2037.529098]  ffff88046cf40000 000000005617b23e ffff88046ed43dd0 ffffffff990bc396
[ 2037.529099]  ffff88046ed520c0 ffffffff99a2bc00 ffff88046ed43e38 ffffffff9913867d
[ 2037.529100]  ffff8802ec8650e0 ffff88046ed520c0 0000000000009467 0000000000000000
[ 2037.529101] Call Trace:
[ 2037.529105]  <IRQ>  [<ffffffff990bc396>] sched_show_task+0xb6/0x120
[ 2037.529108]  [<ffffffff9913867d>] rcu_check_callbacks+0x83d/0x860
[ 2037.529110]  [<ffffffff9908fef1>] update_process_times+0x41/0x70
[ 2037.529113]  [<ffffffff990ebed5>] tick_sched_handle.isra.18+0x25/0x60
[ 2037.529114]  [<ffffffff990ec273>] tick_sched_timer+0x43/0x70
[ 2037.529117]  [<ffffffff990abed4>] __run_hrtimer+0xc4/0x2c0
[ 2037.529119]  [<ffffffff990ec230>] ? tick_sched_do_timer+0x50/0x50
[ 2037.529120]  [<ffffffff990ace50>] hrtimer_interrupt+0x130/0x350
[ 2037.529123]  [<ffffffff99044b85>] local_apic_timer_interrupt+0x35/0x60
[ 2037.529125]  [<ffffffff996aa09d>] smp_apic_timer_interrupt+0x3d/0x50
[ 2037.529127]  [<ffffffff996a881d>] apic_timer_interrupt+0x6d/0x80
[ 2217.528187]  <EOI> 
[ 2217.528193] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 5, t=)
[ 2217.528194] All QSes seen, last rcu_preempt kthread activity 3 (4296885204-42968853
[ 2217.528196] swapper/5       R  running task        0     0      1 0x00080008
[ 2217.528198]  ffff88066d5d1030 b42b23f0b42892ae ffff88086e483dd0 ffffffff990bc396
[ 2217.528199]  ffff88086e4920c0 ffffffff99a2bc00 ffff88086e483e38 ffffffff9913867d
[ 2217.528200]  0000000000000000 ffff88086e4920c0 000000000000cef8 0000000000000000
[ 2217.528200] Call Trace:
[ 2217.528207]  <IRQ>  [<ffffffff990bc396>] sched_show_task+0xb6/0x120
[ 2217.528211]  [<ffffffff9913867d>] rcu_check_callbacks+0x83d/0x860
[ 2217.528214]  [<ffffffff9908fef1>] update_process_times+0x41/0x70
[ 2217.528217]  [<ffffffff990ebed5>] tick_sched_handle.isra.18+0x25/0x60
[ 2217.528218]  [<ffffffff990ec273>] tick_sched_timer+0x43/0x70
[ 2217.528222]  [<ffffffff990abed4>] __run_hrtimer+0xc4/0x2c0
[ 2217.528223]  [<ffffffff990ec230>] ? tick_sched_do_timer+0x50/0x50
[ 2217.528225]  [<ffffffff990ace50>] hrtimer_interrupt+0x130/0x350
[ 2217.528228]  [<ffffffff99044b85>] local_apic_timer_interrupt+0x35/0x60
[ 2217.528230]  [<ffffffff996aa09d>] smp_apic_timer_interrupt+0x3d/0x50
[ 2217.528233]  [<ffffffff996a881d>] apic_timer_interrupt+0x6d/0x80
[ 2217.528236]  <EOI>  [<ffffffff9969e846>] ? native_safe_halt+0x6/0x10
[ 2217.528238]  [<ffffffff9969e65c>] default_idle+0x2c/0x140
[ 2217.528241]  [<ffffffff99026ede>] arch_cpu_idle+0x2e/0x40
[ 2217.528243]  [<ffffffff990e058c>] cpu_startup_entry+0x14c/0x1d0
[ 2217.528245]  [<ffffffff99042c04>] start_secondary+0x1b4/0x220
[ 2397.527294] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 10, t)
[ 2397.527295] All QSes seen, last rcu_preempt kthread activity 1 (4297065209-42970653
[ 2397.527297] pmd91           R  running task        0  3892      1 0x00080080
[ 2397.527299]  ffff88046cf40000 000000005617b23e ffff88046ed43dd0 ffffffff990bc396
[ 2397.527300]  ffff88046ed520c0 ffffffff99a2bc00 ffff88046ed43e38 ffffffff9913867d
[ 2397.527301]  ffff8802ec8650e0 ffff88046ed520c0 0000000000010d1b 0000000000000000


Reference:
[1] Kernel command line of host1 and host2
# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-631.rt56.546.el7.x86_64 root=/dev/mapper/rhel_dell--per730--11-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per730-11/root rd.lvm.lv=rhel_dell-per730-11/swap console=ttyS0,115200n81 default_hugepagesz=1G iommu=pt intel_iommu=on isolcpus=2,4,6,8,10,12,14,16,18,19,17,16,15 nohz=on nohz_full=2,4,6,8,10,12,14,16,18,19,17,16,15 rcu_nocbs=2,4,6,8,10,12,14,16,18,19,17,16,15 intel_pstate=disable nosoftlockup

[2] Set openvswitch using cores 6 cores(2,4,6,8,10,12) and set 6 pmds 95 priority.

# ovs-vsctl set Open_vSwitch . other_config={}
# ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x1554
# ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x1
# ovs-vsctl set Interface dpdk0 options:n_rxq=1
# ovs-vsctl set Interface dpdk1 options:n_rxq=1
# ovs-vsctl set Interface dpdk2 options:n_rxq=1

# top -n 1 -b -d1 -H > /tmp/top_output.log
  3888 root      20   0 4435496  33484   2456 R  0.0  0.1   0:00.00 pmd86
  3889 root      20   0 4435496  33484   2456 R  0.0  0.1   0:00.00 pmd88
  3890 root      20   0 4435496  33484   2456 R  0.0  0.1   0:00.00 pmd89
  3891 root      20   0 4435496  33484   2456 R  0.0  0.1   0:00.00 pmd90
  3892 root      20   0 4435496  33484   2456 R  0.0  0.1   0:00.00 pmd91
  3893 root      20   0 4435496  33484   2456 R  0.0  0.1   0:00.00 pmd87

# chrt -fp 95 3888
# chrt -fp 95 3889
# chrt -fp 95 3890
# chrt -fp 95 3891
# chrt -fp 95 3892
# chrt -fp 95 3893

[3] VM xml
 <domain type='kvm'>
  <name>rhel7.4_rt</name>
  <uuid>d8c54cf6-1674-11e7-be98-14187748a2bb</uuid>
  <memory unit='KiB'>8388608</memory>
  <currentMemory unit='KiB'>8388608</currentMemory>
  <memoryBacking>
    <hugepages>
      <page size='1048576' unit='KiB' nodeset='0'/>
    </hugepages>
  </memoryBacking>
  <vcpu placement='static'>4</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='19'/>
    <vcpupin vcpu='1' cpuset='18'/>
    <vcpupin vcpu='2' cpuset='16'/>
    <vcpupin vcpu='3' cpuset='14'/>
    <emulatorpin cpuset='5,7,9,11,13,15'/>
  <vcpusched vcpus='0-3' scheduler='fifo' priority='1'/>
  </cputune>
  <numatune>
    <memory mode='strict' nodeset='0'/>
  </numatune>
  <os>
    <type arch='x86_64' machine='pc'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pmu state='off'/>
    <vmport state='off'/>
  </features>
<cpu mode='host-passthrough'>
  <feature policy='require' name='tsc-deadline'/>
  <numa>
    <cell id='0' cpus='0-3' memory='8388608' unit='KiB' memAccess='shared'/>
  </numa>
</cpu>
  <clock offset='utc'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <pm>
    <suspend-to-mem enabled='no'/>
    <suspend-to-disk enabled='no'/>
  </pm>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none' io='threads'/>
      <source file='/mnt/nfv/rhel7.4_rt.qcow2'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </disk>
    <controller type='usb' index='0' model='none'/>
    <controller type='pci' index='0' model='pci-root'/>
    <interface type='bridge'>
      <mac address='28:66:da:5f:dd:31'/>
      <source bridge='switch'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </interface>
    <interface type='vhostuser'>
      <mac address='18:66:da:5f:dd:02'/>
      <source type='unix' path='/tmp/vhostuser0.sock' mode='server'/>
      <model type='virtio'/>
      <driver name='vhost' queues='1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <interface type='vhostuser'>
      <mac address='18:66:da:5f:dd:03'/>
      <source type='unix' path='/tmp/vhostuser1.sock' mode='server'/>
      <model type='virtio'/>
      <driver name='vhost' queues='1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </interface>
    <interface type='vhostuser'>
      <mac address='18:66:da:5f:dd:04'/>
      <source type='unix' path='/tmp/vhostuser2.sock' mode='server'/>
      <model type='virtio'/>
      <driver name='vhost' queues='1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </interface>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
    </memballoon>
  </devices>
</domain>

Kernel command line of VM:
# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-631.rt56.546.el7.x86_64 root=/dev/mapper/rhel_bootp--73--75--110-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto rd.lvm.lv=rhel_bootp-73-75-110/root rd.lvm.lv=rhel_bootp-73-75-110/swap rhgb quiet default_hugepagesz=1G iommu=pt intel_iommu=on isolcpus=1,2,3 nohz=on nohz_full=1,2,3 rcu_nocbs=1,2,3 intel_pstate=disable nosoftlockup LANG=en_US.UTF-8


[4] Command line of testpmd
# chrt -f 95 /usr/bin/testpmd -l 1,2,3 -n 4 -d /usr/lib64/librte_pmd_virtio.so.1 -w 0000:00:03.0 -w 0000:00:06.0 -- --nb-cores=2 --disable-hw-vlan -i --disable-rss --rxq=1 --txq=1 

[5] Command line of ping and MoonGen
# /usr/bin/ping 192.168.2.4 -i 0.001 > /dev/null
# ./build/MoonGen /home/nfv-virt-rt-kvm/tests/utils/rfc1242.lua 0 1 64 60 10

Comment 2 Pei Zhang 2017-04-01 02:09:35 UTC

More additional info:
1. This issue can occur about within 20 ping-pong migrations.

2. Non rt didn't hit this issue.

3. Remove step "# chrt -fp 95 $ovs_pmd_process_id" in host1 and host2, and launch testpmd using "testpmd ..." by replacing"chrt -f 95 testpmd ...", migration will work well.

Comment 3 Pei Zhang 2017-04-01 08:24:08 UTC

Also test with memory locked(Others keeps same with above [3]), still hit above problem.
  <memory unit='KiB'>8388608</memory>
  <currentMemory unit='KiB'>8388608</currentMemory>
  <memtune>
    <hard_limit unit='KiB'>10485760</hard_limit>
  </memtune>
  <memoryBacking>
    <hugepages>
      <page size='1048576' unit='KiB' nodeset='0'/>
    </hugepages>
    <locked/>
  </memoryBacking>


Besides host hang problem, we also found other problems(They are not occur at same time):
==Issue 1==:
1. After several(20 or more) ping-pong migration, host1 or host2 didn't hang, however migration fail with below error.
# virsh migrate --verbose --persistent --live rhel7.4_rt qemu+ssh://192.168.1.1/system
error: Cannot open log file: '/var/log/libvirt/qemu/rhel7.4_rt.log': Device or resource busy

2. Call trace occurs in host1 or host2
[ 7233.924060] INFO: task kworker/8:0:83 blocked for more t.
[ 7233.931531] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this m.
[ 7233.940268] kworker/8:0     D ffff8802ec813090     0    83      2 0x00000000
[ 7233.948157]  ffff8802ec82bd98 0000000000000002 ffff8802ec82bfd8 ffff8802ec82bf8
[ 7233.956445]  ffff8802ec82bfd8 ffff8802ec82bfd8 ffff8802ed2c3090 ffff8802ec81300
[ 7233.964733]  ffff8802ec854000 ffff8802ec82bdd8 ffff8802ec813090 ffff8802ec85408
[ 7233.973022] Call Trace:
[ 7233.975753]  [<ffffffff9d69ce90>] schedule+0x30/0x96
[ 7233.981293]  [<ffffffff9d08ed1d>] del_timer_sync+0x9d/0x110
[ 7233.987511]  [<ffffffff9d0a8850>] ? wake_up_atomic_t+0x30/0x30
[ 7233.994020]  [<ffffffff9d0a01ec>] manage_workers.isra.36+0x20c/0x2b0
[ 7234.001108]  [<ffffffff9d0a0667>] worker_thread+0x3d7/0x3f0
[ 7234.007325]  [<ffffffff9d0a0290>] ? manage_workers.isra.36+0x2b0/0x2b0
[ 7234.014608]  [<ffffffff9d0a7abf>] kthread+0xcf/0xe0
[ 7234.020047]  [<ffffffff9d0a79f0>] ? kthread_worker_fn+0x170/0x170
[ 7234.026847]  [<ffffffff9d6a7ad8>] ret_from_fork+0x58/0x90
[ 7234.032869]  [<ffffffff9d0a79f0>] ? kthread_worker_fn+0x170/0x170
[ 7234.039699] INFO: task kworker/8:1:6288 blocked for more than 600 seconds.
[ 7234.047369] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this m.
[ 7234.056108] kworker/8:1     D ffff880467309030     0  6288      2 0x00000080
[ 7234.063991]  ffff88042b733ea8 0000000000000002 ffff88042b733fd8 ffff88042b733f8
[ 7234.072276]  ffff88042b733fd8 ffff88042b733fd8 ffff88066d51b090 ffff88046730900
[ 7234.080563]  ffff88046e4bb200 ffffffff9d0a0290 0000000000000000 000000000000000
[ 7234.088854] Call Trace:
[ 7234.091582]  [<ffffffff9d0a0290>] ? manage_workers.isra.36+0x2b0/0x2b0
[ 7234.098866]  [<ffffffff9d69ce90>] schedule+0x30/0x96
[ 7234.104402]  [<ffffffff9d0a7a9c>] kthread+0xac/0xe0
[ 7234.109842]  [<ffffffff9d0a79f0>] ? kthread_worker_fn+0x170/0x170
[ 7234.116640]  [<ffffffff9d6a7ad8>] ret_from_fork+0x58/0x90
[ 7234.122662]  [<ffffffff9d0a79f0>] ? kthread_worker_fn+0x170/0x170

==Issue 2==:
1. VM become hang, and call trace occurs in host1 or host2, host1 and host2 didn't hang.
[  371.625648] CPU0: NOHZ: local_softirq_pending: SCHED
[  371.625657] CPU0: NOHZ: local_softirq_pending: SCHED
[  371.625668] CPU0: NOHZ: local_softirq_pending: SCHED
[  371.625692] CPU0: NOHZ: local_softirq_pending: SCHED
[  371.625699] CPU0: NOHZ: local_softirq_pending: SCHED
[  371.625717] CPU0: NOHZ: local_softirq_pending: SCHED
[  371.625727] CPU0: NOHZ: local_softirq_pending: SCHED
[  371.625737] CPU0: NOHZ: local_softirq_pending: SCHED
[  371.625762] CPU0: NOHZ: local_softirq_pending: SCHED
[  371.625797] CPU0: NOHZ: local_softirq_pending: SCHED
[ 1234.528484] INFO: task worker:4477 blocked for more than.
[ 1234.535673] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this m.
[ 1234.544412] worker          D ffff880858bc50f0     0  4477      1 0x00000080
[ 1234.552295]  ffff880461f33c88 0000000000000002 ffff880461f33fd8 ffff880461f33f8
[ 1234.560584]  ffff880461f33fd8 ffff880461f33fd8 ffff88066d60d0f0 ffff880858bc500
[ 1234.568870]  ffff880461f33fd8 ffff88086639fb48 ffff880461f33cd0 000000000000026
[ 1234.577158] Call Trace:
[ 1234.579888]  [<ffffffff8d69ce90>] schedule+0x30/0x96
[ 1234.585426]  [<ffffffff8d69dc4d>] rt_spin_lock_slowlock+0x13d/0x350
[ 1234.592420]  [<ffffffff8d69ede5>] rt_spin_lock+0x25/0x30
[ 1234.598364]  [<ffffffffc05aec53>] kvm_mmu_notifier_invalidate_range_end+0x23/0]
[ 1234.607299]  [<ffffffff8d1d5af4>] __mmu_notifier_invalidate_range_end+0x64/0xe0
[ 1234.615456]  [<ffffffff8d1afd75>] unmap_vmas+0x85/0x90
[ 1234.621190]  [<ffffffff8d1b4c5e>] unmap_region+0xbe/0x140
[ 1234.627214]  [<ffffffff8d1b70d2>] do_munmap+0x292/0x470
[ 1234.633043]  [<ffffffff8d1b7318>] vm_munmap+0x68/0xb0
[ 1234.638677]  [<ffffffff8d1b83b2>] SyS_munmap+0x22/0x30
[ 1234.644409]  [<ffffffff8d6a7d94>] tracesys+0xdd/0xe2
[ 1234.649945] INFO: task live_migration:4494 blocked for more than 600 seconds.
[ 1234.657906] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this m.
[ 1234.666641] live_migration  D ffff88087fa16120     0  4494      1 0x00000080
[ 1234.674523]  ffff880445fefb60 0000000000000002 ffff880445feffd8 ffff880445feff8
[ 1234.682808]  ffff880445feffd8 ffff880445feffd8 ffff88066d60c0c0 ffff88087fa1610
[ 1234.691091]  ffff880445feffd8 ffff88086639fb48 ffff880445fefba8 000000000000026
[ 1234.699378] Call Trace:
[ 1234.702103]  [<ffffffff8d69ce90>] schedule+0x30/0x96
[ 1234.707641]  [<ffffffff8d69dc4d>] rt_spin_lock_slowlock+0x13d/0x350
[ 1234.714633]  [<ffffffff8d69ede5>] rt_spin_lock+0x25/0x30
[ 1234.720570]  [<ffffffffc05dd542>] kvm_mmu_change_mmu_pages+0x42/0xd0 [kvm]
[ 1234.728252]  [<ffffffffc05cd007>] kvm_arch_commit_memory_region+0xd7/0xe0 [kvm]
[ 1234.736414]  [<ffffffffc05b3781>] __kvm_set_memory_region+0x661/0x9e0 [kvm]
[ 1234.744188]  [<ffffffffc05b3b2b>] kvm_set_memory_region+0x2b/0x40 [kvm]
[ 1234.751573]  [<ffffffffc05b3f94>] kvm_vm_ioctl+0x454/0x7e0 [kvm]
[ 1234.758277]  [<ffffffff8d0ee473>] ? futex_wake+0x93/0x190
[ 1234.764301]  [<ffffffff8d0f12d2>] ? do_futex+0x122/0x6b0
[ 1234.770230]  [<ffffffff8d20e50d>] do_vfs_ioctl+0x33d/0x550
[ 1234.776350]  [<ffffffff8d2b0dcf>] ? file_has_perm+0x9f/0xb0
[ 1234.782568]  [<ffffffff8d20e7c1>] SyS_ioctl+0xa1/0xc0
[ 1234.788203]  [<ffffffff8d6a7d94>] tracesys+0xdd/0xe2
[ 1234.793742] INFO: task pgrep:4505 blocked for more than 600 seconds.
[ 1234.800829] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this m.
[ 1234.809567] pgrep           D ffff88085b29d0f0     0  4505   4504 0x00000080
[ 1234.817450]  ffff88066d7dfd20 0000000000000002 ffff88066d7dffd8 ffff88066d7dff8
[ 1234.825737]  ffff88066d7dffd8 ffff88066d7dffd8 ffff88066d60e120 ffff88085b29d00
[ 1234.834024]  ffff88085b29d0f0 0000000000000000 ffff88085b29d0f0 000000000000002
[ 1234.842312] Call Trace:
[ 1234.845038]  [<ffffffff8d69ce90>] schedule+0x30/0x96
[ 1234.850574]  [<ffffffff8d69da85>] __rt_mutex_slowlock+0xc5/0x150
[ 1234.857276]  [<ffffffff8d69df8e>] rt_mutex_slowlock+0x12e/0x210
[ 1234.863880]  [<ffffffff8d69e121>] rt_mutex_lock+0x31/0x40
[ 1234.869903]  [<ffffffff8d0f3328>] __rt_down_read.isra.1+0x28/0x30
[ 1234.876703]  [<ffffffff8d0f333e>] rt_down_read+0xe/0x10
[ 1234.882531]  [<ffffffff8d26f7a0>] proc_pid_cmdline_read+0xc0/0x570
[ 1234.889426]  [<ffffffff8d2ab684>] ? security_file_permission+0x84/0xa0
[ 1234.896711]  [<ffffffff8d1f9acc>] vfs_read+0x9c/0x170
[ 1234.902345]  [<ffffffff8d1fa98f>] SyS_read+0x7f/0xe0
[ 1234.907883]  [<ffffffff8d6a7d94>] tracesys+0xdd/0xe2

Comment 7 pagupta 2017-04-06 16:26:24 UTC

Hello,

It looks like issue comes when 'testpmd' thread is set as 'FIFO' with priority '95'. Can we try to pin each host 'PMD' threads to individual isolated cores using 'taskset' and check if issue comes?

just wanted to know if its tried before.

Thanks,
Pankaj

Comment 12 Pei Zhang 2017-04-10 03:30:33 UTC

(In reply to pagupta from comment #7)
> Hello,
> 
> It looks like issue comes when 'testpmd' thread is set as 'FIFO' with
> priority '95'. Can we try to pin each host 'PMD' threads to individual
> isolated cores using 'taskset' and check if issue comes?
> 
> just wanted to know if its tried before.
> 
> Thanks,
> Pankaj

Hi Pankaj,

All pmd threads have been pinned to individual isolated cores already.

Details:

When setting openvswitch using 
"# ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x1554(0x1554 is cores 2,4,6,8,10,12)", 6 ovs pmd threads will automatically pin to these individual isolated cores. 

# top -d1 -H

top - 23:17:07 up 25 min,  1 user,  load average: 9.00, 8.48, 5.48
Threads: 517 total,   9 running, 508 sleeping,   0 stopped,   0 zombie
%Cpu(s): 41.6 us,  0.2 sy,  0.0 ni, 58.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.
KiB Mem : 32448520 total,  8148228 free, 23754976 used,   545316 buff/cache
KiB Swap: 16449532 total, 16449532 free,        0 used.  8414008 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ 
  4092 root      20   0 37.292g  33572   2528 R 96.2  0.1  14:41.54 
  4093 root      20   0 37.292g  33572   2528 R 96.2  0.1  14:41.66 
  4094 root      20   0 37.292g  33572   2528 R 96.2  0.1  14:41.33 
  4095 root      20   0 37.292g  33572   2528 R 96.2  0.1  14:41.69 
  4096 root      20   0 37.292g  33572   2528 R 96.2  0.1  14:41.67 
  4097 root      20   0 37.292g  33572   2528 R 96.2  0.1  14:41.33 
  4268 root      -2   0 8981944 370112  51332 R 96.2  1.1  11:37.99 
  4269 root      -2   0 8981944 370112  51332 R 96.2  1.1  11:37.99 
  4673 root      20   0  158120   2628   1556 R  3.8  0.0   0:00.10 
...

# taskset -c -p 4092
pid 4092's current affinity list: 2
# taskset -c -p 4093
pid 4093's current affinity list: 4
# taskset -c -p 4094
pid 4094's current affinity list: 6
# taskset -c -p 4095
pid 4095's current affinity list: 8
# taskset -c -p 4096
pid 4096's current affinity list: 10
# taskset -c -p 4097
pid 4097's current affinity list: 12

In VM, when start testpmd, 2 pmd threads are automatically pin to individual cores.

# taskset -c -p 4268
pid 4268's current affinity list: 16
# taskset -c -p 4269
pid 4269's current affinity list: 14


Best Regards,
Pei

Comment 14 pagupta 2017-04-11 11:14:16 UTC

Hi, 

I also looked at the logs and I could find some pointers. 
It looks like interrupts are disabled using 'irq_disable', 
logs tell us 'irqs_disabled(): 1' for cores: 1, 9, 11, 13.

As we might be sleeping from 'irq_disabled' context, looks
like cause of this issue. 

[1]

static int device_dma_allocations(struct device *dev, struct dma_debug_entry **out_entry)
{
        struct dma_debug_entry *entry;
        unsigned long flags;
        int count = 0, i;

        local_irq_save(flags);

        for (i = 0; i < HASH_SIZE; ++i) {
                spin_lock(&dma_entry_hash[i].lock);
                list_for_each_entry(entry, &dma_entry_hash[i].list, list) {
                        if (entry->dev == dev) {
                                count += 1;
                                *out_entry = entry;
                        }
                }
                spin_unlock(&dma_entry_hash[i].lock);
        }

        local_irq_restore(flags);

        return count;
}


[ 2430.841001] BUG: sleeping function called from invalid context at kernel/rtmutex.c8
[ 2430.841002] in_atomic(): 0, irqs_disabled(): 1, pid: 4041, name: dpdk-devbind
[ 2430.841003] 6 locks held by dpdk-devbind/4041:
[ 2430.841011]  #0:  (sb_writers#3){.+.+.+}, at: [<ffffffff9e23b35b>] vfs_write+0x1bb0
[ 2430.841017]  #1:  (&buffer->mutex){+.+.+.}, at: [<ffffffff9e2cbb3a>] sysfs_write_f0
[ 2430.841019]  #2:  (s_active#15){++++.+}, at: [<ffffffff9e2cbbc1>] sysfs_write_file0
[ 2430.841025]  #3:  (&__lockdep_no_validate__){......}, at: [<ffffffff9e494da5>] dri0
[ 2430.841028]  #4:  (&__lockdep_no_validate__){......}, at: [<ffffffff9e49658b>] dev0
[ 2430.841034]  #5:  (&(&priv->bus_notifier)->rwsem){+.+.+.}, at: [<ffffffff9e117f50>0
[ 2430.841035] irq event stamp: 1043290
[ 2430.841038] hardirqs last  enabled at (1043289): [<ffffffff9e760385>] _raw_spin_un0
[ 2430.841043] hardirqs last disabled at (1043290): [<ffffffff9e3a8384>] dma_debug_de0---------> hard irqs disabled here
[ 2430.841046] softirqs last  enabled at (0): [<ffffffff9e08184a>] copy_process+0x82a0
[ 2430.841047] softirqs last disabled at (0): [<          (null)>]           (null)
[ 2430.841049] CPU: 9 PID: 4041 Comm: dpdk-devbind Not tainted 3.10.0-644.rt56.560.el1
[ 2430.841049] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
[ 2430.841052]  ffff8da10f334000 00000000dd19f330 ffff8da0fcfe3cc8 ffffffff9e757413
[ 2430.841053]  ffff8da0fcfe3cf0 ffffffff9e0ccb3d ffffffffa0617b50 0000000000000000
[ 2430.841055]  0000000000000000 ffff8da0fcfe3d10 ffffffff9e75f694 0000000000000000
[ 2430.841055] Call Trace:
[ 2430.841060]  [<ffffffff9e757413>] dump_stack+0x19/0x1b
[ 2430.841064]  [<ffffffff9e0ccb3d>] __might_sleep+0x12d/0x1f0
[ 2430.841067]  [<ffffffff9e75f694>] rt_spin_lock+0x24/0x60
[ 2430.841070]  [<ffffffff9e3a8399>] dma_debug_device_change+0x79/0x250
[ 2430.841072]  [<ffffffff9e764d78>] notifier_call_chain+0x58/0xb0
[ 2430.841076]  [<ffffffff9e0c179a>] __blocking_notifier_call_chain+0x5a/0x80
[ 2430.841080]  [<ffffffff9e0c17d6>] blocking_notifier_call_chain+0x16/0x20
[ 2430.841082]  [<ffffffff9e496553>] __device_release_driver+0xd3/0xf0
[ 2430.841084]  [<ffffffff9e496593>] device_release_driver+0x23/0x30
[ 2430.841086]  [<ffffffff9e494db5>] driver_unbind+0xc5/0xf0
[ 2430.841088]  [<ffffffff9e4942b4>] drv_attr_store+0x24/0x40
[ 2430.841090]  [<ffffffff9e2cbbd9>] sysfs_write_file+0xd9/0x150
[ 2430.841092]  [<ffffffff9e23b260>] vfs_write+0xc0/0x1f0
[ 2430.841095]  [<ffffffff9e25e692>] ? fget_light+0x382/0x500
[ 2430.841097]  [<ffffffff9e23c0d8>] SyS_write+0x88/0xf0
[ 2430.841101]  [<ffffffff9e76962d>] tracesys+0xdd/0xe2
[ 2430.841102] ---------------------------
[ 2430.841102] | preempt count: 00000000 ]
[ 2430.841103] | 0-level deep critical section nesting:


I think below patch should fix this. If this looks okay
I will post it upstream and downstream build for testing.

diff --git a/lib/dma-debug.c b/lib/dma-debug.c
index b157b46..fe4d50c 100644
--- a/lib/dma-debug.c
+++ b/lib/dma-debug.c
@@ -942,21 +942,17 @@ static int device_dma_allocations(struct device *dev, struct dma_debug_entry **o
        unsigned long flags;
        int count = 0, i;
 
-       local_irq_save(flags);
-
        for (i = 0; i < HASH_SIZE; ++i) {
-               spin_lock(&dma_entry_hash[i].lock);
+               spin_lock_irqsave(&dma_entry_hash[i].lock, flags);
                list_for_each_entry(entry, &dma_entry_hash[i].list, list) {
                        if (entry->dev == dev) {
                                count += 1;
                                *out_entry = entry;
                        }
                }
-               spin_unlock(&dma_entry_hash[i].lock);
+               spin_unlock_irqrestore(&dma_entry_hash[i].lock, flags);
        }
 
-       local_irq_restore(flags);
-
        return count;
 }


=================================================================

[2]
Here, we are seeing this trace and lockdep is getting disabled:
Looks like we might need to handle this lock in lockdep? As a result it is 
disabling lock validator.


[ 2578.465834]  ffff8d9d17b3fbf8 000000019e0cfced 0000000000000000 ffff8d9d00000000
[ 2578.465835] Call Trace:
[ 2578.465843]  [<ffffffff9e757413>] dump_stack+0x19/0x1b
[ 2578.465848]  [<ffffffff9e750de0>] register_lock_class.part.27+0x38/0x3c
[ 2578.465852]  [<ffffffff9e10ea05>] __lock_acquire+0xcc5/0xdc0
[ 2578.465854]  [<ffffffff9e10f452>] lock_acquire+0xb2/0x230
[ 2578.465858]  [<ffffffff9e75dbbd>] ? rt_spin_lock_slowlock+0x6d/0x3c0
[ 2578.465860]  [<ffffffff9e7604e0>] _raw_spin_lock_irqsave+0x70/0xc0
[ 2578.465862]  [<ffffffff9e75dbbd>] ? rt_spin_lock_slowlock+0x6d/0x3c0
[ 2578.465864]  [<ffffffff9e75dbbd>] rt_spin_lock_slowlock+0x6d/0x3c0
[ 2578.465867]  [<ffffffff9e75f69c>] rt_spin_lock+0x2c/0x60
[ 2578.465884]  [<ffffffffc08c590f>] kvm_page_track_register_notifier+0x1f/0x60 [kvm]
[ 2578.465896]  [<ffffffffc08ac9eb>] kvm_mmu_init_vm+0x2b/0x30 [kvm]
[ 2578.465907]  [<ffffffffc0899e44>] kvm_arch_init_vm+0x264/0x290 [kvm]
[ 2578.465914]  [<ffffffffc087b4de>] kvm_dev_ioctl+0xde/0x740 [kvm]
[ 2578.465918]  [<ffffffff9e0cfced>] ? get_parent_ip+0xd/0x50
[ 2578.465922]  [<ffffffff9e2514c5>] do_vfs_ioctl+0x365/0x5b0
[ 2578.465924]  [<ffffffff9e25e3ff>] ? fget_light+0xef/0x500
[ 2578.465926]  [<ffffffff9e2517b1>] SyS_ioctl+0xa1/0xc0
[ 2578.465929]  [<ffffffff9e76962d>] tracesys+0xdd/0xe2
[ 2578.465930] ---------------------------
[ 2578.465931] | preempt count: 00000001 ]
[ 2578.465931] | 1-level deep critical section nesting:
[ 2578.465931] ----------------------------------------
[ 2578.465933] .. [<ffffffff9e7604bd>] .... _raw_spin_lock_irqsave+0x4d/0xc0
[ 2578.465935] .....[<ffffffff9e75dbbd>] ..   ( <= rt_spin_lock_slowlock+0x6d/0x3c0)
[ 2578.465935] 


static struct lock_class *
register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
{
        struct lockdep_subclass_key *key;
        struct hlist_head *hash_head;
        struct lock_class *class;

        DEBUG_LOCKS_WARN_ON(!irqs_disabled());

        class = look_up_lock_class(lock, subclass);
        if (likely(class))
                goto out_set_class_cache;

        /*
         * Debug-check: all keys must be persistent!
         */
        if (!static_obj(lock->key)) {
                debug_locks_off();----------------> disable log debugging
                printk("INFO: trying to register non-static key.\n");
                printk("the code is fine but needs lockdep annotation.\n");
                printk("turning off the locking correctness validator.\n");
                dump_stack();

                return NULL;
        }


================

[3]

As irqs_disable:1 already for cpu '9' that's why we are seeing this.
Actual issue is [1] and above patch should fix this. 

[ 2580.261273] ----------------------------------------
[ 2580.261274] 
[ 2592.563812] BUG: sleeping function called from invalid context at kernel/rtmutex.c8 ---> still irqs_disable
[ 2592.563813] in_atomic(): 0, irqs_disabled(): 1, pid: 4748, name: qemu-kvm
[ 2592.563814] INFO: lockdep is turned off.
[ 2592.563815] irq event stamp: 301754
[ 2592.563822] hardirqs last  enabled at (301753): [<ffffffff9e760385>] _raw_spin_unl0
[ 2592.563823] hardirqs last disabled at (301754): [<ffffffff9e7604b3>] _raw_spin_loc0
[ 2592.563827] softirqs last  enabled at (0): [<ffffffff9e08184a>] copy_process+0x82a0
[ 2592.563828] softirqs last disabled at (0): [<          (null)>]           (null)
[ 2592.563830] CPU: 9 PID: 4748 Comm: qemu-kvm Not tainted 3.10.0-644.rt56.560.el7.x81
[ 2592.563831] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
[ 2592.563833]  ffff8d9938bf0000 000000000688bb3b ffff8d9d17b3fc68 ffffffff9e757413
[ 2592.563834]  ffff8d9d17b3fc90 ffffffff9e0ccb3d ffff8da111b5b610 ffff8da111b5b610
[ 2592.563836]  00007ffcb8a137b0 ffff8d9d17b3fcb0 ffffffff9e75f694 ffff8da111b5b610
[ 2592.563836] Call Trace:
[ 2592.563842]  [<ffffffff9e757413>] dump_stack+0x19/0x1b
[ 2592.563846]  [<ffffffff9e0ccb3d>] __might_sleep+0x12d/0x1f0
[ 2592.563850]  [<ffffffff9e75f694>] rt_spin_lock+0x24/0x60
[ 2592.563865]  [<ffffffffc0888996>] __get_kvmclock_ns+0x36/0x110 [kvm]
[ 2592.563877]  [<ffffffffc0895937>] kvm_arch_vm_ioctl+0x257/0xd70 [kvm]
[ 2592.563880]  [<ffffffff9e75dbbd>] ? rt_spin_lock_slowlock+0x6d/0x3c0
[ 2592.563890]  [<ffffffffc087f114>] kvm_vm_ioctl+0xa4/0x850 [kvm]
[ 2592.563892]  [<ffffffff9e75d173>] ? rt_spin_lock_slowunlock+0xd3/0xe0
[ 2592.563900]  [<ffffffffc0878dae>] ? kvm_vm_ioctl_check_extension_generic+0x4e/0x90]
[ 2592.563907]  [<ffffffffc087b5ca>] ? kvm_dev_ioctl+0x1ca/0x740 [kvm]
[ 2592.563910]  [<ffffffff9e0cfced>] ? get_parent_ip+0xd/0x50
[ 2592.563912]  [<ffffffff9e76509f>] ? add_preempt_count+0x1f/0x60
[ 2592.563916]  [<ffffffff9e2514c5>] do_vfs_ioctl+0x365/0x5b0
[ 2592.563918]  [<ffffffff9e25e3ff>] ? fget_light+0xef/0x500
[ 2592.563920]  [<ffffffff9e2517b1>] SyS_ioctl+0xa1/0xc0
[ 2592.563923]  [<ffffffff9e76962d>] tracesys+0xdd/0xe2

Thanks,
Pankaj

Comment 16 Marcelo Tosatti 2017-04-11 14:44:20 UTC

(In reply to pagupta from comment #14)
> Hi, 
> 
> I also looked at the logs and I could find some pointers. 
> It looks like interrupts are disabled using 'irq_disable', 
> logs tell us 'irqs_disabled(): 1' for cores: 1, 9, 11, 13.
> 
> As we might be sleeping from 'irq_disabled' context, looks
> like cause of this issue. 
> 
> [1]
> 
> static int device_dma_allocations(struct device *dev, struct dma_debug_entry
> **out_entry)
> {
>         struct dma_debug_entry *entry;
>         unsigned long flags;
>         int count = 0, i;
> 
>         local_irq_save(flags);
> 
>         for (i = 0; i < HASH_SIZE; ++i) {
>                 spin_lock(&dma_entry_hash[i].lock);
>                 list_for_each_entry(entry, &dma_entry_hash[i].list, list) {
>                         if (entry->dev == dev) {
>                                 count += 1;
>                                 *out_entry = entry;
>                         }
>                 }
>                 spin_unlock(&dma_entry_hash[i].lock);
>         }
> 
>         local_irq_restore(flags);
> 
>         return count;
> }
> 
> 
> [ 2430.841001] BUG: sleeping function called from invalid context at
> kernel/rtmutex.c8
> [ 2430.841002] in_atomic(): 0, irqs_disabled(): 1, pid: 4041, name:
> dpdk-devbind
> [ 2430.841003] 6 locks held by dpdk-devbind/4041:
> [ 2430.841011]  #0:  (sb_writers#3){.+.+.+}, at: [<ffffffff9e23b35b>]
> vfs_write+0x1bb0
> [ 2430.841017]  #1:  (&buffer->mutex){+.+.+.}, at: [<ffffffff9e2cbb3a>]
> sysfs_write_f0
> [ 2430.841019]  #2:  (s_active#15){++++.+}, at: [<ffffffff9e2cbbc1>]
> sysfs_write_file0
> [ 2430.841025]  #3:  (&__lockdep_no_validate__){......}, at:
> [<ffffffff9e494da5>] dri0
> [ 2430.841028]  #4:  (&__lockdep_no_validate__){......}, at:
> [<ffffffff9e49658b>] dev0
> [ 2430.841034]  #5:  (&(&priv->bus_notifier)->rwsem){+.+.+.}, at:
> [<ffffffff9e117f50>0
> [ 2430.841035] irq event stamp: 1043290
> [ 2430.841038] hardirqs last  enabled at (1043289): [<ffffffff9e760385>]
> _raw_spin_un0
> [ 2430.841043] hardirqs last disabled at (1043290): [<ffffffff9e3a8384>]
> dma_debug_de0---------> hard irqs disabled here
> [ 2430.841046] softirqs last  enabled at (0): [<ffffffff9e08184a>]
> copy_process+0x82a0
> [ 2430.841047] softirqs last disabled at (0): [<          (null)>]          
> (null)
> [ 2430.841049] CPU: 9 PID: 4041 Comm: dpdk-devbind Not tainted
> 3.10.0-644.rt56.560.el1
> [ 2430.841049] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4
> 10/002/2015
> [ 2430.841052]  ffff8da10f334000 00000000dd19f330 ffff8da0fcfe3cc8
> ffffffff9e757413
> [ 2430.841053]  ffff8da0fcfe3cf0 ffffffff9e0ccb3d ffffffffa0617b50
> 0000000000000000
> [ 2430.841055]  0000000000000000 ffff8da0fcfe3d10 ffffffff9e75f694
> 0000000000000000
> [ 2430.841055] Call Trace:
> [ 2430.841060]  [<ffffffff9e757413>] dump_stack+0x19/0x1b
> [ 2430.841064]  [<ffffffff9e0ccb3d>] __might_sleep+0x12d/0x1f0
> [ 2430.841067]  [<ffffffff9e75f694>] rt_spin_lock+0x24/0x60
> [ 2430.841070]  [<ffffffff9e3a8399>] dma_debug_device_change+0x79/0x250
> [ 2430.841072]  [<ffffffff9e764d78>] notifier_call_chain+0x58/0xb0
> [ 2430.841076]  [<ffffffff9e0c179a>] __blocking_notifier_call_chain+0x5a/0x80
> [ 2430.841080]  [<ffffffff9e0c17d6>] blocking_notifier_call_chain+0x16/0x20
> [ 2430.841082]  [<ffffffff9e496553>] __device_release_driver+0xd3/0xf0
> [ 2430.841084]  [<ffffffff9e496593>] device_release_driver+0x23/0x30
> [ 2430.841086]  [<ffffffff9e494db5>] driver_unbind+0xc5/0xf0
> [ 2430.841088]  [<ffffffff9e4942b4>] drv_attr_store+0x24/0x40
> [ 2430.841090]  [<ffffffff9e2cbbd9>] sysfs_write_file+0xd9/0x150
> [ 2430.841092]  [<ffffffff9e23b260>] vfs_write+0xc0/0x1f0
> [ 2430.841095]  [<ffffffff9e25e692>] ? fget_light+0x382/0x500
> [ 2430.841097]  [<ffffffff9e23c0d8>] SyS_write+0x88/0xf0
> [ 2430.841101]  [<ffffffff9e76962d>] tracesys+0xdd/0xe2
> [ 2430.841102] ---------------------------
> [ 2430.841102] | preempt count: 00000000 ]
> [ 2430.841103] | 0-level deep critical section nesting:
> 
> 
> I think below patch should fix this. If this looks okay
> I will post it upstream and downstream build for testing.
> 
> diff --git a/lib/dma-debug.c b/lib/dma-debug.c
> index b157b46..fe4d50c 100644
> --- a/lib/dma-debug.c
> +++ b/lib/dma-debug.c
> @@ -942,21 +942,17 @@ static int device_dma_allocations(struct device *dev,
> struct dma_debug_entry **o
>         unsigned long flags;
>         int count = 0, i;
>  
> -       local_irq_save(flags);
> -
>         for (i = 0; i < HASH_SIZE; ++i) {
> -               spin_lock(&dma_entry_hash[i].lock);
> +               spin_lock_irqsave(&dma_entry_hash[i].lock, flags);
>                 list_for_each_entry(entry, &dma_entry_hash[i].list, list) {
>                         if (entry->dev == dev) {
>                                 count += 1;
>                                 *out_entry = entry;
>                         }
>                 }
> -               spin_unlock(&dma_entry_hash[i].lock);
> +               spin_unlock_irqrestore(&dma_entry_hash[i].lock, flags);
>         }
>  
> -       local_irq_restore(flags);
> -
>         return count;
>  }

Yes it does.

> 
> 
> =================================================================
> 
> [2]
> Here, we are seeing this trace and lockdep is getting disabled:
> Looks like we might need to handle this lock in lockdep? As a result it is 
> disabling lock validator.
> 
> 
> [ 2578.465834]  ffff8d9d17b3fbf8 000000019e0cfced 0000000000000000
> ffff8d9d00000000
> [ 2578.465835] Call Trace:
> [ 2578.465843]  [<ffffffff9e757413>] dump_stack+0x19/0x1b
> [ 2578.465848]  [<ffffffff9e750de0>] register_lock_class.part.27+0x38/0x3c
> [ 2578.465852]  [<ffffffff9e10ea05>] __lock_acquire+0xcc5/0xdc0
> [ 2578.465854]  [<ffffffff9e10f452>] lock_acquire+0xb2/0x230
> [ 2578.465858]  [<ffffffff9e75dbbd>] ? rt_spin_lock_slowlock+0x6d/0x3c0
> [ 2578.465860]  [<ffffffff9e7604e0>] _raw_spin_lock_irqsave+0x70/0xc0
> [ 2578.465862]  [<ffffffff9e75dbbd>] ? rt_spin_lock_slowlock+0x6d/0x3c0
> [ 2578.465864]  [<ffffffff9e75dbbd>] rt_spin_lock_slowlock+0x6d/0x3c0
> [ 2578.465867]  [<ffffffff9e75f69c>] rt_spin_lock+0x2c/0x60
> [ 2578.465884]  [<ffffffffc08c590f>]
> kvm_page_track_register_notifier+0x1f/0x60 [kvm]
> [ 2578.465896]  [<ffffffffc08ac9eb>] kvm_mmu_init_vm+0x2b/0x30 [kvm]
> [ 2578.465907]  [<ffffffffc0899e44>] kvm_arch_init_vm+0x264/0x290 [kvm]
> [ 2578.465914]  [<ffffffffc087b4de>] kvm_dev_ioctl+0xde/0x740 [kvm]
> [ 2578.465918]  [<ffffffff9e0cfced>] ? get_parent_ip+0xd/0x50
> [ 2578.465922]  [<ffffffff9e2514c5>] do_vfs_ioctl+0x365/0x5b0
> [ 2578.465924]  [<ffffffff9e25e3ff>] ? fget_light+0xef/0x500
> [ 2578.465926]  [<ffffffff9e2517b1>] SyS_ioctl+0xa1/0xc0
> [ 2578.465929]  [<ffffffff9e76962d>] tracesys+0xdd/0xe2
> [ 2578.465930] ---------------------------
> [ 2578.465931] | preempt count: 00000001 ]
> [ 2578.465931] | 1-level deep critical section nesting:
> [ 2578.465931] ----------------------------------------
> [ 2578.465933] .. [<ffffffff9e7604bd>] .... _raw_spin_lock_irqsave+0x4d/0xc0
> [ 2578.465935] .....[<ffffffff9e75dbbd>] ..   ( <=
> rt_spin_lock_slowlock+0x6d/0x3c0)
> [ 2578.465935] 
> 
> 
> static struct lock_class *
> register_lock_class(struct lockdep_map *lock, unsigned int subclass, int
> force)
> {
>         struct lockdep_subclass_key *key;
>         struct hlist_head *hash_head;
>         struct lock_class *class;
> 
>         DEBUG_LOCKS_WARN_ON(!irqs_disabled());
> 
>         class = look_up_lock_class(lock, subclass);
>         if (likely(class))
>                 goto out_set_class_cache;
> 
>         /*
>          * Debug-check: all keys must be persistent!
>          */
>         if (!static_obj(lock->key)) {
>                 debug_locks_off();----------------> disable log debugging
>                 printk("INFO: trying to register non-static key.\n");
>                 printk("the code is fine but needs lockdep annotation.\n");
>                 printk("turning off the locking correctness validator.\n");
>                 dump_stack();
> 
>                 return NULL;
>         }

Yes, don't know how to do that however because the object (kvm structure) 
is not static and can't see how that can be fixed easily. 

Created a debug kernel with "hand made" debugging, hopefully that helps.

> 
> 
> ================
> 
> [3]
> 
> As irqs_disable:1 already for cpu '9' that's why we are seeing this.
> Actual issue is [1] and above patch should fix this.

Not sure this analysis is correct: dma debugging should be separate from [3].
 
> 
> [ 2580.261273] ----------------------------------------
> [ 2580.261274] 
> [ 2592.563812] BUG: sleeping function called from invalid context at
> kernel/rtmutex.c8 ---> still irqs_disable
> [ 2592.563813] in_atomic(): 0, irqs_disabled(): 1, pid: 4748, name: qemu-kvm
> [ 2592.563814] INFO: lockdep is turned off.
> [ 2592.563815] irq event stamp: 301754
> [ 2592.563822] hardirqs last  enabled at (301753): [<ffffffff9e760385>]
> _raw_spin_unl0
> [ 2592.563823] hardirqs last disabled at (301754): [<ffffffff9e7604b3>]
> _raw_spin_loc0
> [ 2592.563827] softirqs last  enabled at (0): [<ffffffff9e08184a>]
> copy_process+0x82a0
> [ 2592.563828] softirqs last disabled at (0): [<          (null)>]          
> (null)
> [ 2592.563830] CPU: 9 PID: 4748 Comm: qemu-kvm Not tainted
> 3.10.0-644.rt56.560.el7.x81
> [ 2592.563831] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4
> 10/002/2015
> [ 2592.563833]  ffff8d9938bf0000 000000000688bb3b ffff8d9d17b3fc68
> ffffffff9e757413
> [ 2592.563834]  ffff8d9d17b3fc90 ffffffff9e0ccb3d ffff8da111b5b610
> ffff8da111b5b610
> [ 2592.563836]  00007ffcb8a137b0 ffff8d9d17b3fcb0 ffffffff9e75f694
> ffff8da111b5b610
> [ 2592.563836] Call Trace:
> [ 2592.563842]  [<ffffffff9e757413>] dump_stack+0x19/0x1b
> [ 2592.563846]  [<ffffffff9e0ccb3d>] __might_sleep+0x12d/0x1f0
> [ 2592.563850]  [<ffffffff9e75f694>] rt_spin_lock+0x24/0x60
> [ 2592.563865]  [<ffffffffc0888996>] __get_kvmclock_ns+0x36/0x110 [kvm]
> [ 2592.563877]  [<ffffffffc0895937>] kvm_arch_vm_ioctl+0x257/0xd70 [kvm]
> [ 2592.563880]  [<ffffffff9e75dbbd>] ? rt_spin_lock_slowlock+0x6d/0x3c0
> [ 2592.563890]  [<ffffffffc087f114>] kvm_vm_ioctl+0xa4/0x850 [kvm]
> [ 2592.563892]  [<ffffffff9e75d173>] ? rt_spin_lock_slowunlock+0xd3/0xe0
> [ 2592.563900]  [<ffffffffc0878dae>] ?
> kvm_vm_ioctl_check_extension_generic+0x4e/0x90]
> [ 2592.563907]  [<ffffffffc087b5ca>] ? kvm_dev_ioctl+0x1ca/0x740 [kvm]
> [ 2592.563910]  [<ffffffff9e0cfced>] ? get_parent_ip+0xd/0x50
> [ 2592.563912]  [<ffffffff9e76509f>] ? add_preempt_count+0x1f/0x60
> [ 2592.563916]  [<ffffffff9e2514c5>] do_vfs_ioctl+0x365/0x5b0
> [ 2592.563918]  [<ffffffff9e25e3ff>] ? fget_light+0xef/0x500
> [ 2592.563920]  [<ffffffff9e2517b1>] SyS_ioctl+0xa1/0xc0
> [ 2592.563923]  [<ffffffff9e76962d>] tracesys+0xdd/0xe2
> 
> Thanks,
> Pankaj

Comment 18 Pei Zhang 2017-04-12 10:43:24 UTC

Created attachment 1271105 [details]
Console log when host is hang - 2

(In reply to Marcelo Tosatti from comment #15)
> [...]
> Pei Zhang,
> 
> Please try the -kvm kernel of this SRPM:
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12983510
> 
> Which prints additional information.

Marcelo,

Please see this attachment.

Testing with 3.10.0-644.rt56.560.el7.x86_64.debug. Using -kvm packages from this link.
 
Best Regards,
Pei

Comment 19 Marcelo Tosatti 2017-04-12 18:48:55 UTC

(In reply to Pei Zhang from comment #18)
> Created attachment 1271105 [details]
> Console log when host is hang - 2
> 
> (In reply to Marcelo Tosatti from comment #15)
> > [...]
> > Pei Zhang,
> > 
> > Please try the -kvm kernel of this SRPM:
> > 
> > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12983510
> > 
> > Which prints additional information.
> 
> Marcelo,
> 
> Please see this attachment.
> 
> Testing with 3.10.0-644.rt56.560.el7.x86_64.debug. Using -kvm packages from
> this link.
>  
> Best Regards,
> Pei

Pei,

Need both kernel-rt and kernel-rt-kvm from the link. Did you install both of them? Sorry for not being clear before.

Comment 20 Marcelo Tosatti 2017-04-12 20:27:22 UTC

[ 1845.970458] INFO: task modprobe:15242 blocked for more than 600 s.
[ 1849.760102] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" di.
[ 1849.768841] modprobe        D ffff8c17a71da440 12528 15242  152020
[ 1849.776727]  ffff8c1b8362baf0 0000000000000002 ffff8c1b8362bfd8 f8
[ 1849.785017]  ffff8c1b8362bfd8 ffff8c1b8362bfd8 ffff8c1980c1c000 f0
[ 1849.793308]  ffff8c1b97654000 ffff8c1b976549b8 0000000000000001 f0
[ 1849.801599] Call Trace:
[ 1849.804330]  [<ffffffff9375c890>] schedule+0x30/0xa0
[ 1849.809869]  [<ffffffff9375dc55>] rt_spin_lock_slowlock+0x105/0x30
[ 1849.816864]  [<ffffffff9375dcca>] ? rt_spin_lock_slowlock+0x17a/00
[ 1849.824052]  [<ffffffff9375f69c>] rt_spin_lock+0x2c/0x60
[ 1849.829980]  [<ffffffff930ad4a0>] pwq_adjust_max_active+0x50/0x110
[ 1849.836876]  [<ffffffff930ad5d2>] link_pwq+0x72/0x90
[ 1849.842415]  [<ffffffff930b2b56>] __alloc_workqueue_key+0x4f6/0x50
[ 1849.849409]  [<ffffffff931b7880>] ? mempool_kfree+0x10/0x10
[ 1849.855628]  [<ffffffff931b7c73>] ? mempool_create_node+0x133/0x10
[ 1849.862621]  [<ffffffff931b78a0>] ? mempool_alloc_slab+0x20/0x20
[ 1849.869325]  [<ffffffff9327e4ea>] bioset_create+0x26a/0x310
[ 1849.875559]  [<ffffffffc0ba20b8>] ? ftrace_define_fields_btrfs_fl]
[ 1849.885182]  [<ffffffffc0ba232f>] extent_io_init+0x65/0xd5 [btrfs]
[ 1849.892090]  [<ffffffffc0ba20f7>] init_btrfs_fs+0x3f/0x129 [btrfs]
[ 1849.898987]  [<ffffffff930003e8>] do_one_initcall+0xb8/0x260
[ 1849.905309]  [<ffffffff9312096f>] load_module+0x168f/0x1d60
[ 1849.911529]  [<ffffffff933a6070>] ? ddebug_proc_write+0xf0/0xf0
[ 1849.918134]  [<ffffffff9311b8c3>] ? copy_module_from_fd.isra.42+00
[ 1849.925806]  [<ffffffff93121216>] SyS_finit_module+0xa6/0xd0
[ 1849.932122]  [<ffffffff9376962d>] tracesys+0xdd/0xe2

Its not making progress to grab this spinlock.

        spin_lock_irq(&pwq->pool->lock);

Comment 21 Marcelo Tosatti 2017-04-12 21:33:34 UTC

Not making progress to grab: 

[ 2449.978453] INFO: task kworker/u288:0:7 blocked for more than 600 secon.
[ 2453.725915] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables.
[ 2453.734652] kworker/u288:0  D ffff8c1bb7011820 12800     7      2 0x0000
[ 2453.742548] Workqueue: edac-poller edac_mc_workq_function [edac_core]
[ 2453.749743]  ffff8c16008e3c00 0000000000000002 ffff8c16008e3fd8 ffff8c18
[ 2453.758034]  ffff8c16008e3fd8 ffff8c16008e3fd8 ffff8c1980c08000 ffff8c10
[ 2453.766324]  ffff8c16008dc000 ffff8c16008dc9b8 0000000000000292 ffff8c10
[ 2453.774615] Call Trace:
[ 2453.777363]  [<ffffffff9375c890>] schedule+0x30/0xa0
[ 2453.782903]  [<ffffffff9375dc55>] rt_spin_lock_slowlock+0x105/0x3c0
[ 2453.789897]  [<ffffffff9375f69c>] rt_spin_lock+0x2c/0x60
[ 2453.795824]  [<ffffffff930b03d7>] queue_delayed_work_on+0x67/0x1d0
[ 2453.802724]  [<ffffffffc03ae452>] edac_mc_workq_function+0x72/0xa0 [eda]
[ 2453.810775]  [<ffffffff930af350>] process_one_work+0x230/0x770
[ 2453.817282]  [<ffffffff930af2df>] ? process_one_work+0x1bf/0x770
[ 2453.823988]  [<ffffffff930afa10>] worker_thread+0x180/0x410
[ 2453.830205]  [<ffffffff930af890>] ? process_one_work+0x770/0x770
[ 2453.836909]  [<ffffffff930b94dd>] kthread+0xed/0x100
[ 2453.842447]  [<ffffffff9376509f>] ? add_preempt_count+0x1f/0x60
[ 2453.849054]  [<ffffffff930b93f0>] ? flush_kthread_work+0x230/0x230
[ 2453.855951]  [<ffffffff93769358>] ret_from_fork+0x58/0x90
[ 2453.861974]  [<ffffffff930b93f0>] ? flush_kthread_work+0x230/0x230

This one is waiting on pendingb_lock.

Comment 22 Marcelo Tosatti 2017-04-12 21:56:14 UTC

Places that grab both locks are:

1. put_pwq_unlocked
2. try_to_grab_pending
3. queue_work_on
4. queue_delayed_work_on
5. mod_delayed_work_on
6. cancel_work_timer (via try_to_grab_pending)
7. flush_delayed_work

So perhaps its scheduling work on an isolated CPU, 
with pmd or qemu-kvm-vcpu with priority higher than 
kworker thread, and then it hangs (because the pmd or 
qemu-kvm do not allow kworker to execute).

To dump the trace to some location: 

1) Have a process writing to disk continuously:
 cat /sys/kernel/debug/tracing/trace_pipe > /root/trace.txt

or 

2)
You can also trigger a dump of the Ftrace buffer to the console with sysrq-z.

Comment 23 Marcelo Tosatti 2017-04-12 21:57:07 UTC

Can you please also enable the following tracepoints:

	trace_workqueue_activate_work(work);
	trace_workqueue_queue_work(req_cpu, pwq, work);
		trace_workqueue_activate_work(work);
	trace_workqueue_execute_start(work);
	trace_workqueue_execute_end(work);

Comment 25 Pei Zhang 2017-04-13 06:31:12 UTC

Created attachment 1271343 [details]
Call trace info when testing build from pagupta

(In reply to Pei Zhang from comment #24)
> (In reply to pagupta from comment #17)
> > Hello Pei,
> > 
> > I have build the test kernel with the patch in comment #14. Can you please
> > also test this and check if it solves the issue. Please test this after 
> > you are able to gather logs with the test kernel provided by 'Marcelo'.
> > 
> > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12999750
> > 
> > Thanks,
> > Pankaj
> 
> Hi Pankaj, this build still hit issues. 
> 
> 1. Host hang. Console didn't print any call trace info and error info after
> waiting about 10minutes, just hang.

Call trace occurs after waiting a little longer. Please see the attachment.


> 2. Migration fail with error below and hosts didn't hang. Even when I didn't
> set 95 priority, migration also fail. 
> # /bin/virsh migrate --verbose --persistent --live rhel7.4_rt
> qemu+ssh://192.168.1.1/system 
> error: operation failed: migration job: is not active
> 
> 
> Best Regards,
> Pei

Comment 26 Pei Zhang 2017-04-13 11:16:59 UTC

(In reply to Marcelo Tosatti from comment #19)
> (In reply to Pei Zhang from comment #18)
> > Created attachment 1271105 [details]
> > Console log when host is hang - 2
> > 
> > (In reply to Marcelo Tosatti from comment #15)
> > > [...]
> > > Pei Zhang,
> > > 
> > > Please try the -kvm kernel of this SRPM:
> > > 
> > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12983510
> > > 
> > > Which prints additional information.
> > 
> > Marcelo,
> > 
> > Please see this attachment.
> > 
> > Testing with 3.10.0-644.rt56.560.el7.x86_64.debug. Using -kvm packages from
> > this link.
> >  
> > Best Regards,
> > Pei
> 
> Pei,
> 
> Need both kernel-rt and kernel-rt-kvm from the link. Did you install both of
> them? Sorry for not being clear before.

Hi Marcelo,

When testing with "3.10.0-644.rt56.560.el7.x86_64.debug", I download and installed below packages from your link:

kernel-rt-3.10.0-644.rt56.560.el7.x86_64.rpm
kernel-rt-debuginfo-3.10.0-644.rt56.560.el7.x86_64.rpm
kernel-rt-debuginfo-common-x86_64-3.10.0-644.rt56.560.el7.x86_64.rpm
kernel-rt-devel-3.10.0-644.rt56.560.el7.x86_64.rpm
kernel-rt-kvm-3.10.0-644.rt56.560.el7.x86_64.rpm
kernel-rt-kvm-debuginfo-3.10.0-644.rt56.560.el7.x86_64.rpm
kernel-rt-trace-3.10.0-644.rt56.560.el7.x86_64.rpm
kernel-rt-trace-debuginfo-3.10.0-644.rt56.560.el7.x86_64.rpm
kernel-rt-trace-devel-3.10.0-644.rt56.560.el7.x86_64.rpm
kernel-rt-trace-kvm-3.10.0-644.rt56.560.el7.x86_64.rpm
kernel-rt-trace-kvm-debuginfo-3.10.0-644.rt56.560.el7.x86_64.rpm


As I was not quite sure if debug kernel is still needed in order to get more debug info, and there is no kernel-rt-debug-* packages in above link, so I download below packages from brewweb.

kernel-rt-debug-3.10.0-644.rt56.560.el7.x86_64.rpm
kernel-rt-debug-debuginfo-3.10.0-644.rt56.560.el7.x86_64.rpm
kernel-rt-debug-devel-3.10.0-644.rt56.560.el7.x86_64.rpm


Is this OK?



Today, I also tested with 3.10.0-644.rt56.560.el7.x86_64, the host hang and I got call trace:

[  532.144591] CPU0: NOHZ: local_softirq_pending: SCHED
[  532.145590] CPU0: NOHZ: local_softirq_pending: SCHED
[  532.146590] CPU0: NOHZ: local_softirq_pending: SCHED
[  532.147589] CPU0: NOHZ: local_softirq_pending: SCHED
[  532.148589] CPU0: NOHZ: local_softirq_pending: SCHED
[  532.149589] CPU0: NOHZ: local_softirq_pending: SCHED
[  532.150589] CPU0: NOHZ: local_softirq_pending: SCHED
[  532.151589] CPU0: NOHZ: local_softirq_pending: SCHED
[  532.152589] CPU0: NOHZ: local_softirq_pending: SCHED
[  532.153589] CPU0: NOHZ: local_softirq_pending: SCHED
[  713.501844] sd 10:0:0:0: [sda] tag#0 task abort called for scmd(ffff9e49ee)
[  713.501848] sd 10:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 42 d7 cf 38 00 000
[  713.501849] sd 10:0:0:0: task abort: FAILED scmd(ffff9e49ee52b100)
[  797.469315] sd 10:0:0:0: [sda] tag#1 task abort called for scmd(ffff9e4bec)
[  797.469317] sd 10:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 43 66 cd 90 00 00
[  797.469319] sd 10:0:0:0: task abort: FAILED scmd(ffff9e4becd0ddc0)
[  797.469322] sd 10:0:0:0: target reset called for scmd(ffff9e49ee52b100)
[  797.469324] sd 10:0:0:0: [sda] tag#0 megasas: target reset FAILED!!
[  797.469327] sd 10:0:0:0: [sda] tag#23 Controller reset is requested due tot
[  797.469327] SCSI command pointer: (ffff9e4dd075cc40)	 SCSI host state: 5	I
[  797.469328] IO request frame:
	0000000a 00000000 00000000 ffcfeb80 00600002 00000020 00000000 00000a 
	00000000 0000400a 00000000 00000000 00000000 00000000 00000000 010000 
	bd42002a 00002854 00000005 00000000 00000000 00000000 00000000 000000 
	005a0000 00000000 00000000 00000000 00000000 00000000 00000110 000000 
	fffa2000 00000000 00000a00 40000000 


(In reply to Marcelo Tosatti from comment #22)
> Places that grab both locks are:
> 
> 1. put_pwq_unlocked
> 2. try_to_grab_pending
> 3. queue_work_on
> 4. queue_delayed_work_on
> 5. mod_delayed_work_on
> 6. cancel_work_timer (via try_to_grab_pending)
> 7. flush_delayed_work
> 
> So perhaps its scheduling work on an isolated CPU, 
> with pmd or qemu-kvm-vcpu with priority higher than 
> kworker thread, and then it hangs (because the pmd or 
> qemu-kvm do not allow kworker to execute).
> 
> To dump the trace to some location: 
> 
> 1) Have a process writing to disk continuously:
>  cat /sys/kernel/debug/tracing/trace_pipe > /root/trace.txt

Before testing, I executed this command in both 2 hosts, reboot after hang to check the log file, nothing show.

# cat /sys/kernel/debug/tracing/trace_pipe > /root/trace.txt

# cat trace.txt
(empty)

> or 
> 
> 2)
> You can also trigger a dump of the Ftrace buffer to the console with sysrq-z.

Could you share more details about this method? Then I can have a try.


(In reply to Marcelo Tosatti from comment #23)
> Can you please also enable the following tracepoints:
> 
> 	trace_workqueue_activate_work(work);
> 	trace_workqueue_queue_work(req_cpu, pwq, work);
> 		trace_workqueue_activate_work(work);
> 	trace_workqueue_execute_start(work);
> 	trace_workqueue_execute_end(work);

Could you please explain a bit more about how to enable these tracepoints? As currently I have no such experiences. Thanks.


Best Regards,
Pei

Comment 27 Pei Zhang 2017-04-14 01:10:50 UTC

Created attachment 1271602 [details]
Call trace info when testing build from marcelo

Marcelo, please check the log info about 13 hours. Are they expected? Still, I got empty from /root/trace.txt.

Note: Test with 3.10.0-644.rt56.560.el7.x86_64, download all the packages from your link.


Best Regards,
Pei

Comment 28 Marcelo Tosatti 2017-04-14 02:48:22 UTC

(In reply to Pei Zhang from comment #26)
> (In reply to Marcelo Tosatti from comment #19)
> > (In reply to Pei Zhang from comment #18)
> > > Created attachment 1271105 [details]
> > > Console log when host is hang - 2
> > > 
> > > (In reply to Marcelo Tosatti from comment #15)
> > > > [...]
> > > > Pei Zhang,
> > > > 
> > > > Please try the -kvm kernel of this SRPM:
> > > > 
> > > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12983510
> > > > 
> > > > Which prints additional information.
> > > 
> > > Marcelo,
> > > 
> > > Please see this attachment.
> > > 
> > > Testing with 3.10.0-644.rt56.560.el7.x86_64.debug. Using -kvm packages from
> > > this link.
> > >  
> > > Best Regards,
> > > Pei
> > 
> > Pei,
> > 
> > Need both kernel-rt and kernel-rt-kvm from the link. Did you install both of
> > them? Sorry for not being clear before.
> 
> Hi Marcelo,
> 
> When testing with "3.10.0-644.rt56.560.el7.x86_64.debug", I download and
> installed below packages from your link:
> 
> kernel-rt-3.10.0-644.rt56.560.el7.x86_64.rpm
> kernel-rt-debuginfo-3.10.0-644.rt56.560.el7.x86_64.rpm
> kernel-rt-debuginfo-common-x86_64-3.10.0-644.rt56.560.el7.x86_64.rpm
> kernel-rt-devel-3.10.0-644.rt56.560.el7.x86_64.rpm
> kernel-rt-kvm-3.10.0-644.rt56.560.el7.x86_64.rpm
> kernel-rt-kvm-debuginfo-3.10.0-644.rt56.560.el7.x86_64.rpm
> kernel-rt-trace-3.10.0-644.rt56.560.el7.x86_64.rpm
> kernel-rt-trace-debuginfo-3.10.0-644.rt56.560.el7.x86_64.rpm
> kernel-rt-trace-devel-3.10.0-644.rt56.560.el7.x86_64.rpm
> kernel-rt-trace-kvm-3.10.0-644.rt56.560.el7.x86_64.rpm
> kernel-rt-trace-kvm-debuginfo-3.10.0-644.rt56.560.el7.x86_64.rpm
> 
> 
> As I was not quite sure if debug kernel is still needed in order to get more
> debug info, and there is no kernel-rt-debug-* packages in above link, so I
> download below packages from brewweb.
> 
> kernel-rt-debug-3.10.0-644.rt56.560.el7.x86_64.rpm
> kernel-rt-debug-debuginfo-3.10.0-644.rt56.560.el7.x86_64.rpm
> kernel-rt-debug-devel-3.10.0-644.rt56.560.el7.x86_64.rpm
> 
> 
> Is this OK?
> 
> 
> 
> Today, I also tested with 3.10.0-644.rt56.560.el7.x86_64, the host hang and
> I got call trace:
> 
> [  532.144591] CPU0: NOHZ: local_softirq_pending: SCHED
> [  532.145590] CPU0: NOHZ: local_softirq_pending: SCHED
> [  532.146590] CPU0: NOHZ: local_softirq_pending: SCHED
> [  532.147589] CPU0: NOHZ: local_softirq_pending: SCHED
> [  532.148589] CPU0: NOHZ: local_softirq_pending: SCHED
> [  532.149589] CPU0: NOHZ: local_softirq_pending: SCHED
> [  532.150589] CPU0: NOHZ: local_softirq_pending: SCHED
> [  532.151589] CPU0: NOHZ: local_softirq_pending: SCHED
> [  532.152589] CPU0: NOHZ: local_softirq_pending: SCHED
> [  532.153589] CPU0: NOHZ: local_softirq_pending: SCHED
> [  713.501844] sd 10:0:0:0: [sda] tag#0 task abort called for
> scmd(ffff9e49ee)
> [  713.501848] sd 10:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 42 d7 cf 38 00
> 000
> [  713.501849] sd 10:0:0:0: task abort: FAILED scmd(ffff9e49ee52b100)
> [  797.469315] sd 10:0:0:0: [sda] tag#1 task abort called for
> scmd(ffff9e4bec)
> [  797.469317] sd 10:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 43 66 cd 90 00
> 00
> [  797.469319] sd 10:0:0:0: task abort: FAILED scmd(ffff9e4becd0ddc0)
> [  797.469322] sd 10:0:0:0: target reset called for scmd(ffff9e49ee52b100)
> [  797.469324] sd 10:0:0:0: [sda] tag#0 megasas: target reset FAILED!!
> [  797.469327] sd 10:0:0:0: [sda] tag#23 Controller reset is requested due
> tot
> [  797.469327] SCSI command pointer: (ffff9e4dd075cc40)	 SCSI host state: 5	I
> [  797.469328] IO request frame:
> 	0000000a 00000000 00000000 ffcfeb80 00600002 00000020 00000000 00000a 
> 	00000000 0000400a 00000000 00000000 00000000 00000000 00000000 010000 
> 	bd42002a 00002854 00000005 00000000 00000000 00000000 00000000 000000 
> 	005a0000 00000000 00000000 00000000 00000000 00000000 00000110 000000 
> 	fffa2000 00000000 00000a00 40000000 
> 
> 
> (In reply to Marcelo Tosatti from comment #22)
> > Places that grab both locks are:
> > 
> > 1. put_pwq_unlocked
> > 2. try_to_grab_pending
> > 3. queue_work_on
> > 4. queue_delayed_work_on
> > 5. mod_delayed_work_on
> > 6. cancel_work_timer (via try_to_grab_pending)
> > 7. flush_delayed_work
> > 
> > So perhaps its scheduling work on an isolated CPU, 
> > with pmd or qemu-kvm-vcpu with priority higher than 
> > kworker thread, and then it hangs (because the pmd or 
> > qemu-kvm do not allow kworker to execute).
> > 
> > To dump the trace to some location: 
> > 
> > 1) Have a process writing to disk continuously:
> >  cat /sys/kernel/debug/tracing/trace_pipe > /root/trace.txt
> 
> Before testing, I executed this command in both 2 hosts, reboot after hang
> to check the log file, nothing show.
> 
> # cat /sys/kernel/debug/tracing/trace_pipe > /root/trace.txt
> 
> # cat trace.txt
> (empty)
> 
> > or 
> > 
> > 2)
> > You can also trigger a dump of the Ftrace buffer to the console with sysrq-z.
> 
> Could you share more details about this method? Then I can have a try.
> 
> 
> (In reply to Marcelo Tosatti from comment #23)
> > Can you please also enable the following tracepoints:
> > 
> > 	trace_workqueue_activate_work(work);
> > 	trace_workqueue_queue_work(req_cpu, pwq, work);
> > 		trace_workqueue_activate_work(work);
> > 	trace_workqueue_execute_start(work);
> > 	trace_workqueue_execute_end(work);
> 
> Could you please explain a bit more about how to enable these tracepoints?
> As currently I have no such experiences. Thanks.

Do the following to enable the tracepoints:

# echo workqueue_activate_work workqueue_queue_work workqueue_execute_start workqueue_execute_end > /sys/kernel/debug/tracing/set_event 
# cat /sys/kernel/debug/tracing/trace_pipe > /root/trace.txt

> 
> 
> Best Regards,
> Pei

Comment 29 Marcelo Tosatti 2017-04-14 02:58:55 UTC

(In reply to Pei Zhang from comment #27)
> Created attachment 1271602 [details]
> Call trace info when testing build from marcelo
> 
> Marcelo, please check the log info about 13 hours. Are they expected? Still,
> I got empty from /root/trace.txt.
> 
> Note: Test with 3.10.0-644.rt56.560.el7.x86_64, download all the packages
> from your link.
> 
> 
> Best Regards,
> Pei

Pei,

Great, thats what we need. Unfortunately there have been no traces from
KVM this time... can you please run again (with the testing build i provided) until the following is seen on the trace (will add to the next comment).

Comment 30 Marcelo Tosatti 2017-04-14 02:59:45 UTC

(In reply to Marcelo Tosatti from comment #29)
> (In reply to Pei Zhang from comment #27)
> > Created attachment 1271602 [details]
> > Call trace info when testing build from marcelo
> > 
> > Marcelo, please check the log info about 13 hours. Are they expected? Still,
> > I got empty from /root/trace.txt.
> > 
> > Note: Test with 3.10.0-644.rt56.560.el7.x86_64, download all the packages
> > from your link.
> > 
> > 
> > Best Regards,
> > Pei
> 
> Pei,
> 
> Great, thats what we need. Unfortunately there have been no traces from
> KVM this time... can you please run again (with the testing build i
> provided) until the following is seen on the trace (will add to the next
> comment).

This one:
[ 1234.528484] INFO: task worker:4477 blocked for more than.
[ 1234.535673] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this m.
[ 1234.544412] worker          D ffff880858bc50f0     0  4477      1 0x00000080
[ 1234.552295]  ffff880461f33c88 0000000000000002 ffff880461f33fd8 ffff880461f33f8
[ 1234.560584]  ffff880461f33fd8 ffff880461f33fd8 ffff88066d60d0f0 ffff880858bc500
[ 1234.568870]  ffff880461f33fd8 ffff88086639fb48 ffff880461f33cd0 000000000000026
[ 1234.577158] Call Trace:
[ 1234.579888]  [<ffffffff8d69ce90>] schedule+0x30/0x96
[ 1234.585426]  [<ffffffff8d69dc4d>] rt_spin_lock_slowlock+0x13d/0x350
[ 1234.592420]  [<ffffffff8d69ede5>] rt_spin_lock+0x25/0x30
[ 1234.598364]  [<ffffffffc05aec53>] kvm_mmu_notifier_invalidate_range_end+0x23/0]
[ 1234.607299]  [<ffffffff8d1d5af4>] __mmu_notifier_invalidate_range_end+0x64/0xe0
[ 1234.615456]  [<ffffffff8d1afd75>] unmap_vmas+0x85/0x90
[ 1234.621190]  [<ffffffff8d1b4c5e>] unmap_region+0xbe/0x140
[ 1234.627214]  [<ffffffff8d1b70d2>] do_munmap+0x292/0x470
[ 1234.633043]  [<ffffffff8d1b7318>] vm_munmap+0x68/0xb0
[ 1234.638677]  [<ffffffff8d1b83b2>] SyS_munmap+0x22/0x30
[ 1234.644409]  [<ffffffff8d6a7d94>] tracesys+0xdd/0xe2

Comment 31 Pei Zhang 2017-04-17 09:28:42 UTC

Created attachment 1272008 [details]
trace.txt from /sys/kernel/debug/tracing/trace_pipe

(In reply to Marcelo Tosatti from comment #29)
> (In reply to Pei Zhang from comment #27)
> > Created attachment 1271602 [details]
> > Call trace info when testing build from marcelo
> > 
> > Marcelo, please check the log info about 13 hours. Are they expected? Still,
> > I got empty from /root/trace.txt.
> > 
> > Note: Test with 3.10.0-644.rt56.560.el7.x86_64, download all the packages
> > from your link.
> > 
> > 
> > Best Regards,
> > Pei
> 
> Pei,
> 
> Great, thats what we need. Unfortunately there have been no traces from
> KVM this time... can you please run again (with the testing build i
> provided) until the following is seen on the trace (will add to the next
> comment).

Hi Marcelo,

The trace info is attached to this comment. 

I'd like to high light below info:

1. With these below commands, in my many times testing(>10 times), the hosts never hang. Even though the call trace occurs, but the host didn't hang.

[1]# echo workqueue_activate_work workqueue_queue_work workqueue_execute_start  
[2]# cat /sys/kernel/debug/tracing/trace_pipe > /root/trace.txt

2. I tried to bind process of [2] to CPU0, and also tried CPU13, the hang issue still can not be reproduced.  

The original values of process of [2]are 
" pid 6379's current affinity list: 0,1,3,5,7,9,11,13"(These are non isolated cores of host). 



Best Regards,
Pei

Comment 32 Pei Zhang 2017-04-17 09:33:17 UTC

Created attachment 1272010 [details]
Call trace info of Comment 31

Comment 33 Pei Zhang 2017-04-17 10:13:03 UTC

Created attachment 1272021 [details]
trace.txt from /sys/kernel/debug/tracing/trace_pipe - update

This trace log contains the info when "INFO: task qemu-kvm:4471 blocked for more than 600 seconds." prints in console.

Comment 34 Pei Zhang 2017-04-17 10:15:40 UTC

Created attachment 1272022 [details]
Call trace info of Comment 33

Update info: The host keeping printing call trace info, however the host still didn't hang.

Comment 35 Pei Zhang 2017-04-17 10:20:02 UTC

Created attachment 1272023 [details]
Call trace info of Comment 33 - update

(In reply to Pei Zhang from comment #33)
> Created attachment 1272021 [details]
> trace.txt from /sys/kernel/debug/tracing/trace_pipe - update
> 
> This trace log contains the info when "INFO: task qemu-kvm:4471 blocked for
> more than 600 seconds." prints in console.

Sorry, please ignore Comment 34. Re-upload the correct console info.

Best Regards,
Pei

Comment 36 Marcelo Tosatti 2017-04-19 16:26:46 UTC

(In reply to Pei Zhang from comment #31)
> Created attachment 1272008 [details]
> trace.txt from /sys/kernel/debug/tracing/trace_pipe
> 
> (In reply to Marcelo Tosatti from comment #29)
> > (In reply to Pei Zhang from comment #27)
> > > Created attachment 1271602 [details]
> > > Call trace info when testing build from marcelo
> > > 
> > > Marcelo, please check the log info about 13 hours. Are they expected? Still,
> > > I got empty from /root/trace.txt.
> > > 
> > > Note: Test with 3.10.0-644.rt56.560.el7.x86_64, download all the packages
> > > from your link.
> > > 
> > > 
> > > Best Regards,
> > > Pei
> > 
> > Pei,
> > 
> > Great, thats what we need. Unfortunately there have been no traces from
> > KVM this time... can you please run again (with the testing build i
> > provided) until the following is seen on the trace (will add to the next
> > comment).
> 
> Hi Marcelo,
> 
> The trace info is attached to this comment. 
> 
> I'd like to high light below info:
> 
> 1. With these below commands, in my many times testing(>10 times), the hosts
> never hang. Even though the call trace occurs, but the host didn't hang.
> 
> [1]# echo workqueue_activate_work workqueue_queue_work
> workqueue_execute_start  
> [2]# cat /sys/kernel/debug/tracing/trace_pipe > /root/trace.txt
> 
> 2. I tried to bind process of [2] to CPU0, and also tried CPU13, the hang
> issue still can not be reproduced.  
> 
> The original values of process of [2]are 
> " pid 6379's current affinity list: 0,1,3,5,7,9,11,13"(These are non
> isolated cores of host). 
> 
> 
> 
> Best Regards,
> Pei

cat 1272008.txt  | cut -f 2 -d "[" | cut -f 1 -d "]" | sort | uniq | while read z; do echo -n " $z, "; done
 000,  001,  002,  003,  004,  005,  006,  007,  008,  009,  010,  011,  012,  013,  014,  015,  016,  017,  018,  019

So workitems are being executed on the isolated cpus 2, 4, 6, 8, 10, 12, 14, 16 and 18. They are:
<...>-3763 1 workqueue_execute_end: work struct ffffa0cb6ee13e20
<...>-3763 1 workqueue_execute_start: work struct ffffa0cb6ee13e20: function vmstat_update
ktimersoftd/14-131 1 workqueue_activate_work: work struct ffffffffc07a3c60
ktimersoftd/14-131 1 workqueue_queue_work: work struct=ffffffffc07a3c60 function=jump_label_update_timeout workqueue=ffffa0c9ffc08e00 req_cpu=5120 cpu=14
ktimersoftd/8-80 [008] workqueue_activate_work: work struct ffffa0cb6ed13e20
ktimersoftd/8-80 [008] workqueue_queue_work: work struct=ffffa0cb6ed13e20 function=vmstat_update workqueue=ffffa0c9ffc08e00 req_cpu=5120 cpu=8
kworker/10:1-3758 1 workqueue_execute_end: work struct ffffa0cb6ed53e20
kworker/10:1-3758 1 workqueue_execute_start: work struct ffffa0cb6ed53e20: function vmstat_update
kworker/12:1-3760 1 653.453440: workqueue_execute_start: work struct ffffa0cb6ed93e20: function vmstat_update
kworker/12:1-3760 1 653.453443: workqueue_execute_end: work struct ffffa0cb6ed93e20
kworker/12:1-3760 1 926.440905: workqueue_execute_start: work struct ffffa0cb6ed93e20: function vmstat_update
kworker/14:1-3761 1 workqueue_execute_end: work struct ffffa0cb6edd3e20
kworker/14:1-3761 1 workqueue_execute_end: work struct ffffffffc07a3c60
kworker/14:1-3761 1 workqueue_execute_start: work struct ffffa0cb6edd3e20: function vmstat_update
kworker/14:1-3761 1 workqueue_execute_start: work struct ffffffffc07a3c60: function jump_label_update_timeout
kworker/16:1-3763 1 workqueue_execute_end: work struct ffffa0cb6ee13e20
kworker/16:1-3763 1 workqueue_execute_start: work struct ffffa0cb6ee13e20: function vmstat_update
kworker/18:1-3765 1 workqueue_execute_end: work struct ffffa0cb6ee53e20
kworker/18:1-3765 1 workqueue_execute_start: work struct ffffa0cb6ee53e20: function vmstat_update
kworker/2:1-3755 2 workqueue_execute_end: work struct ffffa0cb6ec53e20
kworker/2:1-3755 2 workqueue_execute_start: work struct ffffa0cb6ec53e20: function vmstat_update
kworker/4:1-3756 [004] workqueue_execute_end: work struct ffffa0cb6ec93e20
kworker/4:1-3756 [004] workqueue_execute_start: work struct ffffa0cb6ec93e20: function vmstat_update
kworker/6:1-3757 [006] workqueue_execute_end: work struct ffffa0cb6ecd3e20
kworker/6:1-3757 [006] workqueue_execute_start: work struct ffffa0cb6ecd3e20: function vmstat_update
kworker/8:1-3759 [008] workqueue_execute_end: work struct ffffa0cb6ed13e20
kworker/8:1-3759 [008] workqueue_execute_start: work struct ffffa0cb6ed13e20: function vmstat_update

Luiz IIRC you fixed jump_label_update_timeout ? 

I'm working on fixing vmstat_update.

Note this is unrelated to the hang.

Comment 37 Marcelo Tosatti 2017-04-19 16:34:52 UTC

(In reply to Pei Zhang from comment #32)
> Created attachment 1272010 [details]
> Call trace info of Comment 31

This one is waiting a RCU grace period to proceed, but fails to do so 
for more than 600 seconds. What is blocking RCU grace periods? 
Can a long lived process in userspace which does not exit to kernel mode
cause such a thing? 

[ 4235.640305] INFO: task qemu-kvm:4471 blocked for more than 600 seconds.
[ 4235.647686] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables.
[ 4235.656423] qemu-kvm        D ffffa0cb6c6350f0     0  4471      1 0x0006
[ 4235.664307]  ffffa0c777cf3cd8 0000000000000002 ffffa0c777cf3fd8 ffffa0c8
[ 4235.672597]  ffffa0c777cf3fd8 ffffa0c777cf3fd8 ffffa0cd6d57d0f0 ffffa0c0
[ 4235.680885]  7fffffffffffffff ffffa0c777cf3e10 ffffa0cb6c6350f0 ffffa0c0
[ 4235.689175] Call Trace:
[ 4235.691905]  [<ffffffffa569fc30>] schedule+0x30/0x96
[ 4235.697435]  [<ffffffffa569de21>] schedule_timeout+0x2e1/0x370
[ 4235.703944]  [<ffffffffa569de21>] ? schedule_timeout+0x2e1/0x370
[ 4235.710645]  [<ffffffffa51554bb>] ? probe_sched_wakeup+0x2b/0xc0
[ 4235.717348]  [<ffffffffa50b9752>] ? ttwu_do_wakeup+0xc2/0x110
[ 4235.723759]  [<ffffffffa569ec14>] wait_for_completion+0xc4/0x100
[ 4235.730461]  [<ffffffffa5136b60>] ? call_rcu_sched+0x20/0x20
[ 4235.736775]  [<ffffffffa50a3fab>] wait_rcu_gp+0x5b/0x80
[ 4235.742602]  [<ffffffffa50a3f30>] ? __rcu_read_unlock+0x60/0x60
[ 4235.749207]  [<ffffffffa5134e89>] synchronize_rcu+0x29/0x40
[ 4235.755425]  [<ffffffffc09f68bf>] vhost_net_release+0x7f/0xc0 [vhost_ne]
[ 4235.763000]  [<ffffffffa51fc423>] __fput+0xe3/0x260
[ 4235.768442]  [<ffffffffa51fc68e>] ____fput+0xe/0x10
[ 4235.773882]  [<ffffffffa50a3c07>] task_work_run+0xa7/0xf0
[ 4235.779908]  [<ffffffffa501bdfe>] do_notify_resume+0xae/0x100
[ 4235.786320]  [<ffffffffa56aac01>] int_signal+0x12/0x17
[ 4235.792052] debug_kvm_hung
[ 4235.795069] arch/x86/kvm/../../../virt/kvm/kvm_main.c: 385

Comment 38 Luiz Capitulino 2017-04-19 16:48:28 UTC

(In reply to Marcelo Tosatti from comment #36)

> Luiz IIRC you fixed jump_label_update_timeout ? 

I fixed a different jump_label issue.

jump_label_update_timeout looks like a timer. But to understand why it's executing we'd to understand what the process running on that CPU is doing and why this work item was queued. Btw, the fact it executes is not necessarily a problem.

Comment 39 Marcelo Tosatti 2017-04-19 17:23:18 UTC

(In reply to Luiz Capitulino from comment #38)
> (In reply to Marcelo Tosatti from comment #36)
> 
> > Luiz IIRC you fixed jump_label_update_timeout ? 
> 
> I fixed a different jump_label issue.

Ah, OK, i'll take a look at that if no one beats me to it.

> jump_label_update_timeout looks like a timer. But to understand why it's
> executing we'd to understand what the process running on that CPU is doing
> and why this work item was queued. Btw, the fact it executes is not
> necessarily a problem.

Because the priority of qemu-kvm-vcpu is higher, you mean?

And: if someone else does flush_work() on that work item, its a problem then,
correct?

Comment 40 Luiz Capitulino 2017-04-19 18:14:33 UTC

(In reply to Marcelo Tosatti from comment #39)
> (In reply to Luiz Capitulino from comment #38)
> > (In reply to Marcelo Tosatti from comment #36)
> > 
> > > Luiz IIRC you fixed jump_label_update_timeout ? 
> > 
> > I fixed a different jump_label issue.
> 
> Ah, OK, i'll take a look at that if no one beats me to it.
> 
> > jump_label_update_timeout looks like a timer. But to understand why it's
> > executing we'd to understand what the process running on that CPU is doing
> > and why this work item was queued. Btw, the fact it executes is not
> > necessarily a problem.
> 
> Because the priority of qemu-kvm-vcpu is higher, you mean?

That and whether or not the condition that caused that timer to expire can also happen during real-time operation.

> And: if someone else does flush_work() on that work item, its a problem then,
> correct?

I think my definition of a problem would be: provided the system is correctly configured for KVM-RT, any condition on the system that can happen during real-time operation and which causes anything to be scheduled to run on an isolated CPU.

IMHO, Work items that starve but don't generate any serious consequences may be tolarated for the immediate term, but should be fixed in the medium to long term.

Comment 41 Marcelo Tosatti 2017-04-19 20:01:01 UTC

(In reply to Luiz Capitulino from comment #40)
> (In reply to Marcelo Tosatti from comment #39)
> > (In reply to Luiz Capitulino from comment #38)
> > > (In reply to Marcelo Tosatti from comment #36)
> > > 
> > > > Luiz IIRC you fixed jump_label_update_timeout ? 
> > > 
> > > I fixed a different jump_label issue.
> > 
> > Ah, OK, i'll take a look at that if no one beats me to it.
> > 
> > > jump_label_update_timeout looks like a timer. But to understand why it's
> > > executing we'd to understand what the process running on that CPU is doing
> > > and why this work item was queued. Btw, the fact it executes is not
> > > necessarily a problem.
> > 
> > Because the priority of qemu-kvm-vcpu is higher, you mean?
> 
> That and whether or not the condition that caused that timer to expire can
> also happen during real-time operation.
> 
> > And: if someone else does flush_work() on that work item, its a problem then,
> > correct?
> 
> I think my definition of a problem would be: provided the system is
> correctly configured for KVM-RT, any condition on the system that can happen
> during real-time operation and which causes anything to be scheduled to run
> on an isolated CPU.
> 
> IMHO, Work items that starve but don't generate any serious consequences may
> be tolarated for the immediate term, but should be fixed in the medium to
> long term.

Well, its going to hang the task which does flush_work(). This is going to 
generate a "task hung for more than 600 seconds" error which:

1) You don't know what else depends on the progress of that hung task.
2) Its going to generate support calls.

So it can't be tolerated, IMO.

Should we add a WARN_ON() to any work item that is scheduled on an isolated CPU? 

Thats easy and will catch the offenders.

Comment 42 Luiz Capitulino 2017-04-19 20:14:47 UTC

Ah, I see what mean.

We've fixed one instance of that problem in the past:

commit 8c95bc87c79668483a6542ccadc73958fb865367
Author: Luiz Capitulino <lcapitulino>
Date:   Fri May 27 15:03:28 2016 +0200

    mm: perform lru_add_drain_all() remotely

Have you found others?

PS: The commit from the upstream RT tree.

Comment 43 Marcelo Tosatti 2017-04-19 20:24:22 UTC

(In reply to Luiz Capitulino from comment #42)
> Ah, I see what mean.
> 
> We've fixed one instance of that problem in the past:
> 
> commit 8c95bc87c79668483a6542ccadc73958fb865367
> Author: Luiz Capitulino <lcapitulino>
> Date:   Fri May 27 15:03:28 2016 +0200
> 
>     mm: perform lru_add_drain_all() remotely
> 
> Have you found others?
> 
> PS: The commit from the upstream RT tree.

Yes, that. But rather than debug it and trace it down, better
have a warning up front. I'll submit a patch. Thanks.

Comment 44 Luiz Capitulino 2017-04-19 20:50:53 UTC

Scheduling work on an isolated CPU may not be always wrong. It will cause problems for KVM-RT when running DPDK-based applications because the PMD threads execute 100% of the time in user-space with fifo priority so it never requests any kind of kernel service and won't let kworker run on the isolated CPU.

Regular applications that requests kernel services etc, may have valid reasons to get work queued on the isolated CPUs. It really depends on what the application is doing.

Comment 45 Marcelo Tosatti 2017-04-20 10:28:33 UTC

(In reply to Luiz Capitulino from comment #44)
> Scheduling work on an isolated CPU may not be always wrong. It will cause
> problems for KVM-RT when running DPDK-based applications because the PMD
> threads execute 100% of the time in user-space with fifo priority so it
> never requests any kind of kernel service and won't let kworker run on the
> isolated CPU.
> 
> Regular applications that requests kernel services etc, may have valid
> reasons to get work queued on the isolated CPUs. It really depends on what
> the application is doing.

Ok, so WARN_ON() if:

1. Isolated CPU with DPDK poll mode (or similar app), and FIFO priority
of thread is higher than kworker. Opened a separate BZ:

https://bugzilla.redhat.com/show_bug.cgi?id=1443951

Comment 46 Marcelo Tosatti 2017-04-20 10:45:30 UTC

o       Anything that prevents RCU's grace-period kthreads from running.
        This can result in the "All QSes seen" console-log message.
        This message will include information on when the kthread last
        ran and how often it should be expected to run.

[ 2037.529094] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 10, t)
[ 2037.529095] All QSes seen, last rcu_preempt kthread activity 0 (4296705199-42967053
[ 2037.529096] pmd91           R  running task        0  3892      1 0x00080080
[ 2037.529098]  ffff88046cf40000 000000005617b23e ffff88046ed43dd0 ffffffff990bc396
[ 2037.529099]  ffff88046ed520c0 ffffffff99a2bc00 ffff88046ed43e38 ffffffff9913867d
[ 2037.529100]  ffff8802ec8650e0 ffff88046ed520c0 0000000000009467 0000000000000000
[ 2037.529101] Call Trace:
[ 2037.529105]  <IRQ>  [<ffffffff990bc396>] sched_show_task+0xb6/0x120
[ 2037.529108]  [<ffffffff9913867d>] rcu_check_callbacks+0x83d/0x860
[ 2037.529110]  [<ffffffff9908fef1>] update_process_times+0x41/0x70
[ 2037.529113]  [<ffffffff990ebed5>] tick_sched_handle.isra.18+0x25/0x60
[ 2037.529114]  [<ffffffff990ec273>] tick_sched_timer+0x43/0x70
[ 2037.529117]  [<ffffffff990abed4>] __run_hrtimer+0xc4/0x2c0
[ 2037.529119]  [<ffffffff990ec230>] ? tick_sched_do_timer+0x50/0x50
[ 2037.529120]  [<ffffffff990ace50>] hrtimer_interrupt+0x130/0x350
[ 2037.529123]  [<ffffffff99044b85>] local_apic_timer_interrupt+0x35/0x60
[ 2037.529125]  [<ffffffff996aa09d>] smp_apic_timer_interrupt+0x3d/0x50
[ 2037.529127]  [<ffffffff996a881d>] apic_timer_interrupt+0x6d/0x80
[ 2217.528187]  <EOI> 

On my machine rcu_sched task (grace period kthread) is executing on 
a non isolated CPU. 

However in your case it seems pmd91 is executing on a CPU which rcu_preempt can
execute. Check rcu_preempt's mask:

[root@localhost RHEL7]# ps axuw | egrep "preempt|bh|sched"
root        12  0.0  0.0      0     0 ?        S    07:41   0:00 [rcu_preempt]
root        13  0.0  0.0      0     0 ?        S    07:41   0:00 [rcu_bh]
root        14  0.0  0.0      0     0 ?        S    07:41   0:00 [rcu_sched]
root      3197  0.0  0.0 112652   952 pts/0    S+   07:43   0:00 grep -E --color=auto preempt|bh|sched
[root@localhost RHEL7]# taskset -p 12
pid 12's current affinity mask: 3
[root@localhost RHEL7]# taskset -p 13
pid 13's current affinity mask: 3
[root@localhost RHEL7]# taskset -p 14
pid 14's current affinity mask: 3

(the mask should be the mask of non-isolated CPUs).

Thenconfirm that pmd91 is executing on an isolated CPU.

Comment 47 Luiz Capitulino 2017-04-20 13:01:21 UTC

(In reply to Marcelo Tosatti from comment #45)
> (In reply to Luiz Capitulino from comment #44)
> > Scheduling work on an isolated CPU may not be always wrong. It will cause
> > problems for KVM-RT when running DPDK-based applications because the PMD
> > threads execute 100% of the time in user-space with fifo priority so it
> > never requests any kind of kernel service and won't let kworker run on the
> > isolated CPU.
> > 
> > Regular applications that requests kernel services etc, may have valid
> > reasons to get work queued on the isolated CPUs. It really depends on what
> > the application is doing.
> 
> Ok, so WARN_ON() if:
> 
> 1. Isolated CPU with DPDK poll mode (or similar app), and FIFO priority
> of thread is higher than kworker. Opened a separate BZ:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1443951

Yeah, that's perfect!

Comment 48 Pei Zhang 2017-04-28 18:39:19 UTC

(In reply to Marcelo Tosatti from comment #46)
> o       Anything that prevents RCU's grace-period kthreads from running.
>         This can result in the "All QSes seen" console-log message.
>         This message will include information on when the kthread last
>         ran and how often it should be expected to run.
> 
> [ 2037.529094] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected
> by 10, t)
> [ 2037.529095] All QSes seen, last rcu_preempt kthread activity 0
> (4296705199-42967053
> [ 2037.529096] pmd91           R  running task        0  3892      1
> 0x00080080
> [ 2037.529098]  ffff88046cf40000 000000005617b23e ffff88046ed43dd0
> ffffffff990bc396
> [ 2037.529099]  ffff88046ed520c0 ffffffff99a2bc00 ffff88046ed43e38
> ffffffff9913867d
> [ 2037.529100]  ffff8802ec8650e0 ffff88046ed520c0 0000000000009467
> 0000000000000000
> [ 2037.529101] Call Trace:
> [ 2037.529105]  <IRQ>  [<ffffffff990bc396>] sched_show_task+0xb6/0x120
> [ 2037.529108]  [<ffffffff9913867d>] rcu_check_callbacks+0x83d/0x860
> [ 2037.529110]  [<ffffffff9908fef1>] update_process_times+0x41/0x70
> [ 2037.529113]  [<ffffffff990ebed5>] tick_sched_handle.isra.18+0x25/0x60
> [ 2037.529114]  [<ffffffff990ec273>] tick_sched_timer+0x43/0x70
> [ 2037.529117]  [<ffffffff990abed4>] __run_hrtimer+0xc4/0x2c0
> [ 2037.529119]  [<ffffffff990ec230>] ? tick_sched_do_timer+0x50/0x50
> [ 2037.529120]  [<ffffffff990ace50>] hrtimer_interrupt+0x130/0x350
> [ 2037.529123]  [<ffffffff99044b85>] local_apic_timer_interrupt+0x35/0x60
> [ 2037.529125]  [<ffffffff996aa09d>] smp_apic_timer_interrupt+0x3d/0x50
> [ 2037.529127]  [<ffffffff996a881d>] apic_timer_interrupt+0x6d/0x80
> [ 2217.528187]  <EOI> 
> 
> On my machine rcu_sched task (grace period kthread) is executing on 
> a non isolated CPU. 
> 
> However in your case it seems pmd91 is executing on a CPU which rcu_preempt
> can
> execute. Check rcu_preempt's mask:
> 
> [root@localhost RHEL7]# ps axuw | egrep "preempt|bh|sched"
> root        12  0.0  0.0      0     0 ?        S    07:41   0:00
> [rcu_preempt]
> root        13  0.0  0.0      0     0 ?        S    07:41   0:00 [rcu_bh]
> root        14  0.0  0.0      0     0 ?        S    07:41   0:00 [rcu_sched]
> root      3197  0.0  0.0 112652   952 pts/0    S+   07:43   0:00 grep -E
> --color=auto preempt|bh|sched
> [root@localhost RHEL7]# taskset -p 12
> pid 12's current affinity mask: 3
> [root@localhost RHEL7]# taskset -p 13
> pid 13's current affinity mask: 3
> [root@localhost RHEL7]# taskset -p 14
> pid 14's current affinity mask: 3
> 
> (the mask should be the mask of non-isolated CPUs).
> 
> Thenconfirm that pmd91 is executing on an isolated CPU.

Marcelo, sorry for late response.

- ovs pmds is working on separated isolated cores:
[root@dell-per730-11 ~]# top -d1 -H

top - 14:32:06 up  2:03,  1 user,  load average: 8.00, 8.01, 8.05
Threads: 512 total,   9 running, 503 sleeping,   0 stopped,   0 zombie
%Cpu(s): 28.0 us,  0.1 sy,  0.0 ni, 71.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32727380 total, 10220968 free, 21948196 used,   558216 buff/cache
KiB Swap: 16449532 total, 16449532 free,        0 used. 10538744 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND     
  4200 root     -96   0 4500568  33504   2352 R 90.1  0.1 117:36.47 pmd88       
  4201 root     -96   0 4500568  33504   2352 R 89.1  0.1 117:36.46 pmd89       
  4204 root     -96   0 4500568  33504   2352 R 89.1  0.1 117:36.46 pmd87       
  4199 root     -96   0 4500568  33504   2352 R 88.1  0.1 117:36.45 pmd86       
  4202 root     -96   0 4500568  33504   2352 R 86.1  0.1 117:36.43 pmd90       
  4203 root     -96   0 4500568  33504   2352 R 66.3  0.1 117:36.23 pmd91       
  8679 root      20   0  158120   2620   1552 R  1.0  0.0   0:00.03 top         
     1 root      20   0  191032   4128   2496 S  0.0  0.0   0:09.43 systemd     
     2 root      20   0       0      0      0 S  0.0  0.0   0:00.00 kthreadd    
     3 root      -3   0       0      0      0 S  0.0  0.0   0:00.23 ksoftirqd/0 
     4 root      -4   0       0      0      0 S  0.0  0.0   0:08.19 ktimersoft+ 
     9 root      rt   0       0      0      0 S  0.0  0.0   0:00.00 posixcputm+ 
    10 root      20   0       0      0      0 S  0.0  0.0   0:00.00 kswork      
    11 root      20   0       0      0      0 S  0.0  0.0   0:00.00 kcmosdelayd 
[root@dell-per730-11 ~]# taskset -p 4200
pid 4200's current affinity mask: 10
[root@dell-per730-11 ~]# taskset -p 4201
pid 4201's current affinity mask: 40
[root@dell-per730-11 ~]# taskset -p 4204
pid 4204's current affinity mask: 1000
[root@dell-per730-11 ~]# taskset -p 4199
pid 4199's current affinity mask: 4
[root@dell-per730-11 ~]# taskset -p 4202
pid 4202's current affinity mask: 100
[root@dell-per730-11 ~]# taskset -p 4202
pid 4202's current affinity mask: 100
[root@dell-per730-11 ~]# taskset -p 4203
pid 4203's current affinity mask: 400


- rcu_preempt is working on non-isolated cores.
[root@dell-per730-11 ~]# ps axuw | egrep "preempt|bh|sched"
root         13  0.0  0.0      0     0 ?        S    12:29   0:00 [rcu_preempt]
root         14  0.0  0.0      0     0 ?        S    12:29   0:00 [rcu_bh]
root         15  0.0  0.0      0     0 ?        S    12:29   0:00 [rcu_sched]
root       8265  0.0  0.0 112668   960 pts/0    S+   13:47   0:00 grep -E --color=auto preempt|bh|sched
[root@dell-per730-11 ~]#  taskset -p 13
pid 13's current affinity mask: 2aab
[root@dell-per730-11 ~]#  taskset -p 14
pid 14's current affinity mask: 2aab
[root@dell-per730-11 ~]#  taskset -p 15
pid 15's current affinity mask: 2aab


Thanks,
Pei

Comment 49 Luiz Capitulino 2017-05-15 17:29:50 UTC

*** Bug 1448770 has been marked as a duplicate of this bug. ***

Comment 50 Luiz Capitulino 2017-05-15 17:34:55 UTC

Pei Zhang, can you please try the following:

1. Make sure you're using tuned-2.8.0-3.el7 in host and guest
2. In the host assigned FIFO:1 for PMD threads (and not FIFO:95)
3. In the guest you can assign FIFO:95 to testpmd if you're running it

I'm afraid the host NFV profile is not configured to run a PMD thread with FIFO:95 as we have kernel threads such as softirq and ktimersoftd that may need to execute in some situations.

Would be nice to try what I'm suggesting with the reproducer for bug 1448770, which is simpler.

Comment 51 Luiz Capitulino 2017-05-15 17:40:57 UTC

Oops, possibly wrong information in item 3. Please, try with fifo:1 in the guest too.

Comment 53 Pei Zhang 2017-05-17 10:37:53 UTC

(In reply to Luiz Capitulino from comment #50)
> Pei Zhang, can you please try the following:
> 
> 1. Make sure you're using tuned-2.8.0-3.el7 in host and guest
> 2. In the host assigned FIFO:1 for PMD threads (and not FIFO:95)
> 3. In the guest you can assign FIFO:95 to testpmd if you're running it
> 
> I'm afraid the host NFV profile is not configured to run a PMD thread with
> FIFO:95 as we have kernel threads such as softirq and ktimersoftd that may
> need to execute in some situations.

Test with:
3.10.0-666.rt56.585.el7.x86_64
tuned-2.8.0-3.el7.noarch

Set ovs pmd threads with:
# chrt -fp 1 15859

Boot testpmd in guest with:
# chrt -f 1 /usr/bin/testpmd ...

I tried 4 times, after several (about 2 , 5 times) migrations, the guest always kernel panic. Please refer to : 
http://fileshare.englab.nay.redhat.com/pub/section2/coredump/var/crash/pezhang/bug1438120/

But both src and des hosts work well, no call trace, no error.

> Would be nice to try what I'm suggesting with the reproducer for bug
> 1448770, which is simpler.

I'll test and confirm if "chrt -fp 1 $ovs_pmd_process" can solve the host hang/call trace issue.


Best Regards,
Pei

Comment 56 Luiz Capitulino 2017-05-17 19:32:23 UTC

The panic from comment 53 is bug 1441552. What I suggest you do for this BZ is to try to reproduce the original reported issue by assigning fifo:1 to the PMD and testpmd threads. Then:

1. If you keep hitting 1441552 then maybe set this BZ as a dependency of bug 1441552 and wait until bug 1441552 is fixed. Once bug 1441552 is fixed, go back and try to reproduce again

2. If you hit latency spikes with fifo:1, open BZs for them

3. If you can't reproduce the original rcu stall issue with fifo:1, then close this as NOTABUG (independently if you hit unrelated issues along the way)

Comment 57 Marcelo Tosatti 2017-05-18 00:04:14 UTC

(In reply to Luiz Capitulino from comment #56)
> The panic from comment 53 is bug 1441552. What I suggest you do for this BZ
> is to try to reproduce the original reported issue by assigning fifo:1 to
> the PMD and testpmd threads. Then:
> 
> 1. If you keep hitting 1441552 then maybe set this BZ as a dependency of bug
> 1441552 and wait until bug 1441552 is fixed. Once bug 1441552 is fixed, go
> back and try to reproduce again
> 
> 2. If you hit latency spikes with fifo:1, open BZs for them

Yes, or just reopen https://bugzilla.redhat.com/show_bug.cgi?id=1384374
please.

Luiz, your tool can probably help track down the issue very quickly. 
 
> 3. If you can't reproduce the original rcu stall issue with fifo:1, then
> close this as NOTABUG (independently if you hit unrelated issues along the
> way)

Comment 58 Marcelo Tosatti 2017-05-18 00:24:19 UTC

(In reply to Marcelo Tosatti from comment #57)
> (In reply to Luiz Capitulino from comment #56)
> > The panic from comment 53 is bug 1441552. What I suggest you do for this BZ
> > is to try to reproduce the original reported issue by assigning fifo:1 to
> > the PMD and testpmd threads. Then:
> > 
> > 1. If you keep hitting 1441552 then maybe set this BZ as a dependency of bug
> > 1441552 and wait until bug 1441552 is fixed. Once bug 1441552 is fixed, go
> > back and try to reproduce again
> > 
> > 2. If you hit latency spikes with fifo:1, open BZs for them
> 
> Yes, or just reopen https://bugzilla.redhat.com/show_bug.cgi?id=1384374
> please.
> 
> Luiz, your tool can probably help track down the issue very quickly. 

Pai Zhang, you can also start by enabling the tracepoints we discussed
on this bug and attaching the log to that BZ.

Thanks.

>  
> > 3. If you can't reproduce the original rcu stall issue with fifo:1, then
> > close this as NOTABUG (independently if you hit unrelated issues along the
> > way)

Comment 59 Pei Zhang 2017-05-18 13:15:16 UTC

(In reply to Luiz Capitulino from comment #56)
> The panic from comment 53 is bug 1441552. What I suggest you do for this BZ
> is to try to reproduce the original reported issue by assigning fifo:1 to
> the PMD and testpmd threads. Then:

Hi Luiz, could you please explain a little more about the difference between fifo:1 and fifo:95? In our next rt related testing, should we also test with fifo:1? 

> 1. If you keep hitting 1441552 then maybe set this BZ as a dependency of bug
> 1441552 and wait until bug 1441552 is fixed. Once bug 1441552 is fixed, go
> back and try to reproduce again

Yes, I keep hitting 1441552. I'll set dependency.

> 2. If you hit latency spikes with fifo:1, open BZs for them

OK, I'll test the L2 network latency .

> 3. If you can't reproduce the original rcu stall issue with fifo:1, then
> close this as NOTABUG (independently if you hit unrelated issues along the
> way)

OK.


Thanks,
Pei

Comment 60 Pei Zhang 2017-05-18 13:18:42 UTC

(In reply to Marcelo Tosatti from comment #58)
> 
> Pai Zhang, you can also start by enabling the tracepoints we discussed
> on this bug and attaching the log to that BZ.
> 
> Thanks.
> 

OK. Thanks for your reminding.


Best Regards,
Pei

Comment 61 Luiz Capitulino 2017-05-18 15:45:49 UTC

(In reply to Pei Zhang from comment #59)
> (In reply to Luiz Capitulino from comment #56)
> > The panic from comment 53 is bug 1441552. What I suggest you do for this BZ
> > is to try to reproduce the original reported issue by assigning fifo:1 to
> > the PMD and testpmd threads. Then:
> 
> Hi Luiz, could you please explain a little more about the difference between
> fifo:1 and fifo:95? In our next rt related testing, should we also test with
> fifo:1? 

In the RT kernel, there are some kernel threads that must execute when they are runnable: rcu, migration, ksoftirqd and ktimersoftd. It's not expected that those threads will ever become runnable, but if they do and don't execute the system may crash as you've been observing.

We set the priority of those threads higher than fifo:1, so that threads that take 100% of the CPU can be set at fifo:1 without problems. If you use fifo:95, that's higher the kernel threads priority and they will starve if they become runnable.

Now, there are two important details:

- This is only a problem for threads taking 100% of the CPU with fifo priority. Threads that sleep and wake up (like cyclictest) can run at fifo:95 because it will yield the CPU soon

- I've just found out that kernel priorities in the guest are wrong! This could cause problems in this test-case even if you set testpmd to fifo:1 in the guest. I'll open a BZ for this

Comment 62 Pei Zhang 2017-05-19 07:05:40 UTC

(In reply to Luiz Capitulino from comment #61)
> (In reply to Pei Zhang from comment #59)
> > (In reply to Luiz Capitulino from comment #56)
> > > The panic from comment 53 is bug 1441552. What I suggest you do for this BZ
> > > is to try to reproduce the original reported issue by assigning fifo:1 to
> > > the PMD and testpmd threads. Then:
> > 
> > Hi Luiz, could you please explain a little more about the difference between
> > fifo:1 and fifo:95? In our next rt related testing, should we also test with
> > fifo:1? 
> 
> In the RT kernel, there are some kernel threads that must execute when they
> are runnable: rcu, migration, ksoftirqd and ktimersoftd. It's not expected
> that those threads will ever become runnable, but if they do and don't
> execute the system may crash as you've been observing.
> 
> We set the priority of those threads higher than fifo:1, so that threads
> that take 100% of the CPU can be set at fifo:1 without problems. If you use
> fifo:95, that's higher the kernel threads priority and they will starve if
> they become runnable.
> 
> Now, there are two important details:
> 
> - This is only a problem for threads taking 100% of the CPU with fifo
> priority. Threads that sleep and wake up (like cyclictest) can run at
> fifo:95 because it will yield the CPU soon

Thanks Luiz very much for detail explain. Really make sense to me. 

So on any CPU, these kernel threads should execute when they are runnable. And in our testing, all DPDK pmd threads should be set with fifo:1 as they all take 100% of the CPU.  


Thanks,
Pei 

> - I've just found out that kernel priorities in the guest are wrong! This
> could cause problems in this test-case even if you set testpmd to fifo:1 in
> the guest. I'll open a BZ for this

Comment 63 Luiz Capitulino 2017-05-19 13:11:25 UTC

Exactly! But note that we have a bug in the guest that has to be fixed before you can do this testing:

Bug 1452357 - realtime-virtual-guest: not safe to run PMD thread with fifo:1

Comment 64 Marcelo Tosatti 2017-05-26 19:56:44 UTC

(In reply to Marcelo Tosatti from comment #36)
> (In reply to Pei Zhang from comment #31)
> > Created attachment 1272008 [details]
> > trace.txt from /sys/kernel/debug/tracing/trace_pipe
> > 
> > (In reply to Marcelo Tosatti from comment #29)
> > > (In reply to Pei Zhang from comment #27)
> > > > Created attachment 1271602 [details]
> > > > Call trace info when testing build from marcelo
> > > > 
> > > > Marcelo, please check the log info about 13 hours. Are they expected? Still,
> > > > I got empty from /root/trace.txt.
> > > > 
> > > > Note: Test with 3.10.0-644.rt56.560.el7.x86_64, download all the packages
> > > > from your link.
> > > > 
> > > > 
> > > > Best Regards,
> > > > Pei
> > > 
> > > Pei,
> > > 
> > > Great, thats what we need. Unfortunately there have been no traces from
> > > KVM this time... can you please run again (with the testing build i
> > > provided) until the following is seen on the trace (will add to the next
> > > comment).
> > 
> > Hi Marcelo,
> > 
> > The trace info is attached to this comment. 
> > 
> > I'd like to high light below info:
> > 
> > 1. With these below commands, in my many times testing(>10 times), the hosts
> > never hang. Even though the call trace occurs, but the host didn't hang.
> > 
> > [1]# echo workqueue_activate_work workqueue_queue_work
> > workqueue_execute_start  
> > [2]# cat /sys/kernel/debug/tracing/trace_pipe > /root/trace.txt
> > 
> > 2. I tried to bind process of [2] to CPU0, and also tried CPU13, the hang
> > issue still can not be reproduced.  
> > 
> > The original values of process of [2]are 
> > " pid 6379's current affinity list: 0,1,3,5,7,9,11,13"(These are non
> > isolated cores of host). 
> > 
> > 
> > 
> > Best Regards,
> > Pei
> 
> cat 1272008.txt  | cut -f 2 -d "[" | cut -f 1 -d "]" | sort | uniq | while
> read z; do echo -n " $z, "; done
>  000,  001,  002,  003,  004,  005,  006,  007,  008,  009,  010,  011, 
> 012,  013,  014,  015,  016,  017,  018,  019
> 
> So workitems are being executed on the isolated cpus 2, 4, 6, 8, 10, 12, 14,
> 16 and 18. They are:
> <...>-3763 1 workqueue_execute_end: work struct ffffa0cb6ee13e20
> <...>-3763 1 workqueue_execute_start: work struct ffffa0cb6ee13e20: function
> vmstat_update
> ktimersoftd/14-131 1 workqueue_activate_work: work struct ffffffffc07a3c60
> ktimersoftd/14-131 1 workqueue_queue_work: work struct=ffffffffc07a3c60
> function=jump_label_update_timeout workqueue=ffffa0c9ffc08e00 req_cpu=5120
> cpu=14
> ktimersoftd/8-80 [008] workqueue_activate_work: work struct ffffa0cb6ed13e20
> ktimersoftd/8-80 [008] workqueue_queue_work: work struct=ffffa0cb6ed13e20
> function=vmstat_update workqueue=ffffa0c9ffc08e00 req_cpu=5120 cpu=8

Pei Zhang,
CPU8 was an isolated CPU which OVS-DPDK was running, on host, is that correct?

Comment 65 Pei Zhang 2017-05-27 00:28:19 UTC

(In reply to Marcelo Tosatti from comment #64)
> Pei Zhang,
> CPU8 was an isolated CPU which OVS-DPDK was running, on host, is that
> correct?

Yes, CPU8 was an isolated CPU wich OVS-DPDK was running on host. CPU 2,4,6,8,10,12 were all isolated and used for OVS-DPDK.

Comment 66 Pei Zhang 2017-06-14 01:13:34 UTC

Bug[1] has been fixed. 

[1]Bug 1452357 - realtime-virtual-guest: not safe to run PMD thread with fifo:1


Now, Ping-Pong live migration in rt environment can work well with:

(1)In both src and des host, set fifo:1 to all pmd threads:
# chrt -fp 1 $ovs_pmd_process_id

(2)In guest, set testpmd fifo:1
# chrt -f 1 /usr/bin/testpmd ..."


Testing details:
Versions:
3.10.0-680.rt56.604.el7.x86_64
tuned-2.8.0-5.el7.noarch
libvirt-3.2.0-9.el7.x86_64
dpdk-17.05-2.el7fdb.x86_64
openvswitch-2.7.0-8.git20170530.el7fdb.x86_64


Testing 2 scenarios, each with 90 migrations:
(1) Open vSwitch acts as vhost-user client and VM acts as vhost-user server, do ping-pong migration 45 runs, work well.

(2) Open vSwitch acts as vhost-user server and VM acts as vhost-user client, do ping-pong migration 45 runs, work well.

Note: Each run means 2 migration: migrate from src to des host, and migrate des to src host, so 45 runs means 90 migration.

Automation results: PASS
https://beaker.engineering.redhat.com/recipes/3942241#tasks

Comment 67 Pei Zhang 2017-06-14 01:17:12 UTC

Add version of qemu: qemu-kvm-rhev-2.9.0-9.el7.x86_64

Comment 68 Marcelo Tosatti 2017-07-10 14:47:20 UTC

(In reply to Pei Zhang from comment #67)
> Add version of qemu: qemu-kvm-rhev-2.9.0-9.el7.x86_64

Ok, can close the bug, then?

Comment 69 Pei Zhang 2017-07-11 01:05:03 UTC

Hi Marcelo, 

I agree this bug can be closed. Thank you.


Move status of this bug to 'CLOSED''CURRENTRELEASE'.