Bug 1438120
Summary: | Real Time host become hang when doing ping-pong migration over dpdk+openvswitch with setting ovs pmd 95 priority | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Pei Zhang <pezhang> |
Component: | kernel-rt | Assignee: | Marcelo Tosatti <mtosatti> |
kernel-rt sub component: | KVM | QA Contact: | Pei Zhang <pezhang> |
Status: | CLOSED CURRENTRELEASE | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | bhu, chayang, hhuang, juzhang, knoel, lcapitulino, michen, pagupta, peterx, pezhang, virt-maint, xfu |
Version: | 7.4 | ||
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-07-11 01:05:03 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1452357 | ||
Bug Blocks: | |||
Attachments: |
Description
Pei Zhang
2017-04-01 01:56:24 UTC
More additional info: 1. This issue can occur about within 20 ping-pong migrations. 2. Non rt didn't hit this issue. 3. Remove step "# chrt -fp 95 $ovs_pmd_process_id" in host1 and host2, and launch testpmd using "testpmd ..." by replacing"chrt -f 95 testpmd ...", migration will work well. Also test with memory locked(Others keeps same with above [3]), still hit above problem. <memory unit='KiB'>8388608</memory> <currentMemory unit='KiB'>8388608</currentMemory> <memtune> <hard_limit unit='KiB'>10485760</hard_limit> </memtune> <memoryBacking> <hugepages> <page size='1048576' unit='KiB' nodeset='0'/> </hugepages> <locked/> </memoryBacking> Besides host hang problem, we also found other problems(They are not occur at same time): ==Issue 1==: 1. After several(20 or more) ping-pong migration, host1 or host2 didn't hang, however migration fail with below error. # virsh migrate --verbose --persistent --live rhel7.4_rt qemu+ssh://192.168.1.1/system error: Cannot open log file: '/var/log/libvirt/qemu/rhel7.4_rt.log': Device or resource busy 2. Call trace occurs in host1 or host2 [ 7233.924060] INFO: task kworker/8:0:83 blocked for more t. [ 7233.931531] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this m. [ 7233.940268] kworker/8:0 D ffff8802ec813090 0 83 2 0x00000000 [ 7233.948157] ffff8802ec82bd98 0000000000000002 ffff8802ec82bfd8 ffff8802ec82bf8 [ 7233.956445] ffff8802ec82bfd8 ffff8802ec82bfd8 ffff8802ed2c3090 ffff8802ec81300 [ 7233.964733] ffff8802ec854000 ffff8802ec82bdd8 ffff8802ec813090 ffff8802ec85408 [ 7233.973022] Call Trace: [ 7233.975753] [<ffffffff9d69ce90>] schedule+0x30/0x96 [ 7233.981293] [<ffffffff9d08ed1d>] del_timer_sync+0x9d/0x110 [ 7233.987511] [<ffffffff9d0a8850>] ? wake_up_atomic_t+0x30/0x30 [ 7233.994020] [<ffffffff9d0a01ec>] manage_workers.isra.36+0x20c/0x2b0 [ 7234.001108] [<ffffffff9d0a0667>] worker_thread+0x3d7/0x3f0 [ 7234.007325] [<ffffffff9d0a0290>] ? manage_workers.isra.36+0x2b0/0x2b0 [ 7234.014608] [<ffffffff9d0a7abf>] kthread+0xcf/0xe0 [ 7234.020047] [<ffffffff9d0a79f0>] ? kthread_worker_fn+0x170/0x170 [ 7234.026847] [<ffffffff9d6a7ad8>] ret_from_fork+0x58/0x90 [ 7234.032869] [<ffffffff9d0a79f0>] ? kthread_worker_fn+0x170/0x170 [ 7234.039699] INFO: task kworker/8:1:6288 blocked for more than 600 seconds. [ 7234.047369] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this m. [ 7234.056108] kworker/8:1 D ffff880467309030 0 6288 2 0x00000080 [ 7234.063991] ffff88042b733ea8 0000000000000002 ffff88042b733fd8 ffff88042b733f8 [ 7234.072276] ffff88042b733fd8 ffff88042b733fd8 ffff88066d51b090 ffff88046730900 [ 7234.080563] ffff88046e4bb200 ffffffff9d0a0290 0000000000000000 000000000000000 [ 7234.088854] Call Trace: [ 7234.091582] [<ffffffff9d0a0290>] ? manage_workers.isra.36+0x2b0/0x2b0 [ 7234.098866] [<ffffffff9d69ce90>] schedule+0x30/0x96 [ 7234.104402] [<ffffffff9d0a7a9c>] kthread+0xac/0xe0 [ 7234.109842] [<ffffffff9d0a79f0>] ? kthread_worker_fn+0x170/0x170 [ 7234.116640] [<ffffffff9d6a7ad8>] ret_from_fork+0x58/0x90 [ 7234.122662] [<ffffffff9d0a79f0>] ? kthread_worker_fn+0x170/0x170 ==Issue 2==: 1. VM become hang, and call trace occurs in host1 or host2, host1 and host2 didn't hang. [ 371.625648] CPU0: NOHZ: local_softirq_pending: SCHED [ 371.625657] CPU0: NOHZ: local_softirq_pending: SCHED [ 371.625668] CPU0: NOHZ: local_softirq_pending: SCHED [ 371.625692] CPU0: NOHZ: local_softirq_pending: SCHED [ 371.625699] CPU0: NOHZ: local_softirq_pending: SCHED [ 371.625717] CPU0: NOHZ: local_softirq_pending: SCHED [ 371.625727] CPU0: NOHZ: local_softirq_pending: SCHED [ 371.625737] CPU0: NOHZ: local_softirq_pending: SCHED [ 371.625762] CPU0: NOHZ: local_softirq_pending: SCHED [ 371.625797] CPU0: NOHZ: local_softirq_pending: SCHED [ 1234.528484] INFO: task worker:4477 blocked for more than. [ 1234.535673] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this m. [ 1234.544412] worker D ffff880858bc50f0 0 4477 1 0x00000080 [ 1234.552295] ffff880461f33c88 0000000000000002 ffff880461f33fd8 ffff880461f33f8 [ 1234.560584] ffff880461f33fd8 ffff880461f33fd8 ffff88066d60d0f0 ffff880858bc500 [ 1234.568870] ffff880461f33fd8 ffff88086639fb48 ffff880461f33cd0 000000000000026 [ 1234.577158] Call Trace: [ 1234.579888] [<ffffffff8d69ce90>] schedule+0x30/0x96 [ 1234.585426] [<ffffffff8d69dc4d>] rt_spin_lock_slowlock+0x13d/0x350 [ 1234.592420] [<ffffffff8d69ede5>] rt_spin_lock+0x25/0x30 [ 1234.598364] [<ffffffffc05aec53>] kvm_mmu_notifier_invalidate_range_end+0x23/0] [ 1234.607299] [<ffffffff8d1d5af4>] __mmu_notifier_invalidate_range_end+0x64/0xe0 [ 1234.615456] [<ffffffff8d1afd75>] unmap_vmas+0x85/0x90 [ 1234.621190] [<ffffffff8d1b4c5e>] unmap_region+0xbe/0x140 [ 1234.627214] [<ffffffff8d1b70d2>] do_munmap+0x292/0x470 [ 1234.633043] [<ffffffff8d1b7318>] vm_munmap+0x68/0xb0 [ 1234.638677] [<ffffffff8d1b83b2>] SyS_munmap+0x22/0x30 [ 1234.644409] [<ffffffff8d6a7d94>] tracesys+0xdd/0xe2 [ 1234.649945] INFO: task live_migration:4494 blocked for more than 600 seconds. [ 1234.657906] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this m. [ 1234.666641] live_migration D ffff88087fa16120 0 4494 1 0x00000080 [ 1234.674523] ffff880445fefb60 0000000000000002 ffff880445feffd8 ffff880445feff8 [ 1234.682808] ffff880445feffd8 ffff880445feffd8 ffff88066d60c0c0 ffff88087fa1610 [ 1234.691091] ffff880445feffd8 ffff88086639fb48 ffff880445fefba8 000000000000026 [ 1234.699378] Call Trace: [ 1234.702103] [<ffffffff8d69ce90>] schedule+0x30/0x96 [ 1234.707641] [<ffffffff8d69dc4d>] rt_spin_lock_slowlock+0x13d/0x350 [ 1234.714633] [<ffffffff8d69ede5>] rt_spin_lock+0x25/0x30 [ 1234.720570] [<ffffffffc05dd542>] kvm_mmu_change_mmu_pages+0x42/0xd0 [kvm] [ 1234.728252] [<ffffffffc05cd007>] kvm_arch_commit_memory_region+0xd7/0xe0 [kvm] [ 1234.736414] [<ffffffffc05b3781>] __kvm_set_memory_region+0x661/0x9e0 [kvm] [ 1234.744188] [<ffffffffc05b3b2b>] kvm_set_memory_region+0x2b/0x40 [kvm] [ 1234.751573] [<ffffffffc05b3f94>] kvm_vm_ioctl+0x454/0x7e0 [kvm] [ 1234.758277] [<ffffffff8d0ee473>] ? futex_wake+0x93/0x190 [ 1234.764301] [<ffffffff8d0f12d2>] ? do_futex+0x122/0x6b0 [ 1234.770230] [<ffffffff8d20e50d>] do_vfs_ioctl+0x33d/0x550 [ 1234.776350] [<ffffffff8d2b0dcf>] ? file_has_perm+0x9f/0xb0 [ 1234.782568] [<ffffffff8d20e7c1>] SyS_ioctl+0xa1/0xc0 [ 1234.788203] [<ffffffff8d6a7d94>] tracesys+0xdd/0xe2 [ 1234.793742] INFO: task pgrep:4505 blocked for more than 600 seconds. [ 1234.800829] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this m. [ 1234.809567] pgrep D ffff88085b29d0f0 0 4505 4504 0x00000080 [ 1234.817450] ffff88066d7dfd20 0000000000000002 ffff88066d7dffd8 ffff88066d7dff8 [ 1234.825737] ffff88066d7dffd8 ffff88066d7dffd8 ffff88066d60e120 ffff88085b29d00 [ 1234.834024] ffff88085b29d0f0 0000000000000000 ffff88085b29d0f0 000000000000002 [ 1234.842312] Call Trace: [ 1234.845038] [<ffffffff8d69ce90>] schedule+0x30/0x96 [ 1234.850574] [<ffffffff8d69da85>] __rt_mutex_slowlock+0xc5/0x150 [ 1234.857276] [<ffffffff8d69df8e>] rt_mutex_slowlock+0x12e/0x210 [ 1234.863880] [<ffffffff8d69e121>] rt_mutex_lock+0x31/0x40 [ 1234.869903] [<ffffffff8d0f3328>] __rt_down_read.isra.1+0x28/0x30 [ 1234.876703] [<ffffffff8d0f333e>] rt_down_read+0xe/0x10 [ 1234.882531] [<ffffffff8d26f7a0>] proc_pid_cmdline_read+0xc0/0x570 [ 1234.889426] [<ffffffff8d2ab684>] ? security_file_permission+0x84/0xa0 [ 1234.896711] [<ffffffff8d1f9acc>] vfs_read+0x9c/0x170 [ 1234.902345] [<ffffffff8d1fa98f>] SyS_read+0x7f/0xe0 [ 1234.907883] [<ffffffff8d6a7d94>] tracesys+0xdd/0xe2 Hello, It looks like issue comes when 'testpmd' thread is set as 'FIFO' with priority '95'. Can we try to pin each host 'PMD' threads to individual isolated cores using 'taskset' and check if issue comes? just wanted to know if its tried before. Thanks, Pankaj (In reply to pagupta from comment #7) > Hello, > > It looks like issue comes when 'testpmd' thread is set as 'FIFO' with > priority '95'. Can we try to pin each host 'PMD' threads to individual > isolated cores using 'taskset' and check if issue comes? > > just wanted to know if its tried before. > > Thanks, > Pankaj Hi Pankaj, All pmd threads have been pinned to individual isolated cores already. Details: When setting openvswitch using "# ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x1554(0x1554 is cores 2,4,6,8,10,12)", 6 ovs pmd threads will automatically pin to these individual isolated cores. # top -d1 -H top - 23:17:07 up 25 min, 1 user, load average: 9.00, 8.48, 5.48 Threads: 517 total, 9 running, 508 sleeping, 0 stopped, 0 zombie %Cpu(s): 41.6 us, 0.2 sy, 0.0 ni, 58.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0. KiB Mem : 32448520 total, 8148228 free, 23754976 used, 545316 buff/cache KiB Swap: 16449532 total, 16449532 free, 0 used. 8414008 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ 4092 root 20 0 37.292g 33572 2528 R 96.2 0.1 14:41.54 4093 root 20 0 37.292g 33572 2528 R 96.2 0.1 14:41.66 4094 root 20 0 37.292g 33572 2528 R 96.2 0.1 14:41.33 4095 root 20 0 37.292g 33572 2528 R 96.2 0.1 14:41.69 4096 root 20 0 37.292g 33572 2528 R 96.2 0.1 14:41.67 4097 root 20 0 37.292g 33572 2528 R 96.2 0.1 14:41.33 4268 root -2 0 8981944 370112 51332 R 96.2 1.1 11:37.99 4269 root -2 0 8981944 370112 51332 R 96.2 1.1 11:37.99 4673 root 20 0 158120 2628 1556 R 3.8 0.0 0:00.10 ... # taskset -c -p 4092 pid 4092's current affinity list: 2 # taskset -c -p 4093 pid 4093's current affinity list: 4 # taskset -c -p 4094 pid 4094's current affinity list: 6 # taskset -c -p 4095 pid 4095's current affinity list: 8 # taskset -c -p 4096 pid 4096's current affinity list: 10 # taskset -c -p 4097 pid 4097's current affinity list: 12 In VM, when start testpmd, 2 pmd threads are automatically pin to individual cores. # taskset -c -p 4268 pid 4268's current affinity list: 16 # taskset -c -p 4269 pid 4269's current affinity list: 14 Best Regards, Pei Hi, I also looked at the logs and I could find some pointers. It looks like interrupts are disabled using 'irq_disable', logs tell us 'irqs_disabled(): 1' for cores: 1, 9, 11, 13. As we might be sleeping from 'irq_disabled' context, looks like cause of this issue. [1] static int device_dma_allocations(struct device *dev, struct dma_debug_entry **out_entry) { struct dma_debug_entry *entry; unsigned long flags; int count = 0, i; local_irq_save(flags); for (i = 0; i < HASH_SIZE; ++i) { spin_lock(&dma_entry_hash[i].lock); list_for_each_entry(entry, &dma_entry_hash[i].list, list) { if (entry->dev == dev) { count += 1; *out_entry = entry; } } spin_unlock(&dma_entry_hash[i].lock); } local_irq_restore(flags); return count; } [ 2430.841001] BUG: sleeping function called from invalid context at kernel/rtmutex.c8 [ 2430.841002] in_atomic(): 0, irqs_disabled(): 1, pid: 4041, name: dpdk-devbind [ 2430.841003] 6 locks held by dpdk-devbind/4041: [ 2430.841011] #0: (sb_writers#3){.+.+.+}, at: [<ffffffff9e23b35b>] vfs_write+0x1bb0 [ 2430.841017] #1: (&buffer->mutex){+.+.+.}, at: [<ffffffff9e2cbb3a>] sysfs_write_f0 [ 2430.841019] #2: (s_active#15){++++.+}, at: [<ffffffff9e2cbbc1>] sysfs_write_file0 [ 2430.841025] #3: (&__lockdep_no_validate__){......}, at: [<ffffffff9e494da5>] dri0 [ 2430.841028] #4: (&__lockdep_no_validate__){......}, at: [<ffffffff9e49658b>] dev0 [ 2430.841034] #5: (&(&priv->bus_notifier)->rwsem){+.+.+.}, at: [<ffffffff9e117f50>0 [ 2430.841035] irq event stamp: 1043290 [ 2430.841038] hardirqs last enabled at (1043289): [<ffffffff9e760385>] _raw_spin_un0 [ 2430.841043] hardirqs last disabled at (1043290): [<ffffffff9e3a8384>] dma_debug_de0---------> hard irqs disabled here [ 2430.841046] softirqs last enabled at (0): [<ffffffff9e08184a>] copy_process+0x82a0 [ 2430.841047] softirqs last disabled at (0): [< (null)>] (null) [ 2430.841049] CPU: 9 PID: 4041 Comm: dpdk-devbind Not tainted 3.10.0-644.rt56.560.el1 [ 2430.841049] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015 [ 2430.841052] ffff8da10f334000 00000000dd19f330 ffff8da0fcfe3cc8 ffffffff9e757413 [ 2430.841053] ffff8da0fcfe3cf0 ffffffff9e0ccb3d ffffffffa0617b50 0000000000000000 [ 2430.841055] 0000000000000000 ffff8da0fcfe3d10 ffffffff9e75f694 0000000000000000 [ 2430.841055] Call Trace: [ 2430.841060] [<ffffffff9e757413>] dump_stack+0x19/0x1b [ 2430.841064] [<ffffffff9e0ccb3d>] __might_sleep+0x12d/0x1f0 [ 2430.841067] [<ffffffff9e75f694>] rt_spin_lock+0x24/0x60 [ 2430.841070] [<ffffffff9e3a8399>] dma_debug_device_change+0x79/0x250 [ 2430.841072] [<ffffffff9e764d78>] notifier_call_chain+0x58/0xb0 [ 2430.841076] [<ffffffff9e0c179a>] __blocking_notifier_call_chain+0x5a/0x80 [ 2430.841080] [<ffffffff9e0c17d6>] blocking_notifier_call_chain+0x16/0x20 [ 2430.841082] [<ffffffff9e496553>] __device_release_driver+0xd3/0xf0 [ 2430.841084] [<ffffffff9e496593>] device_release_driver+0x23/0x30 [ 2430.841086] [<ffffffff9e494db5>] driver_unbind+0xc5/0xf0 [ 2430.841088] [<ffffffff9e4942b4>] drv_attr_store+0x24/0x40 [ 2430.841090] [<ffffffff9e2cbbd9>] sysfs_write_file+0xd9/0x150 [ 2430.841092] [<ffffffff9e23b260>] vfs_write+0xc0/0x1f0 [ 2430.841095] [<ffffffff9e25e692>] ? fget_light+0x382/0x500 [ 2430.841097] [<ffffffff9e23c0d8>] SyS_write+0x88/0xf0 [ 2430.841101] [<ffffffff9e76962d>] tracesys+0xdd/0xe2 [ 2430.841102] --------------------------- [ 2430.841102] | preempt count: 00000000 ] [ 2430.841103] | 0-level deep critical section nesting: I think below patch should fix this. If this looks okay I will post it upstream and downstream build for testing. diff --git a/lib/dma-debug.c b/lib/dma-debug.c index b157b46..fe4d50c 100644 --- a/lib/dma-debug.c +++ b/lib/dma-debug.c @@ -942,21 +942,17 @@ static int device_dma_allocations(struct device *dev, struct dma_debug_entry **o unsigned long flags; int count = 0, i; - local_irq_save(flags); - for (i = 0; i < HASH_SIZE; ++i) { - spin_lock(&dma_entry_hash[i].lock); + spin_lock_irqsave(&dma_entry_hash[i].lock, flags); list_for_each_entry(entry, &dma_entry_hash[i].list, list) { if (entry->dev == dev) { count += 1; *out_entry = entry; } } - spin_unlock(&dma_entry_hash[i].lock); + spin_unlock_irqrestore(&dma_entry_hash[i].lock, flags); } - local_irq_restore(flags); - return count; } ================================================================= [2] Here, we are seeing this trace and lockdep is getting disabled: Looks like we might need to handle this lock in lockdep? As a result it is disabling lock validator. [ 2578.465834] ffff8d9d17b3fbf8 000000019e0cfced 0000000000000000 ffff8d9d00000000 [ 2578.465835] Call Trace: [ 2578.465843] [<ffffffff9e757413>] dump_stack+0x19/0x1b [ 2578.465848] [<ffffffff9e750de0>] register_lock_class.part.27+0x38/0x3c [ 2578.465852] [<ffffffff9e10ea05>] __lock_acquire+0xcc5/0xdc0 [ 2578.465854] [<ffffffff9e10f452>] lock_acquire+0xb2/0x230 [ 2578.465858] [<ffffffff9e75dbbd>] ? rt_spin_lock_slowlock+0x6d/0x3c0 [ 2578.465860] [<ffffffff9e7604e0>] _raw_spin_lock_irqsave+0x70/0xc0 [ 2578.465862] [<ffffffff9e75dbbd>] ? rt_spin_lock_slowlock+0x6d/0x3c0 [ 2578.465864] [<ffffffff9e75dbbd>] rt_spin_lock_slowlock+0x6d/0x3c0 [ 2578.465867] [<ffffffff9e75f69c>] rt_spin_lock+0x2c/0x60 [ 2578.465884] [<ffffffffc08c590f>] kvm_page_track_register_notifier+0x1f/0x60 [kvm] [ 2578.465896] [<ffffffffc08ac9eb>] kvm_mmu_init_vm+0x2b/0x30 [kvm] [ 2578.465907] [<ffffffffc0899e44>] kvm_arch_init_vm+0x264/0x290 [kvm] [ 2578.465914] [<ffffffffc087b4de>] kvm_dev_ioctl+0xde/0x740 [kvm] [ 2578.465918] [<ffffffff9e0cfced>] ? get_parent_ip+0xd/0x50 [ 2578.465922] [<ffffffff9e2514c5>] do_vfs_ioctl+0x365/0x5b0 [ 2578.465924] [<ffffffff9e25e3ff>] ? fget_light+0xef/0x500 [ 2578.465926] [<ffffffff9e2517b1>] SyS_ioctl+0xa1/0xc0 [ 2578.465929] [<ffffffff9e76962d>] tracesys+0xdd/0xe2 [ 2578.465930] --------------------------- [ 2578.465931] | preempt count: 00000001 ] [ 2578.465931] | 1-level deep critical section nesting: [ 2578.465931] ---------------------------------------- [ 2578.465933] .. [<ffffffff9e7604bd>] .... _raw_spin_lock_irqsave+0x4d/0xc0 [ 2578.465935] .....[<ffffffff9e75dbbd>] .. ( <= rt_spin_lock_slowlock+0x6d/0x3c0) [ 2578.465935] static struct lock_class * register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force) { struct lockdep_subclass_key *key; struct hlist_head *hash_head; struct lock_class *class; DEBUG_LOCKS_WARN_ON(!irqs_disabled()); class = look_up_lock_class(lock, subclass); if (likely(class)) goto out_set_class_cache; /* * Debug-check: all keys must be persistent! */ if (!static_obj(lock->key)) { debug_locks_off();----------------> disable log debugging printk("INFO: trying to register non-static key.\n"); printk("the code is fine but needs lockdep annotation.\n"); printk("turning off the locking correctness validator.\n"); dump_stack(); return NULL; } ================ [3] As irqs_disable:1 already for cpu '9' that's why we are seeing this. Actual issue is [1] and above patch should fix this. [ 2580.261273] ---------------------------------------- [ 2580.261274] [ 2592.563812] BUG: sleeping function called from invalid context at kernel/rtmutex.c8 ---> still irqs_disable [ 2592.563813] in_atomic(): 0, irqs_disabled(): 1, pid: 4748, name: qemu-kvm [ 2592.563814] INFO: lockdep is turned off. [ 2592.563815] irq event stamp: 301754 [ 2592.563822] hardirqs last enabled at (301753): [<ffffffff9e760385>] _raw_spin_unl0 [ 2592.563823] hardirqs last disabled at (301754): [<ffffffff9e7604b3>] _raw_spin_loc0 [ 2592.563827] softirqs last enabled at (0): [<ffffffff9e08184a>] copy_process+0x82a0 [ 2592.563828] softirqs last disabled at (0): [< (null)>] (null) [ 2592.563830] CPU: 9 PID: 4748 Comm: qemu-kvm Not tainted 3.10.0-644.rt56.560.el7.x81 [ 2592.563831] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015 [ 2592.563833] ffff8d9938bf0000 000000000688bb3b ffff8d9d17b3fc68 ffffffff9e757413 [ 2592.563834] ffff8d9d17b3fc90 ffffffff9e0ccb3d ffff8da111b5b610 ffff8da111b5b610 [ 2592.563836] 00007ffcb8a137b0 ffff8d9d17b3fcb0 ffffffff9e75f694 ffff8da111b5b610 [ 2592.563836] Call Trace: [ 2592.563842] [<ffffffff9e757413>] dump_stack+0x19/0x1b [ 2592.563846] [<ffffffff9e0ccb3d>] __might_sleep+0x12d/0x1f0 [ 2592.563850] [<ffffffff9e75f694>] rt_spin_lock+0x24/0x60 [ 2592.563865] [<ffffffffc0888996>] __get_kvmclock_ns+0x36/0x110 [kvm] [ 2592.563877] [<ffffffffc0895937>] kvm_arch_vm_ioctl+0x257/0xd70 [kvm] [ 2592.563880] [<ffffffff9e75dbbd>] ? rt_spin_lock_slowlock+0x6d/0x3c0 [ 2592.563890] [<ffffffffc087f114>] kvm_vm_ioctl+0xa4/0x850 [kvm] [ 2592.563892] [<ffffffff9e75d173>] ? rt_spin_lock_slowunlock+0xd3/0xe0 [ 2592.563900] [<ffffffffc0878dae>] ? kvm_vm_ioctl_check_extension_generic+0x4e/0x90] [ 2592.563907] [<ffffffffc087b5ca>] ? kvm_dev_ioctl+0x1ca/0x740 [kvm] [ 2592.563910] [<ffffffff9e0cfced>] ? get_parent_ip+0xd/0x50 [ 2592.563912] [<ffffffff9e76509f>] ? add_preempt_count+0x1f/0x60 [ 2592.563916] [<ffffffff9e2514c5>] do_vfs_ioctl+0x365/0x5b0 [ 2592.563918] [<ffffffff9e25e3ff>] ? fget_light+0xef/0x500 [ 2592.563920] [<ffffffff9e2517b1>] SyS_ioctl+0xa1/0xc0 [ 2592.563923] [<ffffffff9e76962d>] tracesys+0xdd/0xe2 Thanks, Pankaj (In reply to pagupta from comment #14) > Hi, > > I also looked at the logs and I could find some pointers. > It looks like interrupts are disabled using 'irq_disable', > logs tell us 'irqs_disabled(): 1' for cores: 1, 9, 11, 13. > > As we might be sleeping from 'irq_disabled' context, looks > like cause of this issue. > > [1] > > static int device_dma_allocations(struct device *dev, struct dma_debug_entry > **out_entry) > { > struct dma_debug_entry *entry; > unsigned long flags; > int count = 0, i; > > local_irq_save(flags); > > for (i = 0; i < HASH_SIZE; ++i) { > spin_lock(&dma_entry_hash[i].lock); > list_for_each_entry(entry, &dma_entry_hash[i].list, list) { > if (entry->dev == dev) { > count += 1; > *out_entry = entry; > } > } > spin_unlock(&dma_entry_hash[i].lock); > } > > local_irq_restore(flags); > > return count; > } > > > [ 2430.841001] BUG: sleeping function called from invalid context at > kernel/rtmutex.c8 > [ 2430.841002] in_atomic(): 0, irqs_disabled(): 1, pid: 4041, name: > dpdk-devbind > [ 2430.841003] 6 locks held by dpdk-devbind/4041: > [ 2430.841011] #0: (sb_writers#3){.+.+.+}, at: [<ffffffff9e23b35b>] > vfs_write+0x1bb0 > [ 2430.841017] #1: (&buffer->mutex){+.+.+.}, at: [<ffffffff9e2cbb3a>] > sysfs_write_f0 > [ 2430.841019] #2: (s_active#15){++++.+}, at: [<ffffffff9e2cbbc1>] > sysfs_write_file0 > [ 2430.841025] #3: (&__lockdep_no_validate__){......}, at: > [<ffffffff9e494da5>] dri0 > [ 2430.841028] #4: (&__lockdep_no_validate__){......}, at: > [<ffffffff9e49658b>] dev0 > [ 2430.841034] #5: (&(&priv->bus_notifier)->rwsem){+.+.+.}, at: > [<ffffffff9e117f50>0 > [ 2430.841035] irq event stamp: 1043290 > [ 2430.841038] hardirqs last enabled at (1043289): [<ffffffff9e760385>] > _raw_spin_un0 > [ 2430.841043] hardirqs last disabled at (1043290): [<ffffffff9e3a8384>] > dma_debug_de0---------> hard irqs disabled here > [ 2430.841046] softirqs last enabled at (0): [<ffffffff9e08184a>] > copy_process+0x82a0 > [ 2430.841047] softirqs last disabled at (0): [< (null)>] > (null) > [ 2430.841049] CPU: 9 PID: 4041 Comm: dpdk-devbind Not tainted > 3.10.0-644.rt56.560.el1 > [ 2430.841049] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 > 10/002/2015 > [ 2430.841052] ffff8da10f334000 00000000dd19f330 ffff8da0fcfe3cc8 > ffffffff9e757413 > [ 2430.841053] ffff8da0fcfe3cf0 ffffffff9e0ccb3d ffffffffa0617b50 > 0000000000000000 > [ 2430.841055] 0000000000000000 ffff8da0fcfe3d10 ffffffff9e75f694 > 0000000000000000 > [ 2430.841055] Call Trace: > [ 2430.841060] [<ffffffff9e757413>] dump_stack+0x19/0x1b > [ 2430.841064] [<ffffffff9e0ccb3d>] __might_sleep+0x12d/0x1f0 > [ 2430.841067] [<ffffffff9e75f694>] rt_spin_lock+0x24/0x60 > [ 2430.841070] [<ffffffff9e3a8399>] dma_debug_device_change+0x79/0x250 > [ 2430.841072] [<ffffffff9e764d78>] notifier_call_chain+0x58/0xb0 > [ 2430.841076] [<ffffffff9e0c179a>] __blocking_notifier_call_chain+0x5a/0x80 > [ 2430.841080] [<ffffffff9e0c17d6>] blocking_notifier_call_chain+0x16/0x20 > [ 2430.841082] [<ffffffff9e496553>] __device_release_driver+0xd3/0xf0 > [ 2430.841084] [<ffffffff9e496593>] device_release_driver+0x23/0x30 > [ 2430.841086] [<ffffffff9e494db5>] driver_unbind+0xc5/0xf0 > [ 2430.841088] [<ffffffff9e4942b4>] drv_attr_store+0x24/0x40 > [ 2430.841090] [<ffffffff9e2cbbd9>] sysfs_write_file+0xd9/0x150 > [ 2430.841092] [<ffffffff9e23b260>] vfs_write+0xc0/0x1f0 > [ 2430.841095] [<ffffffff9e25e692>] ? fget_light+0x382/0x500 > [ 2430.841097] [<ffffffff9e23c0d8>] SyS_write+0x88/0xf0 > [ 2430.841101] [<ffffffff9e76962d>] tracesys+0xdd/0xe2 > [ 2430.841102] --------------------------- > [ 2430.841102] | preempt count: 00000000 ] > [ 2430.841103] | 0-level deep critical section nesting: > > > I think below patch should fix this. If this looks okay > I will post it upstream and downstream build for testing. > > diff --git a/lib/dma-debug.c b/lib/dma-debug.c > index b157b46..fe4d50c 100644 > --- a/lib/dma-debug.c > +++ b/lib/dma-debug.c > @@ -942,21 +942,17 @@ static int device_dma_allocations(struct device *dev, > struct dma_debug_entry **o > unsigned long flags; > int count = 0, i; > > - local_irq_save(flags); > - > for (i = 0; i < HASH_SIZE; ++i) { > - spin_lock(&dma_entry_hash[i].lock); > + spin_lock_irqsave(&dma_entry_hash[i].lock, flags); > list_for_each_entry(entry, &dma_entry_hash[i].list, list) { > if (entry->dev == dev) { > count += 1; > *out_entry = entry; > } > } > - spin_unlock(&dma_entry_hash[i].lock); > + spin_unlock_irqrestore(&dma_entry_hash[i].lock, flags); > } > > - local_irq_restore(flags); > - > return count; > } Yes it does. > > > ================================================================= > > [2] > Here, we are seeing this trace and lockdep is getting disabled: > Looks like we might need to handle this lock in lockdep? As a result it is > disabling lock validator. > > > [ 2578.465834] ffff8d9d17b3fbf8 000000019e0cfced 0000000000000000 > ffff8d9d00000000 > [ 2578.465835] Call Trace: > [ 2578.465843] [<ffffffff9e757413>] dump_stack+0x19/0x1b > [ 2578.465848] [<ffffffff9e750de0>] register_lock_class.part.27+0x38/0x3c > [ 2578.465852] [<ffffffff9e10ea05>] __lock_acquire+0xcc5/0xdc0 > [ 2578.465854] [<ffffffff9e10f452>] lock_acquire+0xb2/0x230 > [ 2578.465858] [<ffffffff9e75dbbd>] ? rt_spin_lock_slowlock+0x6d/0x3c0 > [ 2578.465860] [<ffffffff9e7604e0>] _raw_spin_lock_irqsave+0x70/0xc0 > [ 2578.465862] [<ffffffff9e75dbbd>] ? rt_spin_lock_slowlock+0x6d/0x3c0 > [ 2578.465864] [<ffffffff9e75dbbd>] rt_spin_lock_slowlock+0x6d/0x3c0 > [ 2578.465867] [<ffffffff9e75f69c>] rt_spin_lock+0x2c/0x60 > [ 2578.465884] [<ffffffffc08c590f>] > kvm_page_track_register_notifier+0x1f/0x60 [kvm] > [ 2578.465896] [<ffffffffc08ac9eb>] kvm_mmu_init_vm+0x2b/0x30 [kvm] > [ 2578.465907] [<ffffffffc0899e44>] kvm_arch_init_vm+0x264/0x290 [kvm] > [ 2578.465914] [<ffffffffc087b4de>] kvm_dev_ioctl+0xde/0x740 [kvm] > [ 2578.465918] [<ffffffff9e0cfced>] ? get_parent_ip+0xd/0x50 > [ 2578.465922] [<ffffffff9e2514c5>] do_vfs_ioctl+0x365/0x5b0 > [ 2578.465924] [<ffffffff9e25e3ff>] ? fget_light+0xef/0x500 > [ 2578.465926] [<ffffffff9e2517b1>] SyS_ioctl+0xa1/0xc0 > [ 2578.465929] [<ffffffff9e76962d>] tracesys+0xdd/0xe2 > [ 2578.465930] --------------------------- > [ 2578.465931] | preempt count: 00000001 ] > [ 2578.465931] | 1-level deep critical section nesting: > [ 2578.465931] ---------------------------------------- > [ 2578.465933] .. [<ffffffff9e7604bd>] .... _raw_spin_lock_irqsave+0x4d/0xc0 > [ 2578.465935] .....[<ffffffff9e75dbbd>] .. ( <= > rt_spin_lock_slowlock+0x6d/0x3c0) > [ 2578.465935] > > > static struct lock_class * > register_lock_class(struct lockdep_map *lock, unsigned int subclass, int > force) > { > struct lockdep_subclass_key *key; > struct hlist_head *hash_head; > struct lock_class *class; > > DEBUG_LOCKS_WARN_ON(!irqs_disabled()); > > class = look_up_lock_class(lock, subclass); > if (likely(class)) > goto out_set_class_cache; > > /* > * Debug-check: all keys must be persistent! > */ > if (!static_obj(lock->key)) { > debug_locks_off();----------------> disable log debugging > printk("INFO: trying to register non-static key.\n"); > printk("the code is fine but needs lockdep annotation.\n"); > printk("turning off the locking correctness validator.\n"); > dump_stack(); > > return NULL; > } Yes, don't know how to do that however because the object (kvm structure) is not static and can't see how that can be fixed easily. Created a debug kernel with "hand made" debugging, hopefully that helps. > > > ================ > > [3] > > As irqs_disable:1 already for cpu '9' that's why we are seeing this. > Actual issue is [1] and above patch should fix this. Not sure this analysis is correct: dma debugging should be separate from [3]. > > [ 2580.261273] ---------------------------------------- > [ 2580.261274] > [ 2592.563812] BUG: sleeping function called from invalid context at > kernel/rtmutex.c8 ---> still irqs_disable > [ 2592.563813] in_atomic(): 0, irqs_disabled(): 1, pid: 4748, name: qemu-kvm > [ 2592.563814] INFO: lockdep is turned off. > [ 2592.563815] irq event stamp: 301754 > [ 2592.563822] hardirqs last enabled at (301753): [<ffffffff9e760385>] > _raw_spin_unl0 > [ 2592.563823] hardirqs last disabled at (301754): [<ffffffff9e7604b3>] > _raw_spin_loc0 > [ 2592.563827] softirqs last enabled at (0): [<ffffffff9e08184a>] > copy_process+0x82a0 > [ 2592.563828] softirqs last disabled at (0): [< (null)>] > (null) > [ 2592.563830] CPU: 9 PID: 4748 Comm: qemu-kvm Not tainted > 3.10.0-644.rt56.560.el7.x81 > [ 2592.563831] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 > 10/002/2015 > [ 2592.563833] ffff8d9938bf0000 000000000688bb3b ffff8d9d17b3fc68 > ffffffff9e757413 > [ 2592.563834] ffff8d9d17b3fc90 ffffffff9e0ccb3d ffff8da111b5b610 > ffff8da111b5b610 > [ 2592.563836] 00007ffcb8a137b0 ffff8d9d17b3fcb0 ffffffff9e75f694 > ffff8da111b5b610 > [ 2592.563836] Call Trace: > [ 2592.563842] [<ffffffff9e757413>] dump_stack+0x19/0x1b > [ 2592.563846] [<ffffffff9e0ccb3d>] __might_sleep+0x12d/0x1f0 > [ 2592.563850] [<ffffffff9e75f694>] rt_spin_lock+0x24/0x60 > [ 2592.563865] [<ffffffffc0888996>] __get_kvmclock_ns+0x36/0x110 [kvm] > [ 2592.563877] [<ffffffffc0895937>] kvm_arch_vm_ioctl+0x257/0xd70 [kvm] > [ 2592.563880] [<ffffffff9e75dbbd>] ? rt_spin_lock_slowlock+0x6d/0x3c0 > [ 2592.563890] [<ffffffffc087f114>] kvm_vm_ioctl+0xa4/0x850 [kvm] > [ 2592.563892] [<ffffffff9e75d173>] ? rt_spin_lock_slowunlock+0xd3/0xe0 > [ 2592.563900] [<ffffffffc0878dae>] ? > kvm_vm_ioctl_check_extension_generic+0x4e/0x90] > [ 2592.563907] [<ffffffffc087b5ca>] ? kvm_dev_ioctl+0x1ca/0x740 [kvm] > [ 2592.563910] [<ffffffff9e0cfced>] ? get_parent_ip+0xd/0x50 > [ 2592.563912] [<ffffffff9e76509f>] ? add_preempt_count+0x1f/0x60 > [ 2592.563916] [<ffffffff9e2514c5>] do_vfs_ioctl+0x365/0x5b0 > [ 2592.563918] [<ffffffff9e25e3ff>] ? fget_light+0xef/0x500 > [ 2592.563920] [<ffffffff9e2517b1>] SyS_ioctl+0xa1/0xc0 > [ 2592.563923] [<ffffffff9e76962d>] tracesys+0xdd/0xe2 > > Thanks, > Pankaj Created attachment 1271105 [details] Console log when host is hang - 2 (In reply to Marcelo Tosatti from comment #15) > [...] > Pei Zhang, > > Please try the -kvm kernel of this SRPM: > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12983510 > > Which prints additional information. Marcelo, Please see this attachment. Testing with 3.10.0-644.rt56.560.el7.x86_64.debug. Using -kvm packages from this link. Best Regards, Pei (In reply to Pei Zhang from comment #18) > Created attachment 1271105 [details] > Console log when host is hang - 2 > > (In reply to Marcelo Tosatti from comment #15) > > [...] > > Pei Zhang, > > > > Please try the -kvm kernel of this SRPM: > > > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12983510 > > > > Which prints additional information. > > Marcelo, > > Please see this attachment. > > Testing with 3.10.0-644.rt56.560.el7.x86_64.debug. Using -kvm packages from > this link. > > Best Regards, > Pei Pei, Need both kernel-rt and kernel-rt-kvm from the link. Did you install both of them? Sorry for not being clear before. [ 1845.970458] INFO: task modprobe:15242 blocked for more than 600 s. [ 1849.760102] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" di. [ 1849.768841] modprobe D ffff8c17a71da440 12528 15242 152020 [ 1849.776727] ffff8c1b8362baf0 0000000000000002 ffff8c1b8362bfd8 f8 [ 1849.785017] ffff8c1b8362bfd8 ffff8c1b8362bfd8 ffff8c1980c1c000 f0 [ 1849.793308] ffff8c1b97654000 ffff8c1b976549b8 0000000000000001 f0 [ 1849.801599] Call Trace: [ 1849.804330] [<ffffffff9375c890>] schedule+0x30/0xa0 [ 1849.809869] [<ffffffff9375dc55>] rt_spin_lock_slowlock+0x105/0x30 [ 1849.816864] [<ffffffff9375dcca>] ? rt_spin_lock_slowlock+0x17a/00 [ 1849.824052] [<ffffffff9375f69c>] rt_spin_lock+0x2c/0x60 [ 1849.829980] [<ffffffff930ad4a0>] pwq_adjust_max_active+0x50/0x110 [ 1849.836876] [<ffffffff930ad5d2>] link_pwq+0x72/0x90 [ 1849.842415] [<ffffffff930b2b56>] __alloc_workqueue_key+0x4f6/0x50 [ 1849.849409] [<ffffffff931b7880>] ? mempool_kfree+0x10/0x10 [ 1849.855628] [<ffffffff931b7c73>] ? mempool_create_node+0x133/0x10 [ 1849.862621] [<ffffffff931b78a0>] ? mempool_alloc_slab+0x20/0x20 [ 1849.869325] [<ffffffff9327e4ea>] bioset_create+0x26a/0x310 [ 1849.875559] [<ffffffffc0ba20b8>] ? ftrace_define_fields_btrfs_fl] [ 1849.885182] [<ffffffffc0ba232f>] extent_io_init+0x65/0xd5 [btrfs] [ 1849.892090] [<ffffffffc0ba20f7>] init_btrfs_fs+0x3f/0x129 [btrfs] [ 1849.898987] [<ffffffff930003e8>] do_one_initcall+0xb8/0x260 [ 1849.905309] [<ffffffff9312096f>] load_module+0x168f/0x1d60 [ 1849.911529] [<ffffffff933a6070>] ? ddebug_proc_write+0xf0/0xf0 [ 1849.918134] [<ffffffff9311b8c3>] ? copy_module_from_fd.isra.42+00 [ 1849.925806] [<ffffffff93121216>] SyS_finit_module+0xa6/0xd0 [ 1849.932122] [<ffffffff9376962d>] tracesys+0xdd/0xe2 Its not making progress to grab this spinlock. spin_lock_irq(&pwq->pool->lock); Not making progress to grab: [ 2449.978453] INFO: task kworker/u288:0:7 blocked for more than 600 secon. [ 2453.725915] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables. [ 2453.734652] kworker/u288:0 D ffff8c1bb7011820 12800 7 2 0x0000 [ 2453.742548] Workqueue: edac-poller edac_mc_workq_function [edac_core] [ 2453.749743] ffff8c16008e3c00 0000000000000002 ffff8c16008e3fd8 ffff8c18 [ 2453.758034] ffff8c16008e3fd8 ffff8c16008e3fd8 ffff8c1980c08000 ffff8c10 [ 2453.766324] ffff8c16008dc000 ffff8c16008dc9b8 0000000000000292 ffff8c10 [ 2453.774615] Call Trace: [ 2453.777363] [<ffffffff9375c890>] schedule+0x30/0xa0 [ 2453.782903] [<ffffffff9375dc55>] rt_spin_lock_slowlock+0x105/0x3c0 [ 2453.789897] [<ffffffff9375f69c>] rt_spin_lock+0x2c/0x60 [ 2453.795824] [<ffffffff930b03d7>] queue_delayed_work_on+0x67/0x1d0 [ 2453.802724] [<ffffffffc03ae452>] edac_mc_workq_function+0x72/0xa0 [eda] [ 2453.810775] [<ffffffff930af350>] process_one_work+0x230/0x770 [ 2453.817282] [<ffffffff930af2df>] ? process_one_work+0x1bf/0x770 [ 2453.823988] [<ffffffff930afa10>] worker_thread+0x180/0x410 [ 2453.830205] [<ffffffff930af890>] ? process_one_work+0x770/0x770 [ 2453.836909] [<ffffffff930b94dd>] kthread+0xed/0x100 [ 2453.842447] [<ffffffff9376509f>] ? add_preempt_count+0x1f/0x60 [ 2453.849054] [<ffffffff930b93f0>] ? flush_kthread_work+0x230/0x230 [ 2453.855951] [<ffffffff93769358>] ret_from_fork+0x58/0x90 [ 2453.861974] [<ffffffff930b93f0>] ? flush_kthread_work+0x230/0x230 This one is waiting on pendingb_lock. Places that grab both locks are: 1. put_pwq_unlocked 2. try_to_grab_pending 3. queue_work_on 4. queue_delayed_work_on 5. mod_delayed_work_on 6. cancel_work_timer (via try_to_grab_pending) 7. flush_delayed_work So perhaps its scheduling work on an isolated CPU, with pmd or qemu-kvm-vcpu with priority higher than kworker thread, and then it hangs (because the pmd or qemu-kvm do not allow kworker to execute). To dump the trace to some location: 1) Have a process writing to disk continuously: cat /sys/kernel/debug/tracing/trace_pipe > /root/trace.txt or 2) You can also trigger a dump of the Ftrace buffer to the console with sysrq-z. Can you please also enable the following tracepoints: trace_workqueue_activate_work(work); trace_workqueue_queue_work(req_cpu, pwq, work); trace_workqueue_activate_work(work); trace_workqueue_execute_start(work); trace_workqueue_execute_end(work); Created attachment 1271343 [details] Call trace info when testing build from pagupta (In reply to Pei Zhang from comment #24) > (In reply to pagupta from comment #17) > > Hello Pei, > > > > I have build the test kernel with the patch in comment #14. Can you please > > also test this and check if it solves the issue. Please test this after > > you are able to gather logs with the test kernel provided by 'Marcelo'. > > > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12999750 > > > > Thanks, > > Pankaj > > Hi Pankaj, this build still hit issues. > > 1. Host hang. Console didn't print any call trace info and error info after > waiting about 10minutes, just hang. Call trace occurs after waiting a little longer. Please see the attachment. > 2. Migration fail with error below and hosts didn't hang. Even when I didn't > set 95 priority, migration also fail. > # /bin/virsh migrate --verbose --persistent --live rhel7.4_rt > qemu+ssh://192.168.1.1/system > error: operation failed: migration job: is not active > > > Best Regards, > Pei (In reply to Marcelo Tosatti from comment #19) > (In reply to Pei Zhang from comment #18) > > Created attachment 1271105 [details] > > Console log when host is hang - 2 > > > > (In reply to Marcelo Tosatti from comment #15) > > > [...] > > > Pei Zhang, > > > > > > Please try the -kvm kernel of this SRPM: > > > > > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12983510 > > > > > > Which prints additional information. > > > > Marcelo, > > > > Please see this attachment. > > > > Testing with 3.10.0-644.rt56.560.el7.x86_64.debug. Using -kvm packages from > > this link. > > > > Best Regards, > > Pei > > Pei, > > Need both kernel-rt and kernel-rt-kvm from the link. Did you install both of > them? Sorry for not being clear before. Hi Marcelo, When testing with "3.10.0-644.rt56.560.el7.x86_64.debug", I download and installed below packages from your link: kernel-rt-3.10.0-644.rt56.560.el7.x86_64.rpm kernel-rt-debuginfo-3.10.0-644.rt56.560.el7.x86_64.rpm kernel-rt-debuginfo-common-x86_64-3.10.0-644.rt56.560.el7.x86_64.rpm kernel-rt-devel-3.10.0-644.rt56.560.el7.x86_64.rpm kernel-rt-kvm-3.10.0-644.rt56.560.el7.x86_64.rpm kernel-rt-kvm-debuginfo-3.10.0-644.rt56.560.el7.x86_64.rpm kernel-rt-trace-3.10.0-644.rt56.560.el7.x86_64.rpm kernel-rt-trace-debuginfo-3.10.0-644.rt56.560.el7.x86_64.rpm kernel-rt-trace-devel-3.10.0-644.rt56.560.el7.x86_64.rpm kernel-rt-trace-kvm-3.10.0-644.rt56.560.el7.x86_64.rpm kernel-rt-trace-kvm-debuginfo-3.10.0-644.rt56.560.el7.x86_64.rpm As I was not quite sure if debug kernel is still needed in order to get more debug info, and there is no kernel-rt-debug-* packages in above link, so I download below packages from brewweb. kernel-rt-debug-3.10.0-644.rt56.560.el7.x86_64.rpm kernel-rt-debug-debuginfo-3.10.0-644.rt56.560.el7.x86_64.rpm kernel-rt-debug-devel-3.10.0-644.rt56.560.el7.x86_64.rpm Is this OK? Today, I also tested with 3.10.0-644.rt56.560.el7.x86_64, the host hang and I got call trace: [ 532.144591] CPU0: NOHZ: local_softirq_pending: SCHED [ 532.145590] CPU0: NOHZ: local_softirq_pending: SCHED [ 532.146590] CPU0: NOHZ: local_softirq_pending: SCHED [ 532.147589] CPU0: NOHZ: local_softirq_pending: SCHED [ 532.148589] CPU0: NOHZ: local_softirq_pending: SCHED [ 532.149589] CPU0: NOHZ: local_softirq_pending: SCHED [ 532.150589] CPU0: NOHZ: local_softirq_pending: SCHED [ 532.151589] CPU0: NOHZ: local_softirq_pending: SCHED [ 532.152589] CPU0: NOHZ: local_softirq_pending: SCHED [ 532.153589] CPU0: NOHZ: local_softirq_pending: SCHED [ 713.501844] sd 10:0:0:0: [sda] tag#0 task abort called for scmd(ffff9e49ee) [ 713.501848] sd 10:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 42 d7 cf 38 00 000 [ 713.501849] sd 10:0:0:0: task abort: FAILED scmd(ffff9e49ee52b100) [ 797.469315] sd 10:0:0:0: [sda] tag#1 task abort called for scmd(ffff9e4bec) [ 797.469317] sd 10:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 43 66 cd 90 00 00 [ 797.469319] sd 10:0:0:0: task abort: FAILED scmd(ffff9e4becd0ddc0) [ 797.469322] sd 10:0:0:0: target reset called for scmd(ffff9e49ee52b100) [ 797.469324] sd 10:0:0:0: [sda] tag#0 megasas: target reset FAILED!! [ 797.469327] sd 10:0:0:0: [sda] tag#23 Controller reset is requested due tot [ 797.469327] SCSI command pointer: (ffff9e4dd075cc40) SCSI host state: 5 I [ 797.469328] IO request frame: 0000000a 00000000 00000000 ffcfeb80 00600002 00000020 00000000 00000a 00000000 0000400a 00000000 00000000 00000000 00000000 00000000 010000 bd42002a 00002854 00000005 00000000 00000000 00000000 00000000 000000 005a0000 00000000 00000000 00000000 00000000 00000000 00000110 000000 fffa2000 00000000 00000a00 40000000 (In reply to Marcelo Tosatti from comment #22) > Places that grab both locks are: > > 1. put_pwq_unlocked > 2. try_to_grab_pending > 3. queue_work_on > 4. queue_delayed_work_on > 5. mod_delayed_work_on > 6. cancel_work_timer (via try_to_grab_pending) > 7. flush_delayed_work > > So perhaps its scheduling work on an isolated CPU, > with pmd or qemu-kvm-vcpu with priority higher than > kworker thread, and then it hangs (because the pmd or > qemu-kvm do not allow kworker to execute). > > To dump the trace to some location: > > 1) Have a process writing to disk continuously: > cat /sys/kernel/debug/tracing/trace_pipe > /root/trace.txt Before testing, I executed this command in both 2 hosts, reboot after hang to check the log file, nothing show. # cat /sys/kernel/debug/tracing/trace_pipe > /root/trace.txt # cat trace.txt (empty) > or > > 2) > You can also trigger a dump of the Ftrace buffer to the console with sysrq-z. Could you share more details about this method? Then I can have a try. (In reply to Marcelo Tosatti from comment #23) > Can you please also enable the following tracepoints: > > trace_workqueue_activate_work(work); > trace_workqueue_queue_work(req_cpu, pwq, work); > trace_workqueue_activate_work(work); > trace_workqueue_execute_start(work); > trace_workqueue_execute_end(work); Could you please explain a bit more about how to enable these tracepoints? As currently I have no such experiences. Thanks. Best Regards, Pei Created attachment 1271602 [details]
Call trace info when testing build from marcelo
Marcelo, please check the log info about 13 hours. Are they expected? Still, I got empty from /root/trace.txt.
Note: Test with 3.10.0-644.rt56.560.el7.x86_64, download all the packages from your link.
Best Regards,
Pei
(In reply to Pei Zhang from comment #26) > (In reply to Marcelo Tosatti from comment #19) > > (In reply to Pei Zhang from comment #18) > > > Created attachment 1271105 [details] > > > Console log when host is hang - 2 > > > > > > (In reply to Marcelo Tosatti from comment #15) > > > > [...] > > > > Pei Zhang, > > > > > > > > Please try the -kvm kernel of this SRPM: > > > > > > > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12983510 > > > > > > > > Which prints additional information. > > > > > > Marcelo, > > > > > > Please see this attachment. > > > > > > Testing with 3.10.0-644.rt56.560.el7.x86_64.debug. Using -kvm packages from > > > this link. > > > > > > Best Regards, > > > Pei > > > > Pei, > > > > Need both kernel-rt and kernel-rt-kvm from the link. Did you install both of > > them? Sorry for not being clear before. > > Hi Marcelo, > > When testing with "3.10.0-644.rt56.560.el7.x86_64.debug", I download and > installed below packages from your link: > > kernel-rt-3.10.0-644.rt56.560.el7.x86_64.rpm > kernel-rt-debuginfo-3.10.0-644.rt56.560.el7.x86_64.rpm > kernel-rt-debuginfo-common-x86_64-3.10.0-644.rt56.560.el7.x86_64.rpm > kernel-rt-devel-3.10.0-644.rt56.560.el7.x86_64.rpm > kernel-rt-kvm-3.10.0-644.rt56.560.el7.x86_64.rpm > kernel-rt-kvm-debuginfo-3.10.0-644.rt56.560.el7.x86_64.rpm > kernel-rt-trace-3.10.0-644.rt56.560.el7.x86_64.rpm > kernel-rt-trace-debuginfo-3.10.0-644.rt56.560.el7.x86_64.rpm > kernel-rt-trace-devel-3.10.0-644.rt56.560.el7.x86_64.rpm > kernel-rt-trace-kvm-3.10.0-644.rt56.560.el7.x86_64.rpm > kernel-rt-trace-kvm-debuginfo-3.10.0-644.rt56.560.el7.x86_64.rpm > > > As I was not quite sure if debug kernel is still needed in order to get more > debug info, and there is no kernel-rt-debug-* packages in above link, so I > download below packages from brewweb. > > kernel-rt-debug-3.10.0-644.rt56.560.el7.x86_64.rpm > kernel-rt-debug-debuginfo-3.10.0-644.rt56.560.el7.x86_64.rpm > kernel-rt-debug-devel-3.10.0-644.rt56.560.el7.x86_64.rpm > > > Is this OK? > > > > Today, I also tested with 3.10.0-644.rt56.560.el7.x86_64, the host hang and > I got call trace: > > [ 532.144591] CPU0: NOHZ: local_softirq_pending: SCHED > [ 532.145590] CPU0: NOHZ: local_softirq_pending: SCHED > [ 532.146590] CPU0: NOHZ: local_softirq_pending: SCHED > [ 532.147589] CPU0: NOHZ: local_softirq_pending: SCHED > [ 532.148589] CPU0: NOHZ: local_softirq_pending: SCHED > [ 532.149589] CPU0: NOHZ: local_softirq_pending: SCHED > [ 532.150589] CPU0: NOHZ: local_softirq_pending: SCHED > [ 532.151589] CPU0: NOHZ: local_softirq_pending: SCHED > [ 532.152589] CPU0: NOHZ: local_softirq_pending: SCHED > [ 532.153589] CPU0: NOHZ: local_softirq_pending: SCHED > [ 713.501844] sd 10:0:0:0: [sda] tag#0 task abort called for > scmd(ffff9e49ee) > [ 713.501848] sd 10:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 42 d7 cf 38 00 > 000 > [ 713.501849] sd 10:0:0:0: task abort: FAILED scmd(ffff9e49ee52b100) > [ 797.469315] sd 10:0:0:0: [sda] tag#1 task abort called for > scmd(ffff9e4bec) > [ 797.469317] sd 10:0:0:0: [sda] tag#1 CDB: Write(10) 2a 00 43 66 cd 90 00 > 00 > [ 797.469319] sd 10:0:0:0: task abort: FAILED scmd(ffff9e4becd0ddc0) > [ 797.469322] sd 10:0:0:0: target reset called for scmd(ffff9e49ee52b100) > [ 797.469324] sd 10:0:0:0: [sda] tag#0 megasas: target reset FAILED!! > [ 797.469327] sd 10:0:0:0: [sda] tag#23 Controller reset is requested due > tot > [ 797.469327] SCSI command pointer: (ffff9e4dd075cc40) SCSI host state: 5 I > [ 797.469328] IO request frame: > 0000000a 00000000 00000000 ffcfeb80 00600002 00000020 00000000 00000a > 00000000 0000400a 00000000 00000000 00000000 00000000 00000000 010000 > bd42002a 00002854 00000005 00000000 00000000 00000000 00000000 000000 > 005a0000 00000000 00000000 00000000 00000000 00000000 00000110 000000 > fffa2000 00000000 00000a00 40000000 > > > (In reply to Marcelo Tosatti from comment #22) > > Places that grab both locks are: > > > > 1. put_pwq_unlocked > > 2. try_to_grab_pending > > 3. queue_work_on > > 4. queue_delayed_work_on > > 5. mod_delayed_work_on > > 6. cancel_work_timer (via try_to_grab_pending) > > 7. flush_delayed_work > > > > So perhaps its scheduling work on an isolated CPU, > > with pmd or qemu-kvm-vcpu with priority higher than > > kworker thread, and then it hangs (because the pmd or > > qemu-kvm do not allow kworker to execute). > > > > To dump the trace to some location: > > > > 1) Have a process writing to disk continuously: > > cat /sys/kernel/debug/tracing/trace_pipe > /root/trace.txt > > Before testing, I executed this command in both 2 hosts, reboot after hang > to check the log file, nothing show. > > # cat /sys/kernel/debug/tracing/trace_pipe > /root/trace.txt > > # cat trace.txt > (empty) > > > or > > > > 2) > > You can also trigger a dump of the Ftrace buffer to the console with sysrq-z. > > Could you share more details about this method? Then I can have a try. > > > (In reply to Marcelo Tosatti from comment #23) > > Can you please also enable the following tracepoints: > > > > trace_workqueue_activate_work(work); > > trace_workqueue_queue_work(req_cpu, pwq, work); > > trace_workqueue_activate_work(work); > > trace_workqueue_execute_start(work); > > trace_workqueue_execute_end(work); > > Could you please explain a bit more about how to enable these tracepoints? > As currently I have no such experiences. Thanks. Do the following to enable the tracepoints: # echo workqueue_activate_work workqueue_queue_work workqueue_execute_start workqueue_execute_end > /sys/kernel/debug/tracing/set_event # cat /sys/kernel/debug/tracing/trace_pipe > /root/trace.txt > > > Best Regards, > Pei (In reply to Pei Zhang from comment #27) > Created attachment 1271602 [details] > Call trace info when testing build from marcelo > > Marcelo, please check the log info about 13 hours. Are they expected? Still, > I got empty from /root/trace.txt. > > Note: Test with 3.10.0-644.rt56.560.el7.x86_64, download all the packages > from your link. > > > Best Regards, > Pei Pei, Great, thats what we need. Unfortunately there have been no traces from KVM this time... can you please run again (with the testing build i provided) until the following is seen on the trace (will add to the next comment). (In reply to Marcelo Tosatti from comment #29) > (In reply to Pei Zhang from comment #27) > > Created attachment 1271602 [details] > > Call trace info when testing build from marcelo > > > > Marcelo, please check the log info about 13 hours. Are they expected? Still, > > I got empty from /root/trace.txt. > > > > Note: Test with 3.10.0-644.rt56.560.el7.x86_64, download all the packages > > from your link. > > > > > > Best Regards, > > Pei > > Pei, > > Great, thats what we need. Unfortunately there have been no traces from > KVM this time... can you please run again (with the testing build i > provided) until the following is seen on the trace (will add to the next > comment). This one: [ 1234.528484] INFO: task worker:4477 blocked for more than. [ 1234.535673] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this m. [ 1234.544412] worker D ffff880858bc50f0 0 4477 1 0x00000080 [ 1234.552295] ffff880461f33c88 0000000000000002 ffff880461f33fd8 ffff880461f33f8 [ 1234.560584] ffff880461f33fd8 ffff880461f33fd8 ffff88066d60d0f0 ffff880858bc500 [ 1234.568870] ffff880461f33fd8 ffff88086639fb48 ffff880461f33cd0 000000000000026 [ 1234.577158] Call Trace: [ 1234.579888] [<ffffffff8d69ce90>] schedule+0x30/0x96 [ 1234.585426] [<ffffffff8d69dc4d>] rt_spin_lock_slowlock+0x13d/0x350 [ 1234.592420] [<ffffffff8d69ede5>] rt_spin_lock+0x25/0x30 [ 1234.598364] [<ffffffffc05aec53>] kvm_mmu_notifier_invalidate_range_end+0x23/0] [ 1234.607299] [<ffffffff8d1d5af4>] __mmu_notifier_invalidate_range_end+0x64/0xe0 [ 1234.615456] [<ffffffff8d1afd75>] unmap_vmas+0x85/0x90 [ 1234.621190] [<ffffffff8d1b4c5e>] unmap_region+0xbe/0x140 [ 1234.627214] [<ffffffff8d1b70d2>] do_munmap+0x292/0x470 [ 1234.633043] [<ffffffff8d1b7318>] vm_munmap+0x68/0xb0 [ 1234.638677] [<ffffffff8d1b83b2>] SyS_munmap+0x22/0x30 [ 1234.644409] [<ffffffff8d6a7d94>] tracesys+0xdd/0xe2 Created attachment 1272008 [details] trace.txt from /sys/kernel/debug/tracing/trace_pipe (In reply to Marcelo Tosatti from comment #29) > (In reply to Pei Zhang from comment #27) > > Created attachment 1271602 [details] > > Call trace info when testing build from marcelo > > > > Marcelo, please check the log info about 13 hours. Are they expected? Still, > > I got empty from /root/trace.txt. > > > > Note: Test with 3.10.0-644.rt56.560.el7.x86_64, download all the packages > > from your link. > > > > > > Best Regards, > > Pei > > Pei, > > Great, thats what we need. Unfortunately there have been no traces from > KVM this time... can you please run again (with the testing build i > provided) until the following is seen on the trace (will add to the next > comment). Hi Marcelo, The trace info is attached to this comment. I'd like to high light below info: 1. With these below commands, in my many times testing(>10 times), the hosts never hang. Even though the call trace occurs, but the host didn't hang. [1]# echo workqueue_activate_work workqueue_queue_work workqueue_execute_start [2]# cat /sys/kernel/debug/tracing/trace_pipe > /root/trace.txt 2. I tried to bind process of [2] to CPU0, and also tried CPU13, the hang issue still can not be reproduced. The original values of process of [2]are " pid 6379's current affinity list: 0,1,3,5,7,9,11,13"(These are non isolated cores of host). Best Regards, Pei Created attachment 1272010 [details] Call trace info of Comment 31 Created attachment 1272021 [details]
trace.txt from /sys/kernel/debug/tracing/trace_pipe - update
This trace log contains the info when "INFO: task qemu-kvm:4471 blocked for more than 600 seconds." prints in console.
Created attachment 1272022 [details] Call trace info of Comment 33 Update info: The host keeping printing call trace info, however the host still didn't hang. Created attachment 1272023 [details] Call trace info of Comment 33 - update (In reply to Pei Zhang from comment #33) > Created attachment 1272021 [details] > trace.txt from /sys/kernel/debug/tracing/trace_pipe - update > > This trace log contains the info when "INFO: task qemu-kvm:4471 blocked for > more than 600 seconds." prints in console. Sorry, please ignore Comment 34. Re-upload the correct console info. Best Regards, Pei (In reply to Pei Zhang from comment #31) > Created attachment 1272008 [details] > trace.txt from /sys/kernel/debug/tracing/trace_pipe > > (In reply to Marcelo Tosatti from comment #29) > > (In reply to Pei Zhang from comment #27) > > > Created attachment 1271602 [details] > > > Call trace info when testing build from marcelo > > > > > > Marcelo, please check the log info about 13 hours. Are they expected? Still, > > > I got empty from /root/trace.txt. > > > > > > Note: Test with 3.10.0-644.rt56.560.el7.x86_64, download all the packages > > > from your link. > > > > > > > > > Best Regards, > > > Pei > > > > Pei, > > > > Great, thats what we need. Unfortunately there have been no traces from > > KVM this time... can you please run again (with the testing build i > > provided) until the following is seen on the trace (will add to the next > > comment). > > Hi Marcelo, > > The trace info is attached to this comment. > > I'd like to high light below info: > > 1. With these below commands, in my many times testing(>10 times), the hosts > never hang. Even though the call trace occurs, but the host didn't hang. > > [1]# echo workqueue_activate_work workqueue_queue_work > workqueue_execute_start > [2]# cat /sys/kernel/debug/tracing/trace_pipe > /root/trace.txt > > 2. I tried to bind process of [2] to CPU0, and also tried CPU13, the hang > issue still can not be reproduced. > > The original values of process of [2]are > " pid 6379's current affinity list: 0,1,3,5,7,9,11,13"(These are non > isolated cores of host). > > > > Best Regards, > Pei cat 1272008.txt | cut -f 2 -d "[" | cut -f 1 -d "]" | sort | uniq | while read z; do echo -n " $z, "; done 000, 001, 002, 003, 004, 005, 006, 007, 008, 009, 010, 011, 012, 013, 014, 015, 016, 017, 018, 019 So workitems are being executed on the isolated cpus 2, 4, 6, 8, 10, 12, 14, 16 and 18. They are: <...>-3763 1 workqueue_execute_end: work struct ffffa0cb6ee13e20 <...>-3763 1 workqueue_execute_start: work struct ffffa0cb6ee13e20: function vmstat_update ktimersoftd/14-131 1 workqueue_activate_work: work struct ffffffffc07a3c60 ktimersoftd/14-131 1 workqueue_queue_work: work struct=ffffffffc07a3c60 function=jump_label_update_timeout workqueue=ffffa0c9ffc08e00 req_cpu=5120 cpu=14 ktimersoftd/8-80 [008] workqueue_activate_work: work struct ffffa0cb6ed13e20 ktimersoftd/8-80 [008] workqueue_queue_work: work struct=ffffa0cb6ed13e20 function=vmstat_update workqueue=ffffa0c9ffc08e00 req_cpu=5120 cpu=8 kworker/10:1-3758 1 workqueue_execute_end: work struct ffffa0cb6ed53e20 kworker/10:1-3758 1 workqueue_execute_start: work struct ffffa0cb6ed53e20: function vmstat_update kworker/12:1-3760 1 653.453440: workqueue_execute_start: work struct ffffa0cb6ed93e20: function vmstat_update kworker/12:1-3760 1 653.453443: workqueue_execute_end: work struct ffffa0cb6ed93e20 kworker/12:1-3760 1 926.440905: workqueue_execute_start: work struct ffffa0cb6ed93e20: function vmstat_update kworker/14:1-3761 1 workqueue_execute_end: work struct ffffa0cb6edd3e20 kworker/14:1-3761 1 workqueue_execute_end: work struct ffffffffc07a3c60 kworker/14:1-3761 1 workqueue_execute_start: work struct ffffa0cb6edd3e20: function vmstat_update kworker/14:1-3761 1 workqueue_execute_start: work struct ffffffffc07a3c60: function jump_label_update_timeout kworker/16:1-3763 1 workqueue_execute_end: work struct ffffa0cb6ee13e20 kworker/16:1-3763 1 workqueue_execute_start: work struct ffffa0cb6ee13e20: function vmstat_update kworker/18:1-3765 1 workqueue_execute_end: work struct ffffa0cb6ee53e20 kworker/18:1-3765 1 workqueue_execute_start: work struct ffffa0cb6ee53e20: function vmstat_update kworker/2:1-3755 2 workqueue_execute_end: work struct ffffa0cb6ec53e20 kworker/2:1-3755 2 workqueue_execute_start: work struct ffffa0cb6ec53e20: function vmstat_update kworker/4:1-3756 [004] workqueue_execute_end: work struct ffffa0cb6ec93e20 kworker/4:1-3756 [004] workqueue_execute_start: work struct ffffa0cb6ec93e20: function vmstat_update kworker/6:1-3757 [006] workqueue_execute_end: work struct ffffa0cb6ecd3e20 kworker/6:1-3757 [006] workqueue_execute_start: work struct ffffa0cb6ecd3e20: function vmstat_update kworker/8:1-3759 [008] workqueue_execute_end: work struct ffffa0cb6ed13e20 kworker/8:1-3759 [008] workqueue_execute_start: work struct ffffa0cb6ed13e20: function vmstat_update Luiz IIRC you fixed jump_label_update_timeout ? I'm working on fixing vmstat_update. Note this is unrelated to the hang. (In reply to Pei Zhang from comment #32) > Created attachment 1272010 [details] > Call trace info of Comment 31 This one is waiting a RCU grace period to proceed, but fails to do so for more than 600 seconds. What is blocking RCU grace periods? Can a long lived process in userspace which does not exit to kernel mode cause such a thing? [ 4235.640305] INFO: task qemu-kvm:4471 blocked for more than 600 seconds. [ 4235.647686] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables. [ 4235.656423] qemu-kvm D ffffa0cb6c6350f0 0 4471 1 0x0006 [ 4235.664307] ffffa0c777cf3cd8 0000000000000002 ffffa0c777cf3fd8 ffffa0c8 [ 4235.672597] ffffa0c777cf3fd8 ffffa0c777cf3fd8 ffffa0cd6d57d0f0 ffffa0c0 [ 4235.680885] 7fffffffffffffff ffffa0c777cf3e10 ffffa0cb6c6350f0 ffffa0c0 [ 4235.689175] Call Trace: [ 4235.691905] [<ffffffffa569fc30>] schedule+0x30/0x96 [ 4235.697435] [<ffffffffa569de21>] schedule_timeout+0x2e1/0x370 [ 4235.703944] [<ffffffffa569de21>] ? schedule_timeout+0x2e1/0x370 [ 4235.710645] [<ffffffffa51554bb>] ? probe_sched_wakeup+0x2b/0xc0 [ 4235.717348] [<ffffffffa50b9752>] ? ttwu_do_wakeup+0xc2/0x110 [ 4235.723759] [<ffffffffa569ec14>] wait_for_completion+0xc4/0x100 [ 4235.730461] [<ffffffffa5136b60>] ? call_rcu_sched+0x20/0x20 [ 4235.736775] [<ffffffffa50a3fab>] wait_rcu_gp+0x5b/0x80 [ 4235.742602] [<ffffffffa50a3f30>] ? __rcu_read_unlock+0x60/0x60 [ 4235.749207] [<ffffffffa5134e89>] synchronize_rcu+0x29/0x40 [ 4235.755425] [<ffffffffc09f68bf>] vhost_net_release+0x7f/0xc0 [vhost_ne] [ 4235.763000] [<ffffffffa51fc423>] __fput+0xe3/0x260 [ 4235.768442] [<ffffffffa51fc68e>] ____fput+0xe/0x10 [ 4235.773882] [<ffffffffa50a3c07>] task_work_run+0xa7/0xf0 [ 4235.779908] [<ffffffffa501bdfe>] do_notify_resume+0xae/0x100 [ 4235.786320] [<ffffffffa56aac01>] int_signal+0x12/0x17 [ 4235.792052] debug_kvm_hung [ 4235.795069] arch/x86/kvm/../../../virt/kvm/kvm_main.c: 385 (In reply to Marcelo Tosatti from comment #36) > Luiz IIRC you fixed jump_label_update_timeout ? I fixed a different jump_label issue. jump_label_update_timeout looks like a timer. But to understand why it's executing we'd to understand what the process running on that CPU is doing and why this work item was queued. Btw, the fact it executes is not necessarily a problem. (In reply to Luiz Capitulino from comment #38) > (In reply to Marcelo Tosatti from comment #36) > > > Luiz IIRC you fixed jump_label_update_timeout ? > > I fixed a different jump_label issue. Ah, OK, i'll take a look at that if no one beats me to it. > jump_label_update_timeout looks like a timer. But to understand why it's > executing we'd to understand what the process running on that CPU is doing > and why this work item was queued. Btw, the fact it executes is not > necessarily a problem. Because the priority of qemu-kvm-vcpu is higher, you mean? And: if someone else does flush_work() on that work item, its a problem then, correct? (In reply to Marcelo Tosatti from comment #39) > (In reply to Luiz Capitulino from comment #38) > > (In reply to Marcelo Tosatti from comment #36) > > > > > Luiz IIRC you fixed jump_label_update_timeout ? > > > > I fixed a different jump_label issue. > > Ah, OK, i'll take a look at that if no one beats me to it. > > > jump_label_update_timeout looks like a timer. But to understand why it's > > executing we'd to understand what the process running on that CPU is doing > > and why this work item was queued. Btw, the fact it executes is not > > necessarily a problem. > > Because the priority of qemu-kvm-vcpu is higher, you mean? That and whether or not the condition that caused that timer to expire can also happen during real-time operation. > And: if someone else does flush_work() on that work item, its a problem then, > correct? I think my definition of a problem would be: provided the system is correctly configured for KVM-RT, any condition on the system that can happen during real-time operation and which causes anything to be scheduled to run on an isolated CPU. IMHO, Work items that starve but don't generate any serious consequences may be tolarated for the immediate term, but should be fixed in the medium to long term. (In reply to Luiz Capitulino from comment #40) > (In reply to Marcelo Tosatti from comment #39) > > (In reply to Luiz Capitulino from comment #38) > > > (In reply to Marcelo Tosatti from comment #36) > > > > > > > Luiz IIRC you fixed jump_label_update_timeout ? > > > > > > I fixed a different jump_label issue. > > > > Ah, OK, i'll take a look at that if no one beats me to it. > > > > > jump_label_update_timeout looks like a timer. But to understand why it's > > > executing we'd to understand what the process running on that CPU is doing > > > and why this work item was queued. Btw, the fact it executes is not > > > necessarily a problem. > > > > Because the priority of qemu-kvm-vcpu is higher, you mean? > > That and whether or not the condition that caused that timer to expire can > also happen during real-time operation. > > > And: if someone else does flush_work() on that work item, its a problem then, > > correct? > > I think my definition of a problem would be: provided the system is > correctly configured for KVM-RT, any condition on the system that can happen > during real-time operation and which causes anything to be scheduled to run > on an isolated CPU. > > IMHO, Work items that starve but don't generate any serious consequences may > be tolarated for the immediate term, but should be fixed in the medium to > long term. Well, its going to hang the task which does flush_work(). This is going to generate a "task hung for more than 600 seconds" error which: 1) You don't know what else depends on the progress of that hung task. 2) Its going to generate support calls. So it can't be tolerated, IMO. Should we add a WARN_ON() to any work item that is scheduled on an isolated CPU? Thats easy and will catch the offenders. Ah, I see what mean. We've fixed one instance of that problem in the past: commit 8c95bc87c79668483a6542ccadc73958fb865367 Author: Luiz Capitulino <lcapitulino> Date: Fri May 27 15:03:28 2016 +0200 mm: perform lru_add_drain_all() remotely Have you found others? PS: The commit from the upstream RT tree. (In reply to Luiz Capitulino from comment #42) > Ah, I see what mean. > > We've fixed one instance of that problem in the past: > > commit 8c95bc87c79668483a6542ccadc73958fb865367 > Author: Luiz Capitulino <lcapitulino> > Date: Fri May 27 15:03:28 2016 +0200 > > mm: perform lru_add_drain_all() remotely > > Have you found others? > > PS: The commit from the upstream RT tree. Yes, that. But rather than debug it and trace it down, better have a warning up front. I'll submit a patch. Thanks. Scheduling work on an isolated CPU may not be always wrong. It will cause problems for KVM-RT when running DPDK-based applications because the PMD threads execute 100% of the time in user-space with fifo priority so it never requests any kind of kernel service and won't let kworker run on the isolated CPU. Regular applications that requests kernel services etc, may have valid reasons to get work queued on the isolated CPUs. It really depends on what the application is doing. (In reply to Luiz Capitulino from comment #44) > Scheduling work on an isolated CPU may not be always wrong. It will cause > problems for KVM-RT when running DPDK-based applications because the PMD > threads execute 100% of the time in user-space with fifo priority so it > never requests any kind of kernel service and won't let kworker run on the > isolated CPU. > > Regular applications that requests kernel services etc, may have valid > reasons to get work queued on the isolated CPUs. It really depends on what > the application is doing. Ok, so WARN_ON() if: 1. Isolated CPU with DPDK poll mode (or similar app), and FIFO priority of thread is higher than kworker. Opened a separate BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1443951 o Anything that prevents RCU's grace-period kthreads from running. This can result in the "All QSes seen" console-log message. This message will include information on when the kthread last ran and how often it should be expected to run. [ 2037.529094] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 10, t) [ 2037.529095] All QSes seen, last rcu_preempt kthread activity 0 (4296705199-42967053 [ 2037.529096] pmd91 R running task 0 3892 1 0x00080080 [ 2037.529098] ffff88046cf40000 000000005617b23e ffff88046ed43dd0 ffffffff990bc396 [ 2037.529099] ffff88046ed520c0 ffffffff99a2bc00 ffff88046ed43e38 ffffffff9913867d [ 2037.529100] ffff8802ec8650e0 ffff88046ed520c0 0000000000009467 0000000000000000 [ 2037.529101] Call Trace: [ 2037.529105] <IRQ> [<ffffffff990bc396>] sched_show_task+0xb6/0x120 [ 2037.529108] [<ffffffff9913867d>] rcu_check_callbacks+0x83d/0x860 [ 2037.529110] [<ffffffff9908fef1>] update_process_times+0x41/0x70 [ 2037.529113] [<ffffffff990ebed5>] tick_sched_handle.isra.18+0x25/0x60 [ 2037.529114] [<ffffffff990ec273>] tick_sched_timer+0x43/0x70 [ 2037.529117] [<ffffffff990abed4>] __run_hrtimer+0xc4/0x2c0 [ 2037.529119] [<ffffffff990ec230>] ? tick_sched_do_timer+0x50/0x50 [ 2037.529120] [<ffffffff990ace50>] hrtimer_interrupt+0x130/0x350 [ 2037.529123] [<ffffffff99044b85>] local_apic_timer_interrupt+0x35/0x60 [ 2037.529125] [<ffffffff996aa09d>] smp_apic_timer_interrupt+0x3d/0x50 [ 2037.529127] [<ffffffff996a881d>] apic_timer_interrupt+0x6d/0x80 [ 2217.528187] <EOI> On my machine rcu_sched task (grace period kthread) is executing on a non isolated CPU. However in your case it seems pmd91 is executing on a CPU which rcu_preempt can execute. Check rcu_preempt's mask: [root@localhost RHEL7]# ps axuw | egrep "preempt|bh|sched" root 12 0.0 0.0 0 0 ? S 07:41 0:00 [rcu_preempt] root 13 0.0 0.0 0 0 ? S 07:41 0:00 [rcu_bh] root 14 0.0 0.0 0 0 ? S 07:41 0:00 [rcu_sched] root 3197 0.0 0.0 112652 952 pts/0 S+ 07:43 0:00 grep -E --color=auto preempt|bh|sched [root@localhost RHEL7]# taskset -p 12 pid 12's current affinity mask: 3 [root@localhost RHEL7]# taskset -p 13 pid 13's current affinity mask: 3 [root@localhost RHEL7]# taskset -p 14 pid 14's current affinity mask: 3 (the mask should be the mask of non-isolated CPUs). Thenconfirm that pmd91 is executing on an isolated CPU. (In reply to Marcelo Tosatti from comment #45) > (In reply to Luiz Capitulino from comment #44) > > Scheduling work on an isolated CPU may not be always wrong. It will cause > > problems for KVM-RT when running DPDK-based applications because the PMD > > threads execute 100% of the time in user-space with fifo priority so it > > never requests any kind of kernel service and won't let kworker run on the > > isolated CPU. > > > > Regular applications that requests kernel services etc, may have valid > > reasons to get work queued on the isolated CPUs. It really depends on what > > the application is doing. > > Ok, so WARN_ON() if: > > 1. Isolated CPU with DPDK poll mode (or similar app), and FIFO priority > of thread is higher than kworker. Opened a separate BZ: > > https://bugzilla.redhat.com/show_bug.cgi?id=1443951 Yeah, that's perfect! (In reply to Marcelo Tosatti from comment #46) > o Anything that prevents RCU's grace-period kthreads from running. > This can result in the "All QSes seen" console-log message. > This message will include information on when the kthread last > ran and how often it should be expected to run. > > [ 2037.529094] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 10, t) > [ 2037.529095] All QSes seen, last rcu_preempt kthread activity 0 > (4296705199-42967053 > [ 2037.529096] pmd91 R running task 0 3892 1 > 0x00080080 > [ 2037.529098] ffff88046cf40000 000000005617b23e ffff88046ed43dd0 > ffffffff990bc396 > [ 2037.529099] ffff88046ed520c0 ffffffff99a2bc00 ffff88046ed43e38 > ffffffff9913867d > [ 2037.529100] ffff8802ec8650e0 ffff88046ed520c0 0000000000009467 > 0000000000000000 > [ 2037.529101] Call Trace: > [ 2037.529105] <IRQ> [<ffffffff990bc396>] sched_show_task+0xb6/0x120 > [ 2037.529108] [<ffffffff9913867d>] rcu_check_callbacks+0x83d/0x860 > [ 2037.529110] [<ffffffff9908fef1>] update_process_times+0x41/0x70 > [ 2037.529113] [<ffffffff990ebed5>] tick_sched_handle.isra.18+0x25/0x60 > [ 2037.529114] [<ffffffff990ec273>] tick_sched_timer+0x43/0x70 > [ 2037.529117] [<ffffffff990abed4>] __run_hrtimer+0xc4/0x2c0 > [ 2037.529119] [<ffffffff990ec230>] ? tick_sched_do_timer+0x50/0x50 > [ 2037.529120] [<ffffffff990ace50>] hrtimer_interrupt+0x130/0x350 > [ 2037.529123] [<ffffffff99044b85>] local_apic_timer_interrupt+0x35/0x60 > [ 2037.529125] [<ffffffff996aa09d>] smp_apic_timer_interrupt+0x3d/0x50 > [ 2037.529127] [<ffffffff996a881d>] apic_timer_interrupt+0x6d/0x80 > [ 2217.528187] <EOI> > > On my machine rcu_sched task (grace period kthread) is executing on > a non isolated CPU. > > However in your case it seems pmd91 is executing on a CPU which rcu_preempt > can > execute. Check rcu_preempt's mask: > > [root@localhost RHEL7]# ps axuw | egrep "preempt|bh|sched" > root 12 0.0 0.0 0 0 ? S 07:41 0:00 > [rcu_preempt] > root 13 0.0 0.0 0 0 ? S 07:41 0:00 [rcu_bh] > root 14 0.0 0.0 0 0 ? S 07:41 0:00 [rcu_sched] > root 3197 0.0 0.0 112652 952 pts/0 S+ 07:43 0:00 grep -E > --color=auto preempt|bh|sched > [root@localhost RHEL7]# taskset -p 12 > pid 12's current affinity mask: 3 > [root@localhost RHEL7]# taskset -p 13 > pid 13's current affinity mask: 3 > [root@localhost RHEL7]# taskset -p 14 > pid 14's current affinity mask: 3 > > (the mask should be the mask of non-isolated CPUs). > > Thenconfirm that pmd91 is executing on an isolated CPU. Marcelo, sorry for late response. - ovs pmds is working on separated isolated cores: [root@dell-per730-11 ~]# top -d1 -H top - 14:32:06 up 2:03, 1 user, load average: 8.00, 8.01, 8.05 Threads: 512 total, 9 running, 503 sleeping, 0 stopped, 0 zombie %Cpu(s): 28.0 us, 0.1 sy, 0.0 ni, 71.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 32727380 total, 10220968 free, 21948196 used, 558216 buff/cache KiB Swap: 16449532 total, 16449532 free, 0 used. 10538744 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4200 root -96 0 4500568 33504 2352 R 90.1 0.1 117:36.47 pmd88 4201 root -96 0 4500568 33504 2352 R 89.1 0.1 117:36.46 pmd89 4204 root -96 0 4500568 33504 2352 R 89.1 0.1 117:36.46 pmd87 4199 root -96 0 4500568 33504 2352 R 88.1 0.1 117:36.45 pmd86 4202 root -96 0 4500568 33504 2352 R 86.1 0.1 117:36.43 pmd90 4203 root -96 0 4500568 33504 2352 R 66.3 0.1 117:36.23 pmd91 8679 root 20 0 158120 2620 1552 R 1.0 0.0 0:00.03 top 1 root 20 0 191032 4128 2496 S 0.0 0.0 0:09.43 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root -3 0 0 0 0 S 0.0 0.0 0:00.23 ksoftirqd/0 4 root -4 0 0 0 0 S 0.0 0.0 0:08.19 ktimersoft+ 9 root rt 0 0 0 0 S 0.0 0.0 0:00.00 posixcputm+ 10 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kswork 11 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kcmosdelayd [root@dell-per730-11 ~]# taskset -p 4200 pid 4200's current affinity mask: 10 [root@dell-per730-11 ~]# taskset -p 4201 pid 4201's current affinity mask: 40 [root@dell-per730-11 ~]# taskset -p 4204 pid 4204's current affinity mask: 1000 [root@dell-per730-11 ~]# taskset -p 4199 pid 4199's current affinity mask: 4 [root@dell-per730-11 ~]# taskset -p 4202 pid 4202's current affinity mask: 100 [root@dell-per730-11 ~]# taskset -p 4202 pid 4202's current affinity mask: 100 [root@dell-per730-11 ~]# taskset -p 4203 pid 4203's current affinity mask: 400 - rcu_preempt is working on non-isolated cores. [root@dell-per730-11 ~]# ps axuw | egrep "preempt|bh|sched" root 13 0.0 0.0 0 0 ? S 12:29 0:00 [rcu_preempt] root 14 0.0 0.0 0 0 ? S 12:29 0:00 [rcu_bh] root 15 0.0 0.0 0 0 ? S 12:29 0:00 [rcu_sched] root 8265 0.0 0.0 112668 960 pts/0 S+ 13:47 0:00 grep -E --color=auto preempt|bh|sched [root@dell-per730-11 ~]# taskset -p 13 pid 13's current affinity mask: 2aab [root@dell-per730-11 ~]# taskset -p 14 pid 14's current affinity mask: 2aab [root@dell-per730-11 ~]# taskset -p 15 pid 15's current affinity mask: 2aab Thanks, Pei *** Bug 1448770 has been marked as a duplicate of this bug. *** Pei Zhang, can you please try the following: 1. Make sure you're using tuned-2.8.0-3.el7 in host and guest 2. In the host assigned FIFO:1 for PMD threads (and not FIFO:95) 3. In the guest you can assign FIFO:95 to testpmd if you're running it I'm afraid the host NFV profile is not configured to run a PMD thread with FIFO:95 as we have kernel threads such as softirq and ktimersoftd that may need to execute in some situations. Would be nice to try what I'm suggesting with the reproducer for bug 1448770, which is simpler. Oops, possibly wrong information in item 3. Please, try with fifo:1 in the guest too. (In reply to Luiz Capitulino from comment #50) > Pei Zhang, can you please try the following: > > 1. Make sure you're using tuned-2.8.0-3.el7 in host and guest > 2. In the host assigned FIFO:1 for PMD threads (and not FIFO:95) > 3. In the guest you can assign FIFO:95 to testpmd if you're running it > > I'm afraid the host NFV profile is not configured to run a PMD thread with > FIFO:95 as we have kernel threads such as softirq and ktimersoftd that may > need to execute in some situations. Test with: 3.10.0-666.rt56.585.el7.x86_64 tuned-2.8.0-3.el7.noarch Set ovs pmd threads with: # chrt -fp 1 15859 Boot testpmd in guest with: # chrt -f 1 /usr/bin/testpmd ... I tried 4 times, after several (about 2 , 5 times) migrations, the guest always kernel panic. Please refer to : http://fileshare.englab.nay.redhat.com/pub/section2/coredump/var/crash/pezhang/bug1438120/ But both src and des hosts work well, no call trace, no error. > Would be nice to try what I'm suggesting with the reproducer for bug > 1448770, which is simpler. I'll test and confirm if "chrt -fp 1 $ovs_pmd_process" can solve the host hang/call trace issue. Best Regards, Pei The panic from comment 53 is bug 1441552. What I suggest you do for this BZ is to try to reproduce the original reported issue by assigning fifo:1 to the PMD and testpmd threads. Then: 1. If you keep hitting 1441552 then maybe set this BZ as a dependency of bug 1441552 and wait until bug 1441552 is fixed. Once bug 1441552 is fixed, go back and try to reproduce again 2. If you hit latency spikes with fifo:1, open BZs for them 3. If you can't reproduce the original rcu stall issue with fifo:1, then close this as NOTABUG (independently if you hit unrelated issues along the way) (In reply to Luiz Capitulino from comment #56) > The panic from comment 53 is bug 1441552. What I suggest you do for this BZ > is to try to reproduce the original reported issue by assigning fifo:1 to > the PMD and testpmd threads. Then: > > 1. If you keep hitting 1441552 then maybe set this BZ as a dependency of bug > 1441552 and wait until bug 1441552 is fixed. Once bug 1441552 is fixed, go > back and try to reproduce again > > 2. If you hit latency spikes with fifo:1, open BZs for them Yes, or just reopen https://bugzilla.redhat.com/show_bug.cgi?id=1384374 please. Luiz, your tool can probably help track down the issue very quickly. > 3. If you can't reproduce the original rcu stall issue with fifo:1, then > close this as NOTABUG (independently if you hit unrelated issues along the > way) (In reply to Marcelo Tosatti from comment #57) > (In reply to Luiz Capitulino from comment #56) > > The panic from comment 53 is bug 1441552. What I suggest you do for this BZ > > is to try to reproduce the original reported issue by assigning fifo:1 to > > the PMD and testpmd threads. Then: > > > > 1. If you keep hitting 1441552 then maybe set this BZ as a dependency of bug > > 1441552 and wait until bug 1441552 is fixed. Once bug 1441552 is fixed, go > > back and try to reproduce again > > > > 2. If you hit latency spikes with fifo:1, open BZs for them > > Yes, or just reopen https://bugzilla.redhat.com/show_bug.cgi?id=1384374 > please. > > Luiz, your tool can probably help track down the issue very quickly. Pai Zhang, you can also start by enabling the tracepoints we discussed on this bug and attaching the log to that BZ. Thanks. > > > 3. If you can't reproduce the original rcu stall issue with fifo:1, then > > close this as NOTABUG (independently if you hit unrelated issues along the > > way) (In reply to Luiz Capitulino from comment #56) > The panic from comment 53 is bug 1441552. What I suggest you do for this BZ > is to try to reproduce the original reported issue by assigning fifo:1 to > the PMD and testpmd threads. Then: Hi Luiz, could you please explain a little more about the difference between fifo:1 and fifo:95? In our next rt related testing, should we also test with fifo:1? > 1. If you keep hitting 1441552 then maybe set this BZ as a dependency of bug > 1441552 and wait until bug 1441552 is fixed. Once bug 1441552 is fixed, go > back and try to reproduce again Yes, I keep hitting 1441552. I'll set dependency. > 2. If you hit latency spikes with fifo:1, open BZs for them OK, I'll test the L2 network latency . > 3. If you can't reproduce the original rcu stall issue with fifo:1, then > close this as NOTABUG (independently if you hit unrelated issues along the > way) OK. Thanks, Pei (In reply to Marcelo Tosatti from comment #58) > > Pai Zhang, you can also start by enabling the tracepoints we discussed > on this bug and attaching the log to that BZ. > > Thanks. > OK. Thanks for your reminding. Best Regards, Pei (In reply to Pei Zhang from comment #59) > (In reply to Luiz Capitulino from comment #56) > > The panic from comment 53 is bug 1441552. What I suggest you do for this BZ > > is to try to reproduce the original reported issue by assigning fifo:1 to > > the PMD and testpmd threads. Then: > > Hi Luiz, could you please explain a little more about the difference between > fifo:1 and fifo:95? In our next rt related testing, should we also test with > fifo:1? In the RT kernel, there are some kernel threads that must execute when they are runnable: rcu, migration, ksoftirqd and ktimersoftd. It's not expected that those threads will ever become runnable, but if they do and don't execute the system may crash as you've been observing. We set the priority of those threads higher than fifo:1, so that threads that take 100% of the CPU can be set at fifo:1 without problems. If you use fifo:95, that's higher the kernel threads priority and they will starve if they become runnable. Now, there are two important details: - This is only a problem for threads taking 100% of the CPU with fifo priority. Threads that sleep and wake up (like cyclictest) can run at fifo:95 because it will yield the CPU soon - I've just found out that kernel priorities in the guest are wrong! This could cause problems in this test-case even if you set testpmd to fifo:1 in the guest. I'll open a BZ for this (In reply to Luiz Capitulino from comment #61) > (In reply to Pei Zhang from comment #59) > > (In reply to Luiz Capitulino from comment #56) > > > The panic from comment 53 is bug 1441552. What I suggest you do for this BZ > > > is to try to reproduce the original reported issue by assigning fifo:1 to > > > the PMD and testpmd threads. Then: > > > > Hi Luiz, could you please explain a little more about the difference between > > fifo:1 and fifo:95? In our next rt related testing, should we also test with > > fifo:1? > > In the RT kernel, there are some kernel threads that must execute when they > are runnable: rcu, migration, ksoftirqd and ktimersoftd. It's not expected > that those threads will ever become runnable, but if they do and don't > execute the system may crash as you've been observing. > > We set the priority of those threads higher than fifo:1, so that threads > that take 100% of the CPU can be set at fifo:1 without problems. If you use > fifo:95, that's higher the kernel threads priority and they will starve if > they become runnable. > > Now, there are two important details: > > - This is only a problem for threads taking 100% of the CPU with fifo > priority. Threads that sleep and wake up (like cyclictest) can run at > fifo:95 because it will yield the CPU soon Thanks Luiz very much for detail explain. Really make sense to me. So on any CPU, these kernel threads should execute when they are runnable. And in our testing, all DPDK pmd threads should be set with fifo:1 as they all take 100% of the CPU. Thanks, Pei > - I've just found out that kernel priorities in the guest are wrong! This > could cause problems in this test-case even if you set testpmd to fifo:1 in > the guest. I'll open a BZ for this Exactly! But note that we have a bug in the guest that has to be fixed before you can do this testing: Bug 1452357 - realtime-virtual-guest: not safe to run PMD thread with fifo:1 (In reply to Marcelo Tosatti from comment #36) > (In reply to Pei Zhang from comment #31) > > Created attachment 1272008 [details] > > trace.txt from /sys/kernel/debug/tracing/trace_pipe > > > > (In reply to Marcelo Tosatti from comment #29) > > > (In reply to Pei Zhang from comment #27) > > > > Created attachment 1271602 [details] > > > > Call trace info when testing build from marcelo > > > > > > > > Marcelo, please check the log info about 13 hours. Are they expected? Still, > > > > I got empty from /root/trace.txt. > > > > > > > > Note: Test with 3.10.0-644.rt56.560.el7.x86_64, download all the packages > > > > from your link. > > > > > > > > > > > > Best Regards, > > > > Pei > > > > > > Pei, > > > > > > Great, thats what we need. Unfortunately there have been no traces from > > > KVM this time... can you please run again (with the testing build i > > > provided) until the following is seen on the trace (will add to the next > > > comment). > > > > Hi Marcelo, > > > > The trace info is attached to this comment. > > > > I'd like to high light below info: > > > > 1. With these below commands, in my many times testing(>10 times), the hosts > > never hang. Even though the call trace occurs, but the host didn't hang. > > > > [1]# echo workqueue_activate_work workqueue_queue_work > > workqueue_execute_start > > [2]# cat /sys/kernel/debug/tracing/trace_pipe > /root/trace.txt > > > > 2. I tried to bind process of [2] to CPU0, and also tried CPU13, the hang > > issue still can not be reproduced. > > > > The original values of process of [2]are > > " pid 6379's current affinity list: 0,1,3,5,7,9,11,13"(These are non > > isolated cores of host). > > > > > > > > Best Regards, > > Pei > > cat 1272008.txt | cut -f 2 -d "[" | cut -f 1 -d "]" | sort | uniq | while > read z; do echo -n " $z, "; done > 000, 001, 002, 003, 004, 005, 006, 007, 008, 009, 010, 011, > 012, 013, 014, 015, 016, 017, 018, 019 > > So workitems are being executed on the isolated cpus 2, 4, 6, 8, 10, 12, 14, > 16 and 18. They are: > <...>-3763 1 workqueue_execute_end: work struct ffffa0cb6ee13e20 > <...>-3763 1 workqueue_execute_start: work struct ffffa0cb6ee13e20: function > vmstat_update > ktimersoftd/14-131 1 workqueue_activate_work: work struct ffffffffc07a3c60 > ktimersoftd/14-131 1 workqueue_queue_work: work struct=ffffffffc07a3c60 > function=jump_label_update_timeout workqueue=ffffa0c9ffc08e00 req_cpu=5120 > cpu=14 > ktimersoftd/8-80 [008] workqueue_activate_work: work struct ffffa0cb6ed13e20 > ktimersoftd/8-80 [008] workqueue_queue_work: work struct=ffffa0cb6ed13e20 > function=vmstat_update workqueue=ffffa0c9ffc08e00 req_cpu=5120 cpu=8 Pei Zhang, CPU8 was an isolated CPU which OVS-DPDK was running, on host, is that correct? (In reply to Marcelo Tosatti from comment #64) > Pei Zhang, > CPU8 was an isolated CPU which OVS-DPDK was running, on host, is that > correct? Yes, CPU8 was an isolated CPU wich OVS-DPDK was running on host. CPU 2,4,6,8,10,12 were all isolated and used for OVS-DPDK. Bug[1] has been fixed. [1]Bug 1452357 - realtime-virtual-guest: not safe to run PMD thread with fifo:1 Now, Ping-Pong live migration in rt environment can work well with: (1)In both src and des host, set fifo:1 to all pmd threads: # chrt -fp 1 $ovs_pmd_process_id (2)In guest, set testpmd fifo:1 # chrt -f 1 /usr/bin/testpmd ..." Testing details: Versions: 3.10.0-680.rt56.604.el7.x86_64 tuned-2.8.0-5.el7.noarch libvirt-3.2.0-9.el7.x86_64 dpdk-17.05-2.el7fdb.x86_64 openvswitch-2.7.0-8.git20170530.el7fdb.x86_64 Testing 2 scenarios, each with 90 migrations: (1) Open vSwitch acts as vhost-user client and VM acts as vhost-user server, do ping-pong migration 45 runs, work well. (2) Open vSwitch acts as vhost-user server and VM acts as vhost-user client, do ping-pong migration 45 runs, work well. Note: Each run means 2 migration: migrate from src to des host, and migrate des to src host, so 45 runs means 90 migration. Automation results: PASS https://beaker.engineering.redhat.com/recipes/3942241#tasks Add version of qemu: qemu-kvm-rhev-2.9.0-9.el7.x86_64 (In reply to Pei Zhang from comment #67) > Add version of qemu: qemu-kvm-rhev-2.9.0-9.el7.x86_64 Ok, can close the bug, then? Hi Marcelo, I agree this bug can be closed. Thank you. Move status of this bug to 'CLOSED''CURRENTRELEASE'. |