Bug 1412810
Summary: | NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [openshift:31011] | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Andy Goldstein <agoldste> |
Component: | kernel | Assignee: | Larry Woodman <lwoodman> |
kernel sub component: | Control Groups | QA Contact: | Chao Ye <cye> |
Status: | CLOSED INSUFFICIENT_DATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | aquini, bbreard, cye, decarr, eparis, jnovy, julio.valcarcel, juriarte, lwoodman, mifiedle, qcai, rickatnight11, sjenning, srelf |
Version: | 7.3 | Flags: | lwoodman:
needinfo+
|
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-01-21 20:29:29 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1469551 |
Description
Andy Goldstein
2017-01-12 21:20:19 UTC
Here is some of the output about the lockups: Jan 12 15:37:14 rhel73 kernel: NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [openshift:31011] Jan 12 15:37:14 rhel73 kernel: Modules linked in: ceph libceph dns_resolver fuse dummy xt_statistic xt_multiport veth xt_nat xt_recent ipt_REJECT nf_reject_ipv4 xt_mark xt_comment xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter nf_nat nf_conntrack br_netfilter bridge stp llc prl_fs_freeze(POE) prl_fs(POE) prl_eth(POE) dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio ext4 mbcache jbd2 intel_powerclamp coretemp iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper snd_intel8x0 ablk_helper cryptd snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer sg ppdev gpio_ich snd shpchp virtio_balloon i2c_i801 soundcore lpc_ich i2c_core pcspkr parport_pc sbs sbshc parport nfsd auth_rpcgss nfs_acl lockd Jan 12 15:37:14 rhel73 kernel: grace sunrpc ip_tables xfs libcrc32c sd_mod sr_mod crc_t10dif cdrom crct10dif_generic ata_generic pata_acpi virtio_net ahci crct10dif_pclmul crct10dif_common libahci crc32c_intel ata_piix serio_raw libata virtio_pci virtio_ring virtio fjes prl_tg(POE) dm_mirror dm_region_hash dm_log dm_mod Jan 12 15:37:14 rhel73 kernel: CPU: 5 PID: 31011 Comm: openshift Tainted: P OEL ------------ 3.10.0-514.el7.x86_64 #1 Jan 12 15:37:14 rhel73 kernel: Hardware name: Parallels Software International Inc. Parallels Virtual Platform/Parallels Virtual Platform, BIOS 12.1.1 (41491) 11/15/2016 Jan 12 15:37:14 rhel73 kernel: task: ffff88023c3f6dd0 ti: ffff8800360c8000 task.ti: ffff8800360c8000 Jan 12 15:37:14 rhel73 kernel: RIP: 0010:[<ffffffff8168d812>] [<ffffffff8168d812>] _raw_spin_lock+0x32/0x50 Jan 12 15:37:14 rhel73 kernel: RSP: 0000:ffff8800360cbd90 EFLAGS: 00000212 Jan 12 15:37:14 rhel73 kernel: RAX: 000000000000039f RBX: ffff8800360cbd98 RCX: 0000000000007cee Jan 12 15:37:14 rhel73 kernel: RDX: 0000000000007cfc RSI: 0000000000007cfc RDI: ffffea0007b6fcb0 Jan 12 15:37:14 rhel73 kernel: RBP: ffff8800360cbd90 R08: 800000014e2008e5 R09: 0000000000000001 Jan 12 15:37:14 rhel73 kernel: R10: 0000000000000000 R11: 000000c42ea19f20 R12: 0000000000000040 Jan 12 15:37:14 rhel73 kernel: R13: ffff8800360cbe40 R14: ffff8800360cbd98 R15: 000000004a11f8b8 Jan 12 15:37:14 rhel73 kernel: FS: 00007f68b1d4f700(0000) GS:ffff880246f40000(0000) knlGS:0000000000000000 Jan 12 15:37:14 rhel73 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jan 12 15:37:14 rhel73 kernel: CR2: 000000c43f1502d8 CR3: 000000021efa3000 CR4: 00000000001406e0 Jan 12 15:37:14 rhel73 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jan 12 15:37:14 rhel73 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jan 12 15:37:14 rhel73 kernel: Stack: Jan 12 15:37:14 rhel73 kernel: ffff8800360cbe20 ffffffff811ec5f4 ffff88021f6d5878 ffff8800360cbe20 Jan 12 15:37:14 rhel73 kernel: ffffffff811aec3b 00003ffffffff000 ffffea0000000000 000000c43f1502d8 Jan 12 15:37:14 rhel73 kernel: 800000014e2008e5 ffff8802442c12c0 000000c43f000000 ffff8800360cbf48 Jan 12 15:37:14 rhel73 kernel: Call Trace: Jan 12 15:37:14 rhel73 kernel: [<ffffffff811ec5f4>] do_huge_pmd_wp_page+0x5a4/0xb80 Jan 12 15:37:14 rhel73 kernel: [<ffffffff811aec3b>] ? do_wp_page+0x17b/0x530 Jan 12 15:37:14 rhel73 kernel: [<ffffffff811b0e15>] handle_mm_fault+0x705/0xfe0 Jan 12 15:37:14 rhel73 kernel: [<ffffffff81691994>] __do_page_fault+0x154/0x450 Jan 12 15:37:14 rhel73 kernel: [<ffffffff81691cc5>] do_page_fault+0x35/0x90 Jan 12 15:37:14 rhel73 kernel: [<ffffffff8168df88>] page_fault+0x28/0x30 Jan 12 15:37:14 rhel73 kernel: Code: 00 02 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 01 c3 55 83 e2 fe 0f b7 f2 48 89 e5 b8 00 80 00 00 eb 0d 66 0f 1f 44 00 00 f3 90 <83> e8 01 74 0a 0f b7 0f 66 39 ca 75 f1 5d c3 0f 1f 80 00 00 00 Jan 12 15:37:18 rhel73 kernel: INFO: rcu_sched self-detected stall on CPU Jan 12 15:37:18 rhel73 kernel: INFO: rcu_sched self-detected stall on CPU Jan 12 15:37:18 rhel73 kernel: INFO: rcu_sched self-detected stall on CPU Jan 12 15:37:18 rhel73 kernel: { Jan 12 15:37:18 rhel73 kernel: { Jan 12 15:37:18 rhel73 kernel: 1 Jan 12 15:37:18 rhel73 kernel: 4 Jan 12 15:37:18 rhel73 kernel: } Jan 12 15:37:18 rhel73 kernel: } (t=60000 jiffies g=454316 c=454315 q=50798) Jan 12 15:37:18 rhel73 kernel: (t=60000 jiffies g=454316 c=454315 q=50798) Jan 12 15:37:18 rhel73 kernel: Task dump for CPU 0: Jan 12 15:37:18 rhel73 kernel: openshift R running task 0 2994 30953 0x00000088 Jan 12 15:37:18 rhel73 kernel: ffff8801c8e238e8 ffffffff8168af70 ffff880090305e20 ffff8801c8e23fd8 Jan 12 15:37:18 rhel73 kernel: ffff8801c8e23fd8 ffff8801c8e23fd8 ffff880090305e20 ffff8801c8e23ac8 Jan 12 15:37:18 rhel73 kernel: 000000000000c350 0000000000000000 0000000000000000 0000000000000000 Jan 12 15:37:18 rhel73 kernel: Call Trace: Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168af70>] ? __schedule+0x3b0/0x990 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a60e>] ? schedule_hrtimeout_range_clock+0xbe/0x150 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810b4120>] ? hrtimer_get_res+0x50/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a5f6>] ? schedule_hrtimeout_range_clock+0xa6/0x150 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a6b3>] ? schedule_hrtimeout_range+0x13/0x20 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81212ba7>] ? poll_schedule_timeout+0x67/0xb0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8121357e>] ? do_select+0x73e/0x7c0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff815b5418>] ? ip_finish_output+0x268/0x750 Jan 12 15:37:18 rhel73 kernel: [<ffffffff815b6613>] ? ip_output+0x73/0xe0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81096748>] ? __internal_add_timer+0xc8/0x130 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810967e2>] ? internal_add_timer+0x32/0x70 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81098dcb>] ? mod_timer+0x14b/0x230 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810cde74>] ? update_curr+0x104/0x190 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810ca86e>] ? account_entity_dequeue+0xae/0xd0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810ce35c>] ? dequeue_entity+0x11c/0x5d0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810cec2e>] ? dequeue_task_fair+0x41e/0x660 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810cbcdc>] ? set_next_entity+0x3c/0xe0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810295ec>] ? __switch_to+0x15c/0x4c0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168af70>] ? __schedule+0x3b0/0x990 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168b579>] ? schedule+0x29/0x70 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810f4de3>] ? futex_wait_queue_me+0xd3/0x120 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168d817>] ? _raw_spin_lock+0x37/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ec5f4>] ? do_huge_pmd_wp_page+0x5a4/0xb80 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811aec3b>] ? do_wp_page+0x17b/0x530 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811b0e15>] ? handle_mm_fault+0x705/0xfe0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81060aef>] ? kvm_clock_get_cycles+0x1f/0x30 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691994>] ? __do_page_fault+0x154/0x450 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691cc5>] ? do_page_fault+0x35/0x90 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168df88>] ? page_fault+0x28/0x30 Jan 12 15:37:18 rhel73 kernel: Task dump for CPU 1: Jan 12 15:37:18 rhel73 kernel: openshift R running task 0 31540 30953 0x00000088 Jan 12 15:37:18 rhel73 kernel: 0000000000000007 0000000000000000 0000000000000022 0000000000000000 Jan 12 15:37:18 rhel73 kernel: ffff88021b70ff48 ffffffff81691cc5 ffff88023c3f1f60 0000000000000000 Jan 12 15:37:18 rhel73 kernel: 0000000000000070 0000000000000620 000000c4237d4b88 ffffffff8168df88 Jan 12 15:37:18 rhel73 kernel: Call Trace: Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691cc5>] ? do_page_fault+0x35/0x90 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168df88>] ? page_fault+0x28/0x30 Jan 12 15:37:18 rhel73 kernel: Task dump for CPU 3: Jan 12 15:37:18 rhel73 kernel: openshift R running task 0 31004 30953 0x00000088 Jan 12 15:37:18 rhel73 kernel: ffff880215e4b8e8 ffffffff8168af70 ffff88023c3f5e20 ffff880215e4bfd8 Jan 12 15:37:18 rhel73 kernel: ffff880215e4bfd8 ffff880215e4bfd8 ffff88023c3f5e20 ffff880215e4bac8 Jan 12 15:37:18 rhel73 kernel: 000000000000c350 0000000000000000 0000000000000000 0000000000000000 Jan 12 15:37:18 rhel73 kernel: Call Trace: Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168af70>] ? __schedule+0x3b0/0x990 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a60e>] ? schedule_hrtimeout_range_clock+0xbe/0x150 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810b4120>] ? hrtimer_get_res+0x50/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a5f6>] ? schedule_hrtimeout_range_clock+0xa6/0x150 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a6b3>] ? schedule_hrtimeout_range+0x13/0x20 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81212ba7>] ? poll_schedule_timeout+0x67/0xb0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8121357e>] ? do_select+0x73e/0x7c0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff815b5418>] ? ip_finish_output+0x268/0x750 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810c4cf8>] ? try_to_wake_up+0x1c8/0x330 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810c4e83>] ? wake_up_process+0x23/0x40 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a1ee>] ? __mutex_unlock_slowpath+0x3e/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef4d1>] ? mem_cgroup_iter+0x2a1/0x2d0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef54b>] ? tree_stat+0x4b/0x70 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef63a>] ? __mem_cgroup_threshold+0xca/0x140 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef76e>] ? memcg_check_events+0xbe/0x200 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168d817>] ? _raw_spin_lock+0x37/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ec1d7>] ? do_huge_pmd_wp_page+0x187/0xb80 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811aeba5>] ? do_wp_page+0xe5/0x530 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811b0e15>] ? handle_mm_fault+0x705/0xfe0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168d6ab>] ? _raw_spin_unlock_irqrestore+0x1b/0x40 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81060aef>] ? kvm_clock_get_cycles+0x1f/0x30 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691994>] ? __do_page_fault+0x154/0x450 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691cc5>] ? do_page_fault+0x35/0x90 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168df88>] ? page_fault+0x28/0x30 Jan 12 15:37:18 rhel73 kernel: Task dump for CPU 4: Jan 12 15:37:18 rhel73 kernel: openshift R running task 0 31001 30953 0x00000088 Jan 12 15:37:18 rhel73 kernel: ffff880246f68fb0 000000009a5f6012 ffff880246f03db0 ffffffff810c41d8 Jan 12 15:37:18 rhel73 kernel: 0000000000000004 ffffffff81a1a680 ffff880246f03dc8 ffffffff810c7a79 Jan 12 15:37:18 rhel73 kernel: 0000000000000005 ffff880246f03df8 ffffffff811372a0 ffff880246f101c0 Jan 12 15:37:18 rhel73 kernel: Call Trace: Jan 12 15:37:18 rhel73 kernel: <IRQ> [<ffffffff810c41d8>] sched_show_task+0xa8/0x110 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810c7a79>] dump_cpu_task+0x39/0x70 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811372a0>] rcu_dump_cpu_stacks+0x90/0xd0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8113a9f2>] rcu_check_callbacks+0x442/0x720 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810f2f80>] ? tick_sched_handle.isra.13+0x60/0x60 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81099177>] update_process_times+0x47/0x80 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810f2f45>] tick_sched_handle.isra.13+0x25/0x60 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810f2fc1>] tick_sched_timer+0x41/0x70 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810b4862>] __hrtimer_run_queues+0xd2/0x260 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810b4e00>] hrtimer_interrupt+0xb0/0x1e0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81697f5c>] ? call_softirq+0x1c/0x30 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810510d7>] local_apic_timer_interrupt+0x37/0x60 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81698bcf>] smp_apic_timer_interrupt+0x3f/0x60 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8169711d>] apic_timer_interrupt+0x6d/0x80 Jan 12 15:37:18 rhel73 kernel: <EOI> [<ffffffff8168d81a>] ? _raw_spin_lock+0x3a/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ec1d7>] do_huge_pmd_wp_page+0x187/0xb80 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810f4e96>] ? wake_futex+0x66/0x80 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811b0e15>] handle_mm_fault+0x705/0xfe0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81060aef>] ? kvm_clock_get_cycles+0x1f/0x30 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691994>] __do_page_fault+0x154/0x450 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691cc5>] do_page_fault+0x35/0x90 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168df88>] page_fault+0x28/0x30 Jan 12 15:37:18 rhel73 kernel: Task dump for CPU 6: Jan 12 15:37:18 rhel73 kernel: openshift R running task 0 31010 30953 0x00000088 Jan 12 15:37:18 rhel73 kernel: ffff88021a6dfb70 ffffffff8168af70 ffff880090520fb0 ffff88021a6dffd8 Jan 12 15:37:18 rhel73 kernel: ffff88021a6dffd8 ffff88021a6dffd8 ffff880090520fb0 ffffffff819ebbe8 Jan 12 15:37:18 rhel73 kernel: ffffffff819ebbec ffff880090520fb0 00000000ffffffff ffff88021a6dfc38 Jan 12 15:37:18 rhel73 kernel: Call Trace: Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168af70>] ? __schedule+0x3b0/0x990 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef3ab>] ? mem_cgroup_iter+0x17b/0x2d0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef4d1>] ? mem_cgroup_iter+0x2a1/0x2d0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef54b>] ? tree_stat+0x4b/0x70 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef63a>] ? __mem_cgroup_threshold+0xca/0x140 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef76e>] ? memcg_check_events+0xbe/0x200 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811f0a97>] ? __mem_cgroup_commit_charge+0xe7/0x370 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168d812>] ? _raw_spin_lock+0x32/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ec1d7>] ? do_huge_pmd_wp_page+0x187/0xb80 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811aeba5>] ? do_wp_page+0xe5/0x530 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810f4e96>] ? wake_futex+0x66/0x80 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811b0e15>] ? handle_mm_fault+0x705/0xfe0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691994>] ? __do_page_fault+0x154/0x450 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691cc5>] ? do_page_fault+0x35/0x90 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168df88>] ? page_fault+0x28/0x30 Jan 12 15:37:18 rhel73 kernel: Task dump for CPU 0: Jan 12 15:37:18 rhel73 kernel: openshift R running task 0 2994 30953 0x00000088 Jan 12 15:37:18 rhel73 kernel: ffff8801c8e238e8 ffffffff8168af70 ffff880090305e20 ffff8801c8e23fd8 Jan 12 15:37:18 rhel73 kernel: ffff8801c8e23fd8 ffff8801c8e23fd8 ffff880090305e20 ffff8801c8e23ac8 Jan 12 15:37:18 rhel73 kernel: 000000000000c350 0000000000000000 0000000000000000 0000000000000000 Jan 12 15:37:18 rhel73 kernel: Call Trace: Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168af70>] ? __schedule+0x3b0/0x990 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a60e>] ? schedule_hrtimeout_range_clock+0xbe/0x150 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810b4120>] ? hrtimer_get_res+0x50/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a5f6>] ? schedule_hrtimeout_range_clock+0xa6/0x150 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a6b3>] ? schedule_hrtimeout_range+0x13/0x20 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81212ba7>] ? poll_schedule_timeout+0x67/0xb0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8121357e>] ? do_select+0x73e/0x7c0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff815b5418>] ? ip_finish_output+0x268/0x750 Jan 12 15:37:18 rhel73 kernel: [<ffffffff815b6613>] ? ip_output+0x73/0xe0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81096748>] ? __internal_add_timer+0xc8/0x130 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810967e2>] ? internal_add_timer+0x32/0x70 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81098dcb>] ? mod_timer+0x14b/0x230 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810cde74>] ? update_curr+0x104/0x190 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810ca86e>] ? account_entity_dequeue+0xae/0xd0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810ce35c>] ? dequeue_entity+0x11c/0x5d0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810cec2e>] ? dequeue_task_fair+0x41e/0x660 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810cbcdc>] ? set_next_entity+0x3c/0xe0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810295ec>] ? __switch_to+0x15c/0x4c0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168af70>] ? __schedule+0x3b0/0x990 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168b579>] ? schedule+0x29/0x70 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810f4de3>] ? futex_wait_queue_me+0xd3/0x120 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168d817>] ? _raw_spin_lock+0x37/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ec5f4>] ? do_huge_pmd_wp_page+0x5a4/0xb80 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811aec3b>] ? do_wp_page+0x17b/0x530 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811b0e15>] ? handle_mm_fault+0x705/0xfe0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81060aef>] ? kvm_clock_get_cycles+0x1f/0x30 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691994>] ? __do_page_fault+0x154/0x450 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691cc5>] ? do_page_fault+0x35/0x90 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168df88>] ? page_fault+0x28/0x30 Jan 12 15:37:18 rhel73 kernel: Task dump for CPU 1: Jan 12 15:37:18 rhel73 kernel: openshift R running task 0 31540 30953 0x00000088 Jan 12 15:37:18 rhel73 kernel: ffff88023c3f1f60 0000000088f30397 ffff880246e43db0 ffffffff810c41d8 Jan 12 15:37:18 rhel73 kernel: 0000000000000001 ffffffff81a1a680 ffff880246e43dc8 ffffffff810c7a79 Jan 12 15:37:18 rhel73 kernel: 0000000000000002 ffff880246e43df8 ffffffff811372a0 ffff880246e501c0 Jan 12 15:37:18 rhel73 kernel: Call Trace: Jan 12 15:37:18 rhel73 kernel: <IRQ> [<ffffffff810c41d8>] sched_show_task+0xa8/0x110 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810c7a79>] dump_cpu_task+0x39/0x70 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811372a0>] rcu_dump_cpu_stacks+0x90/0xd0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8113a9f2>] rcu_check_callbacks+0x442/0x720 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810f2f80>] ? tick_sched_handle.isra.13+0x60/0x60 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81099177>] update_process_times+0x47/0x80 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810f2f45>] tick_sched_handle.isra.13+0x25/0x60 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810f2fc1>] tick_sched_timer+0x41/0x70 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810b4862>] __hrtimer_run_queues+0xd2/0x260 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810b4e00>] hrtimer_interrupt+0xb0/0x1e0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81697f5c>] ? call_softirq+0x1c/0x30 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810510d7>] local_apic_timer_interrupt+0x37/0x60 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81698bcf>] smp_apic_timer_interrupt+0x3f/0x60 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8169711d>] apic_timer_interrupt+0x6d/0x80 Jan 12 15:37:18 rhel73 kernel: <EOI> [<ffffffff8168d817>] ? _raw_spin_lock+0x37/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ec1d7>] do_huge_pmd_wp_page+0x187/0xb80 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811aeba5>] ? do_wp_page+0xe5/0x530 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811b0e15>] handle_mm_fault+0x705/0xfe0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691994>] __do_page_fault+0x154/0x450 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691cc5>] do_page_fault+0x35/0x90 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168df88>] page_fault+0x28/0x30 Jan 12 15:37:18 rhel73 kernel: Task dump for CPU 3: Jan 12 15:37:18 rhel73 kernel: openshift R running task 0 31004 30953 0x00000088 Jan 12 15:37:18 rhel73 kernel: ffff880215e4b8e8 ffffffff8168af70 ffff88023c3f5e20 ffff880215e4bfd8 Jan 12 15:37:18 rhel73 kernel: ffff880215e4bfd8 ffff880215e4bfd8 ffff88023c3f5e20 ffff880215e4bac8 Jan 12 15:37:18 rhel73 kernel: 000000000000c350 0000000000000000 0000000000000000 0000000000000000 Jan 12 15:37:18 rhel73 kernel: Call Trace: Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168af70>] ? __schedule+0x3b0/0x990 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a60e>] ? schedule_hrtimeout_range_clock+0xbe/0x150 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810b4120>] ? hrtimer_get_res+0x50/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a5f6>] ? schedule_hrtimeout_range_clock+0xa6/0x150 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a6b3>] ? schedule_hrtimeout_range+0x13/0x20 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81212ba7>] ? poll_schedule_timeout+0x67/0xb0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8121357e>] ? do_select+0x73e/0x7c0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff815b5418>] ? ip_finish_output+0x268/0x750 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810c4cf8>] ? try_to_wake_up+0x1c8/0x330 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810c4e83>] ? wake_up_process+0x23/0x40 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a1ee>] ? __mutex_unlock_slowpath+0x3e/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef4d1>] ? mem_cgroup_iter+0x2a1/0x2d0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef54b>] ? tree_stat+0x4b/0x70 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef63a>] ? __mem_cgroup_threshold+0xca/0x140 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef76e>] ? memcg_check_events+0xbe/0x200 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168d817>] ? _raw_spin_lock+0x37/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ec1d7>] ? do_huge_pmd_wp_page+0x187/0xb80 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811aeba5>] ? do_wp_page+0xe5/0x530 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811b0e15>] ? handle_mm_fault+0x705/0xfe0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168d6ab>] ? _raw_spin_unlock_irqrestore+0x1b/0x40 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81060aef>] ? kvm_clock_get_cycles+0x1f/0x30 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691994>] ? __do_page_fault+0x154/0x450 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691cc5>] ? do_page_fault+0x35/0x90 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168df88>] ? page_fault+0x28/0x30 Jan 12 15:37:18 rhel73 kernel: Task dump for CPU 4: Jan 12 15:37:18 rhel73 kernel: openshift R running task 0 31001 30953 0x00000088 Jan 12 15:37:18 rhel73 kernel: ffff88021af878e8 ffffffff8168af70 ffff880246f68fb0 ffff88021af87fd8 Jan 12 15:37:18 rhel73 kernel: ffff88021af87fd8 ffff88021af87fd8 ffff880246f68fb0 ffff88021af87ac8 Jan 12 15:37:18 rhel73 kernel: 000000000000c350 0000000000000000 0000000000000000 0000000000000000 Jan 12 15:37:18 rhel73 kernel: Call Trace: Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168af70>] ? __schedule+0x3b0/0x990 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a60e>] ? schedule_hrtimeout_range_clock+0xbe/0x150 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810b4120>] ? hrtimer_get_res+0x50/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a5f6>] ? schedule_hrtimeout_range_clock+0xa6/0x150 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a6b3>] ? schedule_hrtimeout_range+0x13/0x20 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81212ba7>] ? poll_schedule_timeout+0x67/0xb0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8121357e>] ? do_select+0x73e/0x7c0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff815718fb>] ? __dev_queue_xmit+0x27b/0x570 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810c4cf8>] ? try_to_wake_up+0x1c8/0x330 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810c4e83>] ? wake_up_process+0x23/0x40 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a1ee>] ? __mutex_unlock_slowpath+0x3e/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef4d1>] ? mem_cgroup_iter+0x2a1/0x2d0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef54b>] ? tree_stat+0x4b/0x70 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef63a>] ? __mem_cgroup_threshold+0xca/0x140 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef76e>] ? memcg_check_events+0xbe/0x200 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168d81a>] ? _raw_spin_lock+0x3a/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ec1d7>] ? do_huge_pmd_wp_page+0x187/0xb80 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810f4e96>] ? wake_futex+0x66/0x80 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811b0e15>] ? handle_mm_fault+0x705/0xfe0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81060aef>] ? kvm_clock_get_cycles+0x1f/0x30 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691994>] ? __do_page_fault+0x154/0x450 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691cc5>] ? do_page_fault+0x35/0x90 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168df88>] ? page_fault+0x28/0x30 Jan 12 15:37:18 rhel73 kernel: Task dump for CPU 6: Jan 12 15:37:18 rhel73 kernel: openshift R running task 0 31010 30953 0x00000088 Jan 12 15:37:18 rhel73 kernel: ffff88021a6dfb70 ffffffff8168af70 ffff880090520fb0 ffff88021a6dffd8 Jan 12 15:37:18 rhel73 kernel: ffff88021a6dffd8 ffff88021a6dffd8 ffff880090520fb0 ffffffff819ebbe8 Jan 12 15:37:18 rhel73 kernel: ffffffff819ebbec ffff880090520fb0 00000000ffffffff ffff88021a6dfc38 Jan 12 15:37:18 rhel73 kernel: Call Trace: Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168af70>] ? __schedule+0x3b0/0x990 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef3ab>] ? mem_cgroup_iter+0x17b/0x2d0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef4d1>] ? mem_cgroup_iter+0x2a1/0x2d0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef54b>] ? tree_stat+0x4b/0x70 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef63a>] ? __mem_cgroup_threshold+0xca/0x140 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef76e>] ? memcg_check_events+0xbe/0x200 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811f0a97>] ? __mem_cgroup_commit_charge+0xe7/0x370 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168d812>] ? _raw_spin_lock+0x32/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ec1d7>] ? do_huge_pmd_wp_page+0x187/0xb80 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811aeba5>] ? do_wp_page+0xe5/0x530 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810f4e96>] ? wake_futex+0x66/0x80 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811b0e15>] ? handle_mm_fault+0x705/0xfe0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691994>] ? __do_page_fault+0x154/0x450 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691cc5>] ? do_page_fault+0x35/0x90 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168df88>] ? page_fault+0x28/0x30 Jan 12 15:37:18 rhel73 kernel: { 3} (t=60014 jiffies g=454316 c=454315 q=50798) Jan 12 15:37:18 rhel73 kernel: Task dump for CPU 0: Jan 12 15:37:18 rhel73 kernel: openshift R running task 0 2994 30953 0x00000088 Jan 12 15:37:18 rhel73 kernel: ffff8801c8e238e8 ffffffff8168af70 ffff880090305e20 ffff8801c8e23fd8 Jan 12 15:37:18 rhel73 kernel: ffff8801c8e23fd8 ffff8801c8e23fd8 ffff880090305e20 ffff8801c8e23ac8 Jan 12 15:37:18 rhel73 kernel: 000000000000c350 0000000000000000 0000000000000000 0000000000000000 Jan 12 15:37:18 rhel73 kernel: Call Trace: Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168af70>] ? __schedule+0x3b0/0x990 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a60e>] ? schedule_hrtimeout_range_clock+0xbe/0x150 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810b4120>] ? hrtimer_get_res+0x50/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a5f6>] ? schedule_hrtimeout_range_clock+0xa6/0x150 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a6b3>] ? schedule_hrtimeout_range+0x13/0x20 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81212ba7>] ? poll_schedule_timeout+0x67/0xb0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8121357e>] ? do_select+0x73e/0x7c0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff815b5418>] ? ip_finish_output+0x268/0x750 Jan 12 15:37:18 rhel73 kernel: [<ffffffff815b6613>] ? ip_output+0x73/0xe0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81096748>] ? __internal_add_timer+0xc8/0x130 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810967e2>] ? internal_add_timer+0x32/0x70 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81098dcb>] ? mod_timer+0x14b/0x230 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810cde74>] ? update_curr+0x104/0x190 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810ca86e>] ? account_entity_dequeue+0xae/0xd0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810ce35c>] ? dequeue_entity+0x11c/0x5d0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810cec2e>] ? dequeue_task_fair+0x41e/0x660 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810cbcdc>] ? set_next_entity+0x3c/0xe0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810295ec>] ? __switch_to+0x15c/0x4c0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168af70>] ? __schedule+0x3b0/0x990 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168b579>] ? schedule+0x29/0x70 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810f4de3>] ? futex_wait_queue_me+0xd3/0x120 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168d817>] ? _raw_spin_lock+0x37/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ec5f4>] ? do_huge_pmd_wp_page+0x5a4/0xb80 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811aec3b>] ? do_wp_page+0x17b/0x530 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811b0e15>] ? handle_mm_fault+0x705/0xfe0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81060aef>] ? kvm_clock_get_cycles+0x1f/0x30 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691994>] ? __do_page_fault+0x154/0x450 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691cc5>] ? do_page_fault+0x35/0x90 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168df88>] ? page_fault+0x28/0x30 Jan 12 15:37:18 rhel73 kernel: Task dump for CPU 1: Jan 12 15:37:18 rhel73 kernel: openshift R running task 0 31540 30953 0x00000088 Jan 12 15:37:18 rhel73 kernel: 0000000000000007 0000000000000000 0000000000000022 0000000000000000 Jan 12 15:37:18 rhel73 kernel: ffff88021b70ff48 ffffffff81691cc5 ffff88023c3f1f60 0000000000000000 Jan 12 15:37:18 rhel73 kernel: 0000000000000070 0000000000000620 000000c4237d4b88 ffffffff8168df88 Jan 12 15:37:18 rhel73 kernel: Call Trace: Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691cc5>] ? do_page_fault+0x35/0x90 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168df88>] ? page_fault+0x28/0x30 Jan 12 15:37:18 rhel73 kernel: Task dump for CPU 3: Jan 12 15:37:18 rhel73 kernel: openshift R running task 0 31004 30953 0x00000088 Jan 12 15:37:18 rhel73 kernel: ffff88023c3f5e20 0000000059214541 ffff880246ec3db0 ffffffff810c41d8 Jan 12 15:37:18 rhel73 kernel: 0000000000000003 ffffffff81a1a680 ffff880246ec3dc8 ffffffff810c7a79 Jan 12 15:37:18 rhel73 kernel: 0000000000000004 ffff880246ec3df8 ffffffff811372a0 ffff880246ed01c0 Jan 12 15:37:18 rhel73 kernel: Call Trace: Jan 12 15:37:18 rhel73 kernel: <IRQ> [<ffffffff810c41d8>] sched_show_task+0xa8/0x110 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810c7a79>] dump_cpu_task+0x39/0x70 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811372a0>] rcu_dump_cpu_stacks+0x90/0xd0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8113a9f2>] rcu_check_callbacks+0x442/0x720 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810f2f80>] ? tick_sched_handle.isra.13+0x60/0x60 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81099177>] update_process_times+0x47/0x80 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810f2f45>] tick_sched_handle.isra.13+0x25/0x60 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810f2fc1>] tick_sched_timer+0x41/0x70 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810b4862>] __hrtimer_run_queues+0xd2/0x260 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810b4e00>] hrtimer_interrupt+0xb0/0x1e0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81697f5c>] ? call_softirq+0x1c/0x30 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810510d7>] local_apic_timer_interrupt+0x37/0x60 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81698bcf>] smp_apic_timer_interrupt+0x3f/0x60 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8169711d>] apic_timer_interrupt+0x6d/0x80 Jan 12 15:37:18 rhel73 kernel: <EOI> [<ffffffff8168d817>] ? _raw_spin_lock+0x37/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ec1d7>] do_huge_pmd_wp_page+0x187/0xb80 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811aeba5>] ? do_wp_page+0xe5/0x530 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811b0e15>] handle_mm_fault+0x705/0xfe0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168d6ab>] ? _raw_spin_unlock_irqrestore+0x1b/0x40 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81060aef>] ? kvm_clock_get_cycles+0x1f/0x30 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691994>] __do_page_fault+0x154/0x450 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691cc5>] do_page_fault+0x35/0x90 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168df88>] page_fault+0x28/0x30 Jan 12 15:37:18 rhel73 kernel: Task dump for CPU 4: Jan 12 15:37:18 rhel73 kernel: openshift R running task 0 31001 30953 0x00000088 Jan 12 15:37:18 rhel73 kernel: ffff88021af878e8 ffffffff8168af70 ffff880246f68fb0 ffff88021af87fd8 Jan 12 15:37:18 rhel73 kernel: ffff88021af87fd8 ffff88021af87fd8 ffff880246f68fb0 ffff88021af87ac8 Jan 12 15:37:18 rhel73 kernel: 000000000000c350 0000000000000000 0000000000000000 0000000000000000 Jan 12 15:37:18 rhel73 kernel: Call Trace: Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168af70>] ? __schedule+0x3b0/0x990 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a60e>] ? schedule_hrtimeout_range_clock+0xbe/0x150 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810b4120>] ? hrtimer_get_res+0x50/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a5f6>] ? schedule_hrtimeout_range_clock+0xa6/0x150 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a6b3>] ? schedule_hrtimeout_range+0x13/0x20 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81212ba7>] ? poll_schedule_timeout+0x67/0xb0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8121357e>] ? do_select+0x73e/0x7c0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff815718fb>] ? __dev_queue_xmit+0x27b/0x570 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810c4cf8>] ? try_to_wake_up+0x1c8/0x330 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810c4e83>] ? wake_up_process+0x23/0x40 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168a1ee>] ? __mutex_unlock_slowpath+0x3e/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef4d1>] ? mem_cgroup_iter+0x2a1/0x2d0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef54b>] ? tree_stat+0x4b/0x70 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef63a>] ? __mem_cgroup_threshold+0xca/0x140 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef76e>] ? memcg_check_events+0xbe/0x200 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168d817>] ? _raw_spin_lock+0x37/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ec1d7>] ? do_huge_pmd_wp_page+0x187/0xb80 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810f4e96>] ? wake_futex+0x66/0x80 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811b0e15>] ? handle_mm_fault+0x705/0xfe0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81060aef>] ? kvm_clock_get_cycles+0x1f/0x30 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691994>] ? __do_page_fault+0x154/0x450 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691cc5>] ? do_page_fault+0x35/0x90 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168df88>] ? page_fault+0x28/0x30 Jan 12 15:37:18 rhel73 kernel: Task dump for CPU 6: Jan 12 15:37:18 rhel73 kernel: openshift R running task 0 31010 30953 0x00000088 Jan 12 15:37:18 rhel73 kernel: ffff88021a6dfb70 ffffffff8168af70 ffff880090520fb0 ffff88021a6dffd8 Jan 12 15:37:18 rhel73 kernel: ffff88021a6dffd8 ffff88021a6dffd8 ffff880090520fb0 ffffffff819ebbe8 Jan 12 15:37:18 rhel73 kernel: ffffffff819ebbec ffff880090520fb0 00000000ffffffff ffff88021a6dfc38 Jan 12 15:37:18 rhel73 kernel: Call Trace: Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168af70>] ? __schedule+0x3b0/0x990 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef3ab>] ? mem_cgroup_iter+0x17b/0x2d0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef4d1>] ? mem_cgroup_iter+0x2a1/0x2d0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef54b>] ? tree_stat+0x4b/0x70 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef63a>] ? __mem_cgroup_threshold+0xca/0x140 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ef76e>] ? memcg_check_events+0xbe/0x200 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811f0a97>] ? __mem_cgroup_commit_charge+0xe7/0x370 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168d817>] ? _raw_spin_lock+0x37/0x50 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811ec1d7>] ? do_huge_pmd_wp_page+0x187/0xb80 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811aeba5>] ? do_wp_page+0xe5/0x530 Jan 12 15:37:18 rhel73 kernel: [<ffffffff810f4e96>] ? wake_futex+0x66/0x80 Jan 12 15:37:18 rhel73 kernel: [<ffffffff811b0e15>] ? handle_mm_fault+0x705/0xfe0 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691994>] ? __do_page_fault+0x154/0x450 Jan 12 15:37:18 rhel73 kernel: [<ffffffff81691cc5>] ? do_page_fault+0x35/0x90 Jan 12 15:37:18 rhel73 kernel: [<ffffffff8168df88>] ? page_fault+0x28/0x30 similar issue seen upstream around iptables updates: https://github.com/kubernetes/kubernetes/issues/37853 This almost certainly has little to nothing to do with iptables. The lock ups go away entirely when we stop using the kernel memcg threshold notifier. Now it could certainly be some odd interaction between the threshold notifier and other subsystems, like iptables, but iptables is very unlikely to be near the core of the bug. Do we know if the system ever recovers from this of does it totally hang??? Larry I've never seen it recover. Our Jenkins jobs have a 6-hour timeout, and when we run into this issue, the jobs remain frozen until they are aborted after 6 hours. Also, this is happening on RHEL7.3GA kernel-3.10.0-514.el7.x86_64. Do we know if it worked OK in earlier version of RHEL7??? Larry The pmd page table lock seems to be what the task can not get here: RIP: 0010:[<ffffffff8168d812>] [<ffffffff8168d812>] _raw_spin_lock+0x32/0x50 Jan 12 15:37:14 rhel73 kernel: Call Trace: [<ffffffff811ec5f4>] do_huge_pmd_wp_page+0x5a4/0xb80 [<ffffffff811aec3b>] ? do_wp_page+0x17b/0x530 [<ffffffff811b0e15>] handle_mm_fault+0x705/0xfe0 [<ffffffff81691994>] __do_page_fault+0x154/0x450 [<ffffffff81691cc5>] do_page_fault+0x35/0x90 [<ffffffff8168df88>] page_fault+0x28/0x30 int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, pmd_t orig_pmd) { spinlock_t *ptl; int ret = 0; struct page *page = NULL, *new_page; unsigned long haddr; unsigned long mmun_start; /* For mmu_notifiers */ unsigned long mmun_end; /* For mmu_notifiers */ ptl = pmd_lockptr(mm, pmd); VM_BUG_ON(!vma->anon_vma); haddr = address & HPAGE_PMD_MASK; if (is_huge_zero_pmd(orig_pmd)) goto alloc; >>> spin_lock(ptl); Can you tell me about how you are using the kernel memcg threshold notifier as was stated in Comment#3 so I can try to determine if it related??? Larry Larry, I believe we did also see it in 3.10.0-327.22.2.el7.x86_64. This is the code that is using the memcg threshold notifier: https://github.com/kubernetes/kubernetes/pull/32577/files. See also https://github.com/kubernetes/kubernetes/issues/37853 (lots of data / comments, but the lockups went away after we disabled the memcg threshold notifier by default). Andy, does the notifier actually get involked when/before this hang happens??? Larry I don't think we have a definitive answer on that. Some people upstream claimed they were certain they never got over the threshold and just merely having it enabled was causing the problem. But another person upstream said they could softlock very quickly if they caused the system to constantly go above and below repeatedly. Eric, does this also happen with the upstream kernel of is it just RHEL7??? Larry Newest kernel that I am aware of the definitely exhibited the problem was a debian 3.16 based kernel. I'll see if anyone will confirm anything newer. Thanks Eric, actually you could install a pegas x86_64 kernel on your system and try that... Also, did this ever work in the past or has it always hung like this??? Larry I believe as soon as we added the memcg threshold notifier code into kubernetes, people started seeing soft lockups. Hopefully Andy can get it running some time today. Since the reproducer is so unreliable it would be impossible for us to ever say the pegas kernel was definitely fixed. But maybe we'll be able to say that it is broken. I installed the pegas kernel and ran both the core and conformance test suites for OpenShift in an endless loop and was unable to reproduce the soft lockup. I ran the tests for over 24 hours. While I can't say with 100% certainty, I believe that whatever the bug is, it's been fixed in the pegas kernel. Andy, thanks for doing this, I dont even know how to reproduce this myself. Now can I have you reboot the RHEL7.3 kernel and disable THP? The code path for the 2MB hugepages is different from the 4KB smaller pages and I suspect the problem is in the 2MB page COW fault code. Thanks, Larry I have rolled back to the 3.10 kernel and disabled THP. I'll let the test run in a loop for a while and report back later. So far so good (no lockups). I'll let it keep running the rest of the day. If we don't see any lockups with THP disabled, what is the next step? Andy, did you ever see a lockup with THP disabled? Were you able to confirm your reproducer still was able to break with THP enabled? What's our next step here? I ran it for somewhere between 24 and 48 hours and it never locked up with THP disabled. I attempted to reproduce with THP enabled but wasn't able to. I can try again but who knows if it will lock up, and if it does, how long it will take to observe it. What's the status of the bug? When enable 'experimental-kernel-memcg-notification' to test memory eviction. on AH-7.3.3, when memory eviction trigger, node hang and pod in this node become Unknow. Version-Release number of selected component (if applicable): [root@qe-dma358-master-1 ~]# openshift version openshift v3.5.0.45 kubernetes v1.5.2+43a9be4 etcd 3.1.0 AH-7.3.3 How reproducible: 80% Steps to Reproduce: 1. Enable 'experimental-kernel-memcg-notification' in /etc/origin/node/node-config.yaml kubeletArguments: experimental-kernel-memcg-notification: - 'true' eviction-soft: - "memory.available<1.5Gi" eviction-soft-grace-period: - "memory.available=30s" eviction-max-pod-grace-period: - "10" eviction-pressure-transition-period: - "1m0s" 2. Create some pod on node to trigger memory eviction for i in {1..5}; do kubectl create -f https://raw.githubusercontent.com/derekwaynecarr/kubernetes/examples-eviction/demo/kubelet-eviction/besteffort-pod.yaml -n dma ; done for i in {1..5}; do kubectl create -f https://raw.githubusercontent.com/derekwaynecarr/kubernetes/examples-eviction/demo/kubelet-eviction/burstable-pod.yaml -n dma ; done for i in {1..5}; do kubectl create -f https://raw.githubusercontent.com/derekwaynecarr/kubernetes/examples-eviction/demo/kubelet-eviction/guaranteed-pod.yaml -n dma ; done 3. Wait and check node, pod status [root@qe-dma358-master-1 ~]# oc get pod -n dma NAME READY STATUS RESTARTS AGE besteffort-0f878 1/1 Unknown 0 12m besteffort-6f1w7 1/1 Unknown 0 28m besteffort-71g0w 1/1 Unknown 0 28m besteffort-8kk19 1/1 Unknown 0 12m besteffort-b9d9q 1/1 Unknown 0 12m besteffort-lvrkz 1/1 Unknown 0 12m burstable-1v0d9 1/1 Unknown 0 12m burstable-73bc5 1/1 Unknown 0 28m burstable-8zwc8 1/1 Unknown 0 12m burstable-fv4wb 1/1 Unknown 0 28m burstable-jqq4p 1/1 Unknown 0 12m burstable-k9j2t 1/1 Unknown 0 12m burstable-lg9c5 1/1 Unknown 0 28m burstable-q0jkh 1/1 Unknown 0 12m burstable-wqqrm 1/1 Unknown 0 28m burstable-xqb0c 1/1 Unknown 0 28m guaranteed-139p4 0/1 Pending 0 10m guaranteed-2ql85 1/1 Unknown 0 12m guaranteed-3txss 0/1 Pending 0 10m guaranteed-95wsd 1/1 Unknown 0 12m guaranteed-961bc 1/1 Unknown 0 12m guaranteed-cv2l3 1/1 Unknown 0 27m guaranteed-d3nmp 1/1 Unknown 0 12m guaranteed-ddpqc 0/1 Pending 0 10m guaranteed-gmr2w 1/1 Unknown 0 27m guaranteed-jrnp8 1/1 Unknown 0 27m guaranteed-r09nt 0/1 Pending 0 10m guaranteed-rffbt 1/1 Unknown 0 27m guaranteed-t5240 1/1 Unknown 0 27m guaranteed-vrxdq 1/1 Unknown 0 12m guaranteed-vzk2j 0/1 Pending 0 10m [root@qe-dma358-master-1 ~]# oc get node NAME STATUS AGE qe-dma358-master-1 Ready,SchedulingDisabled 10h qe-dma358-node-registry-router-1 Ready,SchedulingDisabled 10h qe-dma358-node-registry-router-2 NotReady 10h 4. When node become NotReady, In another terminal can't ssh to node. I could have sworn I posted a needinfo for Larry but I guess I didn't :-( Larry, where do we go from here? What are the next steps? FYI, I won't be able to reproduce this using a similiar reproducer from the comment #27 using the single-host plain RHEL 7.3.3 + k8s. I triggered the pod memory evictions all the time but everything is running as normal afterwards except kube-apiserver might be get evicted as it is running as a pod. I am not sure it is because some openshift code triggered it (logging, sdn etc) or requires master/node on different hosts to go through some networking code first. One thing I noticed in the vmcore from the comment #5 is that it missed user pages which cannot tell which extra command arguments were running on the system at the time. If anyone is able to capture a vmcore again with core_collector makedumpfile -d 23 in kdump.conf, that might be helpful to solve the puzzle. From what I can gather, we are recommending to disable THP, but its not clear to me if it was disabled, if we would see the problem the moment openshift nodes starts exercising the hugetlb cgroup as we hope to later this year... This is clearly a leak in releasing the pmd lock in some path of the THP handling code. We have looked for any missing RHEL7 code that is upstream but have not found it yet. Andrea and I are still looking at the code to see if we can spot it and fix it in RHEL7. Larry any update? What's the status of the bug? Has anyone observed this causing a openshift guest go into a "in shutdown" state on a KVM hypervisor? |