Red Hat Bugzilla – Bug 671477
[RHEL6.1] possible vmalloc_sync_all() bug
Last modified: 2013-07-03 03:27:56 EDT
Description of problem: Multiple BUGs on the CPUs being stuck BUG: soft lockup - CPU#5 stuck for 61s! [stapio:8604] Modules linked in: stap_850fe20eb529fd0f6b13f9a95a4cdd61_882(U) cryptd aes_x86_64 aes_generic ts_kmp nls_koi8_u nls_cp932 sunrpc cpufreq_ondemand acpi_cpufreq freq_table ipv6 dm_mirror dm_region_hash dm_log i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support e1000e ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix dm_mod [last unloaded: stap_02b9727bb984f2da043cd88e66031bd0_878] irq event stamp: 1099684 hardirqs last enabled at (1099683): [<ffffffff8100bc10>] restore_args+0x0/0x30 hardirqs last disabled at (1099684): [<ffffffff8100afea>] save_args+0x6a/0x70 softirqs last enabled at (1099680): [<ffffffff8107095d>] __do_softirq+0x14d/0x220 softirqs last disabled at (1099667): [<ffffffff8100c3cc>] call_softirq+0x1c/0x30 CPU 5: Modules linked in: stap_850fe20eb529fd0f6b13f9a95a4cdd61_882(U) cryptd aes_x86_64 aes_generic ts_kmp nls_koi8_u nls_cp932 sunrpc cpufreq_ondemand acpi_cpufreq freq_table ipv6 dm_mirror dm_region_hash dm_log i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support e1000e ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix dm_mod [last unloaded: stap_02b9727bb984f2da043cd88e66031bd0_878] Pid: 8604, comm: stapio Not tainted 2.6.32-102.el6scratch.x86_64.debug #1 Express5800/T110b [N8100-1589] RIP: 0010:[<ffffffff81044e38>] [<ffffffff81044e38>] flush_tlb_others_ipi+0x118/0x130 RSP: 0000:ffff88003a181828 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff88003a181868 RCX: 0000000000000008 RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffffffff82009550 RBP: ffffffff8100bd8e R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000000 R14: ffff88003a180000 R15: ffffffff81791100 FS: 00007f4b4f00c710(0000) GS:ffff880004800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f4b4efebd2c CR3: 000000003a486000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 Call Trace: [<ffffffff81044e48>] ? flush_tlb_others_ipi+0x128/0x130 [<ffffffff81044ec6>] ? native_flush_tlb_others+0x76/0x90 [<ffffffff81044fee>] ? flush_tlb_page+0x5e/0xb0 [<ffffffff81043d50>] ? ptep_clear_flush_young+0x50/0x70 [<ffffffff8114f02c>] ? page_referenced_one+0x9c/0x1d0 [<ffffffff8114e829>] ? page_lock_anon_vma+0x69/0xb0 [<ffffffff8114e7c0>] ? page_lock_anon_vma+0x0/0xb0 [<ffffffff8114fd92>] ? page_referenced+0x2f2/0x3f0 [<ffffffff814faa30>] ? _spin_unlock_irq+0x30/0x40 [<ffffffff810a76cd>] ? trace_hardirqs_on_caller+0x14d/0x190 [<ffffffff81134824>] ? shrink_active_list+0x1c4/0x370 [<ffffffff81096a8d>] ? sched_clock_cpu+0xcd/0x110 [<ffffffff81135f6d>] ? shrink_zone+0x34d/0x510 [<ffffffff8109b8b9>] ? ktime_get_ts+0xa9/0xe0 [<ffffffff8113624e>] ? do_try_to_free_pages+0x11e/0x520 [<ffffffff8113684d>] ? try_to_free_pages+0x9d/0x130 [<ffffffff81133bb0>] ? isolate_pages_global+0x0/0x3a0 [<ffffffff8112ddc0>] ? __alloc_pages_nodemask+0x4a0/0x910 [<ffffffff81013233>] ? native_sched_clock+0x13/0x60 [<ffffffff81162ee3>] ? alloc_pages_vma+0x93/0x150 [<ffffffff8117f8f5>] ? do_huge_pmd_anonymous_page+0x135/0x310 [<ffffffff814fdd77>] ? do_page_fault+0xc7/0x3c0 [<ffffffff811467d5>] ? handle_mm_fault+0x245/0x2b0 [<ffffffff814fddee>] ? do_page_fault+0x13e/0x3c0 [<ffffffff814fb625>] ? page_fault+0x25/0x30 http://rhts.redhat.com/testlogs/2011/01/182206/474344/3992197/console.txt when the NMI watchdog kicks in: sending NMI to all CPUs: NMI backtrace for cpu 4 CPU 4: Modules linked in: stap_850fe20eb529fd0f6b13f9a95a4cdd61_882(U) cryptd aes_x86_64 aes_generic ts_kmp nls_koi8_u nls_cp932 sunrpc cpufreq_ondemand acpi_cpufreq freq_table ipv6 dm_mirror dm_region_hash dm_log i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support e1000e ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix dm_mod [last unloaded: stap_02b9727bb984f2da043cd88e66031bd0_878] Pid: 8593, comm: stapio Not tainted 2.6.32-102.el6scratch.x86_64.debug #1 Express5800/T110b [N8100-1589] RIP: 0010:[<ffffffff81283e11>] [<ffffffff81283e11>] delay_tsc+0x61/0x80 RSP: 0018:ffff88003a587d60 EFLAGS: 00000093 RAX: 000000000bbb6f15 RBX: ffff88003b21ef40 RCX: 000000000bbb6f15 RDX: 0000000000000072 RSI: ffff880004612340 RDI: 0000000000000001 RBP: ffff88003a587d68 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000001 R11: 0000000000000001 R12: 00000000a6726d88 R13: ffff88003bf287c0 R14: ffff88003bf28eb8 R15: 00000000a12d7b43 FS: 00007f4b501ba700(0000) GS:ffff880004600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f4b4f00bff8 CR3: 000000003a486000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: <#DB[1]> <<EOE>> Pid: 8593, comm: stapio Not tainted 2.6.32-102.el6scratch.x86_64.debug #1 Call Trace: <NMI> [<ffffffff81009d59>] ? show_regs+0x49/0x50 [<ffffffff814fcca8>] nmi_watchdog_tick+0x1d8/0x200 [<ffffffff814fbde3>] do_nmi+0x1d3/0x300 [<ffffffff814fb940>] nmi+0x20/0x39 [<ffffffff81283e11>] ? delay_tsc+0x61/0x80 <<EOE>> [<ffffffff81283d4f>] ? __delay+0xf/0x20 [<ffffffff81289570>] _raw_spin_lock+0x110/0x180 [<ffffffff814facd6>] _spin_lock+0x56/0x70 [<ffffffff8103fcee>] ? vmalloc_sync_all+0x10e/0x180 [<ffffffff814faaeb>] ? _spin_unlock+0x2b/0x40 [<ffffffff8103fcee>] vmalloc_sync_all+0x10e/0x180 [<ffffffff81153c3b>] alloc_vm_area+0x4b/0x70 [<ffffffffa05aaf1e>] _stp_ctl_write_cmd+0x19e/0x440 [stap_850fe20eb529fd0f6b13f9a95a4cdd61_882] [<ffffffff8122722b>] ? selinux_file_permission+0xfb/0x150 [<ffffffff8121b946>] ? security_file_permission+0x16/0x20 [<ffffffff81184d98>] vfs_write+0xb8/0x1a0 [<ffffffff81186156>] ? fget_light+0x66/0x100 [<ffffffff811857d1>] sys_write+0x51/0x90 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b on the list of modules, e1000e, one of the drivers using the vzalloc function backports
The problem is with THP. The page reclaim code calls page_referenced_one() which takes the mm->page_table_lock on one CPU before sending an IPI to other CPU(s): On CPU1 we take the mm->page_table_lock, send IPIs and wait for a response: page_referenced_one(...) if (unlikely(PageTransHuge(page))) { pmd_t *pmd; spin_lock(&mm->page_table_lock); pmd = page_check_address_pmd(page, mm, address, PAGE_CHECK_ADDRESS_PMD_FLAG); if (pmd && !pmd_trans_splitting(*pmd) && pmdp_clear_flush_young_notify(vma, address, pmd)) referenced++; spin_unlock(&mm->page_table_lock); } else { CPU2 can race in vmalloc_sync_all() because it disables interrupt(preventing a response to the IPI from CPU1) and takes the pgd_lock then spins in the mm->page_table_lock which is already held on CPU1. spin_lock_irqsave(&pgd_lock, flags); list_for_each_entry(page, &pgd_list, lru) { pgd_t *pgd; spinlock_t *pgt_lock; pgd = (pgd_t *)page_address(page) + pgd_index(address); pgt_lock = &pgd_page_get_mm(page)->page_table_lock; spin_lock(pgt_lock); At this point the system is deadlocked. The pmdp_clear_flush_young_notify needs to do its PDG business with the page_table_lock held then release that lock before sending the IPIs to the other CPUs. Larry
This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. If you would like it considered as an exception in the current release, please ask your support representative.
This request was erroneously denied for the current release of Red Hat Enterprise Linux. The error has been fixed and this request has been re-proposed for the current release.
https://brewweb.devel.redhat.com/taskinfo?taskID=3096320 This has the fix I posted upstream. I'm waiting upstream comment before submitting the fix to rhkernel-list. I couldn't see where anything takes pgd_lock from irq so needing the irqsave around it, but it looks far too easy that I can remove the _irqsave and be done with it. But until I see something that takes it from irq I choosed to believe so.
fix posted to rhkernel-list Message-ID: <20110215184909.GK5935@random.random>
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Posted a second approach to the fix on Message-ID: <20110228222138.GP22700@random.random> The old fix should work too however but this is more obviously safe (for non-Xen users). The old fix remains a good idea but with this applied it's only a cleanup, so ok for upstream but not worth the risk for RHEL if this new patch is applied.
Patch(es) available on kernel-2.6.32-122.el6
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0542.html