Bug 1305620

Summary: kernel panic/crash at page_lock_anon_vma_read ( which then calls down_read_trylock() which takes a memory fault at instruction offset 0x9.)
Product: Red Hat Enterprise Linux 7 Reporter: Sumeet Keswani <sumeet.keswani>
Component: kernelAssignee: Larry Woodman <lwoodman>
kernel sub component: Memory Management QA Contact: Li Wang <liwan>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: urgent    
Priority: urgent CC: 246alexey, aarcange, aburch, agordeev, am0616, amit, apverma, aquini, atomlin, chorn, dhoward, dkwon, ikulkarn, jaeshin, jkachuck, jmarchan, jobacik, joseph.szczypek, jsiddle, karen.skweres, kcleveng, kelvint, knoha, lilu, linda.knippers, lwoodman, masanari.iida, mreznik, naoko.yoshida, nbansal, nigel.croxon, qiuxishi, rblakley, rmarigny, rsussman, shane.seymour, stanislav.moravec, sumeet.keswani, sunzhuofeng, surkumar, tom.vaden, trinh.dao
Version: 7.1Keywords: Reopened, ZStream
Target Milestone: rc   
Target Release: 7.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1496375 1496377 1496378 (view as bug list) Environment:
Last Closed: 2018-02-01 03:30:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1298243, 1394638, 1420851, 1438583, 1445812, 1469549, 1469551, 1473733, 1496375, 1496377, 1496378    

Description Sumeet Keswani 2016-02-08 18:27:21 UTC
Description of problem:




 
We have several kernel cores.  The common theme is that a variety of code paths reach page_lock_anon_vma_read() which then calls down_read_trylock() which takes a memory fault at instruction offset 0x9.
		
These are all from kernel version kernel-3.10.0-229.4.2.el7

This happens with THP enabled and with it disabled.

It's not just a null pointer dereference. We have seen the bug as a GPF as well.

Digging into the details of the crashes, the key thing is that the struct page being used as a mapping pointer has the bit set to say it's an anon_vma, but the page it points to is not allocated from the anon_vma kmem cache.  The crash is somewhat random because it depends on what is on the page that the mapping pointer refers to.

The bug reported here (https://bugzilla.redhat.com/show_bug.cgi?id=1091830) has an instance of the page_lock_anon_vma_read()-calling-down_read_trylock() theme.  We suspect they are the same problem, though we note that bug 1091830 seems to cover a lot of territory, including virtualization host/guest.  We are seeing our problems without virtualization and would not be surprised if this shows up in virtualized environments.

Searching the web, this page (http://comments.gmane.org/gmane.linux.kernel.mm/140642) seems to report the bug, and from our reading the response acknowledges the bug from the author of the code.

Below are 5 backtraces from various kernel crashes.

We're hoping to learn if RedHat already knows of this issue, and/or concurs with what we have found on the web.  And, of course, we'd like to know when a fix will be seen in the field, as this is impacting our customers somewhat regularly already.

Crash 1
crash64> bt
PID: 300 TASK: ffff883f25e26660 CPU: 26 COMMAND: "kswapd0"
 #0 [ffff883f242eb810] machine_kexec at ffffffff8104c6a1
 #1 [ffff883f242eb868] crash_kexec at ffffffff810e2252
 #2 [ffff883f242eb938] oops_end at ffffffff8160d548
 #3 [ffff883f242eb960] no_context at ffffffff815fdf52
 #4 [ffff883f242eb9b0] __bad_area_nosemaphore at ffffffff815fdfe8
 #5 [ffff883f242eb9f8] bad_area_nosemaphore at ffffffff815fe152
 #6 [ffff883f242eba08] __do_page_fault at ffffffff816103ae
 #7 [ffff883f242ebb08] do_page_fault at ffffffff816105ca
 #8 [ffff883f242ebb30] page_fault at ffffffff8160c7c8
    [exception RIP: down_read_trylock+9]
    RIP: ffffffff8109c389 RSP: ffff883f242ebbe0 RFLAGS: 00010202
    RAX: 0000000000000000 RBX: ffff880b32303680 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000008
    RBP: ffff883f242ebbe0 R8: ffffea0028238520 R9: ffff887eb6d31320
    R10: 000000000005f55d R11: ffffea01037d0600 R12: ffff880b32303681
    R13: ffffea0028238500 R14: 0000000000000008 R15: ffffea0028238500
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
 #9 [ffff883f242ebbe8] page_lock_anon_vma_read at ffffffff8118e245
#10 [ffff883f242ebc18] page_referenced at ffffffff8118e4c7
#11 [ffff883f242ebc90] shrink_active_list at ffffffff8116b1cc
#12 [ffff883f242ebd48] balance_pgdat at ffffffff8116cb68
#13 [ffff883f242ebe20] kswapd at ffffffff8116d0f3
#14 [ffff883f242ebec8] kthread at ffffffff8109739f
#15 [ffff883f242ebf50] ret_from_fork at ffffffff81614d3c


Crash 2:


crash64> bt
PID: 15951 TASK: ffff887ece808b60 CPU: 30 COMMAND: "vertica"
 #0 [ffff887eda68b4a0] machine_kexec at ffffffff8104c6a1
 #1 [ffff887eda68b4f8] crash_kexec at ffffffff810e2252
 #2 [ffff887eda68b5c8] oops_end at ffffffff8160d548
 #3 [ffff887eda68b5f0] no_context at ffffffff815fdf52
 #4 [ffff887eda68b640] __bad_area_nosemaphore at ffffffff815fdfe8
 #5 [ffff887eda68b688] bad_area_nosemaphore at ffffffff815fe152
 #6 [ffff887eda68b698] __do_page_fault at ffffffff816103ae
 #7 [ffff887eda68b798] do_page_fault at ffffffff816105ca
 #8 [ffff887eda68b7c0] page_fault at ffffffff8160c7c8
    [exception RIP: down_read_trylock+9]
    RIP: ffffffff8109c389 RSP: ffff887eda68b870 RFLAGS: 00010282
    RAX: 0000000000000000 RBX: ffff8821bd6906c0 RCX: ffff8821bd6906c0
    RDX: 0000000000000001 RSI: 0000000000000301 RDI: fffffffffffffe08
    RBP: ffff887eda68b870 R8: 00000000fffffe7f R9: ffff8821bd6906c0
    R10: ffff88807ffd6000 R11: 0000000000000017 R12: ffff8821bd6906c1
    R13: ffffea0153e09ec0 R14: fffffffffffffe08 R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
 #9 [ffff887eda68b878] page_lock_anon_vma_read at ffffffff8118e245
#10 [ffff887eda68b8a8] try_to_unmap_anon at ffffffff8118e671
#11 [ffff887eda68b8f8] try_to_unmap at ffffffff8118e7bd
#12 [ffff887eda68b910] migrate_pages at ffffffff811b1e2b
#13 [ffff887eda68b9b0] compact_zone at ffffffff8117aff9
#14 [ffff887eda68ba00] compact_zone_order at ffffffff8117b1fc
#15 [ffff887eda68baa8] try_to_compact_pages at ffffffff8117b5b1
#16 [ffff887eda68bb08] __alloc_pages_direct_compact at ffffffff81600286
#17 [ffff887eda68bb68] __alloc_pages_nodemask at ffffffff81160b98
#18 [ffff887eda68bca0] alloc_pages_vma at ffffffff811a2a2a
#19 [ffff887eda68bd08] do_huge_pmd_wp_page at ffffffff811b77d8
#20 [ffff887eda68bd98] handle_mm_fault at ffffffff81182b64
#21 [ffff887eda68be28] __do_page_fault at ffffffff816101c6
#22 [ffff887eda68bf28] do_page_fault at ffffffff816105ca
#23 [ffff887eda68bf50] page_fault at ffffffff8160c7c8
    RIP: 0000000000c97926 RSP: 00007f5421151fb0 RFLAGS: 00010246
    RAX: 000000000029087a RBX: 000000000000086a RCX: 0000000000000000
    RDX: 00007ec2f89caba4 RSI: 0000000000003f8a RDI: 40000a514420ef4a
    RBP: 00007f5421152140 R8: 00007ec8eb2da260 R9: 00000000003fffff
    R10: 000000000000fbe8 R11: 0000000000003f90 R12: 00007f3d9fe318a0
    R13: 00007eb8c81fd010 R14: 00000000000007f1 R15: 00007f5421154aa0
    ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b


Crash 3:
PID: 16939 TASK: ffff8843570b5b00 CPU: 3 COMMAND: "vertica"
 #0 [ffff8849f6f1f610] machine_kexec at ffffffff8104c6a1
 #1 [ffff8849f6f1f668] crash_kexec at ffffffff810e2252
 #2 [ffff8849f6f1f738] oops_end at ffffffff8160d548
 #3 [ffff8849f6f1f760] die at ffffffff810173eb
 #4 [ffff8849f6f1f790] do_general_protection at ffffffff8160ce4e
 #5 [ffff8849f6f1f7c0] general_protection at ffffffff8160c768
    [exception RIP: down_read_trylock+9]
    RIP: ffffffff8109c389 RSP: ffff8849f6f1f870 RFLAGS: 00010206
    RAX: 0000000000000000 RBX: ffff8843a01c0ac0 RCX: ffff8843a01c0ac0
    RDX: 0000000000000001 RSI: 0000000000000301 RDI: 353338353931633f
    RBP: ffff8849f6f1f870 R8: 0000000033356461 R9: ffff8843a01c0ac0
    R10: ffff88807ffd6000 R11: 0000000000000017 R12: ffff8843a01c0ac1
    R13: ffffea0105a09680 R14: 353338353931633f R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
 #6 [ffff8849f6f1f878] page_lock_anon_vma_read at ffffffff8118e245
 #7 [ffff8849f6f1f8a8] try_to_unmap_anon at ffffffff8118e671
 #8 [ffff8849f6f1f8f8] try_to_unmap at ffffffff8118e7bd
 #9 [ffff8849f6f1f910] migrate_pages at ffffffff811b1e2b
#10 [ffff8849f6f1f9b0] compact_zone at ffffffff8117aff9
#11 [ffff8849f6f1fa00] compact_zone_order at ffffffff8117b1fc
#12 [ffff8849f6f1faa8] try_to_compact_pages at ffffffff8117b5b1
#13 [ffff8849f6f1fb08] __alloc_pages_direct_compact at ffffffff81600286
#14 [ffff8849f6f1fb68] __alloc_pages_nodemask at ffffffff81160b98
#15 [ffff8849f6f1fca0] alloc_pages_vma at ffffffff811a2a2a
#16 [ffff8849f6f1fd08] do_huge_pmd_wp_page at ffffffff811b77d8
#17 [ffff8849f6f1fd98] handle_mm_fault at ffffffff81182b64
#18 [ffff8849f6f1fe28] __do_page_fault at ffffffff816101c6
#19 [ffff8849f6f1ff28] do_page_fault at ffffffff816105ca
#20 [ffff8849f6f1ff50] page_fault at ffffffff8160c7c8
    RIP: 0000000000c97926 RSP: 00007fa8aa6503a0 RFLAGS: 00010246
    RAX: 00000000002c48f3 RBX: 00000000000086bd RCX: 0000000000000000
    RDX: 00007f77207547a6 RSI: 0000000000003f8a RDI: 40000685fd6c52e4
    RBP: 00007fa8aa650530 R8: 00007f7eec166ba8 R9: 00000000003fffff
    R10: 00000000000bad6c R11: 0000000000003f90 R12: 00007fcd4b5f2630
    R13: 00007f6c213fe010 R14: 00000000000007f1 R15: 00007fa8aa652e90
    ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b


Crash 4:
crash64> bt
PID: 15199 TASK: ffff88572574a220 CPU: 1 COMMAND: "vertica"
 #0 [ffff88725b71f400] machine_kexec at ffffffff8104c6a1
 #1 [ffff88725b71f458] crash_kexec at ffffffff810e2252
 #2 [ffff88725b71f528] oops_end at ffffffff8160d548
 #3 [ffff88725b71f550] no_context at ffffffff815fdf52
 #4 [ffff88725b71f5a0] __bad_area_nosemaphore at ffffffff815fdfe8
 #5 [ffff88725b71f5e8] bad_area_nosemaphore at ffffffff815fe152
 #6 [ffff88725b71f5f8] __do_page_fault at ffffffff816103ae
 #7 [ffff88725b71f6f8] do_page_fault at ffffffff816105ca
 #8 [ffff88725b71f720] page_fault at ffffffff8160c7c8
    [exception RIP: down_read_trylock+9]
    RIP: ffffffff8109c389 RSP: ffff88725b71f7d0 RFLAGS: 00010202
    RAX: 0000000000000000 RBX: ffff881e84f50ec0 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000008
    RBP: ffff88725b71f7d0 R8: ffffea0191f314e0 R9: ffff883f24ca9098
    R10: ffffea00fc038800 R11: ffffffff812d4e39 R12: ffff881e84f50ec1
    R13: ffffea0191f314c0 R14: 0000000000000008 R15: ffffea0191f314c0
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
 #9 [ffff88725b71f7d8] page_lock_anon_vma_read at ffffffff8118e245
#10 [ffff88725b71f808] page_referenced at ffffffff8118e4c7
#11 [ffff88725b71f880] shrink_active_list at ffffffff8116b1cc
#12 [ffff88725b71f938] shrink_lruvec at ffffffff8116b889
#13 [ffff88725b71fa38] shrink_zone at ffffffff8116bb76
#14 [ffff88725b71fa90] do_try_to_free_pages at ffffffff8116c080
#15 [ffff88725b71fb08] try_to_free_pages at ffffffff8116c56c
#16 [ffff88725b71fba0] __alloc_pages_nodemask at ffffffff81160c0d
#17 [ffff88725b71fcd8] alloc_pages_vma at ffffffff811a2a2a
#18 [ffff88725b71fd40] do_huge_pmd_anonymous_page at ffffffff811b6deb
#19 [ffff88725b71fd98] handle_mm_fault at ffffffff81182794
#20 [ffff88725b71fe28] __do_page_fault at ffffffff816101c6
#21 [ffff88725b71ff28] do_page_fault at ffffffff816105ca
#22 [ffff88725b71ff50] page_fault at ffffffff8160c7c8
    RIP: 000000000229a610 RSP: 00007ee38d57e030 RFLAGS: 00010206
    RAX: 00007eb93481c330 RBX: 0000000000010000 RCX: 00007eb93481c330
    RDX: 0000000000000005 RSI: 0000000011fbda48 RDI: 0000000000000137
    RBP: 00007ee38d57e050 R8: 0000000000000000 R9: 000000000045f2d8
    R10: 0000000000000000 R11: 00007eb93481c330 R12: 00007eb93481c330
    R13: 000000000000002a R14: 00007ef11cda5e40 R15: 00007ef7876325e0
    ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b


Crash 5:
crash64> bt
PID: 27682 TASK: ffff8806b07f0000 CPU: 37 COMMAND: "vertica"
 #0 [ffff885f34e933b8] machine_kexec at ffffffff8104c6a1
 #1 [ffff885f34e93410] crash_kexec at ffffffff810e2252
 #2 [ffff885f34e934e0] oops_end at ffffffff8160d548
 #3 [ffff885f34e93508] no_context at ffffffff815fdf52
 #4 [ffff885f34e93558] __bad_area_nosemaphore at ffffffff815fdfe8
 #5 [ffff885f34e935a0] bad_area at ffffffff815fe366
 #6 [ffff885f34e935c8] __do_page_fault at ffffffff816104ec
 #7 [ffff885f34e936c8] do_page_fault at ffffffff816105ca
 #8 [ffff885f34e936f0] page_fault at ffffffff8160c7c8
    [exception RIP: down_read_trylock+9]
    RIP: ffffffff8109c389 RSP: ffff885f34e937a0 RFLAGS: 00010213
    RAX: 0000000000000000 RBX: ffff886b86e0adc0 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000007
    RBP: ffff885f34e937a0 R8: ffffea0112dc0c60 R9: ffff88006d2f3068
    R10: 0000000000000088 R11: 0000000000000000 R12: ffff886b86e0adc1
    R13: ffffea0112dc0c40 R14: 0000000000000007 R15: ffffea0112dc0c40
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
 #9 [ffff885f34e937a8] page_lock_anon_vma_read at ffffffff8118e245
#10 [ffff885f34e937d8] page_referenced at ffffffff8118e4c7
#11 [ffff885f34e93850] shrink_active_list at ffffffff8116b1cc
#12 [ffff885f34e93908] shrink_lruvec at ffffffff8116b889
#13 [ffff885f34e93a08] shrink_zone at ffffffff8116bb76
#14 [ffff885f34e93a60] do_try_to_free_pages at ffffffff8116c080
#15 [ffff885f34e93ad8] try_to_free_pages at ffffffff8116c56c
#16 [ffff885f34e93b70] __alloc_pages_nodemask at ffffffff81160c0d
#17 [ffff885f34e93ca8] alloc_pages_vma at ffffffff811a2a2a
#18 [ffff885f34e93d10] do_wp_page at ffffffff811807ba
#19 [ffff885f34e93d98] handle_mm_fault at ffffffff81182b94
#20 [ffff885f34e93e28] __do_page_fault at ffffffff816101c6
#21 [ffff885f34e93f28] do_page_fault at ffffffff816105ca
#22 [ffff885f34e93f50] page_fault at ffffffff8160c7c8
    RIP: 0000000000c99820 RSP: 00007eface91d1d0 RFLAGS: 00010246
    RAX: 000000000188d28e RBX: 000000000000415c RCX: 0000000000000000
    RDX: 00007ede36c1bc70 RSI: 0000000000000048 RDI: 4b5b1bfb4f4e2c5d
    RBP: 00007eface91d460 R8: 00007f062229f050 R9: 0000000003ffffff
    R10: 00000000002d5334 R11: 0000000000000050 R12: 00007ee26b3b8070
    R13: 00007eddbbfff010 R14: 0000000000000009 R15: 00007eface91dac0
    ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b




Version-Release number of selected component (if applicable):
RHEL 7.1 / kernel-3.10.0-229.4.2.el7


How reproducible:
Happens randomly....


Steps to Reproduce:
none

Comment 2 Sumeet Keswani 2016-02-09 01:50:50 UTC
Can we make this BZ public.
(I do not seem to have the access to do so)

Comment 3 Sumeet Keswani 2016-02-10 23:12:27 UTC
can we add HPE confidential group to the BZ.

Comment 5 Joseph Kachuck 2016-02-11 16:58:18 UTC
Hello Sumeet,
Please confirm your HPE email address.

Please confirm if you can recreate this issue with the latest RHEL 7.x kernel.
kernel-3.10.0-327.4.5.el7 or above.

Please confirm any steps to recreate the issue. Please also confirm if you are able to recreate this issue on more then one physical system.

Please also attach a sosreport and vmcore directly after this issue occurs.

Thank You
Joe Kachuck

Comment 6 Sumeet Keswani 2016-02-11 19:42:04 UTC
1. We have not attempted to reproduce this in kernel-3.10.0-327.4.5.el7. - so we don't really know.

2. Yes, we are able to recreate this issue on many physical systems. (3 machine at least - maybe more)

3. I have saved vmcores from crashes mentioned above, attaching now.

Comment 7 Sumeet Keswani 2016-02-11 21:27:01 UTC
files are too big to attach (45GB compressed). 
Can I share a sftp url for the files. 
I will need to share the password for the download out-of-band

Comment 8 Joseph Kachuck 2016-02-11 21:37:53 UTC
Hello,
Please recreate the issue on the latest kernel as soon as you are able.

Thank You
Joe Kachuck

Comment 9 Sumeet Keswani 2016-02-12 15:28:09 UTC
1. This is not reliably reproducible, happens randomly. 

2. There are no specific steps to reproduce this either. 

3. There is no plan on upgrading production systems on the chance or reproducing this. i.e. if boxes are upgraded to latest kernel and this problem exists the customer will be further annoyed.

I can share the cores of previous crashes, maybe that can shed some light.

Comment 10 Sumeet Keswani 2016-02-12 15:31:06 UTC
We are unable to reproduce this on development systems because of 
 a) the random nature of the bug 
 b) and/or difference between production and development

Comment 11 Sumeet Keswani 2016-02-16 20:30:25 UTC
Can you do a first pass and look at the vmcores. 
This is not trivial to reproduce.

I can pass the vmcores to you via secure ftp.

Comment 12 Sumeet Keswani 2016-02-16 23:04:07 UTC
Here is a link to a report/analysis.
https://bugs.centos.org/view.php?id=10242

Comment 13 Linda Wang 2016-02-17 18:23:09 UTC
Based on the BZ description:

"Searching the web, this page (http://comments.gmane.org/gmane.linux.kernel.mm/140642) seems to report the bug, and from our reading the response acknowledges the bug from the author of the code." 

Based on the upstream developer's comment in the thread,
he believe the issue is introduced by compound refcounting rework patchset
that went upstream recently. However the issue seen here is 
in 3.10 kernel, and don't believe we have the THP refcounting
rework patches in. 

However, this does point to a possible issue with THB reference 
counting.. any chance to turn THB off to see if the problem goes away?

Comment 14 Sumeet Keswani 2016-02-17 18:25:38 UTC
yes, in our case we have seen this with THP enable and disabled - both.

Comment 15 Sumeet Keswani 2016-02-17 18:37:50 UTC
I do not understand the kernel or the deeper workings of the system.
I suspect its related to page_migration or movement, which may happen during local_migration, transparent_huge_pages and we are seeing different manifestations of this.

Comment 16 Sumeet Keswani 2016-02-17 19:10:36 UTC
and possibly even when swapping

Comment 17 Linda Wang 2016-03-01 05:33:55 UTC
*** Bug 1305728 has been marked as a duplicate of this bug. ***

Comment 20 Larry Woodman 2016-03-28 17:35:55 UTC
The anon_vma->rwsem is null!!!

struct anon_vma *page_lock_anon_vma_read(struct page *page)
{
        struct anon_vma *anon_vma = NULL;
        struct anon_vma *root_anon_vma;
        unsigned long anon_mapping;

        rcu_read_lock();
        anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
        if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
                goto out;
        if (!page_mapped(page))
                goto out;

        anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
        root_anon_vma = ACCESS_ONCE(anon_vma->root);
>>>>>>  if (down_read_trylock(&root_anon_vma->rwsem)) {

struct anon_vma {
        struct anon_vma *root;          /* Root of this anon_vma tree */
>>>>>>  struct rw_semaphore rwsem;      /* W: modification, R: walking the list */


Some sort of memory corruption going on here.


Larry

Comment 21 Shane Seymour 2016-04-06 00:07:50 UTC
(In reply to Larry Woodman from comment #20)
> The anon_vma->rwsem is null!!!
> 

What I found in every case was that for the struct page in question the mapping member pointed to something invalid each and every time the system fell over. Sometimes it would point to an area that is all NUL bytes and other times it would point to bad data (like part of an ASCII string).

The key seems to be working out why the mapping member is bad rather than why what it points to looks bad. I did all of the analysis work here (there are some mistakes in it):

https://bugs.centos.org/view.php?id=10242

But the conclusion in each of the cases was the value we currently had in the mapping member of the struct page was somehow wrong.

Comment 23 Michal Reznik 2016-05-10 10:41:16 UTC
Hello,

Is there any news regarding this ticket?

Thanks...

Comment 24 Sumeet Keswani 2016-05-11 15:07:15 UTC
our customer who ran into this downgraded their production systems to RHEL 6.7 and have not had issues since. So we removed the critical element of this bug bringing down production.

We are working on the side to get this to happen on a test cluster using RHEL 7.2 , but that project has been on the back burner. Its unfortunate we cannot figure out the root cause by looking at the cores, of which there are a few.

Comment 26 Ashlee Burch 2016-05-26 18:56:26 UTC
I have a customer requesting for this bug to be public so that they may keep track for their internal purposes. 

Customer's request:

need to be able to review the following Bugzilla cases as our product is dependent on another product that is dependent on these bug fixes, we need to determine our what versions of RedHat we will support with our next product release which is being determined now. 

Case# https://c.na7.visual.force.com/apex/Case_View?id=500A000000UDB1u&sfdc.override=1

Comment 27 Shane Seymour 2016-05-26 23:40:17 UTC
(In reply to Ashlee Burch from comment #26)
> I have a customer requesting for this bug to be public so that they may keep
> track for their internal purposes. 
> 

You will need to get permission from Sumeet to make the BZ public since it's a partner filed BZ.

Comment 28 Sumeet Keswani 2016-05-27 00:07:04 UTC
Please make it public.. (I have no objection). 
I tried to make it public but don't have the privileges to do so.

Also if you can share the circumstances in which you see this bug it may be helpful to us.

Comment 29 Joseph Kachuck 2016-05-27 12:32:18 UTC
Per comment 28 this is now a public BZ.

Comment 30 Larry Woodman 2016-06-02 15:45:50 UTC
Is there any special system setup here?  We have LOTS of RHEL7.1 installations and this is the only coruption that looks like this we have seen.  Any hints as to whatever is different with this system will be really helpful in getting to the bottom is this problem.

Larry Woodman

Comment 32 Sumeet Keswani 2016-06-13 18:48:19 UTC
I cannot provide more info; as our customer moved off this version of RHEL 
(to an older version which has been stable)
There is no plan from the to try this again, consequently, i wont be able to provide this information.

Comment 34 Trinh Dao 2016-07-05 16:32:25 UTC
JoeK, please close this bug per comment 32 and mark HPE verified.

Comment 35 Christian Horn 2016-09-20 07:11:23 UTC
Reopening this, we have a new occurrence of the issue, with a customer of the same partner.  The customers system has rebooted frequently through the issue, the crash occurs with 5-25% probability when a workload is started, to my understanding CPU bound computation, using many processes on this 48logical cpu system.

Posting access details to the vmcore in a private update.

Comment 37 masanari iida 2016-09-28 01:55:31 UTC
I am the reporter of the panic from HPE-Japan.
Add me to cc.
Thanks

Comment 38 Sumeet Keswani 2016-09-29 16:11:39 UTC
Can I get access to BZ 1341497. 
We have reason to believe it may have some changes that related to this issue.

Comment 39 Christian Horn 2016-09-30 00:34:54 UTC
(In reply to Sumeet Keswani from comment #38)
> Can I get access to BZ 1341497. 
I have requested the access, but there are just some comments, and they are private.  bz1151823 was solved with RHEL6.8GA, and a clone from bz1341497, I have requested access there too.

Comment 40 Christian Horn 2016-09-30 00:37:25 UTC
(In reply to Christian Horn from comment #39)
> bz1151823 was solved with RHEL6.8GA, and a clone from bz1341497, I
> have requested access there too.
That should be "bz1341497 is a clone of bz1151823".

Comment 41 Sumeet Keswani 2016-09-30 00:42:29 UTC
(In reply to Christian Horn from comment #40)
> (In reply to Christian Horn from comment #39)
> > bz1151823 was solved with RHEL6.8GA, and a clone from bz1341497, I
> > have requested access there too.
> That should be "bz1341497 is a clone of bz1151823".

Thank you.

Comment 42 Sumeet Keswani 2016-10-03 21:02:22 UTC
One of our customers who ran into it was told its fixed in RHEL 7.3.
Based on Target Release = 7.4; I take it its not fixed in RHEL 7.3 and will be fixed in RHEL 7.4.

We will wait for access to the above BZ before we recommend a upgrade.

Comment 43 Trinh Dao 2016-10-12 15:47:14 UTC
JoeK, we already marked this bug HPE verified and fix in RHEL7.3. please change target back to 7.3 and this bug can be closed.

Comment 44 Stan Moravec 2016-10-14 07:09:07 UTC
Nor me nor Sumeet verified that 7.3 is indeed a fix, this seems to be misunderstanding of some sort. Reopening.

Comment 45 Trinh Dao 2016-10-14 13:27:33 UTC
Sorry Stan, I misread some comments and thought it was fixed in 7.3. Thank you for re-opening it.  Sorry!!!

JoeK, I cleaned the verified field and added hpe7.4bugs tracker to the bug.

Comment 52 Kelvin Tseng 2016-11-09 03:32:48 UTC
Hi,
The initiator experienced this issue on haswell cpu.
We are experiencing similar issue on broadwell cpu.
Not sure, do you also experience similar issue on broadwell cpu ?

Thanks.

Comment 53 Stan Moravec 2016-11-09 08:55:29 UTC
We have seen it on Haswell, but I do not think CPU microarchitecture
is relevant here. If you are duplicating it reliably, try RHEL7.3. It's still unclear if the submittals into 7.3 fixed the problem or not.

Comment 54 Kelvin Tseng 2016-11-09 09:20:02 UTC
Thank you for the info and suggestion Stan.

Personally, agreed with you that "I do not think CPU microarchitecture
is relevant here".
Our customer feedback to us that they have a batch of HW with Haswell + LSI 2208 works fine but Broadwell + LSI 3108 failed, both applying exactly the same vertica and kernel version. We are still clarifying this.

Thanks.

Comment 55 Sumeet Keswani 2016-11-29 13:57:09 UTC
I am unable to access solution (2779851). can I get the gist of it or access to it. We have a few customers hitting this and it would help to have a workaround.

Comment 56 Joseph Kachuck 2016-11-29 14:48:51 UTC
Hello Sumeet,
Solution 2779851 is an unpublished solution. It says the current work around is to disable THP. It is a summery of the issue, and state the issue is being worked on this BZ.

Thank You
Joe Kachuck

Comment 58 Joseph Kachuck 2017-01-31 19:30:07 UTC
Hello,
Is there a responds to comment 30?

Is there any special system setup here?  We have LOTS of RHEL7.1 installations and this is the only coruption that looks like this we have seen.  Any hints as to whatever is different with this system will be really helpful in getting to the bottom is this problem.


Thank You
Joe Kachuck

Comment 59 Shane Seymour 2017-02-02 05:54:10 UTC
I now have a customer case on this issue (they have RHEL support but it's unfortunately a RHEL 7.0 GA kernel). When will Vertica be tested and supported on RHEL 7.3? Then at least I can ask the customer to upgrade and see if the issue happens again.

As an aside even though it's for RHEL 7.0 would Redhat like me to open a case and provide the vmcore? I haven't looked into the dump but the stack trace looks like it's the same issue:

crash64> bt
PID: 356    TASK: ffff881fd220c440  CPU: 0   COMMAND: "kswapd0"
 #0 [ffff881fd05156c0] machine_kexec at ffffffff81041181
 #1 [ffff881fd0515718] crash_kexec at ffffffff810cf0e2
 #2 [ffff881fd05157e8] oops_end at ffffffff815ea548
 #3 [ffff881fd0515810] no_context at ffffffff815daf63
 #4 [ffff881fd0515860] __bad_area_nosemaphore at ffffffff815daff9
 #5 [ffff881fd05158a8] bad_area_nosemaphore at ffffffff815db163
 #6 [ffff881fd05158b8] __do_page_fault at ffffffff815ed36e
 #7 [ffff881fd05159b8] do_page_fault at ffffffff815ed58a
 #8 [ffff881fd05159e0] page_fault at ffffffff815e97c8
    [exception RIP: down_read_trylock+9]
    RIP: ffffffff8108a919  RSP: ffff881fd0515a90  RFLAGS: 00010202
    RAX: 0000000000000000  RBX: ffff88146e782000  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000008
    RBP: ffff881fd0515a90   R8: ffffea006f6f76a0   R9: ffff881fffa173a0
    R10: ffffea006baca200  R11: ffffffff812b8739  R12: ffff88146e782001
    R13: ffffea006f6f7680  R14: 0000000000000008  R15: ffffea006f6f7680
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff881fd0515a98] page_lock_anon_vma_read at ffffffff81177df5
#10 [ffff881fd0515ac8] page_referenced at ffffffff81178077
#11 [ffff881fd0515b40] shrink_active_list at ffffffff81155844
#12 [ffff881fd0515bf8] shrink_lruvec at ffffffff81155e34
#13 [ffff881fd0515cf8] shrink_zone at ffffffff811561a6
#14 [ffff881fd0515d50] balance_pgdat at ffffffff8115744c
#15 [ffff881fd0515e28] kswapd at ffffffff8115770b
#16 [ffff881fd0515ec8] kthread at ffffffff81085aef
#17 [ffff881fd0515f50] ret_from_fork at ffffffff815f206c

Comment 60 Sumeet Keswani 2017-02-02 14:28:16 UTC
(In reply to Shane Seymour from comment #59)
> I now have a customer case on this issue (they have RHEL support but it's
> unfortunately a RHEL 7.0 GA kernel). When will Vertica be tested and
> supported on RHEL 7.3? Then at least I can ask the customer to upgrade and
> see if the issue happens again.
> 
> As an aside even though it's for RHEL 7.0 would Redhat like me to open a
> case and provide the vmcore? I haven't looked into the dump but the stack
> trace looks like it's the same issue:
> 
> crash64> bt
> PID: 356    TASK: ffff881fd220c440  CPU: 0   COMMAND: "kswapd0"
>  #0 [ffff881fd05156c0] machine_kexec at ffffffff81041181
>  #1 [ffff881fd0515718] crash_kexec at ffffffff810cf0e2
>  #2 [ffff881fd05157e8] oops_end at ffffffff815ea548
>  #3 [ffff881fd0515810] no_context at ffffffff815daf63
>  #4 [ffff881fd0515860] __bad_area_nosemaphore at ffffffff815daff9
>  #5 [ffff881fd05158a8] bad_area_nosemaphore at ffffffff815db163
>  #6 [ffff881fd05158b8] __do_page_fault at ffffffff815ed36e
>  #7 [ffff881fd05159b8] do_page_fault at ffffffff815ed58a
>  #8 [ffff881fd05159e0] page_fault at ffffffff815e97c8
>     [exception RIP: down_read_trylock+9]
>     RIP: ffffffff8108a919  RSP: ffff881fd0515a90  RFLAGS: 00010202
>     RAX: 0000000000000000  RBX: ffff88146e782000  RCX: 0000000000000000
>     RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000008
>     RBP: ffff881fd0515a90   R8: ffffea006f6f76a0   R9: ffff881fffa173a0
>     R10: ffffea006baca200  R11: ffffffff812b8739  R12: ffff88146e782001
>     R13: ffffea006f6f7680  R14: 0000000000000008  R15: ffffea006f6f7680
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>  #9 [ffff881fd0515a98] page_lock_anon_vma_read at ffffffff81177df5
> #10 [ffff881fd0515ac8] page_referenced at ffffffff81178077
> #11 [ffff881fd0515b40] shrink_active_list at ffffffff81155844
> #12 [ffff881fd0515bf8] shrink_lruvec at ffffffff81155e34
> #13 [ffff881fd0515cf8] shrink_zone at ffffffff811561a6
> #14 [ffff881fd0515d50] balance_pgdat at ffffffff8115744c
> #15 [ffff881fd0515e28] kswapd at ffffffff8115770b
> #16 [ffff881fd0515ec8] kthread at ffffffff81085aef
> #17 [ffff881fd0515f50] ret_from_fork at ffffffff815f206c

we are working on RHEL 7.3 support. we have anecdotal reports that upgrading to RHEL 7.3 get around this problem. unfortunately this crash is rare and hard to replicated so a reliable reproducer is infeasible.

Comment 62 Trinh Dao 2017-03-01 16:44:21 UTC
RH, what info do you need from Sumeet?

Comment 64 qiuxishi2 2017-03-16 02:40:58 UTC
Hi, I find upstream 414e2fb8ce5a999571c21eb2ca4d66e53ddce800 fix the bug maybe is the same as the discussion above, but I'm not sure.

    rmap: fix theoretical race between do_wp_page and shrink_active_list

    As noted by Paul the compiler is free to store a temporary result in a
    variable on stack, heap or global unless it is explicitly marked as
    volatile, see:

      http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html#sample-optimizations

    This can result in a race between do_wp_page() and shrink_active_list()
    as follows.

    In do_wp_page() we can call page_move_anon_rmap(), which sets
    page->mapping as follows:

      anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
      page->mapping = (struct address_space *) anon_vma;

    The page in question may be on an LRU list, because nowhere in
    do_wp_page() we remove it from the list, neither do we take any LRU
    related locks.  Although the page is locked, shrink_active_list() can
    still call page_referenced() on it concurrently, because the latter does
    not require an anonymous page to be locked:

      CPU0                          CPU1
      ----                          ----
      do_wp_page                    shrink_active_list
       lock_page                     page_referenced
                                      PageAnon->yes, so skip trylock_page
       page_move_anon_rmap
        page->mapping = anon_vma
                                      rmap_walk
                                       PageAnon->no
                                       rmap_walk_file
                                        BUG
        page->mapping += PAGE_MAPPING_ANON

    This patch fixes this race by explicitly forbidding the compiler to split
    page->mapping store in page_move_anon_rmap() with the aid of WRITE_ONCE.

    [akpm: tweak comment, per Minchan]
    Signed-off-by: Vladimir Davydov <vdavydov>
    Cc: "Paul E. McKenney" <paulmck.ibm.com>
    Acked-by: Kirill A. Shutemov <kirill.shutemov.com>
    Acked-by: Rik van Riel <riel>
    Cc: Hugh Dickins <hughd>
    Acked-by: Minchan Kim <minchan>
    Signed-off-by: Andrew Morton <akpm>
    Signed-off-by: Linus Torvalds <torvalds>

diff --git a/mm/rmap.c b/mm/rmap.c
index 24dd3f9..9f47f15 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -950,7 +950,12 @@ void page_move_anon_rmap(struct page *page,
        VM_BUG_ON_PAGE(page->index != linear_page_index(vma, address), page);

        anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
-       page->mapping = (struct address_space *) anon_vma;
+       /*
+        * Ensure that anon_vma and the PAGE_MAPPING_ANON bit are written
+        * simultaneously, so a concurrent reader (eg page_referenced()'s
+        * PageAnon()) will not see one without the other.
+        */
+       WRITE_ONCE(page->mapping, (struct address_space *) anon_vma);
 }

 /**

Comment 65 masanari iida 2017-03-16 02:50:03 UTC
FYI
As of 3.10.0-514.10.2
mm/rmap.c looks like following.

1115 void page_move_anon_rmap(struct page *page,
1116         struct vm_area_struct *vma, unsigned long address)
1117 {
1118         struct anon_vma *anon_vma = vma->anon_vma;
1119 
1120         VM_BUG_ON(!PageLocked(page));
1121         VM_BUG_ON(!anon_vma);
1122         VM_BUG_ON(page->index != linear_page_index(vma, address));
1123 
1124         anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
1125         page->mapping = (struct address_space *) anon_vma;
1126 }
1127

Comment 66 qiuxishi2 2017-03-17 10:41:12 UTC
(In reply to masanari iida from comment #65)
> FYI
> As of 3.10.0-514.10.2
> mm/rmap.c looks like following.
> 
> 1115 void page_move_anon_rmap(struct page *page,
> 1116         struct vm_area_struct *vma, unsigned long address)
> 1117 {
> 1118         struct anon_vma *anon_vma = vma->anon_vma;
> 1119 
> 1120         VM_BUG_ON(!PageLocked(page));
> 1121         VM_BUG_ON(!anon_vma);
> 1122         VM_BUG_ON(page->index != linear_page_index(vma, address));
> 1123 
> 1124         anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> 1125         page->mapping = (struct address_space *) anon_vma;
> 1126 }
> 1127
hi, does redhat 7.3 fix this problem?
if yes, could you show me the patch?

Thanks,
Xishi Qiu

Comment 67 masanari iida 2017-03-18 08:29:23 UTC
(In reply to qiuxishi2 from comment #66)
> (In reply to masanari iida from comment #65)
(snip)
> hi, does redhat 7.3 fix this problem?
> if yes, could you show me the patch?
> 
> Thanks,
> Xishi Qiu

NO.

My customer encountered the issue before RHEL7.3 release.
They didn't update to RHEL7.3 kernel.
So the customer waiting for an official statement that
the problem is fixed from either RH or HPE(Vertica).

Comment 68 qiuxishi2 2017-03-22 02:05:20 UTC
(In reply to masanari iida from comment #67)
> (In reply to qiuxishi2 from comment #66)
> > (In reply to masanari iida from comment #65)
> (snip)
> > hi, does redhat 7.3 fix this problem?
> > if yes, could you show me the patch?
> > 
> > Thanks,
> > Xishi Qiu
> 
> NO.
> 
> My customer encountered the issue before RHEL7.3 release.
> They didn't update to RHEL7.3 kernel.
> So the customer waiting for an official statement that
> the problem is fixed from either RH or HPE(Vertica).

hi, does this problem triggered on KVM guest os?

Comment 69 Stan Moravec 2017-03-22 08:58:36 UTC
The dump we have (it's the Masanari's one) is from physical system.

Comment 70 Christian Horn 2017-04-06 01:56:54 UTC
(In reply to masanari iida from comment #67)
> My customer encountered the issue before RHEL7.3 release.
> They didn't update to RHEL7.3 kernel.
> So the customer waiting for an official statement that
> the problem is fixed from either RH or HPE(Vertica).

I worked through this and try to sum up the state:

- No reproducer exists.
- Our data here around the issue seems not good enough to pinpoint the issue.
- We have mentioned some commits in this bz, but porting them (assuming they
  are small enough that they could eventually make it in a 7.2.z kernel - this
  verification has not been done) would then result in a testkernel, which some-
  one would have to run an try out.
  The issue occurs rarely, so one would have to run it long enough to be sure
  it's fixed (i.e. having observed the frequency of the previous panics, i.e.
  bimonthly, and then concluding after 6 months that the issue is fixed).
- We have multiple hints that rhel7.3 based kernels fix the issue, plus the
  GA and z-stream kernels from rhel7.3 have gone through full QA.

This issue is leading to a panic, so when experiencing the issue the system becomes unusable and has to be rebooted.  
Considering above, if no 3rd party vendor applications are enforcing to stay on 7.2.z, booting a 7.3 kernel seems like the best option.

Comment 71 Jerome Marchand 2017-04-06 09:14:25 UTC
(In reply to Christian Horn from comment #70)
> I worked through this and try to sum up the state:

I'd like to add a couple of comment.

> 
> - No reproducer exists.
> - Our data here around the issue seems not good enough to pinpoint the issue.

This is a memory corruption and, as usual is such a case, the symptoms appear after the corruption already happened and it's very hard to pinpoint the source of the corruption. If a reproducer were available however, we could run it on a kernel with kasan enabled and likely catch the corruption as it happens.

> - We have mentioned some commits in this bz, but porting them (assuming they
>   are small enough that they could eventually make it in a 7.2.z kernel -
> this
>   verification has not been done) would then result in a testkernel, which
> some-
>   one would have to run an try out.

I don't see how the commits mentioned above would fix the issue we're seeing here.

>   The issue occurs rarely, so one would have to run it long enough to be sure
>   it's fixed (i.e. having observed the frequency of the previous panics, i.e.
>   bimonthly, and then concluding after 6 months that the issue is fixed).
> - We have multiple hints that rhel7.3 based kernels fix the issue, plus the
>   GA and z-stream kernels from rhel7.3 have gone through full QA.
> 
> This issue is leading to a panic, so when experiencing the issue the system
> becomes unusable and has to be rebooted.  
> Considering above, if no 3rd party vendor applications are enforcing to stay
> on 7.2.z, booting a 7.3 kernel seems like the best option.

Comment 72 Stan Moravec 2017-04-06 11:21:09 UTC
The problem is that Vertica (the only reproducer we know about, still very rare)
is (was?, any news there Sumeet) not officially qualified on 7.3. 


And just for the completeness of BZ notes - while the WRITE_ONCE() submittal
mentioned above looks promising and theoretically applicable, 
in the case of our 7.0 dump, the compiler did not split the save, see below:

0xffffffff81176940 page_move_anon_rmap:
nopl   0x0(%rax,%rax,1)
movq   0x88(%rsi),%rax
pushq  %rbp
movq   %rsp,%rbp
addq   $0x1,%rax      // anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
movq   %rax,0x8(%rdi) // page->mapping = (struct address_space *) anon_vma;
popq   %rbp
retq
0xffffffff81176959 end of page_move_anon_rmap+0x19: -------------

Comment 73 Jerome Marchand 2017-04-07 09:34:34 UTC
(In reply to Jerome Marchand from comment #71)
> I don't see how the commits mentioned above would fix the issue we're seeing
> here.

This comment apply to commits related to bz1341497. I missed the WRITE_ONCE() which at a first glance seems like a possible fix.

Comment 74 qiuxishi2 2017-04-10 01:44:17 UTC
(In reply to Jerome Marchand from comment #73)
> (In reply to Jerome Marchand from comment #71)
> > I don't see how the commits mentioned above would fix the issue we're seeing
> > here.
> 
> This comment apply to commits related to bz1341497. I missed the
> WRITE_ONCE() which at a first glance seems like a possible fix.

Hi Jerome, do you mean the following three patches fix this problem?

- [mm] fix anon_vma->degree underflow in anon_vma endless growing prevention (Jerome Marchand) [1341497]
- [mm] fix corner case in anon_vma endless growing prevention (Jerome Marchand) [1341497]
- [mm] prevent endless growth of anon_vma hierarchy (Jerome Marchand) [1341497]

Comment 75 qiuxishi2 2017-04-10 07:46:59 UTC
hi, guys

   I have found a commit 624483f3ea82598("mm: rmap: fix use-after-free in __put_anon_vma"), redhat 7.3 include it ,and 7.2 is not. 

diff --git a/mm/rmap.c b/mm/rmap.c
index 9c3e773..83bfafa 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1564,10 +1564,9 @@ void __put_anon_vma(struct anon_vma *anon_vma)
 {
        struct anon_vma *root = anon_vma->root;

+       anon_vma_free(anon_vma);
        if (root != anon_vma && atomic_dec_and_test(&root->refcount))
                anon_vma_free(root);
-
-       anon_vma_free(anon_vma);
 }

I am not sure that it will resolve the issue. any comments will welcomes.

Thanks

Comment 76 Jerome Marchand 2017-04-10 08:57:39 UTC
(In reply to qiuxishi2 from comment #74)
> (In reply to Jerome Marchand from comment #73)
> > (In reply to Jerome Marchand from comment #71)
> > > I don't see how the commits mentioned above would fix the issue we're seeing
> > > here.
> > 
> > This comment apply to commits related to bz1341497. I missed the
> > WRITE_ONCE() which at a first glance seems like a possible fix.
> 
> Hi Jerome, do you mean the following three patches fix this problem?

No, I mean that I don't see these three patches could fix it.

The other mentioned patch, the one that use WRITE_ONCE(), seems at first like it could be a fix, but as Stan pointed out, the gcc version we're using don't do the optimization that commit 414e2fb8ce5a99 protects against.

> 
> - [mm] fix anon_vma->degree underflow in anon_vma endless growing prevention
> (Jerome Marchand) [1341497]
> - [mm] fix corner case in anon_vma endless growing prevention (Jerome
> Marchand) [1341497]
> - [mm] prevent endless growth of anon_vma hierarchy (Jerome Marchand)
> [1341497]

Comment 77 Sumeet Keswani 2017-05-15 20:16:51 UTC
(In reply to Stan Moravec from comment #72)
> The problem is that Vertica (the only reproducer we know about, still very
> rare)
> is (was?, any news there Sumeet) not officially qualified on 7.3. 
> 
> 
> And just for the completeness of BZ notes - while the WRITE_ONCE() submittal
> mentioned above looks promising and theoretically applicable, 
> in the case of our 7.0 dump, the compiler did not split the save, see below:
> 
> 0xffffffff81176940 page_move_anon_rmap:
> nopl   0x0(%rax,%rax,1)
> movq   0x88(%rsi),%rax
> pushq  %rbp
> movq   %rsp,%rbp
> addq   $0x1,%rax      // anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> movq   %rax,0x8(%rdi) // page->mapping = (struct address_space *) anon_vma;
> popq   %rbp
> retq
> 0xffffffff81176959 end of page_move_anon_rmap+0x19: -------------

we have not yet qualified RHEL 7.3. Note that our customer got this in production. we did not find it in QA. So its unlikely we will find this via qualification of the release per-se. We will probably only know after several customers go to production on RHEL 7.3 which may be many months later assuming we qualify RHEL 7.3 anytime soon. Some customers will use RHEL 7.3 even if its not qualified yet, ill let you know if I hear from them or if they run into this issue.

Comment 78 Linda Wang 2017-05-18 13:35:21 UTC
@Jerome, @Sumeet, will it be helpful if we provide HPE a 7.2.z 
test kernel with the id patch in comment#75, to see if it helps 
HPE's customer? 

Thanks!

Comment 79 Jerome Marchand 2017-05-18 13:55:20 UTC
(In reply to Sumeet Keswani from comment #0)
> Digging into the details of the crashes, the key thing is that the struct
> page being used as a mapping pointer has the bit set to say it's an
> anon_vma, but the page it points to is not allocated from the anon_vma kmem
> cache.  The crash is somewhat random because it depends on what is on the
> page that the mapping pointer refers to.
> 

That might very well be caused by an use-after-free of an anon_vma, the kind that might be fixed by the patch suggested in comment#75 (commit 8270eeba01be in RHEL7).

(In reply to Linda Wang from comment #78)
> @Jerome, @Sumeet, will it be helpful if we provide HPE a 7.2.z 
> test kernel with the id patch in comment#75, to see if it helps 
> HPE's customer? 
> 
> Thanks!

Definitely.

Comment 82 masanari iida 2017-05-29 07:53:50 UTC
Update from my customer who is suffering from the panic with RHEL7.

The customer answered to HPE Japan that they are planning to
update the kernel to RHEL7.3 or later version around OCT/2017 or later.
Thanks

Comment 83 Andrea Arcangeli 2017-06-08 15:15:04 UTC
Following up the upstream discussion this should be fixed by upstream commit ad33bb04b2a6cee6c1f99fabb15cddbf93ff0433 which was backported to RHEL6 in commit 43e0d4dd7c717c6cc2aa9d45527d8d443da05ed2 and to RHEL7 in commit dc8b676fe65a66497941275b190e63a2c47d5319.

All RHEL6 kernels >= kernel-2.6.32-663.el6 and RHEL7 kernels >= kernel-3.10.0-367.el7 already include the fix.

The fix committed to RHEL7 in March 2016 less than a month after the bug was committed upstream.

So this is already fixed in production RHEL7 >= 7.3 and RHEL6 >= 6.9 and only RHEL7.2 and earlier can be affected.

If this is confirmed it may be reasonable to do a zstream update to older RHEL7.

Comment 84 qiuxishi2 2017-07-18 11:15:58 UTC
Hi,

Unfortunately, this patch(mm: thp: fix SMP race condition between
THP page fault and MADV_DONTNEED) didn't help, I got the panic again.

And I find this error before panic, "[468229.996610] BUG: Bad rss-counter state mm:ffff8806aebc2580 idx:1 val:1"

[468451.702807] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[468451.702861] IP: [<ffffffff810ac089>] down_read_trylock+0x9/0x30
[468451.702900] PGD 12445e067 PUD 11acaa067 PMD 0 
[468451.702931] Oops: 0000 [#1] SMP 
[468451.702953] kbox catch die event.
[468451.703003] collected_len = 1047419, LOG_BUF_LEN_LOCAL = 1048576
[468451.703003] kbox: notify die begin
[468451.703003] kbox: no notify die func register. no need to notify
[468451.703003] do nothing after die!
[468451.703003] Modules linked in: ipt_REJECT macvlan ip_set_hash_ipport vport_vxlan(OVE) xt_statistic xt_physdev xt_nat xt_recent xt_mark xt_comment veth ct_limit(OVE) bum_extract(OVE) policy(OVE) bum(OVE) ip_set nfnetlink openvswitch(OVE) nf_defrag_ipv6 gre ext3 jbd ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack bridge stp llc kboxdriver(O) kbox(O) dm_thin_pool dm_persistent_data crc32_pclmul dm_bio_prison dm_bufio ghash_clmulni_intel libcrc32c aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev sg parport_pc cirrus virtio_console parport syscopyarea sysfillrect sysimgblt ttm drm_kms_helper drm i2c_piix4 i2c_core pcspkr ip_tables ext4 jbd2 mbcache sr_mod cdrom ata_generic pata_acpi
[468451.703003]  virtio_net virtio_blk crct10dif_pclmul crct10dif_common ata_piix virtio_pci libata serio_raw virtio_ring crc32c_intel virtio dm_mirror dm_region_hash dm_log dm_mod
[468451.703003] CPU: 6 PID: 21965 Comm: docker-containe Tainted: G           OE  ----V-------   3.10.0-327.53.58.73.x86_64 #1
[468451.703003] Hardware name: OpenStack Foundation OpenStack Nova, BIOS rel-1.8.1-0-g4adadbd-20170107_142945-9_64_246_229 04/01/2014
[468451.703003] task: ffff880692402e00 ti: ffff88018209c000 task.ti: ffff88018209c000
[468451.703003] RIP: 0010:[<ffffffff810ac089>]  [<ffffffff810ac089>] down_read_trylock+0x9/0x30
[468451.703003] RSP: 0018:ffff88018209f8f8  EFLAGS: 00010202
[468451.703003] RAX: 0000000000000000 RBX: ffff880720cd7740 RCX: ffff880720cd7740
[468451.703003] RDX: 0000000000000001 RSI: 0000000000000301 RDI: 0000000000000008
[468451.703003] RBP: ffff88018209f8f8 R08: 00000000c0e0f310 R09: ffff880720cd7740
[468451.703003] R10: ffff88083efd8000 R11: 0000000000000000 R12: ffff880720cd7741
[468451.703003] R13: ffffea000824d100 R14: 0000000000000008 R15: 0000000000000000
[468451.703003] FS:  00007fc0e2a85700(0000) GS:ffff88083ed80000(0000) knlGS:0000000000000000
[468451.703003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[468451.703003] CR2: 0000000000000008 CR3: 0000000661906000 CR4: 00000000001407e0
[468451.703003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[468451.703003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[468451.703003] Stack:
[468451.703003]  ffff88018209f928 ffffffff811a7eb5 ffffea000824d100 ffff88018209fa90
[468451.703003]  ffffea00082f9680 0000000000000301 ffff88018209f978 ffffffff811a82e1
[468451.703003]  ffffea000824d100 ffff88018209fa00 0000000000000001 ffffea000824d100
[468451.703003] Call Trace:
[468451.703003]  [<ffffffff811a7eb5>] page_lock_anon_vma_read+0x55/0x110
[468451.703003]  [<ffffffff811a82e1>] try_to_unmap_anon+0x21/0x120
[468451.703003]  [<ffffffff811a842d>] try_to_unmap+0x4d/0x60
[468451.712006]  [<ffffffff811cc749>] migrate_pages+0x439/0x790
[468451.712006]  [<ffffffff81193280>] ? __reset_isolation_suitable+0xe0/0xe0
[468451.712006]  [<ffffffff811941f9>] compact_zone+0x299/0x400
[468451.712006]  [<ffffffff81059aff>] ? kvm_clock_get_cycles+0x1f/0x30
[468451.712006]  [<ffffffff811943fc>] compact_zone_order+0x9c/0xf0
[468451.712006]  [<ffffffff811947b1>] try_to_compact_pages+0x121/0x1a0
[468451.712006]  [<ffffffff8163ace6>] __alloc_pages_direct_compact+0xac/0x196
[468451.712006]  [<ffffffff811783e2>] __alloc_pages_nodemask+0xbc2/0xca0
[468451.712006]  [<ffffffff811bcb7a>] alloc_pages_vma+0x9a/0x150
[468451.712006]  [<ffffffff811d1573>] do_huge_pmd_anonymous_page+0x123/0x510
[468451.712006]  [<ffffffff8119bc58>] handle_mm_fault+0x1a8/0xf50
[468451.712006]  [<ffffffff8164b4d6>] __do_page_fault+0x166/0x470
[468451.712006]  [<ffffffff8164b8a3>] trace_do_page_fault+0x43/0x110
[468451.712006]  [<ffffffff8164af79>] do_async_page_fault+0x29/0xe0
[468451.712006]  [<ffffffff81647a38>] async_page_fault+0x28/0x30
[468451.712006] Code: 00 00 00 ba 01 00 00 00 48 89 de e8 12 fe ff ff eb ce 48 c7 c0 f2 ff ff ff eb c5 e8 42 ff fc ff 66 90 0f 1f 44 00 00 55 48 89 e5 <48> 8b 07 48 89 c2 48 83 c2 01 7e 07 f0 48 0f b1 17 75 f0 48 f7 
[468451.712006] RIP  [<ffffffff810ac089>] down_read_trylock+0x9/0x30
[468451.738667]  RSP <ffff88018209f8f8>
[468451.738667] CR2: 0000000000000008

Comment 85 Andrea Arcangeli 2017-07-20 16:11:38 UTC
Could you attach to BZ a couple of `cat /proc/*/smaps` run at                                                           
different times while the workload where you can reproduce is running?                                                  

I suggest to enable DEBUG_VM=y in your builds if you didn't                                                        
already, it won't risk to impact performance measurably and it's a                                                      
supported config also enabled in the -debug kernel (but please keep DEBUG_VM_RB=n because that's expensive).

Comment 87 Larry Woodman 2017-08-04 15:46:52 UTC
We think this problem has been fixed by the commits listed in Comment #83.  If you are not running 7.3 or 6.9 and encounter this problem can you update and try to reproduce it again?

Larry Woodman

Comment 90 Trinh Dao 2017-09-12 17:31:26 UTC
Sumeet, do you still see issue with RHEL6.9 or RHEL7.3?

Comment 91 Sumeet Keswani 2017-09-12 19:18:12 UTC
(In reply to Trinh Dao from comment #90)
> Sumeet, do you still see issue with RHEL6.9 or RHEL7.3?

I have not seen it yet on RHEL 7.3 , perhaps because a majority of our customers don't stay on the leading edge of releases. I will update this BZ if it shows up on a more recent kernel.

Comment 97 Joseph Kachuck 2017-09-27 15:36:21 UTC
Hello,
This bug has been copied as 7.4 z-stream (EUS) bug #1496378 

Thank You
Joe Kachuck

Comment 99 Trinh Dao 2018-01-24 21:26:36 UTC
Sumeet, since you don't see the issue anymore in comment 91, can I close your bug and you can re-open if you see it again?

Comment 100 qiuxishi2 2018-01-25 02:09:27 UTC
(In reply to qiuxishi2 from comment #84)
> Hi,
> 
> Unfortunately, this patch(mm: thp: fix SMP race condition between
> THP page fault and MADV_DONTNEED) didn't help, I got the panic again.
> 
> And I find this error before panic, "[468229.996610] BUG: Bad rss-counter
> state mm:ffff8806aebc2580 idx:1 val:1"
> 
> [468451.702807] BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000008
> [468451.702861] IP: [<ffffffff810ac089>] down_read_trylock+0x9/0x30
> [468451.702900] PGD 12445e067 PUD 11acaa067 PMD 0 
> [468451.702931] Oops: 0000 [#1] SMP 
> [468451.702953] kbox catch die event.
> [468451.703003] collected_len = 1047419, LOG_BUF_LEN_LOCAL = 1048576
> [468451.703003] kbox: notify die begin
> [468451.703003] kbox: no notify die func register. no need to notify
> [468451.703003] do nothing after die!
> [468451.703003] Modules linked in: ipt_REJECT macvlan ip_set_hash_ipport
> vport_vxlan(OVE) xt_statistic xt_physdev xt_nat xt_recent xt_mark xt_comment
> veth ct_limit(OVE) bum_extract(OVE) policy(OVE) bum(OVE) ip_set nfnetlink
> openvswitch(OVE) nf_defrag_ipv6 gre ext3 jbd ipt_MASQUERADE
> nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
> nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack
> bridge stp llc kboxdriver(O) kbox(O) dm_thin_pool dm_persistent_data
> crc32_pclmul dm_bio_prison dm_bufio ghash_clmulni_intel libcrc32c
> aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev sg parport_pc
> cirrus virtio_console parport syscopyarea sysfillrect sysimgblt ttm
> drm_kms_helper drm i2c_piix4 i2c_core pcspkr ip_tables ext4 jbd2 mbcache
> sr_mod cdrom ata_generic pata_acpi
> [468451.703003]  virtio_net virtio_blk crct10dif_pclmul crct10dif_common
> ata_piix virtio_pci libata serio_raw virtio_ring crc32c_intel virtio
> dm_mirror dm_region_hash dm_log dm_mod
> [468451.703003] CPU: 6 PID: 21965 Comm: docker-containe Tainted: G          
> OE  ----V-------   3.10.0-327.53.58.73.x86_64 #1
> [468451.703003] Hardware name: OpenStack Foundation OpenStack Nova, BIOS
> rel-1.8.1-0-g4adadbd-20170107_142945-9_64_246_229 04/01/2014
> [468451.703003] task: ffff880692402e00 ti: ffff88018209c000 task.ti:
> ffff88018209c000
> [468451.703003] RIP: 0010:[<ffffffff810ac089>]  [<ffffffff810ac089>]
> down_read_trylock+0x9/0x30
> [468451.703003] RSP: 0018:ffff88018209f8f8  EFLAGS: 00010202
> [468451.703003] RAX: 0000000000000000 RBX: ffff880720cd7740 RCX:
> ffff880720cd7740
> [468451.703003] RDX: 0000000000000001 RSI: 0000000000000301 RDI:
> 0000000000000008
> [468451.703003] RBP: ffff88018209f8f8 R08: 00000000c0e0f310 R09:
> ffff880720cd7740
> [468451.703003] R10: ffff88083efd8000 R11: 0000000000000000 R12:
> ffff880720cd7741
> [468451.703003] R13: ffffea000824d100 R14: 0000000000000008 R15:
> 0000000000000000
> [468451.703003] FS:  00007fc0e2a85700(0000) GS:ffff88083ed80000(0000)
> knlGS:0000000000000000
> [468451.703003] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [468451.703003] CR2: 0000000000000008 CR3: 0000000661906000 CR4:
> 00000000001407e0
> [468451.703003] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [468451.703003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [468451.703003] Stack:
> [468451.703003]  ffff88018209f928 ffffffff811a7eb5 ffffea000824d100
> ffff88018209fa90
> [468451.703003]  ffffea00082f9680 0000000000000301 ffff88018209f978
> ffffffff811a82e1
> [468451.703003]  ffffea000824d100 ffff88018209fa00 0000000000000001
> ffffea000824d100
> [468451.703003] Call Trace:
> [468451.703003]  [<ffffffff811a7eb5>] page_lock_anon_vma_read+0x55/0x110
> [468451.703003]  [<ffffffff811a82e1>] try_to_unmap_anon+0x21/0x120
> [468451.703003]  [<ffffffff811a842d>] try_to_unmap+0x4d/0x60
> [468451.712006]  [<ffffffff811cc749>] migrate_pages+0x439/0x790
> [468451.712006]  [<ffffffff81193280>] ? __reset_isolation_suitable+0xe0/0xe0
> [468451.712006]  [<ffffffff811941f9>] compact_zone+0x299/0x400
> [468451.712006]  [<ffffffff81059aff>] ? kvm_clock_get_cycles+0x1f/0x30
> [468451.712006]  [<ffffffff811943fc>] compact_zone_order+0x9c/0xf0
> [468451.712006]  [<ffffffff811947b1>] try_to_compact_pages+0x121/0x1a0
> [468451.712006]  [<ffffffff8163ace6>] __alloc_pages_direct_compact+0xac/0x196
> [468451.712006]  [<ffffffff811783e2>] __alloc_pages_nodemask+0xbc2/0xca0
> [468451.712006]  [<ffffffff811bcb7a>] alloc_pages_vma+0x9a/0x150
> [468451.712006]  [<ffffffff811d1573>] do_huge_pmd_anonymous_page+0x123/0x510
> [468451.712006]  [<ffffffff8119bc58>] handle_mm_fault+0x1a8/0xf50
> [468451.712006]  [<ffffffff8164b4d6>] __do_page_fault+0x166/0x470
> [468451.712006]  [<ffffffff8164b8a3>] trace_do_page_fault+0x43/0x110
> [468451.712006]  [<ffffffff8164af79>] do_async_page_fault+0x29/0xe0
> [468451.712006]  [<ffffffff81647a38>] async_page_fault+0x28/0x30
> [468451.712006] Code: 00 00 00 ba 01 00 00 00 48 89 de e8 12 fe ff ff eb ce
> 48 c7 c0 f2 ff ff ff eb c5 e8 42 ff fc ff 66 90 0f 1f 44 00 00 55 48 89 e5
> <48> 8b 07 48 89 c2 48 83 c2 01 7e 07 f0 48 0f b1 17 75 f0 48 f7 
> [468451.712006] RIP  [<ffffffff810ac089>] down_read_trylock+0x9/0x30
> [468451.738667]  RSP <ffff88018209f8f8>
> [468451.738667] CR2: 0000000000000008

Hi, I add these two patch which from RHEL 7.3("introduce thp_mmu_gather to pin tail pages during MMU gather", "put_huge_zero_page() with MMU gather"),then I don't see the issue anymore until now. So I think this problem maybe related to these patches too, that means we should add the following patches.
thp: put_huge_zero_page() with MMU gather
thp: introduce thp_mmu_gather to pin tail pages during MMU gather
mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED

Comment 101 Li Wang 2018-02-01 03:30:49 UTC
Patch for this issue has been in the kernel since 7.3 devel so there is nothing to do in 7.5.

Comment 102 Trinh Dao 2018-02-15 15:30:58 UTC
mark hpe verified since bug is closed now.

Comment 105 Red Hat Bugzilla 2023-09-14 23:58:58 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days