Bug 1305620
Summary: | kernel panic/crash at page_lock_anon_vma_read ( which then calls down_read_trylock() which takes a memory fault at instruction offset 0x9.) | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Sumeet Keswani <sumeet.keswani> | |
Component: | kernel | Assignee: | Larry Woodman <lwoodman> | |
kernel sub component: | Memory Management | QA Contact: | Li Wang <liwan> | |
Status: | CLOSED CURRENTRELEASE | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | 246alexey, aarcange, aburch, agordeev, am0616, amit, apverma, aquini, atomlin, chorn, dhoward, dkwon, ikulkarn, jaeshin, jkachuck, jmarchan, jobacik, joseph.szczypek, jsiddle, karen.skweres, kcleveng, kelvint, knoha, lilu, linda.knippers, lwoodman, masanari.iida, mreznik, naoko.yoshida, nbansal, nigel.croxon, qiuxishi, rblakley, rmarigny, rsussman, shane.seymour, stanislav.moravec, sumeet.keswani, sunzhuofeng, surkumar, tom.vaden, trinh.dao | |
Version: | 7.1 | Keywords: | Reopened, ZStream | |
Target Milestone: | rc | |||
Target Release: | 7.3 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1496375 1496377 1496378 (view as bug list) | Environment: | ||
Last Closed: | 2018-02-01 03:30:49 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1298243, 1394638, 1420851, 1438583, 1445812, 1469549, 1469551, 1473733, 1496375, 1496377, 1496378 |
Description
Sumeet Keswani
2016-02-08 18:27:21 UTC
Can we make this BZ public. (I do not seem to have the access to do so) can we add HPE confidential group to the BZ. Hello Sumeet, Please confirm your HPE email address. Please confirm if you can recreate this issue with the latest RHEL 7.x kernel. kernel-3.10.0-327.4.5.el7 or above. Please confirm any steps to recreate the issue. Please also confirm if you are able to recreate this issue on more then one physical system. Please also attach a sosreport and vmcore directly after this issue occurs. Thank You Joe Kachuck 1. We have not attempted to reproduce this in kernel-3.10.0-327.4.5.el7. - so we don't really know. 2. Yes, we are able to recreate this issue on many physical systems. (3 machine at least - maybe more) 3. I have saved vmcores from crashes mentioned above, attaching now. files are too big to attach (45GB compressed). Can I share a sftp url for the files. I will need to share the password for the download out-of-band Hello, Please recreate the issue on the latest kernel as soon as you are able. Thank You Joe Kachuck 1. This is not reliably reproducible, happens randomly. 2. There are no specific steps to reproduce this either. 3. There is no plan on upgrading production systems on the chance or reproducing this. i.e. if boxes are upgraded to latest kernel and this problem exists the customer will be further annoyed. I can share the cores of previous crashes, maybe that can shed some light. We are unable to reproduce this on development systems because of a) the random nature of the bug b) and/or difference between production and development Can you do a first pass and look at the vmcores. This is not trivial to reproduce. I can pass the vmcores to you via secure ftp. Here is a link to a report/analysis. https://bugs.centos.org/view.php?id=10242 Based on the BZ description: "Searching the web, this page (http://comments.gmane.org/gmane.linux.kernel.mm/140642) seems to report the bug, and from our reading the response acknowledges the bug from the author of the code." Based on the upstream developer's comment in the thread, he believe the issue is introduced by compound refcounting rework patchset that went upstream recently. However the issue seen here is in 3.10 kernel, and don't believe we have the THP refcounting rework patches in. However, this does point to a possible issue with THB reference counting.. any chance to turn THB off to see if the problem goes away? yes, in our case we have seen this with THP enable and disabled - both. I do not understand the kernel or the deeper workings of the system. I suspect its related to page_migration or movement, which may happen during local_migration, transparent_huge_pages and we are seeing different manifestations of this. and possibly even when swapping *** Bug 1305728 has been marked as a duplicate of this bug. *** The anon_vma->rwsem is null!!! struct anon_vma *page_lock_anon_vma_read(struct page *page) { struct anon_vma *anon_vma = NULL; struct anon_vma *root_anon_vma; unsigned long anon_mapping; rcu_read_lock(); anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping); if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON) goto out; if (!page_mapped(page)) goto out; anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); root_anon_vma = ACCESS_ONCE(anon_vma->root); >>>>>> if (down_read_trylock(&root_anon_vma->rwsem)) { struct anon_vma { struct anon_vma *root; /* Root of this anon_vma tree */ >>>>>> struct rw_semaphore rwsem; /* W: modification, R: walking the list */ Some sort of memory corruption going on here. Larry (In reply to Larry Woodman from comment #20) > The anon_vma->rwsem is null!!! > What I found in every case was that for the struct page in question the mapping member pointed to something invalid each and every time the system fell over. Sometimes it would point to an area that is all NUL bytes and other times it would point to bad data (like part of an ASCII string). The key seems to be working out why the mapping member is bad rather than why what it points to looks bad. I did all of the analysis work here (there are some mistakes in it): https://bugs.centos.org/view.php?id=10242 But the conclusion in each of the cases was the value we currently had in the mapping member of the struct page was somehow wrong. Hello, Is there any news regarding this ticket? Thanks... our customer who ran into this downgraded their production systems to RHEL 6.7 and have not had issues since. So we removed the critical element of this bug bringing down production. We are working on the side to get this to happen on a test cluster using RHEL 7.2 , but that project has been on the back burner. Its unfortunate we cannot figure out the root cause by looking at the cores, of which there are a few. I have a customer requesting for this bug to be public so that they may keep track for their internal purposes. Customer's request: need to be able to review the following Bugzilla cases as our product is dependent on another product that is dependent on these bug fixes, we need to determine our what versions of RedHat we will support with our next product release which is being determined now. Case# https://c.na7.visual.force.com/apex/Case_View?id=500A000000UDB1u&sfdc.override=1 (In reply to Ashlee Burch from comment #26) > I have a customer requesting for this bug to be public so that they may keep > track for their internal purposes. > You will need to get permission from Sumeet to make the BZ public since it's a partner filed BZ. Please make it public.. (I have no objection). I tried to make it public but don't have the privileges to do so. Also if you can share the circumstances in which you see this bug it may be helpful to us. Per comment 28 this is now a public BZ. Is there any special system setup here? We have LOTS of RHEL7.1 installations and this is the only coruption that looks like this we have seen. Any hints as to whatever is different with this system will be really helpful in getting to the bottom is this problem. Larry Woodman I cannot provide more info; as our customer moved off this version of RHEL (to an older version which has been stable) There is no plan from the to try this again, consequently, i wont be able to provide this information. JoeK, please close this bug per comment 32 and mark HPE verified. Reopening this, we have a new occurrence of the issue, with a customer of the same partner. The customers system has rebooted frequently through the issue, the crash occurs with 5-25% probability when a workload is started, to my understanding CPU bound computation, using many processes on this 48logical cpu system. Posting access details to the vmcore in a private update. I am the reporter of the panic from HPE-Japan. Add me to cc. Thanks Can I get access to BZ 1341497. We have reason to believe it may have some changes that related to this issue. (In reply to Sumeet Keswani from comment #38) > Can I get access to BZ 1341497. I have requested the access, but there are just some comments, and they are private. bz1151823 was solved with RHEL6.8GA, and a clone from bz1341497, I have requested access there too. (In reply to Christian Horn from comment #39) > bz1151823 was solved with RHEL6.8GA, and a clone from bz1341497, I > have requested access there too. That should be "bz1341497 is a clone of bz1151823". (In reply to Christian Horn from comment #40) > (In reply to Christian Horn from comment #39) > > bz1151823 was solved with RHEL6.8GA, and a clone from bz1341497, I > > have requested access there too. > That should be "bz1341497 is a clone of bz1151823". Thank you. One of our customers who ran into it was told its fixed in RHEL 7.3. Based on Target Release = 7.4; I take it its not fixed in RHEL 7.3 and will be fixed in RHEL 7.4. We will wait for access to the above BZ before we recommend a upgrade. JoeK, we already marked this bug HPE verified and fix in RHEL7.3. please change target back to 7.3 and this bug can be closed. Nor me nor Sumeet verified that 7.3 is indeed a fix, this seems to be misunderstanding of some sort. Reopening. Sorry Stan, I misread some comments and thought it was fixed in 7.3. Thank you for re-opening it. Sorry!!! JoeK, I cleaned the verified field and added hpe7.4bugs tracker to the bug. Hi, The initiator experienced this issue on haswell cpu. We are experiencing similar issue on broadwell cpu. Not sure, do you also experience similar issue on broadwell cpu ? Thanks. We have seen it on Haswell, but I do not think CPU microarchitecture is relevant here. If you are duplicating it reliably, try RHEL7.3. It's still unclear if the submittals into 7.3 fixed the problem or not. Thank you for the info and suggestion Stan. Personally, agreed with you that "I do not think CPU microarchitecture is relevant here". Our customer feedback to us that they have a batch of HW with Haswell + LSI 2208 works fine but Broadwell + LSI 3108 failed, both applying exactly the same vertica and kernel version. We are still clarifying this. Thanks. I am unable to access solution (2779851). can I get the gist of it or access to it. We have a few customers hitting this and it would help to have a workaround. Hello Sumeet, Solution 2779851 is an unpublished solution. It says the current work around is to disable THP. It is a summery of the issue, and state the issue is being worked on this BZ. Thank You Joe Kachuck Hello, Is there a responds to comment 30? Is there any special system setup here? We have LOTS of RHEL7.1 installations and this is the only coruption that looks like this we have seen. Any hints as to whatever is different with this system will be really helpful in getting to the bottom is this problem. Thank You Joe Kachuck I now have a customer case on this issue (they have RHEL support but it's unfortunately a RHEL 7.0 GA kernel). When will Vertica be tested and supported on RHEL 7.3? Then at least I can ask the customer to upgrade and see if the issue happens again. As an aside even though it's for RHEL 7.0 would Redhat like me to open a case and provide the vmcore? I haven't looked into the dump but the stack trace looks like it's the same issue: crash64> bt PID: 356 TASK: ffff881fd220c440 CPU: 0 COMMAND: "kswapd0" #0 [ffff881fd05156c0] machine_kexec at ffffffff81041181 #1 [ffff881fd0515718] crash_kexec at ffffffff810cf0e2 #2 [ffff881fd05157e8] oops_end at ffffffff815ea548 #3 [ffff881fd0515810] no_context at ffffffff815daf63 #4 [ffff881fd0515860] __bad_area_nosemaphore at ffffffff815daff9 #5 [ffff881fd05158a8] bad_area_nosemaphore at ffffffff815db163 #6 [ffff881fd05158b8] __do_page_fault at ffffffff815ed36e #7 [ffff881fd05159b8] do_page_fault at ffffffff815ed58a #8 [ffff881fd05159e0] page_fault at ffffffff815e97c8 [exception RIP: down_read_trylock+9] RIP: ffffffff8108a919 RSP: ffff881fd0515a90 RFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff88146e782000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000008 RBP: ffff881fd0515a90 R8: ffffea006f6f76a0 R9: ffff881fffa173a0 R10: ffffea006baca200 R11: ffffffff812b8739 R12: ffff88146e782001 R13: ffffea006f6f7680 R14: 0000000000000008 R15: ffffea006f6f7680 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff881fd0515a98] page_lock_anon_vma_read at ffffffff81177df5 #10 [ffff881fd0515ac8] page_referenced at ffffffff81178077 #11 [ffff881fd0515b40] shrink_active_list at ffffffff81155844 #12 [ffff881fd0515bf8] shrink_lruvec at ffffffff81155e34 #13 [ffff881fd0515cf8] shrink_zone at ffffffff811561a6 #14 [ffff881fd0515d50] balance_pgdat at ffffffff8115744c #15 [ffff881fd0515e28] kswapd at ffffffff8115770b #16 [ffff881fd0515ec8] kthread at ffffffff81085aef #17 [ffff881fd0515f50] ret_from_fork at ffffffff815f206c (In reply to Shane Seymour from comment #59) > I now have a customer case on this issue (they have RHEL support but it's > unfortunately a RHEL 7.0 GA kernel). When will Vertica be tested and > supported on RHEL 7.3? Then at least I can ask the customer to upgrade and > see if the issue happens again. > > As an aside even though it's for RHEL 7.0 would Redhat like me to open a > case and provide the vmcore? I haven't looked into the dump but the stack > trace looks like it's the same issue: > > crash64> bt > PID: 356 TASK: ffff881fd220c440 CPU: 0 COMMAND: "kswapd0" > #0 [ffff881fd05156c0] machine_kexec at ffffffff81041181 > #1 [ffff881fd0515718] crash_kexec at ffffffff810cf0e2 > #2 [ffff881fd05157e8] oops_end at ffffffff815ea548 > #3 [ffff881fd0515810] no_context at ffffffff815daf63 > #4 [ffff881fd0515860] __bad_area_nosemaphore at ffffffff815daff9 > #5 [ffff881fd05158a8] bad_area_nosemaphore at ffffffff815db163 > #6 [ffff881fd05158b8] __do_page_fault at ffffffff815ed36e > #7 [ffff881fd05159b8] do_page_fault at ffffffff815ed58a > #8 [ffff881fd05159e0] page_fault at ffffffff815e97c8 > [exception RIP: down_read_trylock+9] > RIP: ffffffff8108a919 RSP: ffff881fd0515a90 RFLAGS: 00010202 > RAX: 0000000000000000 RBX: ffff88146e782000 RCX: 0000000000000000 > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000008 > RBP: ffff881fd0515a90 R8: ffffea006f6f76a0 R9: ffff881fffa173a0 > R10: ffffea006baca200 R11: ffffffff812b8739 R12: ffff88146e782001 > R13: ffffea006f6f7680 R14: 0000000000000008 R15: ffffea006f6f7680 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > #9 [ffff881fd0515a98] page_lock_anon_vma_read at ffffffff81177df5 > #10 [ffff881fd0515ac8] page_referenced at ffffffff81178077 > #11 [ffff881fd0515b40] shrink_active_list at ffffffff81155844 > #12 [ffff881fd0515bf8] shrink_lruvec at ffffffff81155e34 > #13 [ffff881fd0515cf8] shrink_zone at ffffffff811561a6 > #14 [ffff881fd0515d50] balance_pgdat at ffffffff8115744c > #15 [ffff881fd0515e28] kswapd at ffffffff8115770b > #16 [ffff881fd0515ec8] kthread at ffffffff81085aef > #17 [ffff881fd0515f50] ret_from_fork at ffffffff815f206c we are working on RHEL 7.3 support. we have anecdotal reports that upgrading to RHEL 7.3 get around this problem. unfortunately this crash is rare and hard to replicated so a reliable reproducer is infeasible. RH, what info do you need from Sumeet? Hi, I find upstream 414e2fb8ce5a999571c21eb2ca4d66e53ddce800 fix the bug maybe is the same as the discussion above, but I'm not sure. rmap: fix theoretical race between do_wp_page and shrink_active_list As noted by Paul the compiler is free to store a temporary result in a variable on stack, heap or global unless it is explicitly marked as volatile, see: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html#sample-optimizations This can result in a race between do_wp_page() and shrink_active_list() as follows. In do_wp_page() we can call page_move_anon_rmap(), which sets page->mapping as follows: anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; page->mapping = (struct address_space *) anon_vma; The page in question may be on an LRU list, because nowhere in do_wp_page() we remove it from the list, neither do we take any LRU related locks. Although the page is locked, shrink_active_list() can still call page_referenced() on it concurrently, because the latter does not require an anonymous page to be locked: CPU0 CPU1 ---- ---- do_wp_page shrink_active_list lock_page page_referenced PageAnon->yes, so skip trylock_page page_move_anon_rmap page->mapping = anon_vma rmap_walk PageAnon->no rmap_walk_file BUG page->mapping += PAGE_MAPPING_ANON This patch fixes this race by explicitly forbidding the compiler to split page->mapping store in page_move_anon_rmap() with the aid of WRITE_ONCE. [akpm: tweak comment, per Minchan] Signed-off-by: Vladimir Davydov <vdavydov> Cc: "Paul E. McKenney" <paulmck.ibm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov.com> Acked-by: Rik van Riel <riel> Cc: Hugh Dickins <hughd> Acked-by: Minchan Kim <minchan> Signed-off-by: Andrew Morton <akpm> Signed-off-by: Linus Torvalds <torvalds> diff --git a/mm/rmap.c b/mm/rmap.c index 24dd3f9..9f47f15 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -950,7 +950,12 @@ void page_move_anon_rmap(struct page *page, VM_BUG_ON_PAGE(page->index != linear_page_index(vma, address), page); anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; - page->mapping = (struct address_space *) anon_vma; + /* + * Ensure that anon_vma and the PAGE_MAPPING_ANON bit are written + * simultaneously, so a concurrent reader (eg page_referenced()'s + * PageAnon()) will not see one without the other. + */ + WRITE_ONCE(page->mapping, (struct address_space *) anon_vma); } /** FYI As of 3.10.0-514.10.2 mm/rmap.c looks like following. 1115 void page_move_anon_rmap(struct page *page, 1116 struct vm_area_struct *vma, unsigned long address) 1117 { 1118 struct anon_vma *anon_vma = vma->anon_vma; 1119 1120 VM_BUG_ON(!PageLocked(page)); 1121 VM_BUG_ON(!anon_vma); 1122 VM_BUG_ON(page->index != linear_page_index(vma, address)); 1123 1124 anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; 1125 page->mapping = (struct address_space *) anon_vma; 1126 } 1127 (In reply to masanari iida from comment #65) > FYI > As of 3.10.0-514.10.2 > mm/rmap.c looks like following. > > 1115 void page_move_anon_rmap(struct page *page, > 1116 struct vm_area_struct *vma, unsigned long address) > 1117 { > 1118 struct anon_vma *anon_vma = vma->anon_vma; > 1119 > 1120 VM_BUG_ON(!PageLocked(page)); > 1121 VM_BUG_ON(!anon_vma); > 1122 VM_BUG_ON(page->index != linear_page_index(vma, address)); > 1123 > 1124 anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; > 1125 page->mapping = (struct address_space *) anon_vma; > 1126 } > 1127 hi, does redhat 7.3 fix this problem? if yes, could you show me the patch? Thanks, Xishi Qiu (In reply to qiuxishi2 from comment #66) > (In reply to masanari iida from comment #65) (snip) > hi, does redhat 7.3 fix this problem? > if yes, could you show me the patch? > > Thanks, > Xishi Qiu NO. My customer encountered the issue before RHEL7.3 release. They didn't update to RHEL7.3 kernel. So the customer waiting for an official statement that the problem is fixed from either RH or HPE(Vertica). (In reply to masanari iida from comment #67) > (In reply to qiuxishi2 from comment #66) > > (In reply to masanari iida from comment #65) > (snip) > > hi, does redhat 7.3 fix this problem? > > if yes, could you show me the patch? > > > > Thanks, > > Xishi Qiu > > NO. > > My customer encountered the issue before RHEL7.3 release. > They didn't update to RHEL7.3 kernel. > So the customer waiting for an official statement that > the problem is fixed from either RH or HPE(Vertica). hi, does this problem triggered on KVM guest os? The dump we have (it's the Masanari's one) is from physical system. (In reply to masanari iida from comment #67) > My customer encountered the issue before RHEL7.3 release. > They didn't update to RHEL7.3 kernel. > So the customer waiting for an official statement that > the problem is fixed from either RH or HPE(Vertica). I worked through this and try to sum up the state: - No reproducer exists. - Our data here around the issue seems not good enough to pinpoint the issue. - We have mentioned some commits in this bz, but porting them (assuming they are small enough that they could eventually make it in a 7.2.z kernel - this verification has not been done) would then result in a testkernel, which some- one would have to run an try out. The issue occurs rarely, so one would have to run it long enough to be sure it's fixed (i.e. having observed the frequency of the previous panics, i.e. bimonthly, and then concluding after 6 months that the issue is fixed). - We have multiple hints that rhel7.3 based kernels fix the issue, plus the GA and z-stream kernels from rhel7.3 have gone through full QA. This issue is leading to a panic, so when experiencing the issue the system becomes unusable and has to be rebooted. Considering above, if no 3rd party vendor applications are enforcing to stay on 7.2.z, booting a 7.3 kernel seems like the best option. (In reply to Christian Horn from comment #70) > I worked through this and try to sum up the state: I'd like to add a couple of comment. > > - No reproducer exists. > - Our data here around the issue seems not good enough to pinpoint the issue. This is a memory corruption and, as usual is such a case, the symptoms appear after the corruption already happened and it's very hard to pinpoint the source of the corruption. If a reproducer were available however, we could run it on a kernel with kasan enabled and likely catch the corruption as it happens. > - We have mentioned some commits in this bz, but porting them (assuming they > are small enough that they could eventually make it in a 7.2.z kernel - > this > verification has not been done) would then result in a testkernel, which > some- > one would have to run an try out. I don't see how the commits mentioned above would fix the issue we're seeing here. > The issue occurs rarely, so one would have to run it long enough to be sure > it's fixed (i.e. having observed the frequency of the previous panics, i.e. > bimonthly, and then concluding after 6 months that the issue is fixed). > - We have multiple hints that rhel7.3 based kernels fix the issue, plus the > GA and z-stream kernels from rhel7.3 have gone through full QA. > > This issue is leading to a panic, so when experiencing the issue the system > becomes unusable and has to be rebooted. > Considering above, if no 3rd party vendor applications are enforcing to stay > on 7.2.z, booting a 7.3 kernel seems like the best option. The problem is that Vertica (the only reproducer we know about, still very rare) is (was?, any news there Sumeet) not officially qualified on 7.3. And just for the completeness of BZ notes - while the WRITE_ONCE() submittal mentioned above looks promising and theoretically applicable, in the case of our 7.0 dump, the compiler did not split the save, see below: 0xffffffff81176940 page_move_anon_rmap: nopl 0x0(%rax,%rax,1) movq 0x88(%rsi),%rax pushq %rbp movq %rsp,%rbp addq $0x1,%rax // anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; movq %rax,0x8(%rdi) // page->mapping = (struct address_space *) anon_vma; popq %rbp retq 0xffffffff81176959 end of page_move_anon_rmap+0x19: ------------- (In reply to Jerome Marchand from comment #71) > I don't see how the commits mentioned above would fix the issue we're seeing > here. This comment apply to commits related to bz1341497. I missed the WRITE_ONCE() which at a first glance seems like a possible fix. (In reply to Jerome Marchand from comment #73) > (In reply to Jerome Marchand from comment #71) > > I don't see how the commits mentioned above would fix the issue we're seeing > > here. > > This comment apply to commits related to bz1341497. I missed the > WRITE_ONCE() which at a first glance seems like a possible fix. Hi Jerome, do you mean the following three patches fix this problem? - [mm] fix anon_vma->degree underflow in anon_vma endless growing prevention (Jerome Marchand) [1341497] - [mm] fix corner case in anon_vma endless growing prevention (Jerome Marchand) [1341497] - [mm] prevent endless growth of anon_vma hierarchy (Jerome Marchand) [1341497] hi, guys I have found a commit 624483f3ea82598("mm: rmap: fix use-after-free in __put_anon_vma"), redhat 7.3 include it ,and 7.2 is not. diff --git a/mm/rmap.c b/mm/rmap.c index 9c3e773..83bfafa 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1564,10 +1564,9 @@ void __put_anon_vma(struct anon_vma *anon_vma) { struct anon_vma *root = anon_vma->root; + anon_vma_free(anon_vma); if (root != anon_vma && atomic_dec_and_test(&root->refcount)) anon_vma_free(root); - - anon_vma_free(anon_vma); } I am not sure that it will resolve the issue. any comments will welcomes. Thanks (In reply to qiuxishi2 from comment #74) > (In reply to Jerome Marchand from comment #73) > > (In reply to Jerome Marchand from comment #71) > > > I don't see how the commits mentioned above would fix the issue we're seeing > > > here. > > > > This comment apply to commits related to bz1341497. I missed the > > WRITE_ONCE() which at a first glance seems like a possible fix. > > Hi Jerome, do you mean the following three patches fix this problem? No, I mean that I don't see these three patches could fix it. The other mentioned patch, the one that use WRITE_ONCE(), seems at first like it could be a fix, but as Stan pointed out, the gcc version we're using don't do the optimization that commit 414e2fb8ce5a99 protects against. > > - [mm] fix anon_vma->degree underflow in anon_vma endless growing prevention > (Jerome Marchand) [1341497] > - [mm] fix corner case in anon_vma endless growing prevention (Jerome > Marchand) [1341497] > - [mm] prevent endless growth of anon_vma hierarchy (Jerome Marchand) > [1341497] (In reply to Stan Moravec from comment #72) > The problem is that Vertica (the only reproducer we know about, still very > rare) > is (was?, any news there Sumeet) not officially qualified on 7.3. > > > And just for the completeness of BZ notes - while the WRITE_ONCE() submittal > mentioned above looks promising and theoretically applicable, > in the case of our 7.0 dump, the compiler did not split the save, see below: > > 0xffffffff81176940 page_move_anon_rmap: > nopl 0x0(%rax,%rax,1) > movq 0x88(%rsi),%rax > pushq %rbp > movq %rsp,%rbp > addq $0x1,%rax // anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; > movq %rax,0x8(%rdi) // page->mapping = (struct address_space *) anon_vma; > popq %rbp > retq > 0xffffffff81176959 end of page_move_anon_rmap+0x19: ------------- we have not yet qualified RHEL 7.3. Note that our customer got this in production. we did not find it in QA. So its unlikely we will find this via qualification of the release per-se. We will probably only know after several customers go to production on RHEL 7.3 which may be many months later assuming we qualify RHEL 7.3 anytime soon. Some customers will use RHEL 7.3 even if its not qualified yet, ill let you know if I hear from them or if they run into this issue. @Jerome, @Sumeet, will it be helpful if we provide HPE a 7.2.z test kernel with the id patch in comment#75, to see if it helps HPE's customer? Thanks! (In reply to Sumeet Keswani from comment #0) > Digging into the details of the crashes, the key thing is that the struct > page being used as a mapping pointer has the bit set to say it's an > anon_vma, but the page it points to is not allocated from the anon_vma kmem > cache. The crash is somewhat random because it depends on what is on the > page that the mapping pointer refers to. > That might very well be caused by an use-after-free of an anon_vma, the kind that might be fixed by the patch suggested in comment#75 (commit 8270eeba01be in RHEL7). (In reply to Linda Wang from comment #78) > @Jerome, @Sumeet, will it be helpful if we provide HPE a 7.2.z > test kernel with the id patch in comment#75, to see if it helps > HPE's customer? > > Thanks! Definitely. Update from my customer who is suffering from the panic with RHEL7. The customer answered to HPE Japan that they are planning to update the kernel to RHEL7.3 or later version around OCT/2017 or later. Thanks Following up the upstream discussion this should be fixed by upstream commit ad33bb04b2a6cee6c1f99fabb15cddbf93ff0433 which was backported to RHEL6 in commit 43e0d4dd7c717c6cc2aa9d45527d8d443da05ed2 and to RHEL7 in commit dc8b676fe65a66497941275b190e63a2c47d5319. All RHEL6 kernels >= kernel-2.6.32-663.el6 and RHEL7 kernels >= kernel-3.10.0-367.el7 already include the fix. The fix committed to RHEL7 in March 2016 less than a month after the bug was committed upstream. So this is already fixed in production RHEL7 >= 7.3 and RHEL6 >= 6.9 and only RHEL7.2 and earlier can be affected. If this is confirmed it may be reasonable to do a zstream update to older RHEL7. Hi, Unfortunately, this patch(mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED) didn't help, I got the panic again. And I find this error before panic, "[468229.996610] BUG: Bad rss-counter state mm:ffff8806aebc2580 idx:1 val:1" [468451.702807] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [468451.702861] IP: [<ffffffff810ac089>] down_read_trylock+0x9/0x30 [468451.702900] PGD 12445e067 PUD 11acaa067 PMD 0 [468451.702931] Oops: 0000 [#1] SMP [468451.702953] kbox catch die event. [468451.703003] collected_len = 1047419, LOG_BUF_LEN_LOCAL = 1048576 [468451.703003] kbox: notify die begin [468451.703003] kbox: no notify die func register. no need to notify [468451.703003] do nothing after die! [468451.703003] Modules linked in: ipt_REJECT macvlan ip_set_hash_ipport vport_vxlan(OVE) xt_statistic xt_physdev xt_nat xt_recent xt_mark xt_comment veth ct_limit(OVE) bum_extract(OVE) policy(OVE) bum(OVE) ip_set nfnetlink openvswitch(OVE) nf_defrag_ipv6 gre ext3 jbd ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack bridge stp llc kboxdriver(O) kbox(O) dm_thin_pool dm_persistent_data crc32_pclmul dm_bio_prison dm_bufio ghash_clmulni_intel libcrc32c aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev sg parport_pc cirrus virtio_console parport syscopyarea sysfillrect sysimgblt ttm drm_kms_helper drm i2c_piix4 i2c_core pcspkr ip_tables ext4 jbd2 mbcache sr_mod cdrom ata_generic pata_acpi [468451.703003] virtio_net virtio_blk crct10dif_pclmul crct10dif_common ata_piix virtio_pci libata serio_raw virtio_ring crc32c_intel virtio dm_mirror dm_region_hash dm_log dm_mod [468451.703003] CPU: 6 PID: 21965 Comm: docker-containe Tainted: G OE ----V------- 3.10.0-327.53.58.73.x86_64 #1 [468451.703003] Hardware name: OpenStack Foundation OpenStack Nova, BIOS rel-1.8.1-0-g4adadbd-20170107_142945-9_64_246_229 04/01/2014 [468451.703003] task: ffff880692402e00 ti: ffff88018209c000 task.ti: ffff88018209c000 [468451.703003] RIP: 0010:[<ffffffff810ac089>] [<ffffffff810ac089>] down_read_trylock+0x9/0x30 [468451.703003] RSP: 0018:ffff88018209f8f8 EFLAGS: 00010202 [468451.703003] RAX: 0000000000000000 RBX: ffff880720cd7740 RCX: ffff880720cd7740 [468451.703003] RDX: 0000000000000001 RSI: 0000000000000301 RDI: 0000000000000008 [468451.703003] RBP: ffff88018209f8f8 R08: 00000000c0e0f310 R09: ffff880720cd7740 [468451.703003] R10: ffff88083efd8000 R11: 0000000000000000 R12: ffff880720cd7741 [468451.703003] R13: ffffea000824d100 R14: 0000000000000008 R15: 0000000000000000 [468451.703003] FS: 00007fc0e2a85700(0000) GS:ffff88083ed80000(0000) knlGS:0000000000000000 [468451.703003] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [468451.703003] CR2: 0000000000000008 CR3: 0000000661906000 CR4: 00000000001407e0 [468451.703003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [468451.703003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [468451.703003] Stack: [468451.703003] ffff88018209f928 ffffffff811a7eb5 ffffea000824d100 ffff88018209fa90 [468451.703003] ffffea00082f9680 0000000000000301 ffff88018209f978 ffffffff811a82e1 [468451.703003] ffffea000824d100 ffff88018209fa00 0000000000000001 ffffea000824d100 [468451.703003] Call Trace: [468451.703003] [<ffffffff811a7eb5>] page_lock_anon_vma_read+0x55/0x110 [468451.703003] [<ffffffff811a82e1>] try_to_unmap_anon+0x21/0x120 [468451.703003] [<ffffffff811a842d>] try_to_unmap+0x4d/0x60 [468451.712006] [<ffffffff811cc749>] migrate_pages+0x439/0x790 [468451.712006] [<ffffffff81193280>] ? __reset_isolation_suitable+0xe0/0xe0 [468451.712006] [<ffffffff811941f9>] compact_zone+0x299/0x400 [468451.712006] [<ffffffff81059aff>] ? kvm_clock_get_cycles+0x1f/0x30 [468451.712006] [<ffffffff811943fc>] compact_zone_order+0x9c/0xf0 [468451.712006] [<ffffffff811947b1>] try_to_compact_pages+0x121/0x1a0 [468451.712006] [<ffffffff8163ace6>] __alloc_pages_direct_compact+0xac/0x196 [468451.712006] [<ffffffff811783e2>] __alloc_pages_nodemask+0xbc2/0xca0 [468451.712006] [<ffffffff811bcb7a>] alloc_pages_vma+0x9a/0x150 [468451.712006] [<ffffffff811d1573>] do_huge_pmd_anonymous_page+0x123/0x510 [468451.712006] [<ffffffff8119bc58>] handle_mm_fault+0x1a8/0xf50 [468451.712006] [<ffffffff8164b4d6>] __do_page_fault+0x166/0x470 [468451.712006] [<ffffffff8164b8a3>] trace_do_page_fault+0x43/0x110 [468451.712006] [<ffffffff8164af79>] do_async_page_fault+0x29/0xe0 [468451.712006] [<ffffffff81647a38>] async_page_fault+0x28/0x30 [468451.712006] Code: 00 00 00 ba 01 00 00 00 48 89 de e8 12 fe ff ff eb ce 48 c7 c0 f2 ff ff ff eb c5 e8 42 ff fc ff 66 90 0f 1f 44 00 00 55 48 89 e5 <48> 8b 07 48 89 c2 48 83 c2 01 7e 07 f0 48 0f b1 17 75 f0 48 f7 [468451.712006] RIP [<ffffffff810ac089>] down_read_trylock+0x9/0x30 [468451.738667] RSP <ffff88018209f8f8> [468451.738667] CR2: 0000000000000008 Could you attach to BZ a couple of `cat /proc/*/smaps` run at different times while the workload where you can reproduce is running? I suggest to enable DEBUG_VM=y in your builds if you didn't already, it won't risk to impact performance measurably and it's a supported config also enabled in the -debug kernel (but please keep DEBUG_VM_RB=n because that's expensive). We think this problem has been fixed by the commits listed in Comment #83. If you are not running 7.3 or 6.9 and encounter this problem can you update and try to reproduce it again? Larry Woodman Sumeet, do you still see issue with RHEL6.9 or RHEL7.3? (In reply to Trinh Dao from comment #90) > Sumeet, do you still see issue with RHEL6.9 or RHEL7.3? I have not seen it yet on RHEL 7.3 , perhaps because a majority of our customers don't stay on the leading edge of releases. I will update this BZ if it shows up on a more recent kernel. Hello, This bug has been copied as 7.4 z-stream (EUS) bug #1496378 Thank You Joe Kachuck Sumeet, since you don't see the issue anymore in comment 91, can I close your bug and you can re-open if you see it again? (In reply to qiuxishi2 from comment #84) > Hi, > > Unfortunately, this patch(mm: thp: fix SMP race condition between > THP page fault and MADV_DONTNEED) didn't help, I got the panic again. > > And I find this error before panic, "[468229.996610] BUG: Bad rss-counter > state mm:ffff8806aebc2580 idx:1 val:1" > > [468451.702807] BUG: unable to handle kernel NULL pointer dereference at > 0000000000000008 > [468451.702861] IP: [<ffffffff810ac089>] down_read_trylock+0x9/0x30 > [468451.702900] PGD 12445e067 PUD 11acaa067 PMD 0 > [468451.702931] Oops: 0000 [#1] SMP > [468451.702953] kbox catch die event. > [468451.703003] collected_len = 1047419, LOG_BUF_LEN_LOCAL = 1048576 > [468451.703003] kbox: notify die begin > [468451.703003] kbox: no notify die func register. no need to notify > [468451.703003] do nothing after die! > [468451.703003] Modules linked in: ipt_REJECT macvlan ip_set_hash_ipport > vport_vxlan(OVE) xt_statistic xt_physdev xt_nat xt_recent xt_mark xt_comment > veth ct_limit(OVE) bum_extract(OVE) policy(OVE) bum(OVE) ip_set nfnetlink > openvswitch(OVE) nf_defrag_ipv6 gre ext3 jbd ipt_MASQUERADE > nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 > nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack > bridge stp llc kboxdriver(O) kbox(O) dm_thin_pool dm_persistent_data > crc32_pclmul dm_bio_prison dm_bufio ghash_clmulni_intel libcrc32c > aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ppdev sg parport_pc > cirrus virtio_console parport syscopyarea sysfillrect sysimgblt ttm > drm_kms_helper drm i2c_piix4 i2c_core pcspkr ip_tables ext4 jbd2 mbcache > sr_mod cdrom ata_generic pata_acpi > [468451.703003] virtio_net virtio_blk crct10dif_pclmul crct10dif_common > ata_piix virtio_pci libata serio_raw virtio_ring crc32c_intel virtio > dm_mirror dm_region_hash dm_log dm_mod > [468451.703003] CPU: 6 PID: 21965 Comm: docker-containe Tainted: G > OE ----V------- 3.10.0-327.53.58.73.x86_64 #1 > [468451.703003] Hardware name: OpenStack Foundation OpenStack Nova, BIOS > rel-1.8.1-0-g4adadbd-20170107_142945-9_64_246_229 04/01/2014 > [468451.703003] task: ffff880692402e00 ti: ffff88018209c000 task.ti: > ffff88018209c000 > [468451.703003] RIP: 0010:[<ffffffff810ac089>] [<ffffffff810ac089>] > down_read_trylock+0x9/0x30 > [468451.703003] RSP: 0018:ffff88018209f8f8 EFLAGS: 00010202 > [468451.703003] RAX: 0000000000000000 RBX: ffff880720cd7740 RCX: > ffff880720cd7740 > [468451.703003] RDX: 0000000000000001 RSI: 0000000000000301 RDI: > 0000000000000008 > [468451.703003] RBP: ffff88018209f8f8 R08: 00000000c0e0f310 R09: > ffff880720cd7740 > [468451.703003] R10: ffff88083efd8000 R11: 0000000000000000 R12: > ffff880720cd7741 > [468451.703003] R13: ffffea000824d100 R14: 0000000000000008 R15: > 0000000000000000 > [468451.703003] FS: 00007fc0e2a85700(0000) GS:ffff88083ed80000(0000) > knlGS:0000000000000000 > [468451.703003] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [468451.703003] CR2: 0000000000000008 CR3: 0000000661906000 CR4: > 00000000001407e0 > [468451.703003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [468451.703003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: > 0000000000000400 > [468451.703003] Stack: > [468451.703003] ffff88018209f928 ffffffff811a7eb5 ffffea000824d100 > ffff88018209fa90 > [468451.703003] ffffea00082f9680 0000000000000301 ffff88018209f978 > ffffffff811a82e1 > [468451.703003] ffffea000824d100 ffff88018209fa00 0000000000000001 > ffffea000824d100 > [468451.703003] Call Trace: > [468451.703003] [<ffffffff811a7eb5>] page_lock_anon_vma_read+0x55/0x110 > [468451.703003] [<ffffffff811a82e1>] try_to_unmap_anon+0x21/0x120 > [468451.703003] [<ffffffff811a842d>] try_to_unmap+0x4d/0x60 > [468451.712006] [<ffffffff811cc749>] migrate_pages+0x439/0x790 > [468451.712006] [<ffffffff81193280>] ? __reset_isolation_suitable+0xe0/0xe0 > [468451.712006] [<ffffffff811941f9>] compact_zone+0x299/0x400 > [468451.712006] [<ffffffff81059aff>] ? kvm_clock_get_cycles+0x1f/0x30 > [468451.712006] [<ffffffff811943fc>] compact_zone_order+0x9c/0xf0 > [468451.712006] [<ffffffff811947b1>] try_to_compact_pages+0x121/0x1a0 > [468451.712006] [<ffffffff8163ace6>] __alloc_pages_direct_compact+0xac/0x196 > [468451.712006] [<ffffffff811783e2>] __alloc_pages_nodemask+0xbc2/0xca0 > [468451.712006] [<ffffffff811bcb7a>] alloc_pages_vma+0x9a/0x150 > [468451.712006] [<ffffffff811d1573>] do_huge_pmd_anonymous_page+0x123/0x510 > [468451.712006] [<ffffffff8119bc58>] handle_mm_fault+0x1a8/0xf50 > [468451.712006] [<ffffffff8164b4d6>] __do_page_fault+0x166/0x470 > [468451.712006] [<ffffffff8164b8a3>] trace_do_page_fault+0x43/0x110 > [468451.712006] [<ffffffff8164af79>] do_async_page_fault+0x29/0xe0 > [468451.712006] [<ffffffff81647a38>] async_page_fault+0x28/0x30 > [468451.712006] Code: 00 00 00 ba 01 00 00 00 48 89 de e8 12 fe ff ff eb ce > 48 c7 c0 f2 ff ff ff eb c5 e8 42 ff fc ff 66 90 0f 1f 44 00 00 55 48 89 e5 > <48> 8b 07 48 89 c2 48 83 c2 01 7e 07 f0 48 0f b1 17 75 f0 48 f7 > [468451.712006] RIP [<ffffffff810ac089>] down_read_trylock+0x9/0x30 > [468451.738667] RSP <ffff88018209f8f8> > [468451.738667] CR2: 0000000000000008 Hi, I add these two patch which from RHEL 7.3("introduce thp_mmu_gather to pin tail pages during MMU gather", "put_huge_zero_page() with MMU gather"),then I don't see the issue anymore until now. So I think this problem maybe related to these patches too, that means we should add the following patches. thp: put_huge_zero_page() with MMU gather thp: introduce thp_mmu_gather to pin tail pages during MMU gather mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED Patch for this issue has been in the kernel since 7.3 devel so there is nothing to do in 7.5. mark hpe verified since bug is closed now. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |