Bug 1726896 - mm: fix race on soft-offlining free huge pages
Summary: mm: fix race on soft-offlining free huge pages
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.5
Hardware: All
OS: Linux
unspecified
high
Target Milestone: pre-dev-freeze
: 7.9
Assignee: Artem Savkov
QA Contact: Li Wang
URL:
Whiteboard:
Depends On: 1726983
Blocks: 1729246
TreeView+ depends on / blocked
 
Reported: 2019-07-04 03:38 UTC by Li Wang
Modified: 2023-08-08 02:45 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-19 00:34:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
IBM Linux Technology Center 182884 0 None None None 2019-12-11 09:49:07 UTC

Description Li Wang 2019-07-04 03:38:57 UTC
Description of problem:

commit 6bc9b56433b76e40d11099338d27fbc5cd2935ca
Author: Naoya Horiguchi <n-horiguchi.nec.com>
Date:   Thu Aug 23 17:00:38 2018 -0700

    mm: fix race on soft-offlining free huge pages
    
    Patch series "mm: soft-offline: fix race against page allocation".
    
    Xishi recently reported the issue about race on reusing the target pages
    of soft offlining.  Discussion and analysis showed that we need make
    sure that setting PG_hwpoison should be done in the right place under
    zone->lock for soft offline.  1/2 handles free hugepage's case, and 2/2
    hanldes free buddy page's case.


Without this above patch, ltp/move_pages12 failed on rhel7(3.10.0-1059.el7.x86_64.debug) as:

# ./move_pages12 
tst_test.c:1100: INFO: Timeout per run is 0h 05m 00s
move_pages12.c:235: INFO: Free RAM 129844088 kB
move_pages12.c:253: INFO: Increasing 2048kB hugepages pool on node 0 to 12
move_pages12.c:263: INFO: Increasing 2048kB hugepages pool on node 1 to 12
move_pages12.c:179: INFO: Allocating and freeing 4 hugepages on node 0
move_pages12.c:179: INFO: Allocating and freeing 4 hugepages on node 1
move_pages12.c:169: PASS: Bug not reproduced
tst_test.c:1145: BROK: Test killed by SIGBUS!

reproducer: https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/move_pages/move_pages12.c

Comment 1 Rafael Aquini 2019-07-24 15:22:43 UTC
May I ask you how are you sure that the particular issue is solved by the pointed out patch?

Also, a couple more questions on this case:
 a) do you have the console logs for the registered failure;
 b) how consistent is your reproducer; and 
 c) when it started to happen? at kernel-3.10.0-1059.el7 or at an earlier build?

Thanks in advance!
-- Rafael

Comment 2 Li Wang 2019-07-25 05:27:41 UTC
(In reply to Rafael Aquini from comment #1)
> May I ask you how are you sure that the particular issue is solved by the
> pointed out patch?

The test#2(in move_pages12) is going to simulate the race condition, where move_pages() and soft offline are called on a single hugetlb page concurrently. But, it returns EBUSY and reports FAIL in soft-offline a moving hugepage as a result in upstream v5.2 kernel testing. 

I confirmed with Naoya Horiguchi and he pointed out that because of this new fix commit 6bc9b56433b7(mm: fix race on soft-offlining free huge pages) change on the return value of madvise(MADV_SOFT_OFFLINE), and we see -EBUSY when hugepage migration succeeded and error containment failed, that consider this EBUSY as error, but a good report for application.

And it says that patch also fixes another bz: a race condition between soft offline and hugetlb_fault which causes unexpected process SIGBUS killing and/or hugetlb allocation failure. So I tried it on rhel7 and get a failure like that:

err_log:
tst_test.c:1096: INFO: Timeout per run is 0h 05m 00s
move_pages12.c:236: INFO: Free RAM 119568 kB
move_pages12.c:254: INFO: Increasing 2048kB hugepages pool on node 0 to 83
move_pages12.c:264: INFO: Increasing 2048kB hugepages pool on node 1 to 94
move_pages12.c:180: INFO: Allocating and freeing 4 hugepages on node 0
move_pages12.c:180: INFO: Allocating and freeing 4 hugepages on node 1
move_pages12.c:170: PASS: Bug not reproduced
tst_test.c:1141: BROK: Test killed by SIGBUS!
move_pages12.c:114: FAIL: move_pages failed: ESRCH

dmesg
[ 9868.180669] MCE: Killing move_pages12:29616 due to hardware memory corruption fault at 2aaaaac00018
[ 9990.049875] Soft offlining page 50e00 at 2aaaaac00000
[ 9990.052218] Soft offlining page 50c00 at 2aaaaae00000
[ 9990.060395] Soft offlining page 51000 at 2aaaaac00000


This patch changes soft offline semantics where it sets PageHWPoison flag only after containment of the error page completes successfully.


-               if (PageHuge(page))
-                       dissolve_free_huge_page(page);
+               /*
+                * We set PG_hwpoison only when the migration source hugepage
+                * was successfully dissolved, because otherwise hwpoisoned
+                * hugepage remains on free hugepage list, then userspace will
+                * find it as SIGBUS by allocation failure. That's not expected
+                * in soft-offlining.
+                */
+               ret = dissolve_free_huge_page(page);
+               if (!ret) {
+                       if (set_hwpoison_free_buddy_page(page))
+                               num_poisoned_pages_inc();
+               }


> 
> Also, a couple more questions on this case:
>  a) do you have the console logs for the registered failure;

see above.

>  b) how consistent is your reproducer; and 
>  c) when it started to happen? at kernel-3.10.0-1059.el7 or at an earlier
> build?

Not sure, this reproducer is a new port to LTP, and I just run it from RHEL7.7(kernel-3.10.0-1059.el7) and mainline kernel v5.2, I guess the rhel8 also need to fix this problem too.

If anything I was wrong, feel free to correct me.


Li Wang

Comment 3 Li Wang 2019-07-25 05:32:13 UTC
Btw, this is the original discussion on LTP ML:
  http://lists.linux.it/pipermail/ltp/2019-June/012299.html

Comment 4 Rafael Aquini 2019-07-25 13:17:07 UTC
(In reply to Li Wang from comment #2)
> (In reply to Rafael Aquini from comment #1)
> > May I ask you how are you sure that the particular issue is solved by the
> > pointed out patch?
> 
> The test#2(in move_pages12) is going to simulate the race condition, where
> move_pages() and soft offline are called on a single hugetlb page
> concurrently. But, it returns EBUSY and reports FAIL in soft-offline a
> moving hugepage as a result in upstream v5.2 kernel testing. 
> 
> I confirmed with Naoya Horiguchi and he pointed out that because of this new
> fix commit 6bc9b56433b7(mm: fix race on soft-offlining free huge pages)
> change on the return value of madvise(MADV_SOFT_OFFLINE), and we see -EBUSY
> when hugepage migration succeeded and error containment failed, that
> consider this EBUSY as error, but a good report for application.
> 
> And it says that patch also fixes another bz: a race condition between soft
> offline and hugetlb_fault which causes unexpected process SIGBUS killing
> and/or hugetlb allocation failure. So I tried it on rhel7 and get a failure
> like that:
> 
> err_log:
> tst_test.c:1096: INFO: Timeout per run is 0h 05m 00s
> move_pages12.c:236: INFO: Free RAM 119568 kB
> move_pages12.c:254: INFO: Increasing 2048kB hugepages pool on node 0 to 83
> move_pages12.c:264: INFO: Increasing 2048kB hugepages pool on node 1 to 94
> move_pages12.c:180: INFO: Allocating and freeing 4 hugepages on node 0
> move_pages12.c:180: INFO: Allocating and freeing 4 hugepages on node 1
> move_pages12.c:170: PASS: Bug not reproduced
> tst_test.c:1141: BROK: Test killed by SIGBUS!
> move_pages12.c:114: FAIL: move_pages failed: ESRCH
> 
> dmesg
> [ 9868.180669] MCE: Killing move_pages12:29616 due to hardware memory
> corruption fault at 2aaaaac00018
> [ 9990.049875] Soft offlining page 50e00 at 2aaaaac00000
> [ 9990.052218] Soft offlining page 50c00 at 2aaaaae00000
> [ 9990.060395] Soft offlining page 51000 at 2aaaaac00000
> 
> 
> This patch changes soft offline semantics where it sets PageHWPoison flag
> only after containment of the error page completes successfully.
> 
> 
> -               if (PageHuge(page))
> -                       dissolve_free_huge_page(page);
> +               /*
> +                * We set PG_hwpoison only when the migration source hugepage
> +                * was successfully dissolved, because otherwise hwpoisoned
> +                * hugepage remains on free hugepage list, then userspace
> will
> +                * find it as SIGBUS by allocation failure. That's not
> expected
> +                * in soft-offlining.
> +                */
> +               ret = dissolve_free_huge_page(page);
> +               if (!ret) {
> +                       if (set_hwpoison_free_buddy_page(page))
> +                               num_poisoned_pages_inc();
> +               }
> 
> 
> > 
> > Also, a couple more questions on this case:
> >  a) do you have the console logs for the registered failure;
> 
> see above.
> 
> >  b) how consistent is your reproducer; and 
> >  c) when it started to happen? at kernel-3.10.0-1059.el7 or at an earlier
> > build?
> 
> Not sure, this reproducer is a new port to LTP, and I just run it from
> RHEL7.7(kernel-3.10.0-1059.el7) and mainline kernel v5.2, I guess the rhel8
> also need to fix this problem too.
> 
> If anything I was wrong, feel free to correct me.

Nope, nothing wrong, I'm just double-checking the fact to be sure we're (a) really
hitting the condition described at the patch-fix; and (b) if we know that's a regression
or if it is something that have always been there.

Thanks for the information, Li.

-- Rafael

Comment 7 Jan Stancek 2019-10-24 09:02:45 UTC
Also strange is that I'm unable to release most of hugepages. I run move_pages12 test couple times and hugepage count just keeps increasing:

# cat /sys/devices/system/node/node{0,1}/hugepages/hugepages-2048kB/nr_hugepages
38
24

# echo 0 > /proc/sys/vm/nr_hugepages
# echo 0 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
# echo 0 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages

# cat /sys/devices/system/node/node{0,1}/hugepages/hugepages-2048kB/nr_hugepages
36
24

# cat /proc/meminfo  | grep Huge
AnonHugePages:      6144 kB
HugePages_Total:      60
HugePages_Free:       16
HugePages_Rsvd:        0
HugePages_Surp:       24
Hugepagesize:       2048 kB

Doesn't look to be recent issue, I see same behavior with 7.7GA, 7.6GA kernels.

Comment 8 IBM Bug Proxy 2019-12-11 16:40:31 UTC
------- Comment From mbringm.com 2019-12-11 11:32 EDT-------
Aneesh:
Please take a look at this.

Comment 9 IBM Bug Proxy 2020-01-20 10:20:46 UTC
------- Comment From sadas034.com 2020-01-20 05:12 EDT-------
(In reply to comment #4)
> Also strange is that I'm unable to release most of hugepages. I run
> move_pages12 test couple times and hugepage count just keeps increasing:
> # cat
> /sys/devices/system/node/node{0,1}/hugepages/hugepages-2048kB/nr_hugepages
> 38
> 24
> # echo 0 > /proc/sys/vm/nr_hugepages
> # echo 0 >
> /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
> # echo 0 >
> /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
> # cat
> /sys/devices/system/node/node{0,1}/hugepages/hugepages-2048kB/nr_hugepages
> 36
> 24
> # cat /proc/meminfo  | grep Huge
> AnonHugePages:      6144 kB
> HugePages_Total:      60
> HugePages_Free:       16
> HugePages_Rsvd:        0
> HugePages_Surp:       24
> Hugepagesize:       2048 kB
> Doesn't look to be recent issue, I see same behavior with 7.7GA, 7.6GA
> kernels.

With ppc64, this test triggers a kernel crash on 7.7 and older GA kernels as observed in BZ178206. The fix for that is now included but I don't see any issues with freeing huge pages even after running this test several times.

Comment 10 IBM Bug Proxy 2020-03-10 16:52:51 UTC
------- Comment From mbringm.com 2020-03-10 12:43 EDT-------
RedHat: Any updates on this one?


Note You need to log in before you can comment on or make changes to this bug.