Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1973556

Summary:	RHEL9-Beta: aarch64: BUG: Bad page state in process modprobe pfn:14000
Product:	Red Hat Enterprise Linux 9	Reporter:	Chunyu Hu <chuhu>
Component:	kernel	Assignee:	mm-maint-bot <mm-maint>
kernel sub component:	Memory Management	QA Contact:	Chunyu Hu <chuhu>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	low
Priority:	unspecified	CC:	arozansk, ddutile, drjones, eric.auger, gshan, lcapitulino, liwan, msalter, pbunyan, pifang
Version:	9.0	Keywords:	Triaged
Target Milestone:	beta	Flags:	pm-rhel: mirror+
Target Release:	9.0
Hardware:	aarch64
OS:	Linux
Whiteboard:
Fixed In Version:	kernel-5.14.0-32.el9	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-01-20 08:05:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1924294

Comment 1 Guowen Shan 2021-06-23 05:55:30 UTC

I found the bug was reassigned back to the backlog before I'm going to
update.

First of all, the BUG_ON is invoked because PG_arch_1 page flag isn't
cleared on the page, which is fetched from PCP or buddy. It might be
relevent to the following upstream commit, which is merged into v5.13-rc2.

588a513d34257 ("arm64: Fix race condition on PG_dcache_clean in __sync_icache_dcache()")

As the issue is found on 5.13.0-0.rc4, it's worthy to check same issue
exists on 5.13.0-0.rc1 or not. Chunyu, could you help on this? I guess
the issue is hard to be reproduced by developer as it seems happening
randomly.

Thanks,
Gavin

Comment 2 Chunyu Hu 2021-06-23 12:27:52 UTC

(In reply to Guowen Shan from comment #1)
> I found the bug was reassigned back to the backlog before I'm going to
> update.
> 
> First of all, the BUG_ON is invoked because PG_arch_1 page flag isn't
> cleared on the page, which is fetched from PCP or buddy. It might be
> relevent to the following upstream commit, which is merged into v5.13-rc2.
> 
> 588a513d34257 ("arm64: Fix race condition on PG_dcache_clean in
> __sync_icache_dcache()")
> 
> As the issue is found on 5.13.0-0.rc4, it's worthy to check same issue
> exists on 5.13.0-0.rc1 or not. Chunyu, could you help on this? I guess

We dont' have the rc1 version in brew, the first version for 5.13 in brew is rc2:
http://download-node-02.eng.bos.redhat.com/brewroot/packages/kernel/5.13.0/0.rc2.19.el9/

Does this version is worthy to test? 

> the issue is hard to be reproduced by developer as it seems happening
> randomly.
> 
> Thanks,
> Gavin

Comment 3 Guowen Shan 2021-06-24 00:39:12 UTC

Chunyu, Yes, Please try rc2 either. If I'm correct, the issue starts
to emerge from rc2.

Comment 5 Guowen Shan 2021-07-02 09:58:59 UTC

First of all, thanks to Chuchu for helping me to reproduce the issue locally.
The issue can be reproduced with last RHEL9.0 source code and debug config.
I used "memhog" to reproduce the issue on the VM, which has the following
configurations.

   vCPU:   24
   memory: 8GB

After the VM boots up completely, run "memhog 6G -t 10" triggers the error
as Chuchu reported:

[  410.010288] BUG: Bad page state in process memhog  pfn:08000
[  410.019481] page:0000000015c0a628 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x8000
[  410.021521] flags: 0x7ffff8000000800(arch_1|node=0|zone=0|lastcpupid=0xfffff)
[  410.023100] raw: 07ffff8000000800 dead000000000100 dead000000000122 0000000000000000
[  410.024812] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
[  410.026527] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag(s) set

The root cause is corruption of page flags (PG_arch_1) for the pages, which
seats in buddy's free area list by mm/debug_vm_pgtable.c::debug_vm_pgtable().
In this function, there are two addresses are selected: the (userspace) virtual
address is a random one, figured out from random seed and TASK_SIZE. The physical
address is the one corresponding to kernel symbol (@start_kernel). Besides, the
page's protocol is fixed to have (VM_READ|VM_WRITE|VM_EXEC).

During the PMD tests, the selected physical address is aligned to huge page size,
which is 512MB in our case. After that, set_pmd_at() is called to populate the
PMD entry. As VM_EXEC is used, the PG_arch_1 is set on the head of the compound
page. Unfortunately, this page has been put into buddy's free area list. Afterwards,
the page is fetched from buddy because of 'memhog' and the page flag checking fails.
This is how the warning is raised.

I alrady had temporary fix to call flush_dcache_page() after each call to
set_{pgd, pud, pmd, pte}_at() so that the unexpected page flag PG_arch_1
is cleared in time. With the temporary fix applied to RHEL9.0, I don't see
the warning any more with 'memhog 6G -t 10'. I will post the patch to community
for comments pretty soon.

Comment 6 Guowen Shan 2021-07-02 10:37:47 UTC

The upstream patch was posted for comments. Lets see what I will get
from there.

https://marc.info/?l=linux-kernel&m=162522196632075&w=2

Comment 7 Guowen Shan 2021-07-10 04:09:02 UTC

The previously posted patch was replaced by subsequent series to
enhance mm/debug_vm_pgtable and Andrew Morton starts to merge it
to '-mm' tree. However, I didn't receive any comments from the
maintainer yet.

   https://lkml.org/lkml/2021/7/6/41

Comment 8 Luiz Capitulino 2021-07-19 15:40:06 UTC

*** Bug 1983255 has been marked as a duplicate of this bug. ***

Comment 9 Guowen Shan 2021-07-28 06:36:14 UTC

Upstream v4 series was posted for review, mostly to address comments received
from Anshuman (ARM).

https://lkml.org/lkml/2021/7/27/116

Comment 10 Guowen Shan 2021-09-05 23:42:34 UTC

Chunyu, could you please reassign this bug to memory management developer
to backport the series to our downstream RHEL9.0.0?

The v6 series has been merged to upstream (v5.15.rc1). 

https://lkml.org/lkml/2021/8/13/227

8c5b3a8adad2 mm/debug_vm_pgtable: fix corrupted page flag
fda88cfda1ab mm/debug_vm_pgtable: remove unused code
2f87f8c39a91 mm/debug_vm_pgtable: use struct pgtable_debug_args in PGD and P4D modifying tests
4cbde03bdb0b mm/debug_vm_pgtable: use struct pgtable_debug_args in PUD modifying tests
c0fe07b0aa72 mm/debug_vm_pgtable: use struct pgtable_debug_args in PMD modifying tests
44966c4480f8 mm/debug_vm_pgtable: use struct pgtable_debug_args in PTE modifying tests
4878a888824b mm/debug_vm_pgtable: use struct pgtable_debug_args in migration and thp tests
5f447e8067fd mm/debug_vm_pgtable: use struct pgtable_debug_args in soft_dirty and swap tests
8cb183f2f2a0 mm/debug_vm_pgtable: use struct pgtable_debug_args in protnone and devmap tests
8983d231c7cc mm/debug_vm_pgtable: use struct pgtable_debug_args in leaf and savewrite tests
36b77d1e1592 mm/debug_vm_pgtable: use struct pgtable_debug_args in basic tests
3c9b84f044a9 mm/debug_vm_pgtable: introduce struct pgtable_debug_args

Comment 11 Chunyu Hu 2021-09-06 01:50:38 UTC

(In reply to Guowen Shan from comment #10)
> Chunyu, could you please reassign this bug to memory management developer
> to backport the series to our downstream RHEL9.0.0?

set to the kernel memory management team, thanks for the work.

Comment 12 Don Dutile (Red Hat) 2022-01-18 05:44:29 UTC

Chunyu,
All of the patches related to the series was pulled into RHEL-9.0 when Rafael did the v5.15 kernel-mm update.

Can you re-test with kernel-5.14.0-32.el9 or higher?
If all works, you can close this bz as with Fixed in version set to above kernel version.

Thanks.

Comment 13 Chunyu Hu 2022-01-18 06:12:17 UTC

(In reply to Don Dutile (Red Hat) from comment #12)
> Chunyu,
> All of the patches related to the series was pulled into RHEL-9.0 when
> Rafael did the v5.15 kernel-mm update.
> 
> Can you re-test with kernel-5.14.0-32.el9 or higher?
> If all works, you can close this bz as with Fixed in version set to above
> kernel version.

Thanks for the info.  I'll do that in CTC2 test cycle. 

> 
> Thanks.

Comment 14 Chunyu Hu 2022-01-20 08:05:55 UTC

retested with 5.14.0-44.el9.aarch64+debug， the issue is fixed. no 'Bad page' during boot.
https://beaker.engineering.redhat.com/jobs/6201338