Bug 113522

Summary: Bigpages and sshd crash
Product: Red Hat Enterprise Linux 2.1 Reporter: Dave Anderson <anderson>
Component: kernelAssignee: Dave Anderson <anderson>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.1CC: jbaron, kambiz, riel, tburke
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-09-28 16:13:56 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
downloadable version of posted patch none

Description Dave Anderson 2004-01-14 21:55:19 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030611

Description of problem:

As reported by Joel Becker on oracle-list: 

> Folks,
>	Some time ago, a customer (NASA) started experiencing
> crashes.  The offending process was always sshd, and the trace
> always showed the BUG() in mm/page_alloc.c:__free_pages_ok():
>
>	if (BigPage(page))
>	        BUG();
>
> Some furthur examination shows that __free_pages_ok() is called
> down from zap_pte_range, which is called by zap_pmd_range().  In
> zap_pmd_range() there is this snippet:
>
>    if (pmd_bigpage(*pmd))
>	    pmd_clear(pmd);
>    else
>	    freed += zap_pte_range(tlb, pmd, address, end - address);
>
> Because zap_pte_range() eventually calls __free_pages_ok(), and
> that function explicitly asserts that BigPage() is false, it thus
> follows that if pmd_bigpage() is false, BigPage() MUST be false.
> However, when the BUG() is seen by the customer, it is obvious
> that pmd_bigpage() is false but  BigPage() is true.

The last sentence narrows down the problem -- a 4K page from within
a 2MB bigpage is being mapped in an incorrect manner.  Instead of
using a single PMD entry with the PAE bit set in order to map a
2MB bigpage, a single 4k page from a 2MB bigpage is being allocated
and mapped in a third-level page table PTE entry.  As a result, when
freeing the address space, pmd_bigpage() is false, but BigPage() is true.

As it turns out, this anomoly occurs when mapping anonymous shared
memory when /proc/sys/kernel/shm-use-bigpages is set to 1.  This
should only be acceptable when the shared memory segment is exactly
the size of a bigpage, as is the case when using bigpages to map
segments of a file.  However, for example, the following mmap()
call, while only requesting a 4k shared anomymous page, will end up
taking a single 4k page (the first one) from the next available
2MB bigpage:

  addr = mmap(0, 4096, PROT_WRITE|PROT_READ, MAP_ANON|MAP_SHARED,
              -1, 0);

It leaves a state such that, for the address in which it's mapped,
pmd_bigpage() is false, but BigPage() is true.

with the single PMD entry. 


Version-Release number of selected component (if applicable):
kernel-enterprise-2.4.9-e.34

How reproducible:
Always

Steps to Reproduce:
1. Set /proc/sys/kernel/shm-use-bigpages to 1.
2. Allocate some number of bigpages on the boot command line.
3. Write a C program with the mmap() call above.
4. Touch the address returned by the mmap() call.
5. Note the drop in nr_bigpages with crash utility, or by the
   BigPagesFree value in /proc/meminfo.
    

Actual Results:  
A bigpage will be allocated to each instance of the test program,
until they are all used (at which time subsequent mmap'ers do the
right thing).

Expected Results:  
Bigpages should not be used for mmaps unless the request is the
same size as a bigpage.

Additional info:

Comment 1 Dave Anderson 2004-01-14 22:03:17 UTC
The following patch has been sent to Oracle for verification.  For
shared anonymous memory map requests, it validates the request for
bigpage usage first via shmem_make_bigpage_mmap(), before allowing
shm_enable_bigpages() to be called:

--- linux/mm/shmem.c.orig	Tue Jan 13 15:16:21 2004
+++ linux/mm/shmem.c	Tue Jan 13 15:48:12 2004
@@ -52,6 +52,7 @@
 atomic_t shmem_nrpages = ATOMIC_INIT(0);
 
 int shm_use_bigpages;
+#define MAX_BIGPAGES (16384)
 
 #define BLOCKS_PER_PAGE (PAGE_CACHE_SIZE/512)
 
@@ -828,8 +829,9 @@
 int shmem_make_bigpage_mmap(struct file * file, struct vm_area_struct
* vma)
 {
 	struct inode *inode = file->f_dentry->d_inode;
-	unsigned long pages;
+	unsigned long pages, max_bigpages;
 	shmem_info_t *info;
+	struct shmem_sb_info *sbinfo;
 	int bigpage;
 
 	/*
@@ -858,7 +860,11 @@
 
 	pages = (vma->vm_end - vma->vm_start) / PAGE_SIZE + vma->vm_pgoff;
 	pages >>= (BIGPAGE_SHIFT - PAGE_SHIFT);
-	if (pages > info->max_bigpages)
+        sbinfo = SHMEM_SB(inode->i_sb);
+        max_bigpages = sbinfo->max_blocks >> (BIGPAGE_SHIFT -
PAGE_CACHE_SHIFT);
+        if (max_bigpages > MAX_BIGPAGES)
+                max_bigpages = MAX_BIGPAGES;
+        if (pages > max_bigpages)
 		return -ENOSPC;
 
 	vma->vm_flags |= VM_LOCKED;
@@ -1587,9 +1592,9 @@
 }
 
 /*
- * Limit kmalloc() size to a max of 64K. This covers 32 GB 2MB pages.
+ * MAX_BIGPAGES (16384) limits the kmalloc() size to a max of 64K. 
+ * This covers 32 GB 2MB pages.
  */
-#define MAX_BIGPAGES (16384)
 
 void shm_enable_bigpages(struct inode *inode, struct file *filp,
size_t size, int prealloc)
 {
@@ -1823,12 +1828,18 @@
 {
 	struct file *file;
 	loff_t size = vma->vm_end - vma->vm_start;
+	unsigned long vm_flags = vma->vm_flags;
+	int error;
 	
 	file = shmem_file_setup("dev/zero", size);
 	if (IS_ERR(file))
 		return PTR_ERR(file);
-	if (shm_use_bigpages)
+	if (shm_use_bigpages && !(error = shmem_make_bigpage_mmap(file, vma)) 
+		&& (vma->vm_flags & VM_BIGPAGE)) {
 		shm_enable_bigpages(file->f_dentry->d_inode, file, BIGPAGE_SIZE, 0);
+		if (!I_BIGPAGE(file->f_dentry->d_inode))
+                	vma->vm_flags = vm_flags;
+	}
 
 	if (vma->vm_file)
 		fput (vma->vm_file);

Comment 2 Dave Anderson 2004-01-19 16:54:11 UTC
Patch was accepted/tested at Oracle.

Comment 3 Dave Anderson 2004-01-19 17:02:07 UTC
Created attachment 97103 [details]
downloadable version of posted patch