Bug 1219197 - Xen BUG at page_alloc.c:1738
Summary: Xen BUG at page_alloc.c:1738
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: gcc
Version: 22
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Jakub Jelinek
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-05-06 19:48 UTC by Major Hayden 🤠
Modified: 2015-06-18 14:19 UTC (History)
21 users (show)

Fixed In Version: gcc-5.1.1-3.fc22
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-06-18 14:19:00 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
preprocessed source (880.92 KB, text/plain)
2015-06-06 14:06 UTC, Michael Young
no flags Details


Links
System ID Private Priority Status Summary Last Updated
GNU Compiler Collection 66444 0 None None None Never
XenSource 1908 0 None None None Never

Description Major Hayden 🤠 2015-05-06 19:48:07 UTC
I've installed the Xen hypervisor packages on Fedora 22 but I'm getting a panic early during the boot process:

(XEN) Xen call trace:
(XEN)    [<ffff82d08011d160>] free_domheap_pages+0x240/0x430
(XEN)    [<ffff82d08018c944>] mmio_ro_do_page_fault+0x114/0x160
(XEN)    [<ffff82d0801a4c10>] do_page_fault+0x1a0/0x4f0
(XEN)    [<ffff82d080239768>] handle_exception_saved+0x2e/0x6c
(XEN) 
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Xen BUG at page_alloc.c:1738
(XEN) ****************************************

Full output: https://gist.github.com/major/baa0e2eee7de51a2bcd1

Packages in use:

 * kernel-4.0.1-300.fc22.x86_64
 * xen-4.5.0-8.fc22.x86_64

I'm able to reproduce the failure on Dell/HP physical servers as well as within a KVM virtual machine (with nested virt enabled).  I can't tell if this is a bug in the Linux kernel or within Xen.  I'll be glad to reclassify the component in the bug if someone knows this better than I do.

Comment 1 Major Hayden 🤠 2015-05-06 19:55:59 UTC
FWIW, the error is identical with kernel-4.0.0-0.rc5.git4.1.fc22.x86_64.

Comment 2 Josh Boyer 2015-05-06 20:04:37 UTC
The output is from Xen, so we'll start there.

Comment 3 Major Hayden 🤠 2015-05-06 20:34:50 UTC
The same error appears when using these kernels as well:

 * kernel-3.19.5-200.fc21.x86_64
 * kernel-3.18.8-201.fc21.x86_64
 * kernel-3.17.8-300.fc21.x86_64

Comment 4 Michael Young 2015-05-07 23:09:34 UTC
The crash occurs at the line
                BUG_ON((pg[i].u.inuse.type_info & PGT_count_mask) != 0);
in xen/common/page_alloc.c.

Comment 5 Major Hayden 🤠 2015-05-20 17:07:41 UTC
Jan suggested on xen-devel that gcc 5.0.1 might be to blame[1].  Is Xen 4.5 working for anyone else on Fedora 22's latest package/kernel set?

[1] http://lists.xen.org/archives/html/xen-devel/2015-05/msg02604.html

Comment 6 Michael Young 2015-05-25 18:19:45 UTC
Yes, it looks like gcc (or something else in the build chain). My newly updated F22 system won't boot in xen (4.5.0-8 or 4.5.1-rc1) but will boot with the 4.5.1-rc1 xen.gz file built on F21.

Comment 7 Michael Young 2015-05-31 16:41:56 UTC
From the thread http://marc.info/?l=xen-devel&m=143292326301633&w=2 on the xen-devel list

GCC 5 is indeed miscompiling the code. Comparing the fc21 vs fc22 builds:

The C snippet from mmio_ro_do_page_fault():

struct page_info *page = mfn_to_page(mfn);
struct domain *owner = page_get_owner_and_reference(page);
if ( owner )
    put_page(page);

In fc21 is:

movabs $0xffff82e000000000,%rbp
shr    %cl,%rax
or     %rdx,%rax
shl    $0x5,%rax
add    %rax,%rbp
mov    %rbp,%rdi
callq  ffff82d080186900 <page_get_owner_and_reference>
test   %rax,%rax
mov    %rax,%r12
je     ffff82d080189c4e <mmio_ro_do_page_fault+0x11e>
mov    %rbp,%rdi
callq  ffff82d080188ec0 <put_page>

and in fc22 is:

movabs $0xffff82e000000000,%r8
shr    %cl,%rax
or     %rdx,%rax
shl    $0x5,%rax
lea    (%r8,%rax,1),%rdi
callq  ffff82d0801874f0 <page_get_owner_and_reference>
test   %rax,%rax
mov    %rax,%rbp
je     ffff82d08018ca14 <mmio_ro_do_page_fault+0x114>
mov    %r8,%rdi
callq  ffff82d080189a90 <put_page>

"lea (%r8,%rax,1),%rdi" in FC22 is slightly shorter than "add %rax,%rbp;
mov %rbp,%rdi" in FC21.  In both cases %rdi is now 'page' from the C
snippet.

In FC21, the result is stored in %rbp, then reloaded from %rbp into %rdi
for call to put_page().

However, in FC22, the result of the calculation is only held in %rdi,
and clobbered by the call to page_get_owner_and_reference().  When it
comes to call put_page(), %r8 is reloaded, which is still a pointer to
the base of the frametable, not the page we actually took a reference on.

FC22 is miscompiling the C to:

struct page_info *page = mfn_to_page(mfn);
struct domain *owner = page_get_owner_and_reference(page);
if ( owner )
    put_page(mfn_to_page(0));

which is wrong, and why free_domheap_pages() does legitimately complain
about the wonky refcount.


Further testing links this to the -fcaller-saves option as if the file is built with -fno-caller-saves on F22 then the code snippet goes back to the F21 version. Possibly the mov %r8,%rdi line is incorrect.

Comment 8 Jakub Jelinek 2015-06-06 10:29:37 UTC
Please attach preprocessed source in which this happens and provide full gcc command line used to compile this file.

Comment 9 Michael Young 2015-06-06 14:06:46 UTC
Created attachment 1035629 [details]
preprocessed source

The full compile line (with some duplications removed) is
gcc -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches  -m64 -mtune=generic -fomit-frame-pointer -fno-strict-aliasing -std=gnu99 -Wstrict-prototypes -Wdeclaration-after-statement -Wno-unused-but-set-variable -Wno-unused-local-typedefs -DNDEBUG -I/home/michael/rpmbuild/BUILD/xen-4.5.0/xen/include  -I/home/michael/rpmbuild/BUILD/xen-4.5.0/xen/include/asm-x86/mach-generic -I/home/michael/rpmbuild/BUILD/xen-4.5.0/xen/include/asm-x86/mach-default -msoft-float -fno-stack-protector -fno-exceptions -Wnested-externs -DHAVE_GAS_VMX -DHAVE_GAS_EPT -DHAVE_GAS_FSGSBASE -mno-red-zone -mno-sse -fpic -fno-asynchronous-unwind-tables -DGCC_HAS_VISIBILITY_ATTRIBUTE -fno-builtin -fno-common -Werror -Wredundant-decls -Wno-pointer-arith -pipe -D__XEN__ -include /home/michael/rpmbuild/BUILD/xen-4.5.0/xen/include/xen/config.h -nostdinc -DXSM_ENABLE -DFLASK_ENABLE -DHAS_ACPI -DHAS_GDBSX -DHAS_PASSTHROUGH -DHAS_MEM_ACCESS -DHAS_MEM_PAGING -DHAS_MEM_SHARING -DHAS_PCI -DHAS_IOPORTS -DHAS_PDX -MMD -MF .xen.d -MF .built_in.o.d -MF .mm.o.d -c mm.c -o mm.o

Comment 10 Jakub Jelinek 2015-06-06 19:40:40 UTC
Thanks, filed upstream: PR66444.

Comment 11 Major Hayden 🤠 2015-06-18 14:12:02 UTC
It looks like the patch made it into upstream GCC if I am reading this ticket correctly:

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66444#c12

Comment 12 Jakub Jelinek 2015-06-18 14:19:00 UTC
Then it is already in the gcc-5.1.1-3.fc22 errata.


Note You need to log in before you can comment on or make changes to this bug.