Bug 703055

Summary: RHEL6.1 x86_64 HVM guest crashes on AMD host when guest memory size is larger than 8G
Product: Red Hat Enterprise Linux 6 Reporter: Yufang Zhang <yuzhang>
Component: kernelAssignee: Igor Mammedov <imammedo>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: urgent    
Version: 6.1CC: drjones, imammedo, jwest, jzheng, leiwang, mshao, qwan, wtogami, xavier.bru, xen-maint, yuzhang, yuzhou
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-2.6.32-160.el6 Doc Type: Bug Fix
Doc Text:
Prior to this update, Red Hat Enterprise Linux Xen (up to version 5.6) did not hide 1 GB pages and RDTSCP (enumeration features of CPUID), causing guest soft lock ups on AMD hosts when the guest's memory was greater than 8 GB. With this update, a Red Hat Enterprise Linux 6 HVM (Hardware Virtual Machine) guest is able to run on Red Hat Enterprise Linux Xen 5.6 and lower.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-06 13:27:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 653816, 711546    
Attachments:
Description Flags
console of the hvm guest
none
xm dmesg log
none
config file of the guest
none
xm info log
none
Patch to mask out 1GBPAGES and RDTSCP cpuid features for xen hvm
none
Patch to mask out 1GBPAGES and RDTSCP cpuid features for xen hvm. V2 none

Description Yufang Zhang 2011-05-09 06:56:48 UTC
Created attachment 497717 [details]
console of the hvm guest

Description of problem:
RHEL6.1 HVM guest would crash on AMD host when guest memory size is set as larger than 8G. When guest memory size is set as bellow 8G, the guest would boot successfully.

Version-Release number of selected component (if applicable):
host:
xen-3.0.3-130.el5
kernel-xen-2.6.18-259.el5

guest:
2.6.32-131.0.13.el6

How reproducible:
Always

Steps to Reproduce:
1. create rhel6.1 hvm guest with memory size set as larger than 8G.
2. the guest would crash when boots
  
Actual results:


Expected results:
Guest boots successfully.

Additional info:
1. rhel5.6 hvm guest would boot successfully on the same AMD host even the guest memory size is set as 495G.
2. rhel6.1 x86_64 pv guest would boot successfully when guest memory size is set as 32G.
3. rhel6.1 i386 hvm guest would boot successfully when guest memory size is set as 16G.
4. Test results on Intel machine would be submitted soon.

Comment 1 Yufang Zhang 2011-05-09 07:02:53 UTC
Created attachment 497718 [details]
xm dmesg log

Comment 2 Yufang Zhang 2011-05-09 07:04:02 UTC
Created attachment 497719 [details]
config file of the guest

Comment 3 Yufang Zhang 2011-05-09 07:05:11 UTC
Created attachment 497720 [details]
xm info log

Comment 4 Andrew Jones 2011-05-09 07:44:29 UTC
I see from the boot log that the guest had 3584M and it was using emulated devices (xen_emul_unplug=never). Were experiments with less memory and no unplug also done?

Comment 5 Yufang Zhang 2011-05-09 08:11:52 UTC
(In reply to comment #4)
> I see from the boot log that the guest had 3584M and it was using emulated
> devices (xen_emul_unplug=never). Were experiments with less memory and no
> unplug also done?

Yeah. I tested all the scenarios. The guest would crash as long as memory size is set as larger than 8G, with and without unplug. Otherwise, the guest would not crash, with and without unplug. 1G, 4G, 8G, 9G, 10G, 495G memory size have been covered for this bug.

Comment 6 Yufang Zhang 2011-05-09 09:52:43 UTC
Tested on an Intel host, rhel6.1 x86_64 hvm guest could boot successfully when memory size is set larger than 8G.

Comment 7 Igor Mammedov 2011-05-13 12:46:15 UTC
Kind of reproduced it on 2sockets AMD with 1 vcpu.
However it doesn't crash, instead it hungs shortly after starting accessing disk.
And after a while it spits out udev timeout messages and hungs again.
100% reproducible. 

But booting with 'pv_on_hvm=off' help, os boot succeeds.

With 4 vcpus it hungs too just a bit later 
and produces following errors:
INFO: task plymouthd:156 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
plymouthd     D 0000000000000003     0   156      1 0x00000000
 ffff880280c578a8 0000000000000082 0000000000000000 ffff880281ba4800
 ffff880280c53540 00000000000003e8 ffff880280c57848 00000000fffbfdce
 ffff880280c2b0f8 ffff880280c57fd8 000000000000f598 ffff880280c2b0f8
Call Trace:
 [<ffffffff811a3c00>] ? sync_buffer+0x0/0x50
 [<ffffffff814dbad3>] io_schedule+0x73/0xc0
 [<ffffffff811a3c40>] sync_buffer+0x40/0x50
 [<ffffffff814dc1ea>] __wait_on_bit_lock+0x5a/0xc0
 [<ffffffff81057389>] ? enqueue_task+0x79/0x90
 [<ffffffff811a3c00>] ? sync_buffer+0x0/0x50
 [<ffffffff814dc2c8>] out_of_line_wait_on_bit_lock+0x78/0x90
 [<ffffffff8108e1a0>] ? wake_bit_function+0x0/0x50
 [<ffffffff811a3de6>] __lock_buffer+0x36/0x40
 [<ffffffffa00422d3>] do_get_write_access+0x493/0x520 [jbd2]
 [<ffffffffa00424b1>] jbd2_journal_get_write_access+0x31/0x50 [jbd2]
 [<ffffffffa0096368>] __ext4_journal_get_write_access+0x38/0x80 [ext4]
 [<ffffffffa0072863>] ext4_reserve_inode_write+0x73/0xa0 [ext4]
 [<ffffffffa00728dc>] ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
 [<ffffffffa0072bd0>] ext4_dirty_inode+0x40/0x60 [ext4]
 [<ffffffff8119b5ab>] __mark_inode_dirty+0x3b/0x160
 [<ffffffff8118be92>] file_update_time+0xf2/0x170
 [<ffffffff8110f5f0>] __generic_file_aio_write+0x220/0x480
 [<ffffffff814e0cb6>] ? notifier_call_chain+0x16/0x80
 [<ffffffff8110f8bf>] generic_file_aio_write+0x6f/0xe0
 [<ffffffffa006c2a1>] ext4_file_write+0x61/0x1e0 [ext4]
 [<ffffffff8117241a>] do_sync_write+0xfa/0x140
 [<ffffffff8104af29>] ? __wake_up_common+0x59/0x90
 [<ffffffff8104f843>] ? __wake_up+0x53/0x70
 [<ffffffff8108e160>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81211d3b>] ? selinux_file_permission+0xfb/0x150
 [<ffffffff812051a6>] ? security_file_permission+0x16/0x20
 [<ffffffff81172718>] vfs_write+0xb8/0x1a0
 [<ffffffff81173151>] sys_write+0x51/0x90
 [<ffffffff8100b172>] system_call_fastpath+0x16/0x1b
INFO: task jbd2/xvda1-8:268 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/xvda1-8  D 0000000000000000     0   268      2 0x00000000
 ffff880281675d10 0000000000000046 0000000080dfae40 0000000000000008
 0000000000015f80 ffff880280d78a80 ffffffff81a2d020 ffffffff8160b060
 ffff880280d79038 ffff880281675fd8 000000000000f598 ffff880280d79038
Call Trace:
 [<ffffffff8108e44e>] ? prepare_to_wait+0x4e/0x80
 [<ffffffffa0042870>] jbd2_journal_commit_transaction+0x1c0/0x1490 [jbd2]
 [<ffffffff810096e0>] ? __switch_to+0xd0/0x320
 [<ffffffff810796cc>] ? lock_timer_base+0x3c/0x70
 [<ffffffff8108e160>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0048978>] kjournald2+0xb8/0x220 [jbd2]
 [<ffffffff8108e160>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa00488c0>] ? kjournald2+0x0/0x220 [jbd2]
 [<ffffffff8108ddf6>] kthread+0x96/0xa0
 [<ffffffff8100c1ca>] child_rip+0xa/0x20
 [<ffffffff8108dd60>] ? kthread+0x0/0xa0
 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
INFO: task flush-202:0:281 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-202:0   D 0000000000000001     0   281      2 0x00000000
 ffff88027e9339b0 0000000000000046 ffff88027e933950 ffff88027e933a10
 ffff88027e933a88 0000000000000000 000000000000000e ffff88027e933a00
 ffff8802817b1078 ffff88027e933fd8 000000000000f598 ffff8802817b1078
Call Trace:
 [<ffffffff8108e44e>] ? prepare_to_wait+0x4e/0x80
 [<ffffffffa0040fa2>] start_this_handle+0x262/0x4e0 [jbd2]
 [<ffffffff8115a14b>] ? cache_alloc_refill+0x15b/0x240
 [<ffffffff8108e160>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0041405>] jbd2_journal_start+0xb5/0x100 [jbd2]
 [<ffffffffa00712b5>] ? ext4_meta_trans_blocks+0x75/0xf0 [ext4]
 [<ffffffffa008a868>] ext4_journal_start_sb+0x58/0x90 [ext4]
 [<ffffffffa00754bc>] ext4_da_writepages+0x27c/0x660 [ext4]
 [<ffffffff810537e4>] ? find_busiest_group+0x244/0xb20
 [<ffffffff81122521>] do_writepages+0x21/0x40
 [<ffffffff8119bbbd>] writeback_single_inode+0xdd/0x2c0
 [<ffffffff8119bfbe>] writeback_sb_inodes+0xce/0x180
 [<ffffffff8119c11b>] writeback_inodes_wb+0xab/0x1b0
 [<ffffffff8119c4bb>] wb_writeback+0x29b/0x3f0
 [<ffffffff814db337>] ? thread_return+0x4e/0x777
 [<ffffffff8107a202>] ? del_timer_sync+0x22/0x30
 [<ffffffff8119c7a9>] wb_do_writeback+0x199/0x240
 [<ffffffff8119c8b3>] bdi_writeback_task+0x63/0x1b0
 [<ffffffff8108e027>] ? bit_waitqueue+0x17/0xd0
 [<ffffffff81130b90>] ? bdi_start_fn+0x0/0x100
 [<ffffffff81130c16>] bdi_start_fn+0x86/0x100
 [<ffffffff81130b90>] ? bdi_start_fn+0x0/0x100
 [<ffffffff8108ddf6>] kthread+0x96/0xa0
 [<ffffffff8100c1ca>] child_rip+0xa/0x20
 [<ffffffff8108dd60>] ? kthread+0x0/0xa0
 [<ffffffff8100c1c0>] ? child_rip+0x0/0x20

Comment 8 Andrew Jones 2011-05-13 14:04:35 UTC
comment 7 looks like a different issue. That appears to be a problem when using
pv-on-hvm drivers and likely something with the real disk accesses, since the machine being used is known to have disk issues. The original report
was with the qemu device, although comment 5 says it also happens without xen_emul_unplug=never. If that's the case, then I would guess there should be a different backtrace that doesn't have ata related functions in it.

Comment 9 Igor Mammedov 2011-05-16 09:09:54 UTC
I'll try with iscsi backend to exclude faulty hdd/controller influence. But
I think it isn't case since without pv_on_hvm it works with the same backend disk.

Comment 10 Igor Mammedov 2011-05-17 20:32:23 UTC
Looks like soft lockup in ata_sff_data_xfer32 without pn on hvm and hung (7) with pn on hvm are caused by a missing commit in RHEL5 HV -259

7725941 [xen] x86: Handle new AMD CPUID bits for HVM guests

that is appeared only in -260

Checked with -260 and -261 kernel-xen with 10G memory settings. 
No lockups (xen_emul_unplug=never) on ibm-x3655-01.ovirt.rhts.eng.bos.redhat.com,
no hungs on colossus (lockups are not reproducible on it).
So 5.7 should be ok.

Left a running test with a custom hypervisor -259+7725941 for a night, to see if
problem fixed.

Comment 11 Igor Mammedov 2011-05-19 08:41:17 UTC
On ibm-x3655-01.ovirt.rhts.eng.bos.redhat.com with -259 host and hvm 6.0 (-71) guest, there is the same soft lockup at ata_sff_data_xfer32.

So it is not regression.

Comment 12 Yufang Zhang 2011-05-20 08:11:19 UTC
On the same host(amd-6172-512-1.englab.nay.redhat.com), I could boot rhel6.0(-71 kernel) hvm x86_64 guest successfully, even the guest memory size is larger than 8G. I even submit a job which memory size is set as 256G, it seems that the guest could boot successfully. I would post the results later after the job finishes.

Comment 14 Igor Mammedov 2011-05-23 12:52:03 UTC
Yufang,

Can you retest 6.1 with hap_1gb=1 on xen (-259) kernel line?

Comment 16 Igor Mammedov 2011-05-23 14:08:04 UTC
Yufang,

Have you tested windows guests in similar conditions on host -259?
Can you test a windows guest too, please?

Comment 18 Qixiang Wan 2011-05-24 02:31:54 UTC
(In reply to comment #15)
> It appears that X86_FEATURE_PAGE1GB is a cause of problem.
> Even enabling 1gb page support with hap_1gb=1 in xen doesn't help, just made it
> more difficult to reproduce. 
> It is possible to mask that feature out in the guest, via a dirty hack in
> guest's common cpuid func, but I guess upstream won't accept it (I wouldn't).
> Masking this way allows RHEL6.1 guest to boot (30 reboots without soft lockup
> so far).

Hi Igor,

We have similar bug for PV (bug 502826), so the X86_FEATURE_PAGE1GB has been masked out for PV guests, but not for HVM guests. Seems upstream's cpuid func has been different from RHEL as it's new cpuid func hasn't been backported to RHEL.

Comment 24 Igor Mammedov 2011-05-29 22:38:06 UTC
Created attachment 501673 [details]
Patch to mask out 1GBPAGES and RDTSCP cpuid features for xen hvm

Comment 26 RHEL Program Management 2011-05-30 15:49:41 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 28 Igor Mammedov 2011-05-31 06:47:43 UTC
Created attachment 501921 [details]
Patch to mask out 1GBPAGES and RDTSCP cpuid features for xen hvm. V2

Comment 31 RHEL Program Management 2011-05-31 07:30:37 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 37 Igor Mammedov 2011-06-01 13:57:18 UTC
Hi Qixiang,
You should use RHEL5.6 kernel-xen or more earlier version (I guess no higher than -259) is good enough for testing. See comment 10.

Comment 38 Andrew Jones 2011-06-01 15:23:32 UTC
Qixiang,

in addition to Igor's comment that you need -259 or lower (this is because -260 and higher mask the bits in the HV as well) your test program is flawed because it's using the cpuid and pv-cpuid instructions rather than looking at /dev/cpu/0/cpuid which is the kernel's view of cpuid bits. The kernel's view is what you want. If you add in code to your test program to check that file then you'll find all three differ, but the cpuid device file is the right one for your test.

Drew

Comment 39 Qixiang Wan 2011-06-01 15:36:11 UTC
thanks for update.
I downgrade the kernel and reproduced the error with the AMD 6168 cpu.
Test with the updated kernel in comment 33 show the fix is working properly, guest didn't crash or hang after 10+ times reboot (easily reproducible without the fix). pdpe1gb and rdtscp are not present in cpu flags (/proc/cpuinfo) either, and in guest kernel boot message:
...
CPU: CPU feature pdpe1gb disabled
CPU: CPU feature rdtscp disabled
...

same test result on ibm-x3655-01.ovirt.rhts.eng.bos.redhat.com.

Comment 40 Andrew Jones 2011-06-01 15:55:48 UTC
(In reply to comment #39)
> the fix). pdpe1gb and rdtscp are not present in cpu flags (/proc/cpuinfo)
> either, and in guest kernel boot message:
> ...
> CPU: CPU feature pdpe1gb disabled
> CPU: CPU feature rdtscp disabled

Yeah, actually this is the only way to verify it. /dev/cpu/0/cpuid as I pointed to before actually won't work because of how we decided to only filter the kernel's feature cache, rather than filter all cpuid calls. So none of the three ways to get at cpuid data mentioned in comment 38 would work to confirm these features are masked out, only /proc/cpuinfo (which uses the cache) and boot messages can be used.

Comment 42 Aristeu Rozanski 2011-06-27 19:03:54 UTC
Patch(es) available on kernel-2.6.32-160.el6

Comment 44 Martin Prpič 2011-07-12 11:36:49 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Prior to this update, Red Hat Enterprise Linux Xen (up to version 5.6) did not hide 1 GB pages and RDTSCP (enumeration features of CPUID), causing guest soft lock ups on AMD hosts when the guest's memory was greater than 8 GB. With this update, a Red Hat Enterprise Linux 6 HVM (Hardware Virtual Machine) guest is able to run on Red Hat Enterprise Linux Xen 5.6 and lower.

Comment 45 Xavier Bru 2011-07-18 09:03:16 UTC
Building a kernel 2.6.32-131.6.1 with BZ#703055 and with CONFIG_XEN is not set:

arch/x86/kernel/cpu/common.c: In function 'filter_cpuid_features':
arch/x86/kernel/cpu/common.c:306: error: implicit declaration of function 'xen_cpuid_base'

Additional code should be ifdefed by CONFIG_XEN ?

Comment 46 Andrew Jones 2011-07-18 12:48:35 UTC
(In reply to comment #45)
> Building a kernel 2.6.32-131.6.1 with BZ#703055 and with CONFIG_XEN is not set:
> 
> arch/x86/kernel/cpu/common.c: In function 'filter_cpuid_features':
> arch/x86/kernel/cpu/common.c:306: error: implicit declaration of function
> 'xen_cpuid_base'
> 
> Additional code should be ifdefed by CONFIG_XEN ?

Supported RHEL6 kernels are always compiled with CONFIG_XEN on, so it's not really a RHEL issue. However, it would be nice to keep the kernel compiling under different configs for debug purposes, so a patch can be considered. Please open a new bug for it.

Thanks,
Drew

Comment 48 Warren Togami 2011-08-09 21:44:08 UTC
(In reply to comment #46)
> (In reply to comment #45)
> > Building a kernel 2.6.32-131.6.1 with BZ#703055 and with CONFIG_XEN is not set:
> > 
> > arch/x86/kernel/cpu/common.c: In function 'filter_cpuid_features':
> > arch/x86/kernel/cpu/common.c:306: error: implicit declaration of function
> > 'xen_cpuid_base'
> > 
> > Additional code should be ifdefed by CONFIG_XEN ?
> 
> Supported RHEL6 kernels are always compiled with CONFIG_XEN on, so it's not
> really a RHEL issue. However, it would be nice to keep the kernel compiling
> under different configs for debug purposes, so a patch can be considered.
> Please open a new bug for it.
> 
> Thanks,
> Drew

Filed as Bug #729488.

Comment 49 Qixiang Wan 2011-09-02 09:32:15 UTC
Verified with kernel-2.6.32-192.el6. 

With RHEL6.1 GA kernel, the x86_64 hvm guest can easily hang during boot with 10G memory on an AMD 6168 processor host. After the fix, the guest can boot/reboot successfully (tested more than 50+ times, no issue found).

pdpe1gb, rdtscp are disabled in guest after fix:

# cat /proc/cpuinfo | grep -E '^flags'
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat clflush mmx fxsr sse sse2 syscall nx mmxext lm up rep_good extd_apicid unfair_spinlock pni cx16 hypervisor lahf_lm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt

# dmesg
...
CPU: CPU feature pdpe1gb disabled on xen guest
CPU: CPU feature rdtscp disabled on xen guest
...

Comment 50 errata-xmlrpc 2011-12-06 13:27:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2011-1530.html