Bug 681126

Summary: rhel6.32 guest installation cause B95 host reboot
Product: Red Hat Enterprise Linux 5 Reporter: Suqin Huang <shuang>
Component: kvmAssignee: Karen Noel <knoel>
Status: CLOSED WONTFIX QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.7CC: gcosta, gyue, juzhang, knoel, mkenneth, pmatouse, rhod, tburke, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-04-10 10:30:12 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 580949    
Attachments:
Description Flags
seario info
none
seario info
none
AMD host lspci info none

Description Suqin Huang 2011-03-01 07:50:44 UTC
Created attachment 481547 [details]
seario info

Description of problem:
rhel6.32 guest installation cause B95 host reboot

Version-Release number of selected component (if applicable):
kvm-83-226.el5

How reproducible:
100%

Steps to Reproduce:
1.cmd
qemu-kvm -drive file='/usr/images/RHEL-Server-6.0-64-virtio.qcow2',index=0,if=virtio,media=disk,cache=none,format=qcow2 -net nic,vlan=0,model=virtio,macaddr='9a:42:40:18:c8:b2' -net tap,vlan=0,script='/usr/scripts/qemu-ifup-switch',downscript='no' -m 2048 -smp 2,cores=1,threads=1,sockets=2 -drive file='/usr/isos/linux/RHEL6.0-Server-x86_64.iso',media=cdrom,index=1 -drive file='/usr/images/rhel60-64/ks.iso',media=cdrom,index=2 -cpu qemu64,+sse2 -soundhw ac97 -kernel '/usr/images/rhel60-64/vmlinuz' -initrd '/usr/images/rhel60-64/initrd.img' -vnc :0 -rtc-td-hack -M rhel5.6.0 -boot n -usbdevice tablet -no-kvm-pit-reinjection --append 'ks=cdrom nicdelay=60 console=ttyS0,115200 console=tty0
2.
3.
  
Actual results:


Expected results:


Additional info:

1. host:
kernel: 2.6.18-238.el5

cpu:
processor	: 3
vendor_id	: AuthenticAMD
cpu family	: 16
model		: 4
model name	: AMD Phenom(tm) II X4 B95 Processor
stepping	: 2
cpu MHz		: 800.000
cache size	: 512 KB

flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc nonstop_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
bogomips	: 5984.92
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

2. can install rhel6.32 in other host
3. can install winxp, win7, win2008, rhel6.64 successfully

Comment 1 Suqin Huang 2011-03-01 08:10:43 UTC
(In reply to comment #0)
> Created attachment 481547 [details]
> seario info
> 
> Description of problem:
> rhel6.32 guest installation cause B95 host reboot
> 
> Version-Release number of selected component (if applicable):
> kvm-83-226.el5
> 
> How reproducible:
> 100%
> 
> Steps to Reproduce:
> 1.cmd
> qemu-kvm -drive
> file='/usr/images/RHEL-Server-6.0-64-virtio.qcow2',index=0,if=virtio,media=disk,cache=none,format=qcow2
> -net nic,vlan=0,model=virtio,macaddr='9a:42:40:18:c8:b2' -net
> tap,vlan=0,script='/usr/scripts/qemu-ifup-switch',downscript='no' -m 2048 -smp
> 2,cores=1,threads=1,sockets=2 -drive
> file='/usr/isos/linux/RHEL6.0-Server-x86_64.iso',media=cdrom,index=1 -drive
> file='/usr/images/rhel60-64/ks.iso',media=cdrom,index=2 -cpu qemu64,+sse2
> -soundhw ac97 -kernel '/usr/images/rhel60-64/vmlinuz' -initrd
> '/usr/images/rhel60-64/initrd.img' -vnc :0 -rtc-td-hack -M rhel5.6.0 -boot n
> -usbdevice tablet -no-kvm-pit-reinjection --append 'ks=cdrom nicdelay=60
> console=ttyS0,115200 console=tty0
> 2.
> 3.
> 
> Actual results:
> 
> 
> Expected results:
> 
> 
> Additional info:
> 
> 1. host:
> kernel: 2.6.18-238.el5
> 
host kernel should be 2.6.18-245.el5

I can reproduce in 2.6.18-238.el5 & kvm-83-224.el5
> cpu:
> processor : 3
> vendor_id : AuthenticAMD
> cpu family : 16
> model  : 4
> model name : AMD Phenom(tm) II X4 B95 Processor
> stepping : 2
> cpu MHz  : 800.000
> cache size : 512 KB
> 
> flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
> pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm
> 3dnowext 3dnow constant_tsc nonstop_tsc pni cx16 popcnt lahf_lm cmp_legacy svm
> extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
> bogomips : 5984.92
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
> 
> 2. can install rhel6.32 in other host
> 3. can install winxp, win7, win2008, rhel6.64 successfully

Comment 4 Gleb Natapov 2011-03-02 09:25:39 UTC
Serial output does not look complete. Also try to enable kdump.

Comment 5 Suqin Huang 2011-03-07 01:55:21 UTC
Created attachment 482580 [details]
seario info

no core file while I enable kdump

Comment 6 Avi Kivity 2011-03-07 09:40:24 UTC
At what stage does the host crash?  Immediately after the guest kernel boots, or while installing packages?

Comment 7 Gleb Natapov 2011-03-07 09:44:00 UTC
What other AMD CPUs have you tried to reproduced on? Provide cpuinfo please.

Comment 8 Suqin Huang 2011-03-07 10:57:34 UTC
(In reply to comment #6)
> At what stage does the host crash?  Immediately after the guest kernel boots,
> or while installing packages?

at "Starting installation process" step

Comment 9 Suqin Huang 2011-03-07 10:59:00 UTC
can install successfully in the following host:

processor	: 11
vendor_id	: AuthenticAMD
cpu family	: 16
model		: 8
model name	: Six-Core AMD Opteron(tm) Processor 2427
stepping	: 0
cpu MHz		: 800.000
cache size	: 512 KB
physical id	: 1
siblings	: 6
core id		: 5
cpu cores	: 6
apicid		: 13
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc nonstop_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw

Comment 10 Gleb Natapov 2011-03-07 12:38:56 UTC
Can you check if bios update is available for this machine?

Comment 11 Suqin Huang 2011-05-16 08:38:37 UTC
Can reproduce after I update host BIOS

Comment 12 Avi Kivity 2011-08-11 09:03:16 UTC
Potential duplicate of Bug 713636 - both AMD without NPT.

Comment 14 Avi Kivity 2011-10-11 15:20:22 UTC
Does RHEL 6.latest show the same behaviour?

Comment 15 Suqin Huang 2011-10-20 10:21:39 UTC
repeat 150 times, can not reproduce it on rhel6 qemu-kvm-0.12.1.2-2.199.el6.x86_64

Comment 16 Ronen Hod 2011-10-25 10:40:59 UTC
No time to fix it for RHEL5.8. Moving to 5.9.

The installed guest is RHEL6.0, and if it is the only problematic guest then we can close this bug (probably not the case, since it looks like a high load issue)
Suqin, Please test it with RHEL5.8 host and RHEL6.2 guest to see if the problem still exists.

Thanks.

Comment 17 Golita Yue 2011-11-02 10:44:52 UTC
Already submit job to test this bug, and will update the testing result after job finished.

Comment 18 Golita Yue 2011-11-07 06:52:24 UTC
Tested it with RHEL5.8 host and RHEL6.2 guest, can reproduce this bug.
The host reboot automatically after install the guest two times.

the host info:
kernel-2.6.18-290.el5
kvm-83-243.el5

the guest info:
kernel-2.6.32-216.el6

Comment 20 Avi Kivity 2011-11-07 19:01:39 UTC
Will look for errata in this area.

Comment 21 Avi Kivity 2011-11-07 19:37:32 UTC
Possible relevant errata:


319 Inaccurate Temperature Measurement
Description
The internal thermal sensor used for CurTmp (F3xA4[31:21]), hardware thermal control (HTC), 
software thermal control (STC) thermal zone, and the sideband temperature sensor interface (SB-TSI) 
may report inconsistent values.
For CPUID Fn0000_0001_EAX[7:4] (Model) 4 and higher, this temperature inconsistency will occur 
only on AM2r2, Fr2, Fr5 and Fr6 package processors
Potential Effect on System
HTC, STC thermal zone, and SB-TSI do not provide reliable thermal protection. This does not affect 
THERMTRIP or the use of the STC-active state using StcPstateLimit or StcPstateEn (F3x68[30:28, 
5]).

-----------------------
346 System May Hang if Core Frequency is Even Divisor of 
Northbridge Clock
Description
When one processor core is operating at a clock frequency that is higher than the northbridge clock 
frequency, and another processor core is operating at a clock frequency that is an even divisor of the 
northbridge clock frequency, the northbridge may fail to complete a cache probe.
Potential Effect on System
System hang.
Suggested Workaround
System software should set F3x188[22] to 1b.
Fix Planned

Comment 22 Avi Kivity 2011-11-07 19:50:53 UTC
Please try retesting with reduced core frequency:

For each core:

  cd /sys/devices/system/cpu/cpuX/cpufreq
  echo -n userspace > scaling_governor
  cat scaling_min_freq > scaling_setspeed 

Run the test with this.  Please monitor scaling_cur_freq for all cores to make sure no silly daemon flips them back.

Comment 23 Golita Yue 2011-11-10 11:27:01 UTC
Hi Avi,

Tested as your comment #22. Can reproduce this bug, the host reboot automatically during guest installation.

my steps:

# grep processor /proc/cpuinfo | wc -l
4
# cd /sys/devices/system/cpu/
# ls
cpu0  cpu1  cpu2  cpu3  sched_mc_power_savings
# cat cpu0/cpufreq/scaling_governor 
ondemand
# for i in 0 1 2 3; do echo -n userspace > cpu$i/cpufreq/scaling_governor; done
# for i in 0 1 2 3; do cat cpu$i/cpufreq/scaling_governor; done
userspace
userspace
userspace
userspace
# for i in 0 1 2 3; do cat cpu$i/cpufreq/scaling_min_freq > cpu$i/cpufreq/scaling_setspeed; done
# for i in 0 1 2 3; do cat cpu$i/cpufreq/scaling_setspeed; done
800000
800000
800000
800000
# for i in 0 1 2 3; do cat cpu$i/cpufreq/scaling_cur_freq ; done
800000
800000
800000
800000

Then run job to install guest in a loop.

If my steps have problem please correct me, thanks.

Comment 24 Avi Kivity 2011-11-13 15:14:24 UTC
It looks okay.

Please provide the output of

  lspci -xxxx -s 00:18.3

(checking for erratum 346)

Comment 25 Avi Kivity 2011-11-13 15:21:15 UTC
Also, the output of plain 'lspci'.  Function 18 should be something like "Host bridge: Advanced Micro Devices [AMD] Family 10h Processor".

Comment 26 Golita Yue 2011-11-14 03:18:32 UTC
(In reply to comment #24)
> It looks okay.
> 
> Please provide the output of
> 
>   lspci -xxxx -s 00:18.3
> 
> (checking for erratum 346)

[root@amd-B95-8-2 ~]# lspci -xxxx -s 00:18.3
00:18.3 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor Miscellaneous Control
00: 22 10 03 12 00 00 10 00 00 00 00 06 00 00 80 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 f0 00 00 00 00 00 00 00 00 00 00 00
40: ff ff ff 3f 5c 00 b0 4a 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 04 00 3f 34 00 00 00 30 51 80 01 60
70: 51 11 32 60 01 01 98 00 14 0c 20 00 11 08 07 00
80: 81 e6 00 e6 e6 41 e6 01 08 00 00 00 00 60 58 00
90: 03 00 00 00 02 00 00 00 00 0d 1f 02 00 00 00 00
a0: 96 08 16 a0 80 18 0c 12 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 43 51 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 26 0f 81 c8 16 0f 2e 03 22 63 47 01
e0: 00 00 00 00 30 13 00 1e 59 7f 07 02 00 00 00 00
f0: 0f 00 10 00 00 00 00 00 00 00 00 00 42 0f 10 00

Comment 27 Avi Kivity 2011-11-14 17:44:08 UTC
Looks like a really old lspci.  Was that from RHEL 5?  Please try RHEL 6 or latest Fedora, should give a lot more output, in particular a line beginning with 180:.

Comment 28 Golita Yue 2011-11-15 07:49:14 UTC
(In reply to comment #27)
> Looks like a really old lspci.  Was that from RHEL 5?  

Yes, that came from RHEL 5 host. 
the host info as following:
kernel-2.6.18-290.el5
kvm-83-243.el5

From comment #15, this bug only can be reproduced in RHEL5, cannot reproduce it in RHEL6 (used the same host with different OS).

> Please try RHEL 6 or
> latest Fedora, should give a lot more output, in particular a line beginning
> with 180:.

Hi Avi,
Do you mean let me reinstall above host to RHEL6 then take the lspci info ?

Comment 29 Avi Kivity 2011-11-15 08:20:31 UTC
Yes.  Or you can try to build pciutils from source if that's easier.

Comment 30 Golita Yue 2011-11-17 07:02:00 UTC
Created attachment 534154 [details]
AMD host lspci info

Attached the host lspci info.

Comment 31 Avi Kivity 2011-11-17 10:57:51 UTC
It looks like 0x188[22] is set, so it's not erratum 346.

Comment 33 RHEL Program Management 2012-04-02 10:27:02 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.

Comment 34 Avi Kivity 2012-04-10 10:30:12 UTC
Affects specific, outdated, hardware.  Closing.