Bug 681126 - rhel6.32 guest installation cause B95 host reboot
rhel6.32 guest installation cause B95 host reboot
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kvm (Show other bugs)
5.7
Unspecified Unspecified
medium Severity medium
: rc
: ---
Assigned To: Karen Noel
Virtualization Bugs
:
Depends On:
Blocks: Rhel5KvmTier1
  Show dependency treegraph
 
Reported: 2011-03-01 02:50 EST by Suqin Huang
Modified: 2013-01-09 18:36 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-04-10 06:30:12 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
seario info (6.00 KB, text/plain)
2011-03-01 02:50 EST, Suqin Huang
no flags Details
seario info (12.38 KB, text/plain)
2011-03-06 20:55 EST, Suqin Huang
no flags Details
AMD host lspci info (13.55 KB, text/plain)
2011-11-17 02:02 EST, Golita Yue
no flags Details

  None (edit)
Description Suqin Huang 2011-03-01 02:50:44 EST
Created attachment 481547 [details]
seario info

Description of problem:
rhel6.32 guest installation cause B95 host reboot

Version-Release number of selected component (if applicable):
kvm-83-226.el5

How reproducible:
100%

Steps to Reproduce:
1.cmd
qemu-kvm -drive file='/usr/images/RHEL-Server-6.0-64-virtio.qcow2',index=0,if=virtio,media=disk,cache=none,format=qcow2 -net nic,vlan=0,model=virtio,macaddr='9a:42:40:18:c8:b2' -net tap,vlan=0,script='/usr/scripts/qemu-ifup-switch',downscript='no' -m 2048 -smp 2,cores=1,threads=1,sockets=2 -drive file='/usr/isos/linux/RHEL6.0-Server-x86_64.iso',media=cdrom,index=1 -drive file='/usr/images/rhel60-64/ks.iso',media=cdrom,index=2 -cpu qemu64,+sse2 -soundhw ac97 -kernel '/usr/images/rhel60-64/vmlinuz' -initrd '/usr/images/rhel60-64/initrd.img' -vnc :0 -rtc-td-hack -M rhel5.6.0 -boot n -usbdevice tablet -no-kvm-pit-reinjection --append 'ks=cdrom nicdelay=60 console=ttyS0,115200 console=tty0
2.
3.
  
Actual results:


Expected results:


Additional info:

1. host:
kernel: 2.6.18-238.el5

cpu:
processor	: 3
vendor_id	: AuthenticAMD
cpu family	: 16
model		: 4
model name	: AMD Phenom(tm) II X4 B95 Processor
stepping	: 2
cpu MHz		: 800.000
cache size	: 512 KB

flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc nonstop_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
bogomips	: 5984.92
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate [8]

2. can install rhel6.32 in other host
3. can install winxp, win7, win2008, rhel6.64 successfully
Comment 1 Suqin Huang 2011-03-01 03:10:43 EST
(In reply to comment #0)
> Created attachment 481547 [details]
> seario info
> 
> Description of problem:
> rhel6.32 guest installation cause B95 host reboot
> 
> Version-Release number of selected component (if applicable):
> kvm-83-226.el5
> 
> How reproducible:
> 100%
> 
> Steps to Reproduce:
> 1.cmd
> qemu-kvm -drive
> file='/usr/images/RHEL-Server-6.0-64-virtio.qcow2',index=0,if=virtio,media=disk,cache=none,format=qcow2
> -net nic,vlan=0,model=virtio,macaddr='9a:42:40:18:c8:b2' -net
> tap,vlan=0,script='/usr/scripts/qemu-ifup-switch',downscript='no' -m 2048 -smp
> 2,cores=1,threads=1,sockets=2 -drive
> file='/usr/isos/linux/RHEL6.0-Server-x86_64.iso',media=cdrom,index=1 -drive
> file='/usr/images/rhel60-64/ks.iso',media=cdrom,index=2 -cpu qemu64,+sse2
> -soundhw ac97 -kernel '/usr/images/rhel60-64/vmlinuz' -initrd
> '/usr/images/rhel60-64/initrd.img' -vnc :0 -rtc-td-hack -M rhel5.6.0 -boot n
> -usbdevice tablet -no-kvm-pit-reinjection --append 'ks=cdrom nicdelay=60
> console=ttyS0,115200 console=tty0
> 2.
> 3.
> 
> Actual results:
> 
> 
> Expected results:
> 
> 
> Additional info:
> 
> 1. host:
> kernel: 2.6.18-238.el5
> 
host kernel should be 2.6.18-245.el5

I can reproduce in 2.6.18-238.el5 & kvm-83-224.el5
> cpu:
> processor : 3
> vendor_id : AuthenticAMD
> cpu family : 16
> model  : 4
> model name : AMD Phenom(tm) II X4 B95 Processor
> stepping : 2
> cpu MHz  : 800.000
> cache size : 512 KB
> 
> flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
> pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm
> 3dnowext 3dnow constant_tsc nonstop_tsc pni cx16 popcnt lahf_lm cmp_legacy svm
> extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
> bogomips : 5984.92
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate [8]
> 
> 2. can install rhel6.32 in other host
> 3. can install winxp, win7, win2008, rhel6.64 successfully
Comment 4 Gleb Natapov 2011-03-02 04:25:39 EST
Serial output does not look complete. Also try to enable kdump.
Comment 5 Suqin Huang 2011-03-06 20:55:21 EST
Created attachment 482580 [details]
seario info

no core file while I enable kdump
Comment 6 Avi Kivity 2011-03-07 04:40:24 EST
At what stage does the host crash?  Immediately after the guest kernel boots, or while installing packages?
Comment 7 Gleb Natapov 2011-03-07 04:44:00 EST
What other AMD CPUs have you tried to reproduced on? Provide cpuinfo please.
Comment 8 Suqin Huang 2011-03-07 05:57:34 EST
(In reply to comment #6)
> At what stage does the host crash?  Immediately after the guest kernel boots,
> or while installing packages?

at "Starting installation process" step
Comment 9 Suqin Huang 2011-03-07 05:59:00 EST
can install successfully in the following host:

processor	: 11
vendor_id	: AuthenticAMD
cpu family	: 16
model		: 8
model name	: Six-Core AMD Opteron(tm) Processor 2427
stepping	: 0
cpu MHz		: 800.000
cache size	: 512 KB
physical id	: 1
siblings	: 6
core id		: 5
cpu cores	: 6
apicid		: 13
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc nonstop_tsc pni cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
Comment 10 Gleb Natapov 2011-03-07 07:38:56 EST
Can you check if bios update is available for this machine?
Comment 11 Suqin Huang 2011-05-16 04:38:37 EDT
Can reproduce after I update host BIOS
Comment 12 Avi Kivity 2011-08-11 05:03:16 EDT
Potential duplicate of Bug 713636 - both AMD without NPT.
Comment 14 Avi Kivity 2011-10-11 11:20:22 EDT
Does RHEL 6.latest show the same behaviour?
Comment 15 Suqin Huang 2011-10-20 06:21:39 EDT
repeat 150 times, can not reproduce it on rhel6 qemu-kvm-0.12.1.2-2.199.el6.x86_64
Comment 16 Ronen Hod 2011-10-25 06:40:59 EDT
No time to fix it for RHEL5.8. Moving to 5.9.

The installed guest is RHEL6.0, and if it is the only problematic guest then we can close this bug (probably not the case, since it looks like a high load issue)
Suqin, Please test it with RHEL5.8 host and RHEL6.2 guest to see if the problem still exists.

Thanks.
Comment 17 Golita Yue 2011-11-02 06:44:52 EDT
Already submit job to test this bug, and will update the testing result after job finished.
Comment 18 Golita Yue 2011-11-07 01:52:24 EST
Tested it with RHEL5.8 host and RHEL6.2 guest, can reproduce this bug.
The host reboot automatically after install the guest two times.

the host info:
kernel-2.6.18-290.el5
kvm-83-243.el5

the guest info:
kernel-2.6.32-216.el6
Comment 20 Avi Kivity 2011-11-07 14:01:39 EST
Will look for errata in this area.
Comment 21 Avi Kivity 2011-11-07 14:37:32 EST
Possible relevant errata:


319 Inaccurate Temperature Measurement
Description
The internal thermal sensor used for CurTmp (F3xA4[31:21]), hardware thermal control (HTC), 
software thermal control (STC) thermal zone, and the sideband temperature sensor interface (SB-TSI) 
may report inconsistent values.
For CPUID Fn0000_0001_EAX[7:4] (Model) 4 and higher, this temperature inconsistency will occur 
only on AM2r2, Fr2, Fr5 and Fr6 package processors
Potential Effect on System
HTC, STC thermal zone, and SB-TSI do not provide reliable thermal protection. This does not affect 
THERMTRIP or the use of the STC-active state using StcPstateLimit or StcPstateEn (F3x68[30:28, 
5]).

-----------------------
346 System May Hang if Core Frequency is Even Divisor of 
Northbridge Clock
Description
When one processor core is operating at a clock frequency that is higher than the northbridge clock 
frequency, and another processor core is operating at a clock frequency that is an even divisor of the 
northbridge clock frequency, the northbridge may fail to complete a cache probe.
Potential Effect on System
System hang.
Suggested Workaround
System software should set F3x188[22] to 1b.
Fix Planned
Comment 22 Avi Kivity 2011-11-07 14:50:53 EST
Please try retesting with reduced core frequency:

For each core:

  cd /sys/devices/system/cpu/cpuX/cpufreq
  echo -n userspace > scaling_governor
  cat scaling_min_freq > scaling_setspeed 

Run the test with this.  Please monitor scaling_cur_freq for all cores to make sure no silly daemon flips them back.
Comment 23 Golita Yue 2011-11-10 06:27:01 EST
Hi Avi,

Tested as your comment #22. Can reproduce this bug, the host reboot automatically during guest installation.

my steps:

# grep processor /proc/cpuinfo | wc -l
4
# cd /sys/devices/system/cpu/
# ls
cpu0  cpu1  cpu2  cpu3  sched_mc_power_savings
# cat cpu0/cpufreq/scaling_governor 
ondemand
# for i in 0 1 2 3; do echo -n userspace > cpu$i/cpufreq/scaling_governor; done
# for i in 0 1 2 3; do cat cpu$i/cpufreq/scaling_governor; done
userspace
userspace
userspace
userspace
# for i in 0 1 2 3; do cat cpu$i/cpufreq/scaling_min_freq > cpu$i/cpufreq/scaling_setspeed; done
# for i in 0 1 2 3; do cat cpu$i/cpufreq/scaling_setspeed; done
800000
800000
800000
800000
# for i in 0 1 2 3; do cat cpu$i/cpufreq/scaling_cur_freq ; done
800000
800000
800000
800000

Then run job to install guest in a loop.

If my steps have problem please correct me, thanks.
Comment 24 Avi Kivity 2011-11-13 10:14:24 EST
It looks okay.

Please provide the output of

  lspci -xxxx -s 00:18.3

(checking for erratum 346)
Comment 25 Avi Kivity 2011-11-13 10:21:15 EST
Also, the output of plain 'lspci'.  Function 18 should be something like "Host bridge: Advanced Micro Devices [AMD] Family 10h Processor".
Comment 26 Golita Yue 2011-11-13 22:18:32 EST
(In reply to comment #24)
> It looks okay.
> 
> Please provide the output of
> 
>   lspci -xxxx -s 00:18.3
> 
> (checking for erratum 346)

[root@amd-B95-8-2 ~]# lspci -xxxx -s 00:18.3
00:18.3 Host bridge: Advanced Micro Devices [AMD] Family 10h Processor Miscellaneous Control
00: 22 10 03 12 00 00 10 00 00 00 00 06 00 00 80 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 f0 00 00 00 00 00 00 00 00 00 00 00
40: ff ff ff 3f 5c 00 b0 4a 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 04 00 3f 34 00 00 00 30 51 80 01 60
70: 51 11 32 60 01 01 98 00 14 0c 20 00 11 08 07 00
80: 81 e6 00 e6 e6 41 e6 01 08 00 00 00 00 60 58 00
90: 03 00 00 00 02 00 00 00 00 0d 1f 02 00 00 00 00
a0: 96 08 16 a0 80 18 0c 12 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 43 51 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 26 0f 81 c8 16 0f 2e 03 22 63 47 01
e0: 00 00 00 00 30 13 00 1e 59 7f 07 02 00 00 00 00
f0: 0f 00 10 00 00 00 00 00 00 00 00 00 42 0f 10 00
Comment 27 Avi Kivity 2011-11-14 12:44:08 EST
Looks like a really old lspci.  Was that from RHEL 5?  Please try RHEL 6 or latest Fedora, should give a lot more output, in particular a line beginning with 180:.
Comment 28 Golita Yue 2011-11-15 02:49:14 EST
(In reply to comment #27)
> Looks like a really old lspci.  Was that from RHEL 5?  

Yes, that came from RHEL 5 host. 
the host info as following:
kernel-2.6.18-290.el5
kvm-83-243.el5

From comment #15, this bug only can be reproduced in RHEL5, cannot reproduce it in RHEL6 (used the same host with different OS).

> Please try RHEL 6 or
> latest Fedora, should give a lot more output, in particular a line beginning
> with 180:.

Hi Avi,
Do you mean let me reinstall above host to RHEL6 then take the lspci info ?
Comment 29 Avi Kivity 2011-11-15 03:20:31 EST
Yes.  Or you can try to build pciutils from source if that's easier.
Comment 30 Golita Yue 2011-11-17 02:02:00 EST
Created attachment 534154 [details]
AMD host lspci info

Attached the host lspci info.
Comment 31 Avi Kivity 2011-11-17 05:57:51 EST
It looks like 0x188[22] is set, so it's not erratum 346.
Comment 33 RHEL Product and Program Management 2012-04-02 06:27:02 EDT
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.
Comment 34 Avi Kivity 2012-04-10 06:30:12 EDT
Affects specific, outdated, hardware.  Closing.

Note You need to log in before you can comment on or make changes to this bug.