Bug 594559
Summary: | Host Kernel panic at AMD machine when start a vm | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Joy Pu <ypu> |
Component: | kernel | Assignee: | Gleb Natapov <gleb> |
Status: | CLOSED NEXTRELEASE | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
Severity: | medium | Docs Contact: | |
Priority: | low | ||
Version: | 6.0 | CC: | aarcange, gleb, knoel, tburke |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2010-05-27 05:46:09 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Joy Pu
2010-05-21 02:46:53 UTC
This looks like AMD erratum 383 to me: 1. This is L1 TLB Multimatch error. 2. MC0_STATUS equals f600000000010015 (this is B6000000_00010015h + 62bit set). The question is why rhel6 guest triggers it? Do you know if transparent huge pages were used on the host and/or guest? (In reply to comment #3) > Do you know if transparent huge pages were used on the host and/or guest? I think so. Because I also run some test about transparent hugepage in the same machine at that day. So the transparent hugepage is setted to always in the host. So both host and guest were using it at that time. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux major release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Major release. This request is not yet committed for inclusion. corrupt: yes Data Cache Error: L1 TLB multimatch. This is not a software problem! ------ looks buggy cpu to me... Could be THP trigger hardware bugs. However THP code you were using is from older kernel, please try to install the latest -29 codebase on host and guest, first the anon-vma-chain version which is a unmodified -29 rhel6 version: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2468180 and if it's not fixing it, you can also try the -29 rhel6 with the backout-anon-vma version with some additional debug code enabled: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2468172 It's mostly important for host side, because guest in theory could be malicious and they can do anything there and still in theory host shouldn't machinecheck... I wonder if this could be an issue in the CPU, I've never seen this on my amd phenom x4 cpus with ntp enabled. Thanks! The two rpm I posted (or any -29 or more recent rhel6 build), includes a workaround for an amd errata that could trigger exactly when using THP on both host and guest. So there's a chance the CPU bug is worked around by upgrading host kernel to -29 release I posted. Please try to reproduce with -29. (In reply to comment #8) > Please try to reproduce with -29. I just meet this problem once before. It is not easy to reproduce. And with the patch in host, I try to boot up vm 100 times using commands: for i in `seq 100`; do boot_up_cmd ; done for i in `seq 100`; do sleep 120 && kill -9 `pidof qemu` && echo "kill number" $i ;done And boot up a vm run tsc test which trigger this problem at the first time 100 times and it seems the system still OK. No kernel panic and warning in host. (In reply to comment #9) > (In reply to comment #8) > > Please try to reproduce with -29. > I just meet this problem once before. It is not easy to reproduce. > > And with the patch in host, I try to boot up vm 100 times using commands: > for i in `seq 100`; do boot_up_cmd ; done > for i in `seq 100`; do sleep 120 && kill -9 `pidof qemu` && echo "kill number" > $i ;done > > And boot up a vm run tsc test which trigger this problem at the first time 100 > times and it seems the system still OK. No kernel panic and warning in host. Have you set transparent huge page to "always" in the host and guest before testing? If no please retest. If yes then lets close the bug. (In reply to comment #10) > (In reply to comment #9) > > (In reply to comment #8) > > > Please try to reproduce with -29. > > I just meet this problem once before. It is not easy to reproduce. > > > > And with the patch in host, I try to boot up vm 100 times using commands: > > for i in `seq 100`; do boot_up_cmd ; done > > for i in `seq 100`; do sleep 120 && kill -9 `pidof qemu` && echo "kill number" > > $i ;done > > > > And boot up a vm run tsc test which trigger this problem at the first time 100 > > times and it seems the system still OK. No kernel panic and warning in host. > Have you set transparent huge page to "always" in the host and guest before > testing? If no please retest. > If yes then lets close the bug. Yes. It already set to "always" and the scan_sleep_millisecs, alloc_sleep_millisecs is set to "0", defrag is set to "yes". Then I am closing this bug. If the problem reappears it can be reopened. |