Bug 645043
Summary: | RHEL5.3 - xen host - crashing randomly | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Douglas Schilling Landgraf <dougsland> |
Component: | kernel-xen | Assignee: | Xen Maintainance List <xen-maint> |
Status: | CLOSED DUPLICATE | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
Severity: | high | Docs Contact: | |
Priority: | urgent | ||
Version: | 5.3 | CC: | drjones, ifloodmu, pbonzini, prc, russ+bugzilla-redhat, uobergfe, xen-maint |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2011-01-18 12:55:58 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Douglas Schilling Landgraf
2010-10-20 18:39:21 UTC
The cores files available at: host ussp-pb20: ============================= Machine: -------------- megatron.gsslab.rdu.redhat.com Login with kerberos name/password 1st core available: $ cd /cores/20101013074537/work $ ./crash 2nd core available: $ cd /cores/20101019105733/work $ ./crash host ussp-pb29: ============================= Machine: -------------- megatron.gsslab.rdu.redhat.com Login with kerberos name/password 1st core available: $ cd /cores/20101018111357/work $ ./crash 2nd core available: $ cd /cores/20101018105514/work $ ./crash host ussp-pb07 ================================ Machine: -------------- megatron.gsslab.rdu.redhat.com Login with kerberos name/password Core available: $ cd /cores/20101014095555/work $ ./crash Thanks. The evidence is quite strong, the only problem I have is that I don't see how update_va_mapping could return ENOMEM in either 5.3 or more recent hypervisors. I'll prepare a custom kernel that BUGs on errors from the single hypercalls. The three error messages at startup are always there on 5.3, I think they were fixed on 5.4. > I am confused by the test kernel version with respect to the
> content of the RPMs, because ...
>
> - The most recent change log entry is only 2.6.18-8:
>
> - The list of patches that I see in the 'kernel-2.6.spec' file looks much
> different from what I see, for example in a spec file from a 2.6.18-128
> source RPM.
>
> Could you please clarify ?
There are two sources of these differences:
1) I used "make rh-srpm" on the kernel git repository to build the SRPM, not dist-cvs. I didn't know that it created such a different list of patches.
2) The hypervisor is 5.6-based even for the -128 kernel. This was not intended, if desired the customer can keep using the stock -128 hypervisor since there is no debug output there.
---
Thanks for double checking the -ENOMEM vs. -EINVAL value. It really looks like some paging data structure is corrupted (I don't think it's the hypervisor's fault, it seems more likely to be the dom0 kernel).
At this point, I suggest that the customer tries the BUG_ON version of the -228 test kernel (which has a WARN_ON) on some machines, and the -128 BUG_ON test kernel on others. The former will tell us if the bug has been fixed; the latter will provide hopefully some hints on the corruption earlier, though likely a bit after it has happened.
If the machines are attached to a serial console, it can be useful to get the hypervisor's error output from there, since they're lost by the time the sosreport is generated. Add to the hypervisor boot options the following: "com1=115200,8n1 guest_loglvl=9".
There are residual issues in bug 666453, but this part was a dup. *** This bug has been marked as a duplicate of bug 479754 *** |