Bug 561370

Summary:	KVM guest crashed during a multi guest database run
Product:	Red Hat Enterprise Linux 5	Reporter:	Sanjay Rao <srao>
Component:	kvm	Assignee:	Marcelo Tosatti <mtosatti>
Status:	CLOSED CANTFIX	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	high	Docs Contact:
Priority:	low
Version:	5.6	CC:	cpelland, llim, tburke, virt-maint, ykaul
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-05-27 18:22:17 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	580948

Description Sanjay Rao 2010-02-03 14:45:00 UTC

Description of problem:

KVM guest crashes during multi guest run (running database workload). The host is AMD  (Six-Core AMD Opteron(tm) Processor 8431).


Version-Release number of selected component (if applicable):

Host and guests running 2.6.18-186.el5
File system used for the testing - ext4 (e4fsprogs-1.41.9-3.el5)


How reproducible:

Happened just once. Do not know if this can be reproduced.


Steps to Reproduce:
1. Started 4 KVM guests (6 cpus - 14G each)
2. Ran database workload
3. One of the guests crashed.
  
Actual results:

Message in /var/log/messages in the guest at the time of the crash.


Feb  2 18:02:48 dhcp47-99 kernel: list_add corruption. prev->next should be ffff81038f7d3e28, but was 0000000000497000
Feb  2 18:02:48 dhcp47-99 kernel: ----------- [cut here ] --------- [please bite here ] ---------
Feb  2 18:02:48 dhcp47-99 kernel: Kernel BUG at lib/list_debug.c:31
Feb  3 08:37:19 dhcp47-99 syslogd 1.4.1: restart.



Expected results:

The guest should continue to run.

Additional info:

The screen shot of the console is attached.

Comment 1 Marcelo Tosatti 2010-02-19 04:54:08 UTC

Sanjay,

Can you please attempt to reproduce the bug, and save the entire oops message (also there's no screenshot attached?).

Will look for possible candidates in the meantime. Sorry for the late reply.

Thanks

Comment 2 Marcelo Tosatti 2010-02-19 04:56:53 UTC

Also, were hugepages being used?

Comment 3 Sanjay Rao 2010-02-19 12:59:31 UTC

I will try to reproduce the problem when I get a chance. But I am not sure that this issue is reproducible. That's why I captured everything that was reported hoping that it might give some clues.

Also there is an issue with ext4 running oracle databases. (BZ 562219). I am not sure if the two are related. 

This test was not using huge pages.

Comment 6 Marcelo Tosatti 2010-02-24 20:41:30 UTC

Postponing to RHEL 5.6.

Comment 8 Marcelo Tosatti 2010-05-27 18:22:17 UTC

Closing the bug on the grounds its a one time memory corruption report, there's not much that can be done without a reproducible case.

Please reopen if necessary.