Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 709740

Summary:	[RHEL 5.6] High softirq load on a Xen host
Product:	Red Hat Enterprise Linux 5	Reporter:	asilva <asilva>
Component:	kernel-xen	Assignee:	Xen Maintainance List <xen-maint>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	5.6	CC:	drjones, jentrena, jmunilla, jzheng, leiwang, lersek, pbonzini, pcao, qwan, xen-maint
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-08-26 13:54:00 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	514489

Description asilva 2011-06-01 14:35:32 UTC

>Description of problem:
Customer experienced a sudden high softirq load on a Xen host, that forced them to migrate the Xen guests away to another host. The softirq load decreased linearly as guests were migrated out. 

Customer runs a Xen virtualization host, ux004, that hosts several Xen paravirtualized guests (vux0xx).

- On May 4th, ux004 suddently experienced a massive softirq load in the host around 10:00, that lead both the host and the guests unusable.
- Customer starts to shutdown the guests, and after managing to stop between three and five of them, softirq load in the host dropped to nearly 0%.
- They still shutdown the remaining guests and upgrade the host kernel from 2.6.18-238.5.1 to 2.6.18-238.9.1 and increase dom0 vcpus from 2 to 4.
- /var/log/xen/xend.log show multiple entries like this one:
"Cannot recreate information for dying domain 24.  Xend will ignore this domain from now on."
- BZ#695369 is not belived to be the cause of the softirq load.

>Supporting information:
sosreport-ux004.201105041307-772065-0e0750.tar.bz2
ux004_sar04.pdf (sar statistics for May 4th)

A different event, possibly unrelated to the first one, happened on May 19th on a different virtualization host, ux001, running 2.6.18-238.12 (hotfix for BZ#695369) and with 'hardvirt' enabled.

- Customer started live migrating xen paravirtualized guests from ux002 to ux001 at 06:36.
- After live migrating around 20 guests to ux001, sudden high softirq on ux001 occurs at 06:55.
- Customer stops the live migrations, and they start to live migrate the guests from ux001 back to ux002.
- dom0 experiences a higher than usual steal time (up to 7%).

>Supporting information:
sosreport-ux001.00469714-757550-73e9fb.tar.bz2
sosreport-ux002.00469714-514346-5fa7bf.tar.bz2
ux001_sar19.pdf (sar statistics for May 19th)
ux002_sar19.pdf (sar statistics for May 19th)
ux001-softirq-analysis.tar.bz2 containing:
- 3 captures of /proc/interrupts (1 total, 1 for blk irqs, 1 for net irqs): proc-interrupts*.txt
- 4 captures of /proc/irq/*/smp_affinity as we set it: irq.smp_affinity.*.txt
- 1 five minutes mpstat log showing softirq per cpu with the above pinning: mpstat.20110519-073316.log
- 1 vuxiostat capture also showing steal time: vuxiostat.ux001.20110519-073833.log

>Supporting information:
I've attached the oprofile:
- oprofile_test_reports.tar.gz
- oprofile_test_reports.tar.gz

>How reproducible:
We have a reproducer at the FAB lab:

Xen hosts:
10.33.8.90  r210xen.gsslab.fab.redhat.com
10.33.8.140 dhcp-140.gsslab.fab.redhat.com

iSCSI server for the shared storage:
10.33.8.75  pe1950-5.gsslab.fab.redhat.com

All guests named RHEL5LMx can be live migrated between both Xen hosts, and xenoprofile should be installed, at least, on r210xen.
  
>Actual results:
experienced a sudden high softirq load on a Xen host

>Expected results:
identify why it's getting a high sofirq load

Comment 1 asilva 2011-06-01 14:40:50 UTC

Created attachment 502289 [details]
ux004_sar04.pdf

Comment 2 asilva 2011-06-01 14:43:19 UTC

Created attachment 502291 [details]
ux002_sar19.pdf

Comment 3 asilva 2011-06-01 14:44:22 UTC

Created attachment 502292 [details]
ux001-softirq-analysis.tar.bz2

Comment 4 asilva 2011-06-01 14:50:53 UTC

Created attachment 502293 [details]
ux001_sar19.pdf

Comment 6 asilva 2011-06-01 14:59:40 UTC

Created attachment 502295 [details]
sosreport-ux004.201105041307-772065-0e0750.tar.bz2

Comment 7 asilva 2011-06-01 15:09:26 UTC

Created attachment 502300 [details]
sosreport-ux001.00469714-757550-73e9fb.tar.bz2

Comment 8 asilva 2011-06-01 15:12:23 UTC

Created attachment 502302 [details]
sosreport-ux002.00469714-514346-5fa7bf.tar.bz2

Comment 17 RHEL Program Management 2011-06-21 05:32:35 UTC

This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.7 and Red Hat does not plan to fix this issue the currently developed update.

Contact your manager or support representative in case you need to escalate this bug.

Comment 22 Laszlo Ersek 2011-08-25 12:30:55 UTC

Hello Julio, I'm going to close this BZ as INSU tomorrow. Thank you.