Hide Forgot
>Description of problem: Customer experienced a sudden high softirq load on a Xen host, that forced them to migrate the Xen guests away to another host. The softirq load decreased linearly as guests were migrated out. Customer runs a Xen virtualization host, ux004, that hosts several Xen paravirtualized guests (vux0xx). - On May 4th, ux004 suddently experienced a massive softirq load in the host around 10:00, that lead both the host and the guests unusable. - Customer starts to shutdown the guests, and after managing to stop between three and five of them, softirq load in the host dropped to nearly 0%. - They still shutdown the remaining guests and upgrade the host kernel from 2.6.18-238.5.1 to 2.6.18-238.9.1 and increase dom0 vcpus from 2 to 4. - /var/log/xen/xend.log show multiple entries like this one: "Cannot recreate information for dying domain 24. Xend will ignore this domain from now on." - BZ#695369 is not belived to be the cause of the softirq load. >Supporting information: sosreport-ux004.201105041307-772065-0e0750.tar.bz2 ux004_sar04.pdf (sar statistics for May 4th) A different event, possibly unrelated to the first one, happened on May 19th on a different virtualization host, ux001, running 2.6.18-238.12 (hotfix for BZ#695369) and with 'hardvirt' enabled. - Customer started live migrating xen paravirtualized guests from ux002 to ux001 at 06:36. - After live migrating around 20 guests to ux001, sudden high softirq on ux001 occurs at 06:55. - Customer stops the live migrations, and they start to live migrate the guests from ux001 back to ux002. - dom0 experiences a higher than usual steal time (up to 7%). >Supporting information: sosreport-ux001.00469714-757550-73e9fb.tar.bz2 sosreport-ux002.00469714-514346-5fa7bf.tar.bz2 ux001_sar19.pdf (sar statistics for May 19th) ux002_sar19.pdf (sar statistics for May 19th) ux001-softirq-analysis.tar.bz2 containing: - 3 captures of /proc/interrupts (1 total, 1 for blk irqs, 1 for net irqs): proc-interrupts*.txt - 4 captures of /proc/irq/*/smp_affinity as we set it: irq.smp_affinity.*.txt - 1 five minutes mpstat log showing softirq per cpu with the above pinning: mpstat.20110519-073316.log - 1 vuxiostat capture also showing steal time: vuxiostat.ux001.20110519-073833.log >Supporting information: I've attached the oprofile: - oprofile_test_reports.tar.gz - oprofile_test_reports.tar.gz >How reproducible: We have a reproducer at the FAB lab: Xen hosts: 10.33.8.90 r210xen.gsslab.fab.redhat.com 10.33.8.140 dhcp-140.gsslab.fab.redhat.com iSCSI server for the shared storage: 10.33.8.75 pe1950-5.gsslab.fab.redhat.com All guests named RHEL5LMx can be live migrated between both Xen hosts, and xenoprofile should be installed, at least, on r210xen. >Actual results: experienced a sudden high softirq load on a Xen host >Expected results: identify why it's getting a high sofirq load
Created attachment 502289 [details] ux004_sar04.pdf
Created attachment 502291 [details] ux002_sar19.pdf
Created attachment 502292 [details] ux001-softirq-analysis.tar.bz2
Created attachment 502293 [details] ux001_sar19.pdf
Created attachment 502295 [details] sosreport-ux004.201105041307-772065-0e0750.tar.bz2
Created attachment 502300 [details] sosreport-ux001.00469714-757550-73e9fb.tar.bz2
Created attachment 502302 [details] sosreport-ux002.00469714-514346-5fa7bf.tar.bz2
This request was evaluated by Red Hat Product Management for inclusion in Red Hat Enterprise Linux 5.7 and Red Hat does not plan to fix this issue the currently developed update. Contact your manager or support representative in case you need to escalate this bug.
Hello Julio, I'm going to close this BZ as INSU tomorrow. Thank you.