Description of problem: When running the RHEL3 (RHEL3U8) PV drivers for networking and disk Cisco has seen a performance regression as compared to non-PV drivers. Version-Release number of selected component (if applicable): PV drivers for RHEL3u8 (2.4.21-47.0.1.ELsmp) RHEL5.1 GA for the Hypervisor How reproducible: Always Steps to Reproduce: 1. 2. 3. Actual results: Details from Cisco : I wanted to share the results from the first round of Xen testing. The short of it is that the PV drivers (version 0.1-1) performed much worse than the standard drivers. In all cases the host is a DL380G5 with 2 dual core processors (4 cpu contexts), 4 GB RAM and running the 32-bit version of RHEL5.1. Guest config: - OS is 32-bit version of RHEL3U8 with 2.4.21-47.0.1.ELsmp kernel - 4 vcpus - 3 GB RAM - 2 disks: -- all files except trace are on the first disk which is a 16 GB file -- second disk is a 12 GB file and is used for ccm trace files (the most profilic IO generator) - trace files are compressed by ccm before writing to disk (minimizes IO) Results: 1. version 0.1-1 of PV drivers -- both network, xen-vnif, and disk, xen-vbd for the trace partition --> max call rate is ~40,000 BHCC (busy-hour calls completed) What concerns me on this run is the interrupt rate: 66,000 interrupts/sec. In other runs I traced it to the network driver. One thought is that maybe the xen PV driver does not adhere to the NAPI design, but 66,000 interrupts far surpasses the number of packets received so that does not appear to explain it. Here's a representative sample from 'mpstat 60': user=12%, system=12%, iowait=0.22%, softirq=16%, 66,000 intr/sec softirq is high compared to all other xen tests and vmware-esx tests (it's on par with what I see running kvm and vmware server). 2. standard drivers (8139cp network and ide disk) --> max call rate is ~52,000 BHCC Compared to the PV test, the standard drivers (ie, non-PV) handle a 30% higher load. Here's a representative sample from 'mpstat 60': user=16.3%, system=10.5%, iowait=1.6%, irq=6.5%, softirq=3%, 1928 intr/sec 3. PV driver for trace disk, standard network driver --> max call rate is ~52,000 BHCC I seem to have lost the mpstat sample for this case, but I recall the numbers being similar to case 2 with one exception -- iowait was much lower (ie., using the PV disk driver lowers iowait which is good). As a comparison, a similar guest configuration running on ESX can sustain a 55,000 BHCC load and a representative cpu usage sample is: user=14.8%, sys=11%, iowait=0.13%, irq=0.265%, softirq=1.85%, 1710 intr/sec Expected results: Additional info:
More info from Cisco : Cisco is jusing a "standard" config file (see below) tailored for PV drivers (VBD and NIF). cucm61 uses standard drivers cucm61pv uses the PV drivers. Right now the latter is setup to use the pv disk and standard network. Cisco has verified on the guest side that the PV drivers are used in the PV tests. Cisco gave up on virt-manager months ago, besides they prefer CLI. They use 'xm create' to launch the vm's. name = "cucm61" uuid="97fb0bb6-93ee-4575-ada5-e85bc285edf1" maxmem = 3072 memory = 3072 vcpus = 4 boot = "c" disk = [ "file:/opt/xen/images/cucm-6.1.0.9901-9000.img,hda,w", "file:/opt/xen/images/cucm-trace.img,hdb,w", ] vif = [ "mac=00:16:3e:65:e7:9e, bridge=virbr0, type=ioemu" ] builder = "hvm" kernel = "/usr/lib/xen/boot/hvmloader" device_model = "/usr/lib/xen/bin/qemu-dm" serial = "pty" pae = 1 acpi = 1 apic = 1 on_poweroff = "destroy" on_reboot = "destroy" on_crash = "destroy" vnc = 1 vncunused = 1 vnclisten="10.94.150.89" name = "cucm61pv" uuid="d90406ab-accd-45aa-86bf-26da1f7dda34" maxmem = 3072 memory = 3072 vcpus = 4 boot = "c" disk = [ "file:/opt/xen/images/cucm-6.1.0.9901-9000.img,hda,w", "tap:aio:/opt/xen/images/cucm-trace.img,xvda,w", ] #vif = [ "mac=00:16:3e:65:e7:9e, bridge=virbr0" ] vif = [ "mac=00:16:3e:65:e7:9e, bridge=virbr0, type=ioemu" ] builder = "hvm" kernel = "/usr/lib/xen/boot/hvmloader" device_model = "/usr/lib/xen/bin/qemu-dm" serial = "pty" pae = 1 acpi = 1 apic = 1 on_poweroff = "destroy" on_reboot = "destroy" on_crash = "destroy" vnc = 1 vncunused = 1 vnclisten="10.94.150.89"
We have been able to recreate the symptoms using netperf with UDP traffic. Here is a summary of some of the test results. Test Data --------- This section lists a summary of the test data. Notes: ------ 1. In all cases, the numbers shown here reflect the receive side of a netperf test. The transmitter was kept constant (perf15) 2. The values shown are approximate averages with some rounding. Thus there is a *relatively* large margin of error, so small percentage changes should not be quoted. 3. I have be having trouble running a reliable set of tests with a FV guest w/o PV drivers. Thus, no data for those configurations is presented here. 4) In the data below, all runs were with the standard kernel except those indicated by a RHEL52 prefix. That set of data was using a hypervisor with Xen3.1.2 changes incorporated. However the respective guests for the RHEL5X tests were the same. OS Flags VCPU int/sec msg/sec(rcv) Mb/sec %soft %sys %irq RHEL3 FVPV 4 165K 5K 41 100% 0% 0% RHEL3 FVPV noapic 4 22K 114K 934 43% 11% 17% RHEL3 FVPV 1 22K 114K 933 51% 11% 0% RHEL3 FVPV noapic 1 22K 113K 931 42% 11% 17% RHEL5 FVPV 4 165K 77K 630 93% 5% 0% RHEL5 FVPV noapic 4 23K 114K 935 33% 15% 17% RHEL5 FVPV 1 44K 114K 935 50% 14% 2% RHEL5 FVPV noapic 1 21K 114K 935 30% 15% 18% RHEL5 PV 4 19.5K 114K 935 12% 16% 1% RHEL5 Dom0 8 23K 114K 935 37% 30% 8% RHEL5 Baremetal 8 20K 114K 935 14% 9% 1% kernels RHEL3 guest : Linux 2.4.21-50.EL #1 SMP Tue May 8 17:10:00 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux RHEL5 guest: Linux 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:19 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux RHEL host: Linux 2.6.18-53.el5xen #1 SMP Wed Oct 10 16:48:44 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux EXCEPT FOR: **RHEL52 which was Host Kernel/ hypervisor from BBurns :Linux 2.6.18-58.el5bb64cpustratusxen #1 SMP Thu Dec 6 16:41:46 EST 2007 x86_64 x86_64 x86_64 GNU/Linux this kernel has many of the Xen3.1.2 changes. RHEL52 FVPV 4 215K 65K 529 1% 87% 0% RHEL52 FVPV noapic 4 23K 114K 935 30% 16% 14% RHEL52 FVPV 1 44K 114K 934 44% 14% 2% RHEL52 FVPV noapic 1 22K 114K 935 30% 13% 16%
Posted 4-patch fix for this performance problem; patches come from xen-3.1-testing. Performance / functionality fix verified by Mark Wagner w/netperf.
in 2.6.18-71.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html