1095627 – missing vhost schedule causing thread starvation

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1095627 - missing vhost schedule causing thread starvation

Summary: missing vhost schedule causing thread starvation

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.6
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Michael S. Tsirkin
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1090938 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-05-08 08:52 UTC by Michael S. Tsirkin
Modified:	2018-12-06 16:27 UTC (History)
CC List:	12 users (show)
Fixed In Version:	kernel-2.6.32-465.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-10-14 06:08:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	890083	0	None	None	None	2018-12-06 16:20:44 UTC
Red Hat Product Errata	RHSA-2014:1392	0	normal	SHIPPED_LIVE	Important: kernel security, bug fix, and enhancement update	2014-10-14 01:28:44 UTC

Description Michael S. Tsirkin 2014-05-08 08:52:50 UTC

Description of problem:
 A vhost thread can currently "hog" the core. If several vhost
 threads need to share the same core, typically one would get most of the
 CPU time (and its associated guest most of the performance), while the
 others hardly get any work done.

Version-Release number of selected component (if applicable):
kernel-2.6.32-447.el6

How reproducible:
often

Steps to Reproduce:
1.start two vms

2.force two vhost threads to run on the same core

3.run stress from both VMs

Actual results:
one vhost thread makes progress, another hardly gets any work done

Expected results:
both threads making slow progress

Additional info:

fixed upstream and in rhel7 by commit d550dda192c1bd039afb774b99485e88b70d7cb8

Comment 1 Michael S. Tsirkin 2014-05-08 09:09:15 UTC

build with a fix
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=7432666

Comment 2 RHEL Program Management 2014-05-08 09:31:51 UTC

This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release.  Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.

Comment 3 Qian Guo 2014-05-08 11:03:45 UTC

Hi Michael

I can not reproduce this bug, could you help check my tests, very thanks.

Steps:
host builds:
# rpm -q qemu-kvm
qemu-kvm-0.12.1.2-2.425.el6.x86_64
# uname -r 
2.6.32-458.el6.x86_64

Guest kernel:
# uname -r
2.6.32-431.19.1.el6.x86_64

Steps:

1.Boot 2 guests with vhost:
guest1:
# /usr/libexec/qemu-kvm -cpu Penryn -m 4G -smp 4,sockets=1,cores=4,threads=1 -M pc -enable-kvm  -device piix3-usb-uhci,id=usb -name rhel7 -nodefaults -nodefconfig  -device virtio-balloon-pci,id=balloon0  -vnc :20 -vga std -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0   -monitor stdio     -drive file=/root/qzhang/rhel6.5-64-backup.qcow2,if=none,media=disk,format=qcow2,rerror=stop,werror=stop,aio=native,id=scsi-disk0 -device virtio-scsi-pci,id=bus2 -device scsi-hd,bus=bus2.0,drive=scsi-disk0,id=disk0 -netdev tap,id=netdev0,vhost=on,script=/etc/qemu-ifup -device virtio-net-pci,netdev=netdev0,id=vn1,mac=52:54:00:12:34:1a

guest2:
# /usr/libexec/qemu-kvm -cpu Penryn -m 4G -smp 4,sockets=1,cores=4,threads=1 -M pc -enable-kvm  -device piix3-usb-uhci,id=usb -name rhel7 -nodefaults -nodefconfig  -device virtio-balloon-pci,id=balloon0  -vnc :10 -vga std -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0   -monitor stdio     -drive file=/root/qzhang/rhel6.5-64-backupcp1.qcow2,if=none,media=disk,format=qcow2,rerror=stop,werror=stop,aio=native,id=scsi-disk0 -device virtio-scsi-pci,id=bus2 -device scsi-hd,bus=bus2.0,drive=scsi-disk0,id=disk0 -netdev tap,id=netdev0,vhost=on,script=/etc/qemu-ifup -device virtio-net-pci,netdev=netdev0,id=vn1,mac=52:54:00:12:34:0a

2.pin vhost threads to same cpu:
# pgrep vhost
10633
10699

# taskset -p 01 10633
pid 10633's current affinity mask: ff
pid 10633's new affinity mask: 1

# taskset -p 01 10699
pid 10699's current affinity mask: ff
pid 10699's new affinity mask: 1

# taskset -pc 10633
pid 10633's current affinity list: 0

# taskset -pc 10699
pid 10699's current affinity list: 0


3.Run netserver in both guests. and launch several UDP_STREAM netperf instances in host:
# for i in $(seq 15) ; do netperf -H 10.66.10.169 -l 172800 -t UDP_STREAM -- -m 65507 & done
# for i in $(seq 15) ; do netperf -H 10.66.11.129 -l 172800 -t UDP_STREAM -- -m 65507 & done

4.Monitor the resources via top:
# top -p 10633,10699
....
top - 19:00:03 up 1 day,  6:23,  8 users,  load average: 3.15, 2.85, 2.68
Tasks:   2 total,   0 running,   2 sleeping,   0 stopped,   0 zombie
Cpu(s): 13.4%us, 21.5%sy,  0.0%ni, 64.3%id,  0.3%wa,  0.0%hi,  0.5%si,  0.0%st
Mem:   8001996k total,  7819568k used,   182428k free,    54240k buffers
Swap:  8142840k total,      692k used,  8142148k free,  5798588k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND                                                        
10633 root      20   0     0    0    0 S 43.8  0.0  17:28.75 0 vhost-10620                                                    
10699 root      20   0     0    0    0 S 43.8  0.0  17:08.43 0 vhost-10690 

the threads are exhausting nearly same resources in host.

so I did not reproduce this bug, since the reporter's reproducible is not 100%, I will wait for 1 day, and check it tommorow and update result here.

Hi Michael, is there sth wrong in my steps or any suggestions can help us to reproduce it?

thanks,

Comment 4 Michael S. Tsirkin 2014-05-08 11:09:48 UTC

yes as you see together they don't reach 100% this is why
it does not trigger.
I think guest to host will reproduce this faster.
also try tcp.
also maybe look at bandwidth with -D.

Comment 5 Qian Guo 2014-05-08 11:15:22 UTC

(In reply to Michael S. Tsirkin from comment #4)
> yes as you see together they don't reach 100% this is why
> it does not trigger.
> I think guest to host will reproduce this faster.
> also try tcp.
> also maybe look at bandwidth with -D.

Thank you for your quick response~~

In fact, I firstly test is from guest to host, but can not reproduce, anyway, I will cancel current instances and resume to test from guest to host and add some tcp instances, and update here after long time.

thanks

Comment 7 Rafael Aquini 2014-05-14 12:35:26 UTC

Patch(es) available on kernel-2.6.32-465.el6

Comment 10 Vlad Yasevich 2014-06-05 16:05:07 UTC

*** Bug 1090938 has been marked as a duplicate of this bug. ***

Comment 11 Vlad Yasevich 2014-06-05 16:07:25 UTC

Appears to also solve customer reported issues from Bug 1090938.  Please consider for z-stream.

Comment 19 John Skeoch 2014-08-19 22:40:41 UTC

There are a number of interested customers who wish to follow the progress of this bug, this is amplified by the closed (as duplicate) public bug which points to this restricted bug.

Can I ask that you review the contents and if appropriate reconsider the groups applied, thank you.

John

Comment 20 Michael S. Tsirkin 2014-09-01 08:13:15 UTC

go ahead and make it public as appropriate

Comment 22 errata-xmlrpc 2014-10-14 06:08:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-1392.html

Note You need to log in before you can comment on or make changes to this bug.