Bug 1712781

Summary: KVM-RT guest fails boot with emulatorsched
Product: Red Hat Enterprise Linux 8 Reporter: Pei Zhang <pezhang>
Component: kernel-rtAssignee: Juri Lelli <juri.lelli>
kernel-rt sub component: KVM QA Contact: Pei Zhang <pezhang>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: bhu, chayang, chuhu, fiezzi, jinzhao, jlelli, juri.lelli, juzhang, lcapitulino, mkletzan, mtosatti, pauld, virt-maint, williams
Version: 8.1   
Target Milestone: rc   
Target Release: 8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-rt-4.18.0-176.rt13.33.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-28 15:25:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1774652    
Bug Blocks: 1640832, 1722609    
Attachments:
Description Flags
excerpt from libvirtd.log when starting a domain with <emulatorsched/> fails
none
excerpt from libvirtd.log when starting a domain without <emulatorsched/> works none

Description Pei Zhang 2019-05-22 08:59:53 UTC
Description of problem:
KVM-RT guest fails boot with <emulatorsched scheduler='fifo' priority='1'/>.

Version-Release number of selected component (if applicable):
libvirt-5.3.0-1.module+el8.1.0+3164+94495c71.x86_64
kernel-rt-4.18.0-89.rt16.29.el8.x86_64
tuned-2.10.0-15.el8.noarch
qemu-kvm-4.0.0-0.module+el8.1.0+3169+3c501422.x86_64


How reproducible:
100%

Steps to Reproduce:
1. Setup RT host

2. Boot KVM-RT guest with emulatorsched. Fail.

<vcpu placement='static'>10</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='16'/>
    <vcpupin vcpu='1' cpuset='1'/>
    <vcpupin vcpu='2' cpuset='3'/>
    <vcpupin vcpu='3' cpuset='5'/>
    <vcpupin vcpu='4' cpuset='7'/>
    <vcpupin vcpu='5' cpuset='9'/>
    <vcpupin vcpu='6' cpuset='11'/>
    <vcpupin vcpu='7' cpuset='13'/>
    <vcpupin vcpu='8' cpuset='15'/>
    <vcpupin vcpu='9' cpuset='18'/>
    <emulatorpin cpuset='2,4,6,8'/>
    <emulatorsched scheduler='fifo' priority='1'/>
    <vcpusched vcpus='0' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='2' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='3' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='4' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='5' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='6' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='7' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='8' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='9' scheduler='fifo' priority='1'/>
  </cputune>

# virsh start rhel8.0_rt_8vcpu
error: Failed to start domain rhel8.0_rt_8vcpu
error: Invalid value '19120' for 'tasks': Invalid argument


Actual results:
KVM-RT guest fails boot with <emulatorsched scheduler='fifo' priority='1'/>.


Expected results:
KVM-RT guest should boot well.


Additional info:
1. This bug is related to: 
Bug 1580229 - KVM-RT guest with 10 vCPUs hangs on reboot

2. This bug is filed to track kernel-rt part of the reboot issue.  Refer to:
Bug 1580229#c58 from Martin:
"One more weird thing, if I umount the cpu,cpuacct controller, everything works.  When I mount it back, I start getting the error when trying to start the machine.  I'm inclined to this being a kernel bug (or something that is not documented as I looked for that as well)."

Comment 1 Martin Kletzander 2019-05-22 09:25:22 UTC
I'll copy some more info from the previous BZ to keep everything in one place.

Comment 2 Martin Kletzander 2019-05-22 09:26:05 UTC
So what I found out is that if I set the scheduler for the emulator thread before resuming the VM (current state) I get EINVAL when trying to write the vcpu TID to /tasks in the cpu,cpuacct controller (probably because that is the first one that is being tried).  I cannot reproduce this with anything else, I don't know why this is happening and the only thing that would make sense to me is if this was a kernel bug.  I would love if someone could find out.

But when I try booting without the <emulatorsched/> setting, I can then set the scheduler and even reboot and everything works.  So I am going to need to change the sequence in which libvirt is doing this (which is very inconvenient in the current state of things), but it will work.  Hence I'm switching this to ASSIGNED as this needs yet another fix.  In the meantime, if someone can figure out why this is happening, that would help me not to get a headache again =)

I am also attaching libvirtd logs for two starts of the domain, one with <emulatorsched/>, which fails, and one without it which works.  I hope that helps someone who is trying to figure out why this is happening.  Some nice things to search for in those logs are "virCgroupSetValue", "error" or any QMP commands like "query-cpus" and so on.

Comment 3 Martin Kletzander 2019-05-22 09:30:02 UTC
Created attachment 1571935 [details]
excerpt from libvirtd.log when starting a domain with <emulatorsched/> fails

Comment 4 Martin Kletzander 2019-05-22 09:30:35 UTC
Created attachment 1571936 [details]
excerpt from libvirtd.log when starting a domain without <emulatorsched/> works

Comment 5 Juri Lelli 2019-06-04 16:10:22 UTC
Hi,

I think the problem you were facing is related to

https://elixir.bootlin.com/linux/latest/source/kernel/sched/core.c#L6525

I spent some time understanding how libvirtd sets up emulator and vcpu(s)
properties and I believe a simpler reproducer of the very same problem is
the following:

# mkdir /sys/fs/cgroup/cpu,cpuacct/kvm
# mkdir /sys/fs/cgroup/cpu,cpuacct/kvm/emulator
# echo $$ > /sys/fs/cgroup/cpu,cpuacct/kvm/tasks
# chrt -fp 10 $$
# echo $$ > /sys/fs/cgroup/cpu,cpuacct/kvm/emulator/tasks
bash: echo: write error: Invalid argument

This is the EINVAL libvirtd gets if it tries to first setup emulator's
scheduling properties (setting it to FIFO) and then move it into emulator
group.

As you noticed, doing the other way around (first move it into the group
and then setup scheduling properties) works OK. This sounds not correct
to me, so I'll try to see what upstream folks think about it (and if there
is indeed a plausible explanation to this seemingly odd behavior).

Comment 6 Juri Lelli 2019-06-06 08:20:16 UTC
Issue seems fixed with the following kernel:

http://brew-task-repos.usersys.redhat.com/repos/scratch/jlelli/kernel-rt/4.18.0/100.rt16.40.el8.cpuctrl/

Related patch is currently under discussion (positive feedback
from cgroups maintainer so far) upstream:

https://lore.kernel.org/lkml/20190605114935.7683-1-juri.lelli@redhat.com/

libvirt master is probably fine as it is today, though.

Comment 7 Martin Kletzander 2019-06-06 09:47:59 UTC
(In reply to Juri Lelli from comment #6)
Thanks a lot, I'm glad this made sense, even though I could not reproduce it with just shell (no idea what I was doing differently).  This must've taken you awful amount of time and that makes me appreciate it even more.  The workaround might actually make more sense for us anyway, but I'm really glad it got sorted out, even if it is not upstream yet.  Thank you again.

Comment 8 Juri Lelli 2019-11-20 13:19:54 UTC
The kernel fix has been in mainline for a while now and I
think it would be good if we could have this in RHEL as
well.

It doesn't affect RHEL (where RT_GROUP_SCHED is enabled), but
it might create problems (like the one this BZ is about) for
RHEL-RT.

Since the fix is upstream I think correct procedure is to
bring it through RHEL.

Phil, do you see any problems with it and would you be up for
backporting it? If yes, please take the BZ and change component
to kernel.

Thanks a lot in any case!

Comment 10 Phil Auld 2019-11-20 14:07:39 UTC
Sure. I had that on my not critical list of fixes anyway. I should be able to pull it into RHEL8.2, with sanity only testing I think.

Thanks!

Comment 11 Phil Auld 2019-11-20 15:26:30 UTC
Juri, 
  Maybe it would be better to clone it to rhel8, then this one can get tested in RT for real.  I think that makes the process better and keeps the RT part of this from getting lost. What do you think?

Comment 12 Juri Lelli 2019-11-20 15:37:59 UTC
(In reply to Phil Auld from comment #11)
> Juri, 
>   Maybe it would be better to clone it to rhel8, then this one can get
> tested in RT for real.  I think that makes the process better and keeps the
> RT part of this from getting lost. What do you think?

Yes. Makes sense to me. Please feel free to do so.

Thanks!

Comment 13 Phil Auld 2019-11-21 20:18:00 UTC
I posted the other one for rhel8.2 (bz1774652).

Comment 20 Pei Zhang 2020-03-11 07:05:17 UTC
Move to VERIFIED as Comment 15.

Comment 22 errata-xmlrpc 2020-04-28 15:25:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:1567