Bug 1700390 - KVM-RT guest with 10 vCPUs hangs on reboot
Summary: KVM-RT guest with 10 vCPUs hangs on reboot
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 15.0 (Stein)
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: ---
Assignee: OSP DFG:Compute
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
Depends On: 1580229
Blocks: 1932086
TreeView+ depends on / blocked
 
Reported: 2019-04-16 12:50 UTC by Jaroslav Suchanek
Modified: 2023-03-21 19:14 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1580229
Environment:
Last Closed: 2021-06-01 13:31:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1930706 0 None None None 2021-06-03 13:08:10 UTC
Red Hat Issue Tracker OSP-3121 0 None None None 2022-08-23 18:50:20 UTC

Description Jaroslav Suchanek 2019-04-16 12:50:37 UTC
+++ This bug was initially created as a clone of Bug #1580229 +++

Description of problem:
Boot guest with 10 vCPUs, mostly guest can not boot up. With less vCPUs, everything works well.


Version-Release number of selected component (if applicable):
4.16.0-11.rt1.1.el8+6.x86_64
libvirt-3.10.0-2.el8+526+412dc3e0.x86_64
qemu-kvm-2.12.0-10.el8+526+412dc3e0.x86_64
tuned-2.9.0-2.20180430git5d0a9d91.el8+5.noarch


How reproducible:
5/6


Steps to Reproduce:
1. Install a rhel8 host

2. Setup RT host
BOOT_IMAGE=/vmlinuz-4.16.0-11.rt1.1.el8+6.x86_64 root=/dev/mapper/rhel_dell--per430--09-root ro crashkernel=auto rd.lvm.lv=rhel_dell-per430-09/root rd.lvm.lv=rhel_dell-per430-09/swap console=ttyS0,115200n81 default_hugepagesz=1G iommu=pt intel_iommu=on skew_tick=1 isolcpus=2,4,6,8,10,12,14,16,18,19,17,15,13 intel_pstate=disable nosoftlockup nohz=on nohz_full=2,4,6,8,10,12,14,16,18,19,17,15,13 rcu_nocbs=2,4,6,8,10,12,14,16,18,19,17,15,13

3. Boot VM with 10 vCPUs, fail boot up. XML is attached.

2 vCPUs     works well (2/2 work)
4 vCPUs     works well (6/6 work)
6 vCPUs     works well (6/6 work)
8 vCPUs     works well (6/6 work)
10 vCPUs    fail       (1/6 work)


Actual results:
VM can not boot up with 10 vCPUs.


Expected results:
VM should boot up with 10 vCPUs.


Additional info:
1. We are testing with q35, as q35 is default chipset in rhel8.

2. Using same XML, rhel7.6 works well. So this issue only happens with rhel8.


--- Additional comment from Marcelo Tosatti on 2019-04-11 14:51:32 CEST ---

1) I assume it is going to be consumed in some OpenStack product. 
 
Yes. 

2) it is not clear, what is needed to be done in libvirt and how it can be consumed by OpenStack,
can you please summarize it in the bz somehow?

Libvirt needs to add a new directive to control whether it should set special
priority class and priority value for the QEMU main thread (the PID of the
QEMU process), similarly to:

    <vcpusched vcpus='0' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='2' scheduler='fifo' priority='1'/>

In fact, when setting scheduler='fifo' priority='1' to guest vcpus,
it is _necessary_ to set same priority to the QEMU main thread.
So perhaps the new element can be:

    <qemumainthread scheduler='fifo' priority='1'/>

How its to be consumed by OpenStack: exactly the same way as
<vcpusched> element is consumed.



--- Additional comment from Luiz Capitulino on 2019-04-15 19:22:57 CEST ---

Since OpenStack will have to use the new XML element, should we clone this BZ for Red Hat OpenStack
in openstack-nova component?


--- Additional comment from Martin Kletzander on 2019-04-16 13:51:32 CEST ---

I mistakenly changed this to POST when the patches were just sent and not actually pushed.  They are now, however, as commits v5.2.0-291-g0c010cd1039c..v5.2.0-295-g2b342cda7277:

commit 0c010cd1039c301059037d323376be0501f18048
Author: Martin Kletzander <mkletzan>
Date:   Mon Apr 15 10:56:03 2019 +0200

    conf: Parse common scheduler attributes in separate function

commit 3217bcc535b18a14453c2bc9ab5ee522478d28f0
Author: Martin Kletzander <mkletzan>
Date:   Mon Apr 15 10:48:07 2019 +0200

    conf: Format thread IDs optionally

commit c79a39e60cef50ab9ee5cdec51f46243e4202622
Author: Martin Kletzander <mkletzan>
Date:   Mon Apr 15 13:17:41 2019 +0200

    docs: Mention iothreadsched element in the docs and reword

commit 842bc56ad29f1aa72dce8071ba98c25b4fcbed16
Author: Martin Kletzander <mkletzan>
Date:   Mon Apr 15 10:45:38 2019 +0200

    conf: Add support for emulatorsched

commit 2b342cda7277646ee4a50fa87c3586578c4bfa7c
Author: Martin Kletzander <mkletzan>
Date:   Mon Apr 15 13:13:06 2019 +0200

    qemu: Add support for emulatorsched

Comment 2 Kashyap Chamarthy 2019-04-24 15:34:13 UTC
Nova currently doesn't allow CPU overcommit, so it is not a problem today to expose 'emulatorsched' option.

So closing the bug with this rationale.  (In future, if we decide we need this, we can always reopen the bug.)

Comment 3 Marcelo Tosatti 2019-04-29 14:45:08 UTC
(In reply to Kashyap Chamarthy from comment #2)
> Nova currently doesn't allow CPU overcommit, so it is not a problem today to
> expose 'emulatorsched' option.
> 
> So closing the bug with this rationale.  (In future, if we decide we need
> this, we can always reopen the bug.)

Kashyap,


Whether CPU overcommit is supported or not is not revelant to this option.

See comment #39 of 

https://bugzilla.redhat.com/show_bug.cgi?id=1580229

Comment 5 Kashyap Chamarthy 2019-05-17 13:04:55 UTC
(In reply to Marcelo Tosatti from comment #3)
> (In reply to Kashyap Chamarthy from comment #2)
> > Nova currently doesn't allow CPU overcommit, so it is not a problem today to
> > expose 'emulatorsched' option.
> > 
> > So closing the bug with this rationale.  (In future, if we decide we need
> > this, we can always reopen the bug.)
> 
> Kashyap,
> 
> 
> Whether CPU overcommit is supported or not is not revelant to this option.

Thanks for correcting, Marcelo.  I closed it based on a bug triage discussion; I was reminded by a colleague, Sean Mooney, that the reasons are more granular, as in, there are two things here:

  (1) Nova can expose 'emulatorsched' unconditionally whenever 'vcpusched' is exposed.
  (2) However, Nova should _not_ expose this to the "tenant" users (who are not admins)

> See comment #39 of 
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1580229

Given that this sounds this is important for real-time workloads, then we can enable (i.e. the first point noted earlier) 'emulatorsched' whenever 'vcpusched' is exposed.  

This can be done whenever you set a property called: `hw:cpu_realtime=yes` on a "flavor" (which defines the compute, memory and storage capacity for guests), then make sure to enable both 'emulatorsched' and 'vcpusched'.  Their values can both be set to whatever the value of the Nova config attribute is 'realtime_scheduler_priority'[1], which is defined as follows:

    "In a realtime host context vCPUs for guest will run in that scheduling priority. 
     Priority depends on the host kernel (usually 1-99)"

(Sean, please correct me if I misparsed you.)


[1] https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.realtime_scheduler_priority

Comment 6 Kashyap Chamarthy 2019-05-17 13:34:05 UTC
Another point that Sean suggested is that, to _not_ hit this bug in OpenStack Nova, you can do the following.

If you set the flavor property `hw:cpu_realtime`, ensure to also set:

  - `hw:emulator_thread_policy=isolate` or `hw:emulator_thread_policy=share`; and

  - set `cpu_shared_set` (this defines which physical CPUs will be used for 
    best-effort guest vCPU resources) in `nova.conf`

The above is not enforced in the code, though.  So this feature request can protect those users who don't manually set the above.

Comment 7 Luiz Capitulino 2020-04-15 20:58:24 UTC
What's the status of this BZ?

Comment 10 Kashyap Chamarthy 2021-06-01 13:31:32 UTC
Sorry for the late response.  I chatted with my colleague Sean Mooney and we both agree that this is a user-error.  I.e. the user must ensure to configure "hw:emulator_thread_policy" for a real-time guest.

See comment#6 for details on hw:emulator_thread_policy".

On the above basis, I'm closing this bug.

Comment 11 smooney 2021-06-01 18:19:51 UTC
since this was fresh in our minds i brought this up in our upstream team meeting today.

we still belive the statemng above is correct that as implmented today it is user error to use realtime
instance without hw:emulator_thread_policy today.

with that said it also occurred to me that there may be a better default we can do in this specific case when the flavor/image combination is miss configured.

hw:cpu_realtime_mask is also a require paramter when using realtime cpus in os 15 and osp 16.
in osp 17 that is releaxt by https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/releasenotes/notes/bug-1884231-16acf297d88b122e.yaml
but only when hw:emulator_thread_policy is used.

what we might be able to do is future relax the requirements.
in the event that emulator_thread_policy  is not defined and cpu_realtime_mask is not defined we can return an error.
but when cpu_realtime_mask is defined and emulator_thread_policy is not defied we can reduce the priorty of the non realtime vcpus
and then confine the emultor thread to float over the non realtim vCPU  host cores with the same elevated priority as the realtime vcpus.


what this woudl mean for a 2 core vm where guest cpu 0 is non realtime and guest cpu 1 is realtime we would generate the xml as follows
.e.g. hw:cpu_policy=dedicated hw:cpu_realtime=True hw:cpu_realtime_mask=^0

  <vcpupin vcpu="0" cpuset="0"/>
  <vcpupin vcpu="1" cpuset="1"/>
  <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
  <emulatorpin cpuset="1"/>
  <emulatorsched scheduler='fifo' priority='1'>


vs today 
  <vcpupin vcpu="0" cpuset="0"/>
  <vcpupin vcpu="1" cpuset="1"/>
  <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
  <emulatorpin cpuset="1,2"/>

that should ensure that you cant get into an situation where the emulator thread is starved by the guest cpus.

this however would still not be our recommended configuration as we would advise using an isolated emulator thread or preferable an emulator thread form the
cpu_shared_set instead.

i will capture this in an upstream bug report but to set expectation this is a low priory wishlist enhancement.
it is not clear when we will have time to implement this change but we may be able to include it in other work.

Comment 12 Pei Zhang 2021-06-03 10:00:23 UTC
(In reply to Kashyap Chamarthy from comment #10)
> Sorry for the late response.  I chatted with my colleague Sean Mooney and we
> both agree that this is a user-error.  I.e. the user must ensure to
> configure "hw:emulator_thread_policy" for a real-time guest.
> 
> See comment#6 for details on hw:emulator_thread_policy".
> 
> On the above basis, I'm closing this bug.

Hello Kashyap,

Seems the "hw:emulator_threads_policy=share" is for setting RT VM <emulatorpin xx>, I got this info from Bug 1849469.

And seems this bz is a new request to support <emulatorsched scheduler='fifo' priority='1'/> in OSP (RHEL has supported it after fix of Bug 1580229). But currently RT VMs work well without <emulatorsched scheduler='fifo' priority='1'/> in both RHEL layer and OSP layer. From function working well perspective, I agree this bug can be closed now. 

Thanks.

Best regards,

Pei


Note You need to log in before you can comment on or make changes to this bug.