Bug 1775917 - add customizer initramfs support
Summary: add customizer initramfs support
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.4.0
Assignee: Colin Walters
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: 1771572
TreeView+ depends on / blocked
 
Reported: 2019-11-23 15:19 UTC by jianzzha
Modified: 2023-09-14 05:47 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-04 11:16:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:17:26 UTC

Internal Links: 1801787

Description jianzzha 2019-11-23 15:19:37 UTC
Description of problem:
telco/NFV application requires low latency tuning. In RHEL some of these tunings are required to be put in as early as in initramfs. For example when apply cpu-partitioning tuned profile, a small initrd is created, like,
  tuned creates a small initrd with  update to system.conf and add dracut/hooks/pre-udev script, like,

Image: tuned-initrd.img: 4.0K
========================================================================
Version: 
Arguments: 
dracut modules:
========================================================================
drwx------   4 root     root            0 Nov 14 18:34 .
drwxr-xr-x   3 root     root            0 Nov 14 18:34 etc
drwxr-xr-x   2 root     root            0 Nov 14 18:34 etc/systemd
-rw-r--r--   1 root     root         1698 Nov 14 18:34 etc/systemd/system.conf
drwxr-xr-x   3 root     root            0 Nov 14 18:34 usr
drwxr-xr-x   3 root     root            0 Nov 14 18:34 usr/lib
drwxr-xr-x   3 root     root            0 Nov 14 18:34 usr/lib/dracut
drwxr-xr-x   3 root     root            0 Nov 14 18:34 usr/lib/dracut/hooks
drwxr-xr-x   2 root     root            0 Nov 14 18:34 usr/lib/dracut/hooks/pre-udev
-rwxr-xr-x   1 root     root          375 Nov 14 18:34 usr/lib/dracut/hooks/pre-udev/00-tuned-pre-udev.sh
========================================================================

etc/systemd/system.conf and usr/lib/dracut/hooks/pre-udev/00-tuned-pre-udev.sh are the keys additions here for this tuning.

Then this small initrd is chained to the existing initramfs by bootloader.

need similar customize initramfs capability from rhcos, though not necessarily the same approach.

operation wise, we expect for OCP machine config operator will use this initramfs customization capability when provision the system in day 1 install or day 2 update.

Version-Release number of selected component (if applicable):
expecting in ocp 4.4

Comment 1 Colin Walters 2019-11-25 18:51:39 UTC
It looks like today, dracut already defaults to grabbing `/etc/systemd/system.conf` - but only in non-hostonly mode:
https://github.com/dracutdevs/dracut/blob/7d47d1c423cabfd125a2bf15c5d72732a6334024/modules.d/00systemd/module-setup.sh#L188

It's a bit tempting to introduce `rpm-ostree initramfs --hostonly`...but this conflicts with our longer term direction around being image-based by default and doing signing of the initramfs.

I think both of these things should probably be kernel arguments.  But, I also do think it's somewhat strange to use kernel arguments for things the kernel doesn't read, though we do this a lot.

See also https://github.com/fedora-silverblue/issue-tracker/issues/3# for related discussion.

Comment 4 Colin Walters 2019-11-25 22:13:08 UTC
One thing I quickly experimented with is:

`rpm-ostree initramfs --enable --arg=--hostonly`

This does result in my modified `/etc/systemd/system.conf` in the initramfs, as expected.

But, the resulting system doesn't boot because we don't have xfs.ko in the initramfs, because dracut's --hostonly codepath is confused by how we're running it inside a container (bubblewrap) - it doesn't see the rootfs as xfs.
Using `--hostonly` strikes me as very dangerous here because we'd be at the risk of omitting key NIC drivers, etc.

We went to a lot of work to "containerize" the dracut run with rpm-ostree to ensure it didn't affect the booted system and was transactional, etc.  Partially walking that back, or teaching e.g. rpm-ostree how to inject things based on the host would be nontrivial.

Probably the cleanest would be to add something like dracut --hostonly-mode=config.

But, for now...this works:

rpm-ostree initramfs --enable --arg=-I --arg=/etc/systemd/system.conf

Comment 5 jianzzha 2019-11-26 12:10:23 UTC
> 
> But, for now...this works:
> 
> rpm-ostree initramfs --enable --arg=-I --arg=/etc/systemd/system.conf

@walters "--arg=-I --arg=/etc/systemd/system.conf" does work; however when I couldn't add a pre-udev hook script following this pattern:
rpm-ostree initramfs --enable --arg=-I --arg=/etc/systemd/system.conf --arg=-i --arg=/var/home/core/00-tuned-pre-udev.sh --arg=/usr/lib/dracut/hooks/pre-udev/00-tuned-pre-udev.sh

in the new initramfs image, /usr/lib/dracut/hooks/pre-udev/00-tuned-pre-udev.sh end up as a directory.

use dracut directly, dracut -i /var/home/core/00-tuned-pre-udev.sh /usr/lib/dracut/hooks/pre-udev/00-tuned-pre-udev.sh new-initramfs, in new-initramfs 00-tuned-pre-udev.sh is a file as expected.

Comment 6 Colin Walters 2019-11-26 13:03:17 UTC
>  however when I couldn't add a pre-udev hook script following this pattern:

Right, I'm going to work on a patch to bake in that hook into RHCOS by default.

Comment 9 Yanir Quinn 2019-11-27 17:38:34 UTC
Is it necessary to use initramfs for masking instead of using sysfs plugin with node tuning operator ? e.g. :
  [sysfs]
      /sys/devices/virtual/workqueue/cpumask = ${not_isolated_cpumask}

Is that because it should be done in an early stage only and we cannot use this plugin ?

Comment 10 Jiří Mencák 2019-11-28 11:45:54 UTC
(In reply to Colin Walters from comment #7)
> https://gitlab.cee.redhat.com/coreos/redhat-coreos/merge_requests/727

That PR seems to address only one scenario -- for cpu-partitioning profile.  I believe we need a more flexible solution.
Consider https://bugzilla.redhat.com/show_bug.cgi?id=1775834 which has a slightly modified version of the 00-tuned-pre-udev.sh
and there might be different requirements in the future.  Why are initramfs overlays a problem?  The future intention of
signing of initramfs?

Comment 12 Colin Walters 2019-12-02 17:42:44 UTC
> Consider https://bugzilla.redhat.com/show_bug.cgi?id=1775834 which has a slightly modified version of the 00-tuned-pre-udev.sh
and there might be different requirements in the future.  

Do all of those variables need to be set in the initramfs?  Most of them offhand look to me like they could be set after the switch to the real root, possibly as late as having cluster-node-tuning-operator do it.

If after kubelet is too late, using MachineConfig to write files into `/etc/sysctl.d` would work too.

> Why are initramfs overlays a problem?  The future intention of signing of initramfs?

Right, supporting signing is a big one.  And in general, we want to avoid intermixing our *code* with user *configuration*.  This is the same philosphy backing OSTree, where /usr is *code* that comes from us and is read-only, and user *configuration* is cleanly separated in /etc.

Comment 13 Andrew Theurer 2019-12-03 15:07:28 UTC
(In reply to Colin Walters from comment #12)

> Do all of those variables need to be set in the initramfs? 

Only /sys/devices/virtual/workqueue/cpumask.  The others were included out of convenience of doing it all on once place.

Comment 14 Marcelo Tosatti 2019-12-19 11:10:39 UTC
(In reply to Andrew Theurer from comment #13)
> (In reply to Colin Walters from comment #12)
> 
> > Do all of those variables need to be set in the initramfs? 
> 
> Only /sys/devices/virtual/workqueue/cpumask.  The others were included out
> of convenience of doing it all on once place.

The workqueue files above can be changed on runtime, in the tuned machine config operator.

Needs to be able to support kernel command line changes, though, which machine config operator already does, correct?

Comment 15 Marcelo Tosatti 2019-12-19 11:36:13 UTC
(In reply to Yanir Quinn from comment #9)
> Is it necessary to use initramfs for masking instead of using sysfs plugin
> with node tuning operator ? e.g. :
>   [sysfs]
>       /sys/devices/virtual/workqueue/cpumask = ${not_isolated_cpumask}
> 
> Is that because it should be done in an early stage only and we cannot use
> this plugin ?

There is no need to do this in an early stage since the realtime-virtual-host profile
should be used, for setting up realtime VNFs, which has to use isolcpus at the moment (Andrew, there are other problems with sched balancing such as 1G hugepage pre-allocation, which turns the sched balancing goal unachievable at the moment for the vRAN usecase, need to think of a solution for future releases).

For the vRAN usecase, what might be needed is the following (we don't know yet because in the end it depends on the access pattern of VNFs A and B, if VNFs A and B happen to be allocated on the same socket or not, they might interfere negatively with each other, known 
as "noisy neighbor problem" (Google for "Intel noisy neighbor").

Abstraction of HW details might be an illusion when designing interfaces (it might be impossible design an interface that allows abstraction and hardware specific control 
at the same time, if the hw interface of different hw implements a feature which seems 
to be similar to the interface designer).

In the case of CAT:

1) How to share cache between two VNFs.
2) What if the same feature is implemented in HW differently?

It seems the VNF specification prefers _not_ to expose HW details. Features important for customers might require exposing HW details.

So what is needed is an API which can be easily modified in the future. Can please someone 
make a summary of Kubernetes API stability so an informed, planned decision can be made?

Comment 16 Marcelo Tosatti 2019-12-19 11:39:50 UTC
(In reply to Marcelo Tosatti from comment #15)
> (In reply to Yanir Quinn from comment #9)
> > Is it necessary to use initramfs for masking instead of using sysfs plugin
> > with node tuning operator ? e.g. :
> >   [sysfs]
> >       /sys/devices/virtual/workqueue/cpumask = ${not_isolated_cpumask}
> > 
> > Is that because it should be done in an early stage only and we cannot use
> > this plugin ?
> 
> There is no need to do this in an early stage since the
> realtime-virtual-host profile
> should be used, for setting up realtime VNFs, which has to use isolcpus at
> the moment (Andrew, there are other problems with sched balancing such as 1G
> hugepage pre-allocation, which turns the sched balancing goal unachievable
> at the moment for the vRAN usecase, need to think of a solution for future
> releases).
> 
> For the vRAN usecase, what might be needed is the following (we don't know
> yet because in the end it depends on the access pattern of VNFs A and B, if
> VNFs A and B happen to be allocated on the same socket or not, they might
> interfere negatively with each other, known 
> as "noisy neighbor problem" (Google for "Intel noisy neighbor").
> 
> Abstraction of HW details might be an illusion when designing interfaces (it
> might be impossible design an interface that allows abstraction and hardware
> specific control 
> at the same time, if the hw interface of different hw implements a feature
> which seems 
> to be similar to the interface designer).
> 
> In the case of CAT:
> 
> 1) How to share cache between two VNFs.
> 2) What if the same feature is implemented in HW differently?
> 
> It seems the VNF specification prefers _not_ to expose HW details. Features
> important for customers might require exposing HW details.
> 
> So what is needed is an API which can be easily modified in the future. Can
> please someone 
> make a summary of Kubernetes API stability so an informed, planned decision
> can be made?

Accordingly to

https://kubernetes.io/docs/concepts/overview/kubernetes-api/

An alpha level API for this issue seems appropriate. 
Is there any negative consequence in using an Alpha level API?

Comment 17 Marcelo Tosatti 2019-12-19 12:32:15 UTC
(In reply to Marcelo Tosatti from comment #15)
> (In reply to Yanir Quinn from comment #9)
> > Is it necessary to use initramfs for masking instead of using sysfs plugin
> > with node tuning operator ? e.g. :
> >   [sysfs]
> >       /sys/devices/virtual/workqueue/cpumask = ${not_isolated_cpumask}
> > 
> > Is that because it should be done in an early stage only and we cannot use
> > this plugin ?
> 
> There is no need to do this in an early stage since the
> realtime-virtual-host profile
> should be used, for setting up realtime VNFs, which has to use isolcpus at
> the moment (Andrew, there are other problems with sched balancing such as 1G
> hugepage pre-allocation, which turns the sched balancing goal unachievable
> at the moment for the vRAN usecase, need to think of a solution for future
> releases).
> 
> For the vRAN usecase, what might be needed is the following (we don't know
> yet because in the end it depends on the access pattern of VNFs A and B, if
> VNFs A and B happen to be allocated on the same socket or not, they might
> interfere negatively with each other, known 
> as "noisy neighbor problem" (Google for "Intel noisy neighbor").
> 
> Abstraction of HW details might be an illusion when designing interfaces (it
> might be impossible design an interface that allows abstraction and hardware
> specific control 
> at the same time, if the hw interface of different hw implements a feature
> which seems 
> to be similar to the interface designer).
> 
> In the case of CAT:
> 
> 1) How to share cache between two VNFs.
> 2) What if the same feature is implemented in HW differently?
> 
> It seems the VNF specification prefers _not_ to expose HW details. Features
> important for customers might require exposing HW details.
> 
> So what is needed is an API which can be easily modified in the future. Can
> please someone 
> make a summary of Kubernetes API stability so an informed, planned decision
> can be made?

This is the wrong BZ for this discussion. Please do not reply.

Comment 19 Micah Abbott 2020-03-05 20:55:55 UTC
The changes requested for tuning have merged in RHCOS and builds have been made with the changes included.  Moving to MODIFIED.

Comment 22 Michael Nguyen 2020-03-13 01:16:39 UTC
Verified on 44.81.202003110830-0

$ sudo rpm-ostree kargs --append=tuned.non_isolcpus=1
Staging deployment... done
Kernel arguments updated.
Run "systemctl reboot" to start a reboot

[core@localhost ~]$ cat /proc/cmdline 
BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-8e33da004cf9c41250a7fb2f8f125e9767bfe7343252ae5d2488b08a858f8d8a/vmlinuz-4.18.0-147.5.1.el8_1.x86_64 rhcos.root=crypt_rootfs console=tty0 console=ttyS0,115200n8 ignition.platform.id=qemu rd.luks.options=discard ostree=/ostree/boot.0/rhcos/8e33da004cf9c41250a7fb2f8f125e9767bfe7343252ae5d2488b08a858f8d8a/0 console=/dev/ttys0 tuned.non_isolcpus=1

$ dmesg | grep -3 tuned
[    2.839636] In beiscsi_module_init, tt=000000000744f862
[    2.883054] systemd[1]: Started dracut cmdline hook.
[    2.885040] systemd[1]: Starting dracut pre-udev hook...
[    2.897198] tuned: setting workqueue CPU mask to 1
[    2.934653] device-mapper: uevent: version 1.0.3
[    2.935612] device-mapper: ioctl: 4.39.0-ioctl (2018-04-03) initialised: dm-devel
[    3.007992] systemd[1]: Started dracut pre-udev hook.

Comment 24 errata-xmlrpc 2020-05-04 11:16:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Comment 25 Red Hat Bugzilla 2023-09-14 05:47:29 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.