Bug 1865839 - Deleted kernel arguments still exist under the cmdline
Summary: Deleted kernel arguments still exist under the cmdline
Keywords:
Status: ASSIGNED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.6
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.7.0
Assignee: Sinny Kumari
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-04 10:43 UTC by Artyom
Modified: 2020-09-22 08:40 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)
mahine config daemon log (69.59 KB, text/plain)
2020-09-02 17:13 UTC, Denys Shchedrivyi
no flags Details


Links
System ID Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2005 None closed Bug 1865839: daemon: cleanup pending deployment on an error 2020-09-21 17:05:59 UTC
Github openshift machine-config-operator pull 2067 None closed Bug 1865839: daemon/update: print error causing rollback 2020-09-21 17:05:58 UTC
Github openshift machine-config-operator pull 2068 None closed Bug 1865839: daemon: Set RPMOSTREE_CLIENT_ID again 2020-09-21 17:05:58 UTC
Github openshift machine-config-operator pull 2097 None open Bug 1865839: daemon: better error reporting for rpm-ostree operations 2020-09-21 17:06:02 UTC

Description Artyom 2020-08-04 10:43:39 UTC
Description of problem:
The performance-addon-operator updates kernel arguments and disable the RT kernel under the same update operation. But after reboot I still can see kernel arguments that should be deleted under the /proc/cmdline.

Version-Release number of selected component (if applicable):
master

How reproducible:
50% started to happen a lot under our CI.

Steps to Reproduce:
1. Create some custom machine-config that will related to custom MCP
2. Lable worker node to be part of the custom MCP
3. Enable RT kernel and add some test kernel arguments under the MC and wait for the update.
4. Disable the RT kernel and update kernel arguments(add and remove) and wait for the update.
5. Check the /proc/cmdline under the node

Actual results:
Kernel arguments that should be deleted still exist under the node

Expected results:
Kernel arguments on the node should not include deleted arguments under the MC.

Additional info:

output of the machine-config-daemon logs
I0804 06:53:36.346285    2116 update.go:1079] Writing systemd unit dropin "mco-disabled.conf"
I0804 06:53:36.348012    2116 update.go:901] Deleting stale data
I0804 06:53:36.348232    2116 update.go:1029] Removed stale systemd unit "/etc/systemd/system/hugepages-allocation-1048576kB-NUMA0.service"
I0804 06:53:36.361529    2116 update.go:1264] Writing SSHKeys at "/home/core/.ssh/authorized_keys"
I0804 06:53:36.385979    2116 run.go:18] Running: oc image extract --path /:/run/mco-machine-os-content/os-content-341274820 --registry-config /var/lib/kubelet/config.json registry.svc.ci.openshift.org/ci-op-52tsdbr3/stable@sha256:16a8dde4b893ff0b5b4aeb05474f2f5e2ce9cac45d5d3e98b40c4309e23215a7
I0804 06:53:55.402698    2116 update.go:1436] Running rpm-ostree [kargs --delete=rcu_nocbs=1-3 --delete=tuned.non_isolcpus=00000001 --delete=isolcpus=managed_irq,1-3 --delete=systemd.cpu_affinity=0 --delete=default_hugepagesz=1G --delete=nmi_watchdog=0 --append=rcu_nocbs=1-2 --append=tuned.non_isolcpus=00000009 --append=isolcpus=domain,managed_irq,1-2 --append=systemd.cpu_affinity=0,3 --append=default_hugepagesz=2M --append=hugepagesz=2M --append=hugepages=5 --append=new-argument=test]
I0804 06:56:44.773800    2116 update.go:1436] Initiating switch from kernel realtime to default
I0804 06:56:44.779164    2116 update.go:1436] Switching to kernelType=default, invoking rpm-ostree ["override" "reset" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--uninstall" "kernel-rt-core" "--uninstall" "kernel-rt-modules" "--uninstall" "kernel-rt-modules-extra" "--uninstall" "kernel-rt-kvm"]
I0804 06:56:52.504966    2116 update.go:1436] initiating reboot: Node will reboot into config rendered-worker-cnf-49e1d72c9785af8eca1b6dacc2b62fec
I0804 06:57:00.122841    2116 daemon.go:627] Shutting down MachineConfigDaemon
I0804 06:59:07.064449    2121 start.go:108] Version: machine-config-daemon-4.6.0-202006240615.p0-157-gcfea781f-dirty (cfea781f822a3c7516ce71ff7959f8ec772b6ec4)
I0804 06:59:07.094522    2121 start.go:118] Calling chroot("/rootfs")
I0804 06:59:07.096500    2121 rpm-ostree.go:196] Running captured: rpm-ostree status --json
I0804 06:59:07.607134    2121 daemon.go:222] Booted osImageURL: registry.svc.ci.openshift.org/ci-op-52tsdbr3/stable@sha256:16a8dde4b893ff0b5b4aeb05474f2f5e2ce9cac45d5d3e98b40c4309e23215a7 (46.82.202008032211-0)
I0804 06:59:07.758977    2121 daemon.go:229] Installed Ignition binary version: 2.5.0
I0804 06:59:07.928245    2121 start.go:97] Copied self to /run/bin/machine-config-daemon on host
I0804 06:59:07.930791    2121 metrics.go:106] Registering Prometheus metrics
I0804 06:59:07.930929    2121 metrics.go:111] Starting metrics listener on 127.0.0.1:8797
I0804 06:59:07.936244    2121 update.go:1436] Starting to manage node: ci-op-52tsdbr3-24cc7-46vcj-worker-b-h8jxx
I0804 06:59:07.959741    2121 rpm-ostree.go:196] Running captured: rpm-ostree status


After the reboot I still can see all deleted arguments

I0804 06:59:28.020514   17878 utils.go:23] run command 'oc [exec -i -n openshift-machine-config-operator -c machine-config-daemon --request-timeout 30 machine-config-daemon-q5j7p -- cat /proc/cmdline]':
  out=BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-5c3797397db1bafbc8c328137628f4008d4157b555477436f6ad87a10a00a388/vmlinuz-4.18.0-211.el8.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ostree=/ostree/boot.0/rhcos/5c3797397db1bafbc8c328137628f4008d4157b555477436f6ad87a10a00a388/0 ignition.platform.id=gcp skew_tick=1 nohz=on intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0 skew_tick=1 nohz=on rcu_nocbs=1-3 tuned.non_isolcpus=00000001 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,1-3 systemd.cpu_affinity=0 default_hugepagesz=1G nmi_watchdog=0 audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0 skew_tick=1 nohz=on rcu_nocbs=1-3 tuned.non_isolcpus=00000001 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,1-3 systemd.cpu_affinity=0 default_hugepagesz=1G nmi_watchdog=0 audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0 rcu_nocbs=1-2 tuned.non_isolcpus=00000009 isolcpus=domain,managed_irq,1-2 systemd.cpu_affinity=0,3 default_hugepagesz=2M hugepagesz=2M hugepages=5 new-argument=test

You can find all logs under the https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift-kni_performance-addon-operators/318/pull-ci-openshift-kni-performance-addon-operators-master-e2e-gcp/1290521035233300480/artifacts/e2e-gcp/pods/openshift-machine-config-operator_machine-config-daemon-q5j7p_machine-config-daemon.log

I unsure if this bug relates to the machine-config-daemon or to the rpm-ostree.

Comment 1 Denys Shchedrivyi 2020-08-10 20:35:16 UTC
@Artyom, did you check your /proc/cmdline before step 4 (removing rt kernel and some arguments?). Probably you had some arguments duplication in your cmdline.. At least that what I see on my cluster after switching to RT kernel and adding arguments:


> cat [core@worker-0 ~]$ cat /proc/cmdline 
> BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-5b17e85442f17aba6acf91310f2589910585fbcfa2596471c9d2d818471d7030/vmlinuz-4.18.0-211.rt5.23.el8.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ostree=/ostree/boot.1/rhcos/5b17e85442f17aba6acf91310f2589910585fbcfa2596471c9d2d818471d7030/0 ignition.platform.id=openstack skew_tick=1 nohz=on rcu_nocbs=1-3 tuned.non_isolcpus=00000001 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,1-3 systemd.cpu_affinity=0 default_hugepagesz=1G nmi_watchdog=0 audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0 skew_tick=1 nohz=on rcu_nocbs=1-3 tuned.non_isolcpus=00000001 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,1-3 systemd.cpu_affinity=0 default_hugepagesz=1G nmi_watchdog=0 audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0 skew_tick=1 nohz=on rcu_nocbs=1-3 tuned.non_isolcpus=00000001 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,1-3 systemd.cpu_affinity=0 default_hugepagesz=1G nmi_watchdog=0 audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0

 You can see that almost all arguments triplicated (!).


Here the log from machine config daemon pod:

>.
>I0810 18:22:26.561149    3545 update.go:1436] Running rpm-ostree [kargs --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1-3 --append=tuned.non_isolcpus=00000001 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1-3 --append=systemd.cpu_affinity=0 --append=default_hugepagesz=1G --append=nmi_watchdog=0 --append=audit=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=intel_idle.max_cstate=0]
>I0810 18:22:41.652350    3545 update.go:1436] Initiating switch from kernel default to realtime
>I0810 18:22:41.659569    3545 update.go:1436] Switching to kernelType=realtime, invoking rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"]
>.
>E0810 18:22:42.655159    3545 writer.go:135] Marking Degraded due to: failed to execute rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"] : exit status 1
>I0810 18:22:42.686359    3545 update.go:324] Checking Reconcilable for config rendered-worker-e4af5d3dba1d64414516c4c5f235e054 to rendered-worker-cnf-3145023d5727e9f7b5478b734e43e0f4
>I0810 18:22:42.786650    3545 update.go:1436] Starting update from rendered-worker-e4af5d3dba1d64414516c4c5f235e054 to rendered-worker-cnf-3145023d5727e9f7b5478b734e43e0f4: &{osUpdate:false kargs:true fips:false passwd:false files:true units:true kernelType:true extensions:false}
>.
>I0810 18:23:11.736999    3545 update.go:1436] Running rpm-ostree [kargs --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1-3 --append=tuned.non_isolcpus=00000001 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1-3 --append=systemd.cpu_affinity=0 --append=default_hugepagesz=1G --append=nmi_watchdog=0 --append=audit=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=intel_idle.max_cstate=0]
>I0810 18:23:18.047962    3545 update.go:1436] Initiating switch from kernel default to realtime
>I0810 18:23:18.053693    3545 update.go:1436] Switching to kernelType=realtime, invoking rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"]
>.
>E0810 18:23:19.037960    3545 writer.go:135] Marking Degraded due to: failed to execute rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"] : exit status 1
>I0810 18:23:42.674926    3545 update.go:324] Checking Reconcilable for config rendered-worker-e4af5d3dba1d64414516c4c5f235e054 to rendered-worker-cnf-3145023d5727e9f7b5478b734e43e0f4
>I0810 18:23:42.756123    3545 update.go:1436] Starting update from rendered-worker-e4af5d3dba1d64414516c4c5f235e054 to rendered-worker-cnf-3145023d5727e9f7b5478b734e43e0f4: &{osUpdate:false kargs:true fips:false passwd:false files:true units:true kernelType:true extensions:false}
>.
>I0810 18:24:56.713134    3545 update.go:1436] Running rpm-ostree [kargs --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1-3 --append=tuned.non_isolcpus=00000001 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1-3 --append=systemd.cpu_affinity=0 --append=default_hugepagesz=1G --append=nmi_watchdog=0 --append=audit=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=intel_idle.max_cstate=0]
>I0810 18:25:02.634854    3545 update.go:1436] Initiating switch from kernel default to realtime
>I0810 18:25:02.640505    3545 update.go:1436] Switching to kernelType=realtime, invoking rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt->core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"]
>I0810 18:28:22.534336    3545 update.go:1436] initiating reboot: Node will reboot into config rendered-worker-cnf-3145023d5727e9f7b5478b734e43e0f4


What I see - MC added arguments -> failed with switching kernel -> added arguments for second time -> failed with kernel again -> added arguments for the third time -> successfully switched kernel and rebooted the node. As result my kernel arguments were added three times.

Comment 2 Artyom 2020-08-11 08:21:06 UTC
I did not check the cmdline before, because it happens under the CI run so I did not have a chance to check it, but do you know why do you have doubles and triples of kernel arguments?

Comment 3 Denys Shchedrivyi 2020-08-11 13:32:16 UTC
From the logs in comment #1 I see that MCO daemon added kernel arguments on every try to install RT kernel.. after third tries I got triplicated arguments

Comment 5 Sinny Kumari 2020-08-17 14:44:29 UTC
As Deny mentioned in comment #1, from the log we can see that kernelArguments applied successfully but switch to kernel-rt failed twice with error:

E0804 06:43:53.479144    2036 writer.go:135] Marking Degraded due to: failed to execute rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"] : exit status 1

This is a regression from recent rework on OS update where we missed handling rollback properly during failure of an rpm-ostree transaction.

Fix is on the way.

Comment 8 Sinny Kumari 2020-08-18 06:16:51 UTC
Fix should be available in nightlies now.

@Dany @Artyom Can you please let us know if kernelArguments are applied correctly in your CI setup?

Comment 9 Denys Shchedrivyi 2020-08-18 23:49:59 UTC
will keep an eye on it, unfortunately it was not happen on every run

Comment 10 Denys Shchedrivyi 2020-08-31 18:45:58 UTC
reproduced again, here the log message from the daemon pod:


I0831 17:27:55.988406    2155 update.go:1542] Running rpm-ostree [kargs --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1-3 --append=tuned.non_isolcpus=00000001 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1-3 --append=systemd.cpu_affinity=0 --append=default_hugepagesz=1G --append=nmi_watchdog=0 --append=audit=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=intel_idle.max_cstate=0]
I0831 17:28:07.054461    2155 update.go:1542] Initiating switch from kernel default to realtime
I0831 17:28:07.058866    2155 update.go:1542] Switching to kernelType=realtime, invoking rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"]
I0831 17:28:07.110342    2155 update.go:358] Rolling back applied changes to OS
I0831 17:28:07.110442    2155 rpm-ostree.go:261] Running captured: rpm-ostree cleanup -p
error: System transaction in progress
.
.
E0831 17:28:07.721421    2155 writer.go:135] Marking Degraded due to: error removing staged deployment: error running rpm-ostree cleanup -p: : exit status 1: failed to execute rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"] : exit status 1



@Sinny, tell me if you need more logs, I can try to get full artifacts from CI

Comment 11 Sinny Kumari 2020-09-02 11:49:18 UTC
By looking at the above logs, looks to me that MCO has applied kargs as expected but realtime switch fails (don't know why) and as a result MCO tried to rollback the applied changes but it fails maybe because another rpm-ostree transaction is still active.

yes, access to complete log will be needed to dig in further.

Comment 12 Denys Shchedrivyi 2020-09-02 17:13:44 UTC
Created attachment 1713486 [details]
mahine config daemon log

Attached full mc-daemon log

Comment 14 Sinny Kumari 2020-09-03 06:36:42 UTC
Thanks for providing the logs, didn't find the useful information I was looking for in the journal logs which is rpm-ostreed.service log during the time `rpm-ostree cleanup` failed. It seems ci keeps journal log for n and n-1 boot. I will try to figure out reproducing this behavior locally. Meanwhile, wondering if we can modify ci to capture journal logs from all the boots.

Comment 15 Sinny Kumari 2020-09-08 06:26:45 UTC
I can't reproduce rpm-ostree cleanup issue locally.

What I did to reproduce the problem is after kernel-rt switch, specified explicitly to do pending deployment cleanup. Diff of MCO changes:

```
diff --git a/pkg/daemon/update.go b/pkg/daemon/update.go
index a6cffbbc..f3ed0735 100644
--- a/pkg/daemon/update.go
+++ b/pkg/daemon/update.go
@@ -375,6 +375,11 @@ func (dn *Daemon) applyOSChanges(oldConfig, newConfig *mcfgv1.MachineConfig) (re
                return err
        }
 
+       if err := removePendingDeployment(); err != nil {
+               retErr = errors.Wrapf(retErr, "error removing staged deployment: %v", err)
+               return
+       }
+
        // Apply extensions
        if err := dn.applyExtensions(oldConfig, newConfig); err != nil {
                return err
```

Launched a cluster with day1 realtime MachineConifg, pending deployment did get cleanup fine during firstboot:

sh-4.4# journalctl -u rpm-ostreed |grep "rhcos successful"
Sep 08 05:37:41 sinny-asset-qmq2k-worker-a-jlgww.c.openshift-gce-devel.internal rpm-ostree[2046]: Txn UpdateDeployment on /org/projectatomic/rpmostree1/rhcos successful
Sep 08 05:39:41 sinny-asset-qmq2k-worker-a-jlgww.c.openshift-gce-devel.internal rpm-ostree[2046]: Txn UpdateDeployment on /org/projectatomic/rpmostree1/rhcos successful
Sep 08 05:39:42 sinny-asset-qmq2k-worker-a-jlgww.c.openshift-gce-devel.internal rpm-ostree[2046]: Txn Cleanup on /org/projectatomic/rpmostree1/rhcos successful
Sep 08 05:42:26 sinny-asset-qmq2k-worker-a-jlgww.c.openshift-gce-devel.internal rpm-ostree[9762]: Txn UpdateDeployment on /org/projectatomic/rpmostree1/rhcos successful


When it happen next time in CI, is it possible to give access to that cluster?

Comment 17 Sinny Kumari 2020-09-09 13:51:20 UTC
Thanks Denys for sharing reproducible cluster and logs.

What I observed is when we applied realtime and kernelarguments on the nodes, two of the nodes(worker-0, worker-2) had issues with applying realtime kernel. On failure worker-0 was successfully able to rollback the rpm-ostree related changes while worker-2 failed since as per rpm-ostree another transaction was already in progress.

Pasting below relevant logs from worker-0 and worker-2:

worker-0
--------

MCD log:
...
I0908 16:26:51.463443    3356 update.go:1542] Starting update from rendered-worker-ff2507f638ea359164e1d3b6386cadd3 to rendered-worker-cnf-79ee49acc24e9f1cc0be696c4d29ea2b: &{osUpdate:false kargs:true fips:false passwd:false files:true units:true kernelType:true extensions:false}
...
I0908 16:28:00.340382    3356 run.go:18] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-583474098 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:19ebd206b45fe385def2b15a576dab64d2ed58dccd4b0f74387110311388f031
I0908 16:28:27.868638    3356 update.go:1542] Running rpm-ostree [kargs --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1-3 --append=tuned.non_isolcpus=00000001 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1-3 --append=systemd.cpu_affinity=0 --append=default_hugepagesz=1G --append=hugepagesz=2M --append=hugepages=128 --append=nmi_watchdog=0 --append=audit=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=intel_idle.max_cstate=0]
I0908 16:28:41.705072    3356 update.go:1542] Initiating switch from kernel default to realtime
I0908 16:28:41.711720    3356 update.go:1542] Switching to kernelType=realtime, invoking rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"]
I0908 16:28:41.789103    3356 update.go:358] Rolling back applied changes to OS
I0908 16:28:41.789251    3356 rpm-ostree.go:261] Running captured: rpm-ostree cleanup -p
I0908 16:28:44.505318    3356 update.go:1373] Writing SSHKeys at "/home/core/.ssh/authorized_keys"
...

rpm-ostreed service log:
...
Sep 08 16:28:28 worker-0 rpm-ostree[39195]: Initiated txn KernelArgs for client(id:cli dbus:1.402 unit:crio-5be1e7088bb22198577df3814cc5525868648ddd724695eb3de17317a567dde1.scope uid:0): /org/projectatomic/rpmostree1/rhcos
 08 16:28:39 worker-0 rpm-ostree[39195]: Created new deployment /ostree/deploy/rhcos/deploy/a694d718a6c3f0fd0c367030f865ff2d85a2467853ef5de80be2d2eb77e98b56.1
Sep 08 16:28:41 worker-0 rpm-ostree[39195]: Txn KernelArgs on /org/projectatomic/rpmostree1/rhcos successful
Sep 08 16:28:41 worker-0 rpm-ostree[39195]: Initiated txn Cleanup for client(id:cli dbus:1.405 unit:crio-5be1e7088bb22198577df3814cc5525868648ddd724695eb3de17317a567dde1.scope uid:0): /org/projectatomic/rpmostree1/rhcos
Sep 08 16:28:43 worker-0 rpm-ostree[39195]: Txn Cleanup on /org/projectatomic/rpmostree1/rhcos successful


worker-2
---------

MCD log:
..
I0908 16:59:01.633512    3648 update.go:1542] Starting update from rendered-worker-ff2507f638ea359164e1d3b6386cadd3 to rendered-worker-cnf-79ee49acc24e9f1cc0be696c4d29ea2b: &{osUpdate:false kargs:true fips:false passwd:false files:true units:true kernelType:true extensions:false}
...
I0908 17:00:54.376819    3648 run.go:18] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-159144463 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:19ebd206b45fe385def2b15a576dab64d2ed58dccd4b0f74387110311388f031
I0908 17:01:42.046930    3648 update.go:1542] Running rpm-ostree [kargs --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1-3 --append=tuned.non_isolcpus=00000001 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1-3 --append=systemd.cpu_affinity=0 --append=default_hugepagesz=1G --append=hugepagesz=2M --append=hugepages=128 --append=nmi_watchdog=0 --append=audit=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=intel_idle.max_cstate=0]
I0908 17:01:55.033620    3648 update.go:1542] Initiating switch from kernel default to realtime
I0908 17:01:55.038773    3648 update.go:1542] Switching to kernelType=realtime, invoking rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"]
I0908 17:01:55.105279    3648 update.go:358] Rolling back applied changes to OS
I0908 17:01:55.105402    3648 rpm-ostree.go:261] Running captured: rpm-ostree cleanup -p
error: System transaction in progress
I0908 17:01:55.826785    3648 update.go:1373] Writing SSHKeys at "/home/core/.ssh/authorized_keys"
...



rpm-ostreed service log:
Sep 08 17:01:42 worker-2 rpm-ostree[107863]: Initiated txn KernelArgs for client(id:cli dbus:1.524 unit:crio-0ec57c7902750bb7aa343a39e489126a2371246b3a3ca376763ac2582d563385.scope uid:0): /org/projectatomic/rpmostree1/rhcos
Sep 08 17:01:53 worker-2 rpm-ostree[107863]: Created new deployment /ostree/deploy/rhcos/deploy/a694d718a6c3f0fd0c367030f865ff2d85a2467853ef5de80be2d2eb77e98b56.1
Sep 08 17:01:55 worker-2 rpm-ostree[107863]: Txn KernelArgs on /org/projectatomic/rpmostree1/rhcos successful
Sep 08 17:01:55 worker-2 rpm-ostree[107863]: In idle state; will auto-exit in 61 seconds
Sep 08 17:02:32 worker-2 rpm-ostree[107863]: Initiated txn KernelArgs for client(id:cli dbus:1.528 unit:crio-0ec57c7902750bb7aa343a39e489126a2371246b3a3ca376763ac2582d563385.scope uid:0): /org/projectatomic/rpmostree1/rhcos
Sep 08 17:02:36 worker-2 rpm-ostree[107863]: Created new deployment /ostree/deploy/rhcos/deploy/a694d718a6c3f0fd0c367030f865ff2d85a2467853ef5de80be2d2eb77e98b56.2
Sep 08 17:02:36 worker-2 rpm-ostree[107863]: sanitycheck(/usr/bin/true) successful
Sep 08 17:02:37 worker-2 rpm-ostree[107863]: Txn KernelArgs on /org/projectatomic/rpmostree1/rhcos successful


It seems rpm-ostreed service doesn't capture logs when a request is made to rpm-ostree while another system transaction is in progress.

We can solve this issue by keep retrying rpm-ostree cleanup operation until it succeeds.

@jonathan @colin MCD does all rpm-ostree related operations sequentially. Any idea what could be the reason for rpm-ostree saying "System transaction in progress" ?

Comment 18 Colin Walters 2020-09-09 14:16:04 UTC
My recollection is that before the MCO gained support for some things like kernel arguments that some other operator (outside of the base platform) started talking to rpm-ostree - and I pointed out it would introduce potential races exactly like this.

I don't recall where that code landed but it should be removed now.

We could enhance rpm-ostree to have support for automatic "wait until lock released" semantics which would be nice in some cases, but I think would mask the real underlying problem here.

Comment 19 Colin Walters 2020-09-09 14:22:12 UTC
The systemd journal will log which transaction and the client ID so that might help in narrowing down which other project is doing that.

Comment 20 Jonathan Lebon 2020-09-09 14:24:50 UTC
One thing I noticed here is that we seem to lose what the original error was. This might help debug why it seems like rpm-ostree is busy. Opened https://github.com/openshift/machine-config-operator/pull/2067.

Comment 21 Jonathan Lebon 2020-09-09 16:57:19 UTC
So there are four issues here:
1. `override remove` is presumably failing
2. we're swallowing the `override remove` error because `cleanup -p` also fails (this is fixed by https://github.com/openshift/machine-config-operator/pull/2067)
3. `cleanup -p` is failing because something is keeping the daemon busy
4. we don't have any logging of what else is keeping the daemon busy (this is fixed by https://github.com/openshift/machine-config-operator/pull/2068)

So there's still issues 1 and 3 present.

How about: let's merge those two PRs, keep this RHBZ open, and debug more once we have those since we should have more information by then?

Comment 22 Colin Walters 2020-09-09 18:51:25 UTC
Found it - does the operator have this change https://github.com/openshift-kni/performance-addon-operators/commit/a5341110f44ebe5cb5f0112434688a0cc2f4dd88 ?

Comment 24 Sinny Kumari 2020-09-10 07:00:19 UTC
Moving it to POST to include fixes from https://github.com/openshift/machine-config-operator/pull/2067

Comment 26 Sinny Kumari 2020-09-17 11:45:27 UTC
Moving this bug to assigned because issue is still not resolved.

Denis reported that CI is seeing the failure:

I0916 09:51:06.896151    1983 update.go:361] Rolling back applied changes to OS due to error: failed to execute rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"] : exit status 1
I0916 09:51:06.896252    1983 rpm-ostree.go:261] Running captured: rpm-ostree cleanup -p
error: System transaction in progress.

A new PR is on the way to capture complete rpm-ostree output during error. Hopefully that will help us to know why kernel-rt switch is failing.

Comment 27 Michelle Krejci 2020-09-21 17:12:56 UTC
Moving to 4.7. The team will continue to look for ways to debug this issue in hopes of being able to ship something in 4.6 but this is not a blocker.

Comment 28 Yu Qi Zhang 2020-09-21 18:34:41 UTC
As an aside: from Colin's comment https://bugzilla.redhat.com/show_bug.cgi?id=1865839#c22 could someone make sure that PR is in place? It sounds like there could be a race between multiple operators calling rpm-ostree still

Comment 29 Artyom 2020-09-22 07:22:50 UTC
Yes the performance operator has this change, but the change was merged before the issue start to happen and I also unsure how does it relevant to the problem, the performance-addon-operator never calls to the rpm-ostree. The performance-addon-operator creates bunch of objects like machine-config and tuned profile under the cluster and all future actions will be done by controllers responsible for these objects.

Jiri does the tuned controller calls to rpm-ostree commands?

Comment 30 jmencak 2020-09-22 08:40:44 UTC
(In reply to Artyom from comment #29)
> Yes the performance operator has this change, but the change was merged
> before the issue start to happen and I also unsure how does it relevant to
> the problem, the performance-addon-operator never calls to the rpm-ostree.
> The performance-addon-operator creates bunch of objects like machine-config
> and tuned profile under the cluster and all future actions will be done by
> controllers responsible for these objects.
> 
> Jiri does the tuned controller calls to rpm-ostree commands?

Artyom, Node Tuning Operator (NTO) that controlls the Tuned operands does not call
rpm-ostree commands.  It only creates MachineConfigs on request.


Note You need to log in before you can comment on or make changes to this bug.