Description of problem: The performance-addon-operator updates kernel arguments and disable the RT kernel under the same update operation. But after reboot I still can see kernel arguments that should be deleted under the /proc/cmdline. Version-Release number of selected component (if applicable): master How reproducible: 50% started to happen a lot under our CI. Steps to Reproduce: 1. Create some custom machine-config that will related to custom MCP 2. Lable worker node to be part of the custom MCP 3. Enable RT kernel and add some test kernel arguments under the MC and wait for the update. 4. Disable the RT kernel and update kernel arguments(add and remove) and wait for the update. 5. Check the /proc/cmdline under the node Actual results: Kernel arguments that should be deleted still exist under the node Expected results: Kernel arguments on the node should not include deleted arguments under the MC. Additional info: output of the machine-config-daemon logs I0804 06:53:36.346285 2116 update.go:1079] Writing systemd unit dropin "mco-disabled.conf" I0804 06:53:36.348012 2116 update.go:901] Deleting stale data I0804 06:53:36.348232 2116 update.go:1029] Removed stale systemd unit "/etc/systemd/system/hugepages-allocation-1048576kB-NUMA0.service" I0804 06:53:36.361529 2116 update.go:1264] Writing SSHKeys at "/home/core/.ssh/authorized_keys" I0804 06:53:36.385979 2116 run.go:18] Running: oc image extract --path /:/run/mco-machine-os-content/os-content-341274820 --registry-config /var/lib/kubelet/config.json registry.svc.ci.openshift.org/ci-op-52tsdbr3/stable@sha256:16a8dde4b893ff0b5b4aeb05474f2f5e2ce9cac45d5d3e98b40c4309e23215a7 I0804 06:53:55.402698 2116 update.go:1436] Running rpm-ostree [kargs --delete=rcu_nocbs=1-3 --delete=tuned.non_isolcpus=00000001 --delete=isolcpus=managed_irq,1-3 --delete=systemd.cpu_affinity=0 --delete=default_hugepagesz=1G --delete=nmi_watchdog=0 --append=rcu_nocbs=1-2 --append=tuned.non_isolcpus=00000009 --append=isolcpus=domain,managed_irq,1-2 --append=systemd.cpu_affinity=0,3 --append=default_hugepagesz=2M --append=hugepagesz=2M --append=hugepages=5 --append=new-argument=test] I0804 06:56:44.773800 2116 update.go:1436] Initiating switch from kernel realtime to default I0804 06:56:44.779164 2116 update.go:1436] Switching to kernelType=default, invoking rpm-ostree ["override" "reset" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--uninstall" "kernel-rt-core" "--uninstall" "kernel-rt-modules" "--uninstall" "kernel-rt-modules-extra" "--uninstall" "kernel-rt-kvm"] I0804 06:56:52.504966 2116 update.go:1436] initiating reboot: Node will reboot into config rendered-worker-cnf-49e1d72c9785af8eca1b6dacc2b62fec I0804 06:57:00.122841 2116 daemon.go:627] Shutting down MachineConfigDaemon I0804 06:59:07.064449 2121 start.go:108] Version: machine-config-daemon-4.6.0-202006240615.p0-157-gcfea781f-dirty (cfea781f822a3c7516ce71ff7959f8ec772b6ec4) I0804 06:59:07.094522 2121 start.go:118] Calling chroot("/rootfs") I0804 06:59:07.096500 2121 rpm-ostree.go:196] Running captured: rpm-ostree status --json I0804 06:59:07.607134 2121 daemon.go:222] Booted osImageURL: registry.svc.ci.openshift.org/ci-op-52tsdbr3/stable@sha256:16a8dde4b893ff0b5b4aeb05474f2f5e2ce9cac45d5d3e98b40c4309e23215a7 (46.82.202008032211-0) I0804 06:59:07.758977 2121 daemon.go:229] Installed Ignition binary version: 2.5.0 I0804 06:59:07.928245 2121 start.go:97] Copied self to /run/bin/machine-config-daemon on host I0804 06:59:07.930791 2121 metrics.go:106] Registering Prometheus metrics I0804 06:59:07.930929 2121 metrics.go:111] Starting metrics listener on 127.0.0.1:8797 I0804 06:59:07.936244 2121 update.go:1436] Starting to manage node: ci-op-52tsdbr3-24cc7-46vcj-worker-b-h8jxx I0804 06:59:07.959741 2121 rpm-ostree.go:196] Running captured: rpm-ostree status After the reboot I still can see all deleted arguments I0804 06:59:28.020514 17878 utils.go:23] run command 'oc [exec -i -n openshift-machine-config-operator -c machine-config-daemon --request-timeout 30 machine-config-daemon-q5j7p -- cat /proc/cmdline]': out=BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-5c3797397db1bafbc8c328137628f4008d4157b555477436f6ad87a10a00a388/vmlinuz-4.18.0-211.el8.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ostree=/ostree/boot.0/rhcos/5c3797397db1bafbc8c328137628f4008d4157b555477436f6ad87a10a00a388/0 ignition.platform.id=gcp skew_tick=1 nohz=on intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0 skew_tick=1 nohz=on rcu_nocbs=1-3 tuned.non_isolcpus=00000001 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,1-3 systemd.cpu_affinity=0 default_hugepagesz=1G nmi_watchdog=0 audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0 skew_tick=1 nohz=on rcu_nocbs=1-3 tuned.non_isolcpus=00000001 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,1-3 systemd.cpu_affinity=0 default_hugepagesz=1G nmi_watchdog=0 audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0 rcu_nocbs=1-2 tuned.non_isolcpus=00000009 isolcpus=domain,managed_irq,1-2 systemd.cpu_affinity=0,3 default_hugepagesz=2M hugepagesz=2M hugepages=5 new-argument=test You can find all logs under the https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift-kni_performance-addon-operators/318/pull-ci-openshift-kni-performance-addon-operators-master-e2e-gcp/1290521035233300480/artifacts/e2e-gcp/pods/openshift-machine-config-operator_machine-config-daemon-q5j7p_machine-config-daemon.log I unsure if this bug relates to the machine-config-daemon or to the rpm-ostree.
@Artyom, did you check your /proc/cmdline before step 4 (removing rt kernel and some arguments?). Probably you had some arguments duplication in your cmdline.. At least that what I see on my cluster after switching to RT kernel and adding arguments: > cat [core@worker-0 ~]$ cat /proc/cmdline > BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-5b17e85442f17aba6acf91310f2589910585fbcfa2596471c9d2d818471d7030/vmlinuz-4.18.0-211.rt5.23.el8.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ostree=/ostree/boot.1/rhcos/5b17e85442f17aba6acf91310f2589910585fbcfa2596471c9d2d818471d7030/0 ignition.platform.id=openstack skew_tick=1 nohz=on rcu_nocbs=1-3 tuned.non_isolcpus=00000001 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,1-3 systemd.cpu_affinity=0 default_hugepagesz=1G nmi_watchdog=0 audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0 skew_tick=1 nohz=on rcu_nocbs=1-3 tuned.non_isolcpus=00000001 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,1-3 systemd.cpu_affinity=0 default_hugepagesz=1G nmi_watchdog=0 audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0 skew_tick=1 nohz=on rcu_nocbs=1-3 tuned.non_isolcpus=00000001 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,1-3 systemd.cpu_affinity=0 default_hugepagesz=1G nmi_watchdog=0 audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0 You can see that almost all arguments triplicated (!). Here the log from machine config daemon pod: >. >I0810 18:22:26.561149 3545 update.go:1436] Running rpm-ostree [kargs --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1-3 --append=tuned.non_isolcpus=00000001 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1-3 --append=systemd.cpu_affinity=0 --append=default_hugepagesz=1G --append=nmi_watchdog=0 --append=audit=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=intel_idle.max_cstate=0] >I0810 18:22:41.652350 3545 update.go:1436] Initiating switch from kernel default to realtime >I0810 18:22:41.659569 3545 update.go:1436] Switching to kernelType=realtime, invoking rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"] >. >E0810 18:22:42.655159 3545 writer.go:135] Marking Degraded due to: failed to execute rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"] : exit status 1 >I0810 18:22:42.686359 3545 update.go:324] Checking Reconcilable for config rendered-worker-e4af5d3dba1d64414516c4c5f235e054 to rendered-worker-cnf-3145023d5727e9f7b5478b734e43e0f4 >I0810 18:22:42.786650 3545 update.go:1436] Starting update from rendered-worker-e4af5d3dba1d64414516c4c5f235e054 to rendered-worker-cnf-3145023d5727e9f7b5478b734e43e0f4: &{osUpdate:false kargs:true fips:false passwd:false files:true units:true kernelType:true extensions:false} >. >I0810 18:23:11.736999 3545 update.go:1436] Running rpm-ostree [kargs --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1-3 --append=tuned.non_isolcpus=00000001 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1-3 --append=systemd.cpu_affinity=0 --append=default_hugepagesz=1G --append=nmi_watchdog=0 --append=audit=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=intel_idle.max_cstate=0] >I0810 18:23:18.047962 3545 update.go:1436] Initiating switch from kernel default to realtime >I0810 18:23:18.053693 3545 update.go:1436] Switching to kernelType=realtime, invoking rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"] >. >E0810 18:23:19.037960 3545 writer.go:135] Marking Degraded due to: failed to execute rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"] : exit status 1 >I0810 18:23:42.674926 3545 update.go:324] Checking Reconcilable for config rendered-worker-e4af5d3dba1d64414516c4c5f235e054 to rendered-worker-cnf-3145023d5727e9f7b5478b734e43e0f4 >I0810 18:23:42.756123 3545 update.go:1436] Starting update from rendered-worker-e4af5d3dba1d64414516c4c5f235e054 to rendered-worker-cnf-3145023d5727e9f7b5478b734e43e0f4: &{osUpdate:false kargs:true fips:false passwd:false files:true units:true kernelType:true extensions:false} >. >I0810 18:24:56.713134 3545 update.go:1436] Running rpm-ostree [kargs --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1-3 --append=tuned.non_isolcpus=00000001 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1-3 --append=systemd.cpu_affinity=0 --append=default_hugepagesz=1G --append=nmi_watchdog=0 --append=audit=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=intel_idle.max_cstate=0] >I0810 18:25:02.634854 3545 update.go:1436] Initiating switch from kernel default to realtime >I0810 18:25:02.640505 3545 update.go:1436] Switching to kernelType=realtime, invoking rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt->core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"] >I0810 18:28:22.534336 3545 update.go:1436] initiating reboot: Node will reboot into config rendered-worker-cnf-3145023d5727e9f7b5478b734e43e0f4 What I see - MC added arguments -> failed with switching kernel -> added arguments for second time -> failed with kernel again -> added arguments for the third time -> successfully switched kernel and rebooted the node. As result my kernel arguments were added three times.
I did not check the cmdline before, because it happens under the CI run so I did not have a chance to check it, but do you know why do you have doubles and triples of kernel arguments?
From the logs in comment #1 I see that MCO daemon added kernel arguments on every try to install RT kernel.. after third tries I got triplicated arguments
As Deny mentioned in comment #1, from the log we can see that kernelArguments applied successfully but switch to kernel-rt failed twice with error: E0804 06:43:53.479144 2036 writer.go:135] Marking Degraded due to: failed to execute rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"] : exit status 1 This is a regression from recent rework on OS update where we missed handling rollback properly during failure of an rpm-ostree transaction. Fix is on the way.
Fix should be available in nightlies now. @Dany @Artyom Can you please let us know if kernelArguments are applied correctly in your CI setup?
will keep an eye on it, unfortunately it was not happen on every run
reproduced again, here the log message from the daemon pod: I0831 17:27:55.988406 2155 update.go:1542] Running rpm-ostree [kargs --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1-3 --append=tuned.non_isolcpus=00000001 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1-3 --append=systemd.cpu_affinity=0 --append=default_hugepagesz=1G --append=nmi_watchdog=0 --append=audit=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=intel_idle.max_cstate=0] I0831 17:28:07.054461 2155 update.go:1542] Initiating switch from kernel default to realtime I0831 17:28:07.058866 2155 update.go:1542] Switching to kernelType=realtime, invoking rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"] I0831 17:28:07.110342 2155 update.go:358] Rolling back applied changes to OS I0831 17:28:07.110442 2155 rpm-ostree.go:261] Running captured: rpm-ostree cleanup -p error: System transaction in progress . . E0831 17:28:07.721421 2155 writer.go:135] Marking Degraded due to: error removing staged deployment: error running rpm-ostree cleanup -p: : exit status 1: failed to execute rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"] : exit status 1 @Sinny, tell me if you need more logs, I can try to get full artifacts from CI
By looking at the above logs, looks to me that MCO has applied kargs as expected but realtime switch fails (don't know why) and as a result MCO tried to rollback the applied changes but it fails maybe because another rpm-ostree transaction is still active. yes, access to complete log will be needed to dig in further.
Created attachment 1713486 [details] mahine config daemon log Attached full mc-daemon log
Thanks for providing the logs, didn't find the useful information I was looking for in the journal logs which is rpm-ostreed.service log during the time `rpm-ostree cleanup` failed. It seems ci keeps journal log for n and n-1 boot. I will try to figure out reproducing this behavior locally. Meanwhile, wondering if we can modify ci to capture journal logs from all the boots.
I can't reproduce rpm-ostree cleanup issue locally. What I did to reproduce the problem is after kernel-rt switch, specified explicitly to do pending deployment cleanup. Diff of MCO changes: ``` diff --git a/pkg/daemon/update.go b/pkg/daemon/update.go index a6cffbbc..f3ed0735 100644 --- a/pkg/daemon/update.go +++ b/pkg/daemon/update.go @@ -375,6 +375,11 @@ func (dn *Daemon) applyOSChanges(oldConfig, newConfig *mcfgv1.MachineConfig) (re return err } + if err := removePendingDeployment(); err != nil { + retErr = errors.Wrapf(retErr, "error removing staged deployment: %v", err) + return + } + // Apply extensions if err := dn.applyExtensions(oldConfig, newConfig); err != nil { return err ``` Launched a cluster with day1 realtime MachineConifg, pending deployment did get cleanup fine during firstboot: sh-4.4# journalctl -u rpm-ostreed |grep "rhcos successful" Sep 08 05:37:41 sinny-asset-qmq2k-worker-a-jlgww.c.openshift-gce-devel.internal rpm-ostree[2046]: Txn UpdateDeployment on /org/projectatomic/rpmostree1/rhcos successful Sep 08 05:39:41 sinny-asset-qmq2k-worker-a-jlgww.c.openshift-gce-devel.internal rpm-ostree[2046]: Txn UpdateDeployment on /org/projectatomic/rpmostree1/rhcos successful Sep 08 05:39:42 sinny-asset-qmq2k-worker-a-jlgww.c.openshift-gce-devel.internal rpm-ostree[2046]: Txn Cleanup on /org/projectatomic/rpmostree1/rhcos successful Sep 08 05:42:26 sinny-asset-qmq2k-worker-a-jlgww.c.openshift-gce-devel.internal rpm-ostree[9762]: Txn UpdateDeployment on /org/projectatomic/rpmostree1/rhcos successful When it happen next time in CI, is it possible to give access to that cluster?
Thanks Denys for sharing reproducible cluster and logs. What I observed is when we applied realtime and kernelarguments on the nodes, two of the nodes(worker-0, worker-2) had issues with applying realtime kernel. On failure worker-0 was successfully able to rollback the rpm-ostree related changes while worker-2 failed since as per rpm-ostree another transaction was already in progress. Pasting below relevant logs from worker-0 and worker-2: worker-0 -------- MCD log: ... I0908 16:26:51.463443 3356 update.go:1542] Starting update from rendered-worker-ff2507f638ea359164e1d3b6386cadd3 to rendered-worker-cnf-79ee49acc24e9f1cc0be696c4d29ea2b: &{osUpdate:false kargs:true fips:false passwd:false files:true units:true kernelType:true extensions:false} ... I0908 16:28:00.340382 3356 run.go:18] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-583474098 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:19ebd206b45fe385def2b15a576dab64d2ed58dccd4b0f74387110311388f031 I0908 16:28:27.868638 3356 update.go:1542] Running rpm-ostree [kargs --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1-3 --append=tuned.non_isolcpus=00000001 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1-3 --append=systemd.cpu_affinity=0 --append=default_hugepagesz=1G --append=hugepagesz=2M --append=hugepages=128 --append=nmi_watchdog=0 --append=audit=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=intel_idle.max_cstate=0] I0908 16:28:41.705072 3356 update.go:1542] Initiating switch from kernel default to realtime I0908 16:28:41.711720 3356 update.go:1542] Switching to kernelType=realtime, invoking rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"] I0908 16:28:41.789103 3356 update.go:358] Rolling back applied changes to OS I0908 16:28:41.789251 3356 rpm-ostree.go:261] Running captured: rpm-ostree cleanup -p I0908 16:28:44.505318 3356 update.go:1373] Writing SSHKeys at "/home/core/.ssh/authorized_keys" ... rpm-ostreed service log: ... Sep 08 16:28:28 worker-0 rpm-ostree[39195]: Initiated txn KernelArgs for client(id:cli dbus:1.402 unit:crio-5be1e7088bb22198577df3814cc5525868648ddd724695eb3de17317a567dde1.scope uid:0): /org/projectatomic/rpmostree1/rhcos 08 16:28:39 worker-0 rpm-ostree[39195]: Created new deployment /ostree/deploy/rhcos/deploy/a694d718a6c3f0fd0c367030f865ff2d85a2467853ef5de80be2d2eb77e98b56.1 Sep 08 16:28:41 worker-0 rpm-ostree[39195]: Txn KernelArgs on /org/projectatomic/rpmostree1/rhcos successful Sep 08 16:28:41 worker-0 rpm-ostree[39195]: Initiated txn Cleanup for client(id:cli dbus:1.405 unit:crio-5be1e7088bb22198577df3814cc5525868648ddd724695eb3de17317a567dde1.scope uid:0): /org/projectatomic/rpmostree1/rhcos Sep 08 16:28:43 worker-0 rpm-ostree[39195]: Txn Cleanup on /org/projectatomic/rpmostree1/rhcos successful worker-2 --------- MCD log: .. I0908 16:59:01.633512 3648 update.go:1542] Starting update from rendered-worker-ff2507f638ea359164e1d3b6386cadd3 to rendered-worker-cnf-79ee49acc24e9f1cc0be696c4d29ea2b: &{osUpdate:false kargs:true fips:false passwd:false files:true units:true kernelType:true extensions:false} ... I0908 17:00:54.376819 3648 run.go:18] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-159144463 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:19ebd206b45fe385def2b15a576dab64d2ed58dccd4b0f74387110311388f031 I0908 17:01:42.046930 3648 update.go:1542] Running rpm-ostree [kargs --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1-3 --append=tuned.non_isolcpus=00000001 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1-3 --append=systemd.cpu_affinity=0 --append=default_hugepagesz=1G --append=hugepagesz=2M --append=hugepages=128 --append=nmi_watchdog=0 --append=audit=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=intel_idle.max_cstate=0] I0908 17:01:55.033620 3648 update.go:1542] Initiating switch from kernel default to realtime I0908 17:01:55.038773 3648 update.go:1542] Switching to kernelType=realtime, invoking rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"] I0908 17:01:55.105279 3648 update.go:358] Rolling back applied changes to OS I0908 17:01:55.105402 3648 rpm-ostree.go:261] Running captured: rpm-ostree cleanup -p error: System transaction in progress I0908 17:01:55.826785 3648 update.go:1373] Writing SSHKeys at "/home/core/.ssh/authorized_keys" ... rpm-ostreed service log: Sep 08 17:01:42 worker-2 rpm-ostree[107863]: Initiated txn KernelArgs for client(id:cli dbus:1.524 unit:crio-0ec57c7902750bb7aa343a39e489126a2371246b3a3ca376763ac2582d563385.scope uid:0): /org/projectatomic/rpmostree1/rhcos Sep 08 17:01:53 worker-2 rpm-ostree[107863]: Created new deployment /ostree/deploy/rhcos/deploy/a694d718a6c3f0fd0c367030f865ff2d85a2467853ef5de80be2d2eb77e98b56.1 Sep 08 17:01:55 worker-2 rpm-ostree[107863]: Txn KernelArgs on /org/projectatomic/rpmostree1/rhcos successful Sep 08 17:01:55 worker-2 rpm-ostree[107863]: In idle state; will auto-exit in 61 seconds Sep 08 17:02:32 worker-2 rpm-ostree[107863]: Initiated txn KernelArgs for client(id:cli dbus:1.528 unit:crio-0ec57c7902750bb7aa343a39e489126a2371246b3a3ca376763ac2582d563385.scope uid:0): /org/projectatomic/rpmostree1/rhcos Sep 08 17:02:36 worker-2 rpm-ostree[107863]: Created new deployment /ostree/deploy/rhcos/deploy/a694d718a6c3f0fd0c367030f865ff2d85a2467853ef5de80be2d2eb77e98b56.2 Sep 08 17:02:36 worker-2 rpm-ostree[107863]: sanitycheck(/usr/bin/true) successful Sep 08 17:02:37 worker-2 rpm-ostree[107863]: Txn KernelArgs on /org/projectatomic/rpmostree1/rhcos successful It seems rpm-ostreed service doesn't capture logs when a request is made to rpm-ostree while another system transaction is in progress. We can solve this issue by keep retrying rpm-ostree cleanup operation until it succeeds. @jonathan @colin MCD does all rpm-ostree related operations sequentially. Any idea what could be the reason for rpm-ostree saying "System transaction in progress" ?
My recollection is that before the MCO gained support for some things like kernel arguments that some other operator (outside of the base platform) started talking to rpm-ostree - and I pointed out it would introduce potential races exactly like this. I don't recall where that code landed but it should be removed now. We could enhance rpm-ostree to have support for automatic "wait until lock released" semantics which would be nice in some cases, but I think would mask the real underlying problem here.
The systemd journal will log which transaction and the client ID so that might help in narrowing down which other project is doing that.
One thing I noticed here is that we seem to lose what the original error was. This might help debug why it seems like rpm-ostree is busy. Opened https://github.com/openshift/machine-config-operator/pull/2067.
So there are four issues here: 1. `override remove` is presumably failing 2. we're swallowing the `override remove` error because `cleanup -p` also fails (this is fixed by https://github.com/openshift/machine-config-operator/pull/2067) 3. `cleanup -p` is failing because something is keeping the daemon busy 4. we don't have any logging of what else is keeping the daemon busy (this is fixed by https://github.com/openshift/machine-config-operator/pull/2068) So there's still issues 1 and 3 present. How about: let's merge those two PRs, keep this RHBZ open, and debug more once we have those since we should have more information by then?
Found it - does the operator have this change https://github.com/openshift-kni/performance-addon-operators/commit/a5341110f44ebe5cb5f0112434688a0cc2f4dd88 ?
Moving it to POST to include fixes from https://github.com/openshift/machine-config-operator/pull/2067
Moving this bug to assigned because issue is still not resolved. Denis reported that CI is seeing the failure: I0916 09:51:06.896151 1983 update.go:361] Rolling back applied changes to OS due to error: failed to execute rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"] : exit status 1 I0916 09:51:06.896252 1983 rpm-ostree.go:261] Running captured: rpm-ostree cleanup -p error: System transaction in progress. A new PR is on the way to capture complete rpm-ostree output during error. Hopefully that will help us to know why kernel-rt switch is failing.
Moving to 4.7. The team will continue to look for ways to debug this issue in hopes of being able to ship something in 4.6 but this is not a blocker.
As an aside: from Colin's comment https://bugzilla.redhat.com/show_bug.cgi?id=1865839#c22 could someone make sure that PR is in place? It sounds like there could be a race between multiple operators calling rpm-ostree still
Yes the performance operator has this change, but the change was merged before the issue start to happen and I also unsure how does it relevant to the problem, the performance-addon-operator never calls to the rpm-ostree. The performance-addon-operator creates bunch of objects like machine-config and tuned profile under the cluster and all future actions will be done by controllers responsible for these objects. Jiri does the tuned controller calls to rpm-ostree commands?
(In reply to Artyom from comment #29) > Yes the performance operator has this change, but the change was merged > before the issue start to happen and I also unsure how does it relevant to > the problem, the performance-addon-operator never calls to the rpm-ostree. > The performance-addon-operator creates bunch of objects like machine-config > and tuned profile under the cluster and all future actions will be done by > controllers responsible for these objects. > > Jiri does the tuned controller calls to rpm-ostree commands? Artyom, Node Tuning Operator (NTO) that controlls the Tuned operands does not call rpm-ostree commands. It only creates MachineConfigs on request.
From another journal log provided by Denys, it appears that only machine-config-operator is calling rpm-ostree after the first boot(id:cli is from firstboot) # grep -i "rpm-ostree" j | egrep -o 'client\(id:[^ ]+ ' | uniq client(id:cli client(id:machine-config-operator Curious that why `rpm-ostree cleanup -p` are failing then.
Here the latest logs: I1009 16:30:24.431641 2447 update.go:1558] Initiating switch from kernel default to realtime I1009 16:30:24.436755 2447 update.go:1558] Switching to kernelType=realtime, invoking rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"] I1009 16:30:24.441612 2447 rpm-ostree.go:261] Running captured: rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm I1009 16:30:24.505583 2447 update.go:361] Rolling back applied changes to OS due to error: error running rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm: error: System transaction in progress : exit status 1 Also I've attached mco daemon log and journal from the node: logs.tgz
Created attachment 1720330 [details] mco daemon log and journal from the node
From the latest log provided, it is clear that both cleanup and kernel-rt switch are failing because System transaction in progress. From provided journal log, I don't see any clue of specious activity in rpm-ostreed service. Applying above rpm-ostree kargs and switching to kernel-rt usling cli works well. Also, applying kernel-rt and kargs on a local 4.7 cluster in same MC completes successfully for both worker and master. ind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: worker-1 spec: config: ignition: version: 3.0.0 kernelArguments: - skew_tick=1 - nohz=on - rcu_nocbs=1-3 - tuned.non_isolcpus=00000001 - intel_pstate=disable - nosoftlockup - tsc=nowatchdog - intel_iommu=on - iommu=pt - isolcpus=managed_irq,1-3 - systemd.cpu_affinity=0 - default_hugepagesz=1G - hugepagesz=2M - hugepages=128 - nmi_watchdog=0 - audit=0 - mce=off - processor.max_cstate=1 - idle=poll - intel_idle.max_cstate=0 kernelType: realtime @Colin, @Jonathan Can we do/add any additional debugging to know why rpm-ostree says another system transaction is in progress? Also, Wondering if there is a way we can wait here on finishing system transaction other than grepping the error message?
That error means that the ostree sysroot is locked which could be because something else is using ostree directly (not via rpm-ostree). One possibility is that something else is trying to reboot the node, which would cause ostree-finalize-staged.service to start which takes the write lock. It's also *maybe* possible that somehow the daemon isn't releasing the write lock before returning to the client...hm yeah this is messy, we release the write lock when the client's dbus connection is closed. But that should happen because the MCO waits synchronously for the client process to exit, and the bus connection should be closed I think. The fact that both the attempt to switch and the rollback failed points strongly towards something else on the system doing something w/ostree I'd say.
Each time I look into logs from different cluster where we hit the problem, don't find anything unusual. From one of the latest cluster provided by Denys, checked service logs of all other ostree and rpm-ostree related service(ostree-finalize-staged.service, ostree-remount.service, ostree-prepare-root.servicerpm-ostree-bootstatus.service, rpm-ostreed-automatic.service, rpm-ostreed.service) and there is nothing unusual. Came across related upstream bug https://github.com/coreos/rpm-ostree/issues/1692 , nothing in journal logs which indicates similarity. Moving this bug to RHCOS as there could be a bug in either ostree or rpm-ostree. If required, as a workaround we can add a fix in MCO to wait if rpm-ostree returns "System transaction is in progress"
> It's also *maybe* possible that somehow the daemon isn't releasing the write lock before returning to the client...hm yeah this is messy, we release the write lock when the client's dbus connection is closed. But that should happen because the MCO waits synchronously for the client process to exit, and the bus connection should be closed I think. I did look at this a few weeks back when investigating this bug and even tried adding a bunch of `sleep()`s in some of the places where I thought we were susceptible to this (notably, right before we `ostree_sysroot_unlock`), but couldn't reproduce it. Still, it's possible it's more subtle than that.
https://github.com/coreos/rpm-ostree/pull/2255/
This one is probably going to get into RHEL 8.3 / OpenShift 4.7 first and we'll see if these tests are reproducible there. That said if we did a scratch build for 4.6 would you be able to try it out?
Work continues on this issue; labeling UpcomingSprint
OK I pushed a test release payload with the latest rpm-ostree: registry.svc.ci.openshift.org/cgwalters/test-rpmostree-4.7@sha256:bf59834a03388f3aba6a3785fad66eab7ab6c0b5ea6ed45faeffb6db9b157042 Can you test that out?
Hm, I can't install RT kernel on your build: # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2020-11-11-150919 True False 14h Cluster version is 4.7.0-0.nightly-2020-11-11-150919 $ rpm-ostree --version rpm-ostree: Version: '2020.7' E1113 00:46:05.603931 3636 writer.go:135] Marking Degraded due to: error running rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm: Checking out tree 859d427...done Enabled rpm-md repositories: coreos-extensions rpm-md repo 'coreos-extensions' (cached); generated: 2020-11-11T18:27:37Z Importing rpm-md...done Resolving dependencies...done error: Could not depsolve transaction; 4 problems detected: Problem 1: conflicting requests - nothing provides linux-firmware >= 20200619-99.git3890db36 needed by kernel-rt-core-4.18.0-240.1.1.rt7.55.el8_3.x86_64 Problem 2: package kernel-rt-modules-extra-4.18.0-240.1.1.rt7.55.el8_3.x86_64 requires kernel-rt-uname-r = 4.18.0-240.1.1.rt7.55.el8_3.x86_64, but none of the providers can be installed - conflicting requests - nothing provides linux-firmware >= 20200619-99.git3890db36 needed by kernel-rt-core-4.18.0-240.1.1.rt7.55.el8_3.x86_64 Problem 3: package kernel-rt-modules-4.18.0-240.1.1.rt7.55.el8_3.x86_64 requires kernel-rt-uname-r = 4.18.0-240.1.1.rt7.55.el8_3.x86_64, but none of the providers can be installed - conflicting requests - nothing provides linux-firmware >= 20200619-99.git3890db36 needed by kernel-rt-core-4.18.0-240.1.1.rt7.55.el8_3.x86_64 Problem 4: package kernel-rt-kvm-4.18.0-240.1.1.rt7.55.el8_3.x86_64 requires kernel-rt = 4.18.0-240.1.1.rt7.55.el8_3, but none of the providers can be installed - conflicting requests - nothing provides linux-firmware >= 20200619-99.git3890db36 needed by kernel-rt-core-4.18.0-240.1.1.rt7.55.el8_3.x86_64 : exit status 1
OK right it's actually worse because > cluster.go:141: rpm-ostree: symbol lookup error: /lib64/librpmostree-1.so.1: undefined symbol: archive_read_support_filter_zstd Which happens because it was built in the 8.4 buildroot which has a newer libarchive among other things. I think probably the sanest thing is to do an 8.3.z build and tag that.
(In reply to Artyom from comment #49) > I can see that both Jira specified under the comment > https://bugzilla.redhat.com/show_bug.cgi?id=1865839#c47 resolved, so what > else do we need to make it available under the RHCoS 4.7? See comment #48; we've encountered additional issues with the version of rpm-ostree that was originally tagged in. We are curently working to resolve those problems. See https://bugzilla.redhat.com/show_bug.cgi?id=1894061 and https://bugzilla.redhat.com/show_bug.cgi?id=1894062 to follow progress.
bumping priority because we are hitting this bug quite a lot recently while working with the performance-addon operator
(In reply to Francesco Romani from comment #51) > bumping priority because we are hitting this bug quite a lot recently while > working with the performance-addon operator I understand this is impacting multiple teams, but unfortunately bumping the priority does not make builds go faster or paperwork get done quicker. :) The team is actively trying to provide resolution to this issue and this BZ will be updated once there is progress to note.
This should be addressed with `rpm-ostree-2020.7-1.el8_3.x86_64` The package will be included in forthcoming RHCOS builds via: https://github.com/openshift/os/pull/475 https://gitlab.cee.redhat.com/coreos/redhat-coreos/-/merge_requests/1203
Higher priority work has prevented from this issue being solved; adding UpcomingSprint keyword
the RHCOS 4.7 boot image was recently bumped in the installer - https://github.com/openshift/installer/pull/4540 This includes the fix for this issue, so moving to MODIFIED
Looks good, tested it tens times and never seen this issue anymore.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633