1865839 – rpm-ostree fails with "System transaction in progress" when moving to kernel-rt

Bug 1865839 - rpm-ostree fails with "System transaction in progress" when moving to kernel-rt

Summary: rpm-ostree fails with "System transaction in progress" when moving to kernel-rt

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	RHCOS
Sub Component:
Version:	4.6
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Jonathan Lebon
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:	1894062
Blocks:	1915617
TreeView+	depends on / blocked

Reported:	2020-08-04 10:43 UTC by Artyom
Modified:	2021-03-23 20:20 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The MCO tries to roll back rpm-ostree changes after encountering errors using `rpm-ostree cleanup -p`. Consequence: `rpm-ostree cleanup -p` sometimes fails with "System transaction in progress". Fix: rpm-ostree code related to D-Bus handling has been reworked and strengthened. Result: `rpm-ostree cleanup -p` should no longer hit "System transaction in progress".
Clone Of:
Environment:
Last Closed:	2021-02-24 15:14:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
mahine config daemon log (69.59 KB, text/plain) 2020-09-02 17:13 UTC, Denys Shchedrivyi	no flags	Details
mco daemon log and journal from the node (10.87 MB, application/gzip) 2020-10-09 19:03 UTC, Denys Shchedrivyi	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2005	None	closed	Bug 1865839: daemon: cleanup pending deployment on an error	2021-02-17 21:57:49 UTC
Github	openshift machine-config-operator pull 2067	None	closed	Bug 1865839: daemon/update: print error causing rollback	2021-02-17 21:57:49 UTC
Github	openshift machine-config-operator pull 2068	None	closed	Bug 1865839: daemon: Set RPMOSTREE_CLIENT_ID again	2021-02-17 21:57:54 UTC
Github	openshift machine-config-operator pull 2097	None	closed	Bug 1865839: daemon: better error reporting for rpm-ostree operations	2021-02-17 21:57:49 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:14:42 UTC

Description Artyom 2020-08-04 10:43:39 UTC

Description of problem:
The performance-addon-operator updates kernel arguments and disable the RT kernel under the same update operation. But after reboot I still can see kernel arguments that should be deleted under the /proc/cmdline.

Version-Release number of selected component (if applicable):
master

How reproducible:
50% started to happen a lot under our CI.

Steps to Reproduce:
1. Create some custom machine-config that will related to custom MCP
2. Lable worker node to be part of the custom MCP
3. Enable RT kernel and add some test kernel arguments under the MC and wait for the update.
4. Disable the RT kernel and update kernel arguments(add and remove) and wait for the update.
5. Check the /proc/cmdline under the node

Actual results:
Kernel arguments that should be deleted still exist under the node

Expected results:
Kernel arguments on the node should not include deleted arguments under the MC.

Additional info:

output of the machine-config-daemon logs
I0804 06:53:36.346285    2116 update.go:1079] Writing systemd unit dropin "mco-disabled.conf"
I0804 06:53:36.348012    2116 update.go:901] Deleting stale data
I0804 06:53:36.348232    2116 update.go:1029] Removed stale systemd unit "/etc/systemd/system/hugepages-allocation-1048576kB-NUMA0.service"
I0804 06:53:36.361529    2116 update.go:1264] Writing SSHKeys at "/home/core/.ssh/authorized_keys"
I0804 06:53:36.385979    2116 run.go:18] Running: oc image extract --path /:/run/mco-machine-os-content/os-content-341274820 --registry-config /var/lib/kubelet/config.json registry.svc.ci.openshift.org/ci-op-52tsdbr3/stable@sha256:16a8dde4b893ff0b5b4aeb05474f2f5e2ce9cac45d5d3e98b40c4309e23215a7
I0804 06:53:55.402698    2116 update.go:1436] Running rpm-ostree [kargs --delete=rcu_nocbs=1-3 --delete=tuned.non_isolcpus=00000001 --delete=isolcpus=managed_irq,1-3 --delete=systemd.cpu_affinity=0 --delete=default_hugepagesz=1G --delete=nmi_watchdog=0 --append=rcu_nocbs=1-2 --append=tuned.non_isolcpus=00000009 --append=isolcpus=domain,managed_irq,1-2 --append=systemd.cpu_affinity=0,3 --append=default_hugepagesz=2M --append=hugepagesz=2M --append=hugepages=5 --append=new-argument=test]
I0804 06:56:44.773800    2116 update.go:1436] Initiating switch from kernel realtime to default
I0804 06:56:44.779164    2116 update.go:1436] Switching to kernelType=default, invoking rpm-ostree ["override" "reset" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--uninstall" "kernel-rt-core" "--uninstall" "kernel-rt-modules" "--uninstall" "kernel-rt-modules-extra" "--uninstall" "kernel-rt-kvm"]
I0804 06:56:52.504966    2116 update.go:1436] initiating reboot: Node will reboot into config rendered-worker-cnf-49e1d72c9785af8eca1b6dacc2b62fec
I0804 06:57:00.122841    2116 daemon.go:627] Shutting down MachineConfigDaemon
I0804 06:59:07.064449    2121 start.go:108] Version: machine-config-daemon-4.6.0-202006240615.p0-157-gcfea781f-dirty (cfea781f822a3c7516ce71ff7959f8ec772b6ec4)
I0804 06:59:07.094522    2121 start.go:118] Calling chroot("/rootfs")
I0804 06:59:07.096500    2121 rpm-ostree.go:196] Running captured: rpm-ostree status --json
I0804 06:59:07.607134    2121 daemon.go:222] Booted osImageURL: registry.svc.ci.openshift.org/ci-op-52tsdbr3/stable@sha256:16a8dde4b893ff0b5b4aeb05474f2f5e2ce9cac45d5d3e98b40c4309e23215a7 (46.82.202008032211-0)
I0804 06:59:07.758977    2121 daemon.go:229] Installed Ignition binary version: 2.5.0
I0804 06:59:07.928245    2121 start.go:97] Copied self to /run/bin/machine-config-daemon on host
I0804 06:59:07.930791    2121 metrics.go:106] Registering Prometheus metrics
I0804 06:59:07.930929    2121 metrics.go:111] Starting metrics listener on 127.0.0.1:8797
I0804 06:59:07.936244    2121 update.go:1436] Starting to manage node: ci-op-52tsdbr3-24cc7-46vcj-worker-b-h8jxx
I0804 06:59:07.959741    2121 rpm-ostree.go:196] Running captured: rpm-ostree status


After the reboot I still can see all deleted arguments

I0804 06:59:28.020514   17878 utils.go:23] run command 'oc [exec -i -n openshift-machine-config-operator -c machine-config-daemon --request-timeout 30 machine-config-daemon-q5j7p -- cat /proc/cmdline]':
  out=BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-5c3797397db1bafbc8c328137628f4008d4157b555477436f6ad87a10a00a388/vmlinuz-4.18.0-211.el8.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ostree=/ostree/boot.0/rhcos/5c3797397db1bafbc8c328137628f4008d4157b555477436f6ad87a10a00a388/0 ignition.platform.id=gcp skew_tick=1 nohz=on intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0 skew_tick=1 nohz=on rcu_nocbs=1-3 tuned.non_isolcpus=00000001 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,1-3 systemd.cpu_affinity=0 default_hugepagesz=1G nmi_watchdog=0 audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0 skew_tick=1 nohz=on rcu_nocbs=1-3 tuned.non_isolcpus=00000001 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,1-3 systemd.cpu_affinity=0 default_hugepagesz=1G nmi_watchdog=0 audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0 rcu_nocbs=1-2 tuned.non_isolcpus=00000009 isolcpus=domain,managed_irq,1-2 systemd.cpu_affinity=0,3 default_hugepagesz=2M hugepagesz=2M hugepages=5 new-argument=test

You can find all logs under the https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift-kni_performance-addon-operators/318/pull-ci-openshift-kni-performance-addon-operators-master-e2e-gcp/1290521035233300480/artifacts/e2e-gcp/pods/openshift-machine-config-operator_machine-config-daemon-q5j7p_machine-config-daemon.log

I unsure if this bug relates to the machine-config-daemon or to the rpm-ostree.

Comment 1 Denys Shchedrivyi 2020-08-10 20:35:16 UTC

@Artyom, did you check your /proc/cmdline before step 4 (removing rt kernel and some arguments?). Probably you had some arguments duplication in your cmdline.. At least that what I see on my cluster after switching to RT kernel and adding arguments:


> cat [core@worker-0 ~]$ cat /proc/cmdline 
> BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-5b17e85442f17aba6acf91310f2589910585fbcfa2596471c9d2d818471d7030/vmlinuz-4.18.0-211.rt5.23.el8.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ostree=/ostree/boot.1/rhcos/5b17e85442f17aba6acf91310f2589910585fbcfa2596471c9d2d818471d7030/0 ignition.platform.id=openstack skew_tick=1 nohz=on rcu_nocbs=1-3 tuned.non_isolcpus=00000001 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,1-3 systemd.cpu_affinity=0 default_hugepagesz=1G nmi_watchdog=0 audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0 skew_tick=1 nohz=on rcu_nocbs=1-3 tuned.non_isolcpus=00000001 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,1-3 systemd.cpu_affinity=0 default_hugepagesz=1G nmi_watchdog=0 audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0 skew_tick=1 nohz=on rcu_nocbs=1-3 tuned.non_isolcpus=00000001 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,1-3 systemd.cpu_affinity=0 default_hugepagesz=1G nmi_watchdog=0 audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0

 You can see that almost all arguments triplicated (!).


Here the log from machine config daemon pod:

>.
>I0810 18:22:26.561149    3545 update.go:1436] Running rpm-ostree [kargs --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1-3 --append=tuned.non_isolcpus=00000001 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1-3 --append=systemd.cpu_affinity=0 --append=default_hugepagesz=1G --append=nmi_watchdog=0 --append=audit=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=intel_idle.max_cstate=0]
>I0810 18:22:41.652350    3545 update.go:1436] Initiating switch from kernel default to realtime
>I0810 18:22:41.659569    3545 update.go:1436] Switching to kernelType=realtime, invoking rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"]
>.
>E0810 18:22:42.655159    3545 writer.go:135] Marking Degraded due to: failed to execute rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"] : exit status 1
>I0810 18:22:42.686359    3545 update.go:324] Checking Reconcilable for config rendered-worker-e4af5d3dba1d64414516c4c5f235e054 to rendered-worker-cnf-3145023d5727e9f7b5478b734e43e0f4
>I0810 18:22:42.786650    3545 update.go:1436] Starting update from rendered-worker-e4af5d3dba1d64414516c4c5f235e054 to rendered-worker-cnf-3145023d5727e9f7b5478b734e43e0f4: &{osUpdate:false kargs:true fips:false passwd:false files:true units:true kernelType:true extensions:false}
>.
>I0810 18:23:11.736999    3545 update.go:1436] Running rpm-ostree [kargs --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1-3 --append=tuned.non_isolcpus=00000001 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1-3 --append=systemd.cpu_affinity=0 --append=default_hugepagesz=1G --append=nmi_watchdog=0 --append=audit=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=intel_idle.max_cstate=0]
>I0810 18:23:18.047962    3545 update.go:1436] Initiating switch from kernel default to realtime
>I0810 18:23:18.053693    3545 update.go:1436] Switching to kernelType=realtime, invoking rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"]
>.
>E0810 18:23:19.037960    3545 writer.go:135] Marking Degraded due to: failed to execute rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"] : exit status 1
>I0810 18:23:42.674926    3545 update.go:324] Checking Reconcilable for config rendered-worker-e4af5d3dba1d64414516c4c5f235e054 to rendered-worker-cnf-3145023d5727e9f7b5478b734e43e0f4
>I0810 18:23:42.756123    3545 update.go:1436] Starting update from rendered-worker-e4af5d3dba1d64414516c4c5f235e054 to rendered-worker-cnf-3145023d5727e9f7b5478b734e43e0f4: &{osUpdate:false kargs:true fips:false passwd:false files:true units:true kernelType:true extensions:false}
>.
>I0810 18:24:56.713134    3545 update.go:1436] Running rpm-ostree [kargs --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1-3 --append=tuned.non_isolcpus=00000001 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1-3 --append=systemd.cpu_affinity=0 --append=default_hugepagesz=1G --append=nmi_watchdog=0 --append=audit=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=intel_idle.max_cstate=0]
>I0810 18:25:02.634854    3545 update.go:1436] Initiating switch from kernel default to realtime
>I0810 18:25:02.640505    3545 update.go:1436] Switching to kernelType=realtime, invoking rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt->core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"]
>I0810 18:28:22.534336    3545 update.go:1436] initiating reboot: Node will reboot into config rendered-worker-cnf-3145023d5727e9f7b5478b734e43e0f4


What I see - MC added arguments -> failed with switching kernel -> added arguments for second time -> failed with kernel again -> added arguments for the third time -> successfully switched kernel and rebooted the node. As result my kernel arguments were added three times.

Comment 2 Artyom 2020-08-11 08:21:06 UTC

I did not check the cmdline before, because it happens under the CI run so I did not have a chance to check it, but do you know why do you have doubles and triples of kernel arguments?

Comment 3 Denys Shchedrivyi 2020-08-11 13:32:16 UTC

From the logs in comment #1 I see that MCO daemon added kernel arguments on every try to install RT kernel.. after third tries I got triplicated arguments

Comment 5 Sinny Kumari 2020-08-17 14:44:29 UTC

As Deny mentioned in comment #1, from the log we can see that kernelArguments applied successfully but switch to kernel-rt failed twice with error:

E0804 06:43:53.479144    2036 writer.go:135] Marking Degraded due to: failed to execute rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"] : exit status 1

This is a regression from recent rework on OS update where we missed handling rollback properly during failure of an rpm-ostree transaction.

Fix is on the way.

Comment 8 Sinny Kumari 2020-08-18 06:16:51 UTC

Fix should be available in nightlies now.

@Dany @Artyom Can you please let us know if kernelArguments are applied correctly in your CI setup?

Comment 9 Denys Shchedrivyi 2020-08-18 23:49:59 UTC

will keep an eye on it, unfortunately it was not happen on every run

Comment 10 Denys Shchedrivyi 2020-08-31 18:45:58 UTC

reproduced again, here the log message from the daemon pod:


I0831 17:27:55.988406    2155 update.go:1542] Running rpm-ostree [kargs --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1-3 --append=tuned.non_isolcpus=00000001 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1-3 --append=systemd.cpu_affinity=0 --append=default_hugepagesz=1G --append=nmi_watchdog=0 --append=audit=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=intel_idle.max_cstate=0]
I0831 17:28:07.054461    2155 update.go:1542] Initiating switch from kernel default to realtime
I0831 17:28:07.058866    2155 update.go:1542] Switching to kernelType=realtime, invoking rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"]
I0831 17:28:07.110342    2155 update.go:358] Rolling back applied changes to OS
I0831 17:28:07.110442    2155 rpm-ostree.go:261] Running captured: rpm-ostree cleanup -p
error: System transaction in progress
.
.
E0831 17:28:07.721421    2155 writer.go:135] Marking Degraded due to: error removing staged deployment: error running rpm-ostree cleanup -p: : exit status 1: failed to execute rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"] : exit status 1



@Sinny, tell me if you need more logs, I can try to get full artifacts from CI

Comment 11 Sinny Kumari 2020-09-02 11:49:18 UTC

By looking at the above logs, looks to me that MCO has applied kargs as expected but realtime switch fails (don't know why) and as a result MCO tried to rollback the applied changes but it fails maybe because another rpm-ostree transaction is still active.

yes, access to complete log will be needed to dig in further.

Comment 12 Denys Shchedrivyi 2020-09-02 17:13:44 UTC

Created attachment 1713486 [details]
mahine config daemon log

Attached full mc-daemon log

Comment 14 Sinny Kumari 2020-09-03 06:36:42 UTC

Thanks for providing the logs, didn't find the useful information I was looking for in the journal logs which is rpm-ostreed.service log during the time `rpm-ostree cleanup` failed. It seems ci keeps journal log for n and n-1 boot. I will try to figure out reproducing this behavior locally. Meanwhile, wondering if we can modify ci to capture journal logs from all the boots.

Comment 15 Sinny Kumari 2020-09-08 06:26:45 UTC

I can't reproduce rpm-ostree cleanup issue locally.

What I did to reproduce the problem is after kernel-rt switch, specified explicitly to do pending deployment cleanup. Diff of MCO changes:

```
diff --git a/pkg/daemon/update.go b/pkg/daemon/update.go
index a6cffbbc..f3ed0735 100644
--- a/pkg/daemon/update.go
+++ b/pkg/daemon/update.go
@@ -375,6 +375,11 @@ func (dn *Daemon) applyOSChanges(oldConfig, newConfig *mcfgv1.MachineConfig) (re
                return err
        }
 
+       if err := removePendingDeployment(); err != nil {
+               retErr = errors.Wrapf(retErr, "error removing staged deployment: %v", err)
+               return
+       }
+
        // Apply extensions
        if err := dn.applyExtensions(oldConfig, newConfig); err != nil {
                return err
```

Launched a cluster with day1 realtime MachineConifg, pending deployment did get cleanup fine during firstboot:

sh-4.4# journalctl -u rpm-ostreed |grep "rhcos successful"
Sep 08 05:37:41 sinny-asset-qmq2k-worker-a-jlgww.c.openshift-gce-devel.internal rpm-ostree[2046]: Txn UpdateDeployment on /org/projectatomic/rpmostree1/rhcos successful
Sep 08 05:39:41 sinny-asset-qmq2k-worker-a-jlgww.c.openshift-gce-devel.internal rpm-ostree[2046]: Txn UpdateDeployment on /org/projectatomic/rpmostree1/rhcos successful
Sep 08 05:39:42 sinny-asset-qmq2k-worker-a-jlgww.c.openshift-gce-devel.internal rpm-ostree[2046]: Txn Cleanup on /org/projectatomic/rpmostree1/rhcos successful
Sep 08 05:42:26 sinny-asset-qmq2k-worker-a-jlgww.c.openshift-gce-devel.internal rpm-ostree[9762]: Txn UpdateDeployment on /org/projectatomic/rpmostree1/rhcos successful


When it happen next time in CI, is it possible to give access to that cluster?

Comment 17 Sinny Kumari 2020-09-09 13:51:20 UTC

Thanks Denys for sharing reproducible cluster and logs.

What I observed is when we applied realtime and kernelarguments on the nodes, two of the nodes(worker-0, worker-2) had issues with applying realtime kernel. On failure worker-0 was successfully able to rollback the rpm-ostree related changes while worker-2 failed since as per rpm-ostree another transaction was already in progress.

Pasting below relevant logs from worker-0 and worker-2:

worker-0
--------

MCD log:
...
I0908 16:26:51.463443    3356 update.go:1542] Starting update from rendered-worker-ff2507f638ea359164e1d3b6386cadd3 to rendered-worker-cnf-79ee49acc24e9f1cc0be696c4d29ea2b: &{osUpdate:false kargs:true fips:false passwd:false files:true units:true kernelType:true extensions:false}
...
I0908 16:28:00.340382    3356 run.go:18] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-583474098 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:19ebd206b45fe385def2b15a576dab64d2ed58dccd4b0f74387110311388f031
I0908 16:28:27.868638    3356 update.go:1542] Running rpm-ostree [kargs --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1-3 --append=tuned.non_isolcpus=00000001 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1-3 --append=systemd.cpu_affinity=0 --append=default_hugepagesz=1G --append=hugepagesz=2M --append=hugepages=128 --append=nmi_watchdog=0 --append=audit=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=intel_idle.max_cstate=0]
I0908 16:28:41.705072    3356 update.go:1542] Initiating switch from kernel default to realtime
I0908 16:28:41.711720    3356 update.go:1542] Switching to kernelType=realtime, invoking rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"]
I0908 16:28:41.789103    3356 update.go:358] Rolling back applied changes to OS
I0908 16:28:41.789251    3356 rpm-ostree.go:261] Running captured: rpm-ostree cleanup -p
I0908 16:28:44.505318    3356 update.go:1373] Writing SSHKeys at "/home/core/.ssh/authorized_keys"
...

rpm-ostreed service log:
...
Sep 08 16:28:28 worker-0 rpm-ostree[39195]: Initiated txn KernelArgs for client(id:cli dbus:1.402 unit:crio-5be1e7088bb22198577df3814cc5525868648ddd724695eb3de17317a567dde1.scope uid:0): /org/projectatomic/rpmostree1/rhcos
 08 16:28:39 worker-0 rpm-ostree[39195]: Created new deployment /ostree/deploy/rhcos/deploy/a694d718a6c3f0fd0c367030f865ff2d85a2467853ef5de80be2d2eb77e98b56.1
Sep 08 16:28:41 worker-0 rpm-ostree[39195]: Txn KernelArgs on /org/projectatomic/rpmostree1/rhcos successful
Sep 08 16:28:41 worker-0 rpm-ostree[39195]: Initiated txn Cleanup for client(id:cli dbus:1.405 unit:crio-5be1e7088bb22198577df3814cc5525868648ddd724695eb3de17317a567dde1.scope uid:0): /org/projectatomic/rpmostree1/rhcos
Sep 08 16:28:43 worker-0 rpm-ostree[39195]: Txn Cleanup on /org/projectatomic/rpmostree1/rhcos successful


worker-2
---------

MCD log:
..
I0908 16:59:01.633512    3648 update.go:1542] Starting update from rendered-worker-ff2507f638ea359164e1d3b6386cadd3 to rendered-worker-cnf-79ee49acc24e9f1cc0be696c4d29ea2b: &{osUpdate:false kargs:true fips:false passwd:false files:true units:true kernelType:true extensions:false}
...
I0908 17:00:54.376819    3648 run.go:18] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-159144463 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:19ebd206b45fe385def2b15a576dab64d2ed58dccd4b0f74387110311388f031
I0908 17:01:42.046930    3648 update.go:1542] Running rpm-ostree [kargs --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1-3 --append=tuned.non_isolcpus=00000001 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1-3 --append=systemd.cpu_affinity=0 --append=default_hugepagesz=1G --append=hugepagesz=2M --append=hugepages=128 --append=nmi_watchdog=0 --append=audit=0 --append=mce=off --append=processor.max_cstate=1 --append=idle=poll --append=intel_idle.max_cstate=0]
I0908 17:01:55.033620    3648 update.go:1542] Initiating switch from kernel default to realtime
I0908 17:01:55.038773    3648 update.go:1542] Switching to kernelType=realtime, invoking rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"]
I0908 17:01:55.105279    3648 update.go:358] Rolling back applied changes to OS
I0908 17:01:55.105402    3648 rpm-ostree.go:261] Running captured: rpm-ostree cleanup -p
error: System transaction in progress
I0908 17:01:55.826785    3648 update.go:1373] Writing SSHKeys at "/home/core/.ssh/authorized_keys"
...



rpm-ostreed service log:
Sep 08 17:01:42 worker-2 rpm-ostree[107863]: Initiated txn KernelArgs for client(id:cli dbus:1.524 unit:crio-0ec57c7902750bb7aa343a39e489126a2371246b3a3ca376763ac2582d563385.scope uid:0): /org/projectatomic/rpmostree1/rhcos
Sep 08 17:01:53 worker-2 rpm-ostree[107863]: Created new deployment /ostree/deploy/rhcos/deploy/a694d718a6c3f0fd0c367030f865ff2d85a2467853ef5de80be2d2eb77e98b56.1
Sep 08 17:01:55 worker-2 rpm-ostree[107863]: Txn KernelArgs on /org/projectatomic/rpmostree1/rhcos successful
Sep 08 17:01:55 worker-2 rpm-ostree[107863]: In idle state; will auto-exit in 61 seconds
Sep 08 17:02:32 worker-2 rpm-ostree[107863]: Initiated txn KernelArgs for client(id:cli dbus:1.528 unit:crio-0ec57c7902750bb7aa343a39e489126a2371246b3a3ca376763ac2582d563385.scope uid:0): /org/projectatomic/rpmostree1/rhcos
Sep 08 17:02:36 worker-2 rpm-ostree[107863]: Created new deployment /ostree/deploy/rhcos/deploy/a694d718a6c3f0fd0c367030f865ff2d85a2467853ef5de80be2d2eb77e98b56.2
Sep 08 17:02:36 worker-2 rpm-ostree[107863]: sanitycheck(/usr/bin/true) successful
Sep 08 17:02:37 worker-2 rpm-ostree[107863]: Txn KernelArgs on /org/projectatomic/rpmostree1/rhcos successful


It seems rpm-ostreed service doesn't capture logs when a request is made to rpm-ostree while another system transaction is in progress.

We can solve this issue by keep retrying rpm-ostree cleanup operation until it succeeds.

@jonathan @colin MCD does all rpm-ostree related operations sequentially. Any idea what could be the reason for rpm-ostree saying "System transaction in progress" ?

Comment 18 Colin Walters 2020-09-09 14:16:04 UTC

My recollection is that before the MCO gained support for some things like kernel arguments that some other operator (outside of the base platform) started talking to rpm-ostree - and I pointed out it would introduce potential races exactly like this.

I don't recall where that code landed but it should be removed now.

We could enhance rpm-ostree to have support for automatic "wait until lock released" semantics which would be nice in some cases, but I think would mask the real underlying problem here.

Comment 19 Colin Walters 2020-09-09 14:22:12 UTC

The systemd journal will log which transaction and the client ID so that might help in narrowing down which other project is doing that.

Comment 20 Jonathan Lebon 2020-09-09 14:24:50 UTC

One thing I noticed here is that we seem to lose what the original error was. This might help debug why it seems like rpm-ostree is busy. Opened https://github.com/openshift/machine-config-operator/pull/2067.

Comment 21 Jonathan Lebon 2020-09-09 16:57:19 UTC

So there are four issues here:
1. `override remove` is presumably failing
2. we're swallowing the `override remove` error because `cleanup -p` also fails (this is fixed by https://github.com/openshift/machine-config-operator/pull/2067)
3. `cleanup -p` is failing because something is keeping the daemon busy
4. we don't have any logging of what else is keeping the daemon busy (this is fixed by https://github.com/openshift/machine-config-operator/pull/2068)

So there's still issues 1 and 3 present.

How about: let's merge those two PRs, keep this RHBZ open, and debug more once we have those since we should have more information by then?

Comment 22 Colin Walters 2020-09-09 18:51:25 UTC

Found it - does the operator have this change https://github.com/openshift-kni/performance-addon-operators/commit/a5341110f44ebe5cb5f0112434688a0cc2f4dd88 ?

Comment 24 Sinny Kumari 2020-09-10 07:00:19 UTC

Moving it to POST to include fixes from https://github.com/openshift/machine-config-operator/pull/2067

Comment 26 Sinny Kumari 2020-09-17 11:45:27 UTC

Moving this bug to assigned because issue is still not resolved.

Denis reported that CI is seeing the failure:

I0916 09:51:06.896151    1983 update.go:361] Rolling back applied changes to OS due to error: failed to execute rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"] : exit status 1
I0916 09:51:06.896252    1983 rpm-ostree.go:261] Running captured: rpm-ostree cleanup -p
error: System transaction in progress.

A new PR is on the way to capture complete rpm-ostree output during error. Hopefully that will help us to know why kernel-rt switch is failing.

Comment 27 Michelle Krejci 2020-09-21 17:12:56 UTC

Moving to 4.7. The team will continue to look for ways to debug this issue in hopes of being able to ship something in 4.6 but this is not a blocker.

Comment 28 Yu Qi Zhang 2020-09-21 18:34:41 UTC

As an aside: from Colin's comment https://bugzilla.redhat.com/show_bug.cgi?id=1865839#c22 could someone make sure that PR is in place? It sounds like there could be a race between multiple operators calling rpm-ostree still

Comment 29 Artyom 2020-09-22 07:22:50 UTC

Yes the performance operator has this change, but the change was merged before the issue start to happen and I also unsure how does it relevant to the problem, the performance-addon-operator never calls to the rpm-ostree. The performance-addon-operator creates bunch of objects like machine-config and tuned profile under the cluster and all future actions will be done by controllers responsible for these objects.

Jiri does the tuned controller calls to rpm-ostree commands?

Comment 30 Jiří Mencák 2020-09-22 08:40:44 UTC

(In reply to Artyom from comment #29)
> Yes the performance operator has this change, but the change was merged
> before the issue start to happen and I also unsure how does it relevant to
> the problem, the performance-addon-operator never calls to the rpm-ostree.
> The performance-addon-operator creates bunch of objects like machine-config
> and tuned profile under the cluster and all future actions will be done by
> controllers responsible for these objects.
> 
> Jiri does the tuned controller calls to rpm-ostree commands?

Artyom, Node Tuning Operator (NTO) that controlls the Tuned operands does not call
rpm-ostree commands.  It only creates MachineConfigs on request.

Comment 31 Sinny Kumari 2020-10-08 16:00:43 UTC

From another journal log provided by Denys, it appears that only machine-config-operator is calling rpm-ostree after the first boot(id:cli is from firstboot)

#  grep -i  "rpm-ostree" j | egrep -o 'client\(id:[^ ]+ '  |  uniq
client(id:cli 
client(id:machine-config-operator 

Curious that why `rpm-ostree cleanup -p` are failing then.

Comment 33 Denys Shchedrivyi 2020-10-09 19:02:49 UTC

 Here the latest logs:

I1009 16:30:24.431641    2447 update.go:1558] Initiating switch from kernel default to realtime
I1009 16:30:24.436755    2447 update.go:1558] Switching to kernelType=realtime, invoking rpm-ostree ["override" "remove" "kernel" "kernel-core" "kernel-modules" "kernel-modules-extra" "--install" "kernel-rt-core" "--install" "kernel-rt-modules" "--install" "kernel-rt-modules-extra" "--install" "kernel-rt-kvm"]
I1009 16:30:24.441612    2447 rpm-ostree.go:261] Running captured: rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm
I1009 16:30:24.505583    2447 update.go:361] Rolling back applied changes to OS due to error: error running rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm: error: System transaction in progress
: exit status 1


 Also I've attached mco daemon log and journal from the node: logs.tgz

Comment 34 Denys Shchedrivyi 2020-10-09 19:03:42 UTC

Created attachment 1720330 [details]
mco daemon log and journal from the node

Comment 36 Sinny Kumari 2020-10-12 14:28:55 UTC

From the latest log provided, it is clear that both cleanup and kernel-rt switch are failing because System transaction in progress. From provided journal log, I don't see any clue of specious activity in rpm-ostreed service.

Applying above rpm-ostree kargs and switching to kernel-rt usling cli works well. Also, applying kernel-rt and kargs on a local 4.7 cluster in same MC completes successfully for both worker and master. 

ind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: worker-1
spec:
  config:
    ignition:
      version: 3.0.0
  kernelArguments:
    - skew_tick=1
    - nohz=on
    - rcu_nocbs=1-3
    - tuned.non_isolcpus=00000001
    - intel_pstate=disable
    - nosoftlockup
    - tsc=nowatchdog
    - intel_iommu=on
    - iommu=pt
    - isolcpus=managed_irq,1-3
    - systemd.cpu_affinity=0
    - default_hugepagesz=1G
    - hugepagesz=2M
    - hugepages=128
    - nmi_watchdog=0
    - audit=0
    - mce=off
    - processor.max_cstate=1
    - idle=poll
    - intel_idle.max_cstate=0
  kernelType: realtime

@Colin, @Jonathan Can we do/add any additional debugging to know why rpm-ostree says another system transaction is in progress?

Also, Wondering if there is a way we can wait here on finishing system transaction other than grepping the error message?

Comment 37 Colin Walters 2020-10-12 16:46:50 UTC

That error means that the ostree sysroot is locked which could be because something else is using ostree directly (not via rpm-ostree).  

One possibility is that something else is trying to reboot the node, which would cause ostree-finalize-staged.service to start which takes the write lock.

It's also *maybe* possible that somehow the daemon isn't releasing the write lock before returning to the client...hm yeah this is messy, we release the write lock when the client's dbus connection is closed.  But that should happen because the MCO waits synchronously for the client process to exit, and the bus connection should be closed I think.

The fact that both the attempt to switch and the rollback failed points strongly towards something else on the system doing something w/ostree I'd say.

Comment 38 Sinny Kumari 2020-10-14 14:17:53 UTC

Each time I look into logs from different cluster where we hit the problem, don't find anything unusual.

From one of the latest cluster provided by Denys, checked service logs of all other ostree and rpm-ostree related service(ostree-finalize-staged.service, ostree-remount.service, ostree-prepare-root.servicerpm-ostree-bootstatus.service, rpm-ostreed-automatic.service, rpm-ostreed.service) and there is nothing unusual. Came across related upstream bug https://github.com/coreos/rpm-ostree/issues/1692 , nothing in journal logs which indicates similarity.

Moving this bug to RHCOS as there could be a bug in either ostree or rpm-ostree. If required, as  a workaround we can add a fix in MCO to wait if rpm-ostree returns "System transaction is in progress"

Comment 39 Jonathan Lebon 2020-10-14 16:32:21 UTC

> It's also *maybe* possible that somehow the daemon isn't releasing the write lock before returning to the client...hm yeah this is messy, we release the write lock when the client's dbus connection is closed.  But that should happen because the MCO waits synchronously for the client process to exit, and the bus connection should be closed I think.

I did look at this a few weeks back when investigating this bug and even tried adding a bunch of `sleep()`s in some of the places where I thought we were susceptible to this (notably, right before we `ostree_sysroot_unlock`), but couldn't reproduce it. Still, it's possible it's more subtle than that.

Comment 40 Colin Walters 2020-10-14 17:28:12 UTC

https://github.com/coreos/rpm-ostree/pull/2255/

Comment 41 Colin Walters 2020-10-22 22:18:02 UTC

This one is probably going to get into RHEL 8.3 / OpenShift 4.7 first and we'll see if these tests are reproducible there.

That said if we did a scratch build for 4.6 would you be able to try it out?

Comment 42 Micah Abbott 2020-10-25 18:39:36 UTC

Work continues on this issue; labeling UpcomingSprint

Comment 43 Colin Walters 2020-11-11 21:40:40 UTC

OK I pushed a test release payload with the latest rpm-ostree:

registry.svc.ci.openshift.org/cgwalters/test-rpmostree-4.7@sha256:bf59834a03388f3aba6a3785fad66eab7ab6c0b5ea6ed45faeffb6db9b157042

Can you test that out?

Comment 44 Denys Shchedrivyi 2020-11-13 15:53:45 UTC

 Hm, I can't install RT kernel on your build:

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-11-11-150919   True        False         14h     Cluster version is 4.7.0-0.nightly-2020-11-11-150919

$ rpm-ostree --version
rpm-ostree:
 Version: '2020.7'


E1113 00:46:05.603931    3636 writer.go:135] Marking Degraded due to: error running rpm-ostree override remove kernel kernel-core kernel-modules kernel-modules-extra --install kernel-rt-core --install kernel-rt-modules --install kernel-rt-modules-extra --install kernel-rt-kvm: Checking out tree 859d427...done
Enabled rpm-md repositories: coreos-extensions
rpm-md repo 'coreos-extensions' (cached); generated: 2020-11-11T18:27:37Z
Importing rpm-md...done
Resolving dependencies...done
error: Could not depsolve transaction; 4 problems detected:
 Problem 1: conflicting requests
  - nothing provides linux-firmware >= 20200619-99.git3890db36 needed by kernel-rt-core-4.18.0-240.1.1.rt7.55.el8_3.x86_64
 Problem 2: package kernel-rt-modules-extra-4.18.0-240.1.1.rt7.55.el8_3.x86_64 requires kernel-rt-uname-r = 4.18.0-240.1.1.rt7.55.el8_3.x86_64, but none of the providers can be installed
  - conflicting requests
  - nothing provides linux-firmware >= 20200619-99.git3890db36 needed by kernel-rt-core-4.18.0-240.1.1.rt7.55.el8_3.x86_64
 Problem 3: package kernel-rt-modules-4.18.0-240.1.1.rt7.55.el8_3.x86_64 requires kernel-rt-uname-r = 4.18.0-240.1.1.rt7.55.el8_3.x86_64, but none of the providers can be installed
  - conflicting requests
  - nothing provides linux-firmware >= 20200619-99.git3890db36 needed by kernel-rt-core-4.18.0-240.1.1.rt7.55.el8_3.x86_64
 Problem 4: package kernel-rt-kvm-4.18.0-240.1.1.rt7.55.el8_3.x86_64 requires kernel-rt = 4.18.0-240.1.1.rt7.55.el8_3, but none of the providers can be installed
  - conflicting requests
  - nothing provides linux-firmware >= 20200619-99.git3890db36 needed by kernel-rt-core-4.18.0-240.1.1.rt7.55.el8_3.x86_64
: exit status 1

Comment 48 Colin Walters 2020-12-08 13:57:38 UTC

OK right it's actually worse because 

> cluster.go:141: rpm-ostree: symbol lookup error: /lib64/librpmostree-1.so.1: undefined symbol: archive_read_support_filter_zstd

Which happens because it was built in the 8.4 buildroot which has a newer libarchive among other things.

I think probably the sanest thing is to do an 8.3.z build and tag that.

Comment 50 Micah Abbott 2020-12-09 15:24:28 UTC

(In reply to Artyom from comment #49)
> I can see that both Jira specified under the comment
> https://bugzilla.redhat.com/show_bug.cgi?id=1865839#c47 resolved, so what
> else do we need to make it available under the RHCoS 4.7?

See comment #48; we've encountered additional issues with the version of rpm-ostree that was originally tagged in.  We are curently working to resolve those problems.

See https://bugzilla.redhat.com/show_bug.cgi?id=1894061 and https://bugzilla.redhat.com/show_bug.cgi?id=1894062 to follow progress.

Comment 51 Francesco Romani 2020-12-10 10:00:13 UTC

bumping priority because we are hitting this bug quite a lot recently while working with the performance-addon operator

Comment 53 Micah Abbott 2020-12-10 15:39:14 UTC

(In reply to Francesco Romani from comment #51)
> bumping priority because we are hitting this bug quite a lot recently while
> working with the performance-addon operator

I understand this is impacting multiple teams, but unfortunately bumping the priority does not make builds go faster or paperwork get done quicker.  :)

The team is actively trying to provide resolution to this issue and this BZ will be updated once there is progress to note.

Comment 56 Micah Abbott 2020-12-18 20:52:33 UTC

This should be addressed with `rpm-ostree-2020.7-1.el8_3.x86_64`

The package will be included in forthcoming RHCOS builds via:

https://github.com/openshift/os/pull/475
https://gitlab.cee.redhat.com/coreos/redhat-coreos/-/merge_requests/1203

Comment 58 Micah Abbott 2021-01-15 20:36:30 UTC

Higher priority work has prevented from this issue being solved; adding UpcomingSprint keyword

Comment 59 Micah Abbott 2021-01-21 17:25:10 UTC

the RHCOS 4.7 boot image was recently bumped in the installer - https://github.com/openshift/installer/pull/4540

This includes the fix for this issue, so moving to MODIFIED

Comment 63 Denys Shchedrivyi 2021-01-28 17:24:49 UTC

Looks good, tested it tens times and never seen this issue anymore.

Comment 66 errata-xmlrpc 2021-02-24 15:14:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.