Created attachment 1743225 [details] Machine config daemon logs. Description of problem: When updating ocp-4.6 (4.6.9) to 4.7.0-fc.0-x86_69 , mcp fails to update the worker node when it trying to apply the machineconfig. snippet from command: oc logs -f -n openshift-machine-config-operator machine-config-daemon-98nbw -c machine-config-daemon I1230 12:41:41.537595 17256 update.go:1844] Running rpm-ostree [kargs --delete=skew_tick=1 --delete=nohz=on --delete=rcu_nocbs=1-11 --delete=tuned.non_isolcpus=fffff001 --delete=intel_pstate=disable --delete=nosoftlockup --delete=tsc=nowatchdog --delete=intel_iommu=on --delete=iommu=pt --delete=isolcpus=managed_irq,1-11 --delete=systemd.cpu_affinity=0,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 --delete=default_hugepagesz=1G --delete=hugepagesz=2M --delete=hugepages=0 --delete=+ --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1-11 --append=tuned.non_isolcpus=fffff001 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1-11 --append=systemd.cpu_affinity=0,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 --append=default_hugepagesz=1G --append=hugepagesz=2M --append=hugepages=0 --append=+] I1230 12:41:41.543543 17256 rpm-ostree.go:261] Running captured: rpm-ostree kargs --delete=skew_tick=1 --delete=nohz=on --delete=rcu_nocbs=1-11 --delete=tuned.non_isolcpus=fffff001 --delete=intel_pstate=disable --delete=nosoftlockup --delete=tsc=nowatchdog --delete=intel_iommu=on --delete=iommu=pt --delete=isolcpus=managed_irq,1-11 --delete=systemd.cpu_affinity=0,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 --delete=default_hugepagesz=1G --delete=hugepagesz=2M --delete=hugepages=0 --delete=+ --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1-11 --append=tuned.non_isolcpus=fffff001 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1-11 --append=systemd.cpu_affinity=0,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 --append=default_hugepagesz=1G --append=hugepagesz=2M --append=hugepages=0 --append=+ I1230 12:43:37.967322 17256 update.go:1844] Initiating switch from kernel realtime to realtime I1230 12:43:37.973068 17256 update.go:1844] Updating rt-kernel packages on host: [] I1230 12:43:37.978195 17256 rpm-ostree.go:261] Running captured: rpm-ostree I1230 12:43:38.012525 17256 update.go:437] Rolling back applied changes to OS due to error: error running rpm-ostree : Usage: rpm-ostree [OPTION?] COMMAND Builtin Commands: compose Commands to compose a tree cleanup Clear cached/pending data db Commands to query the RPM database deploy Deploy a specific commit rebase Switch to a different tree rollback Revert to the previously booted tree status Get the version of the booted system upgrade Perform a system upgrade reload Reload configuration usroverlay Apply a transient overlayfs to /usr cancel Cancel an active transaction initramfs Enable or disable local initramfs regeneration install Overlay additional packages uninstall Remove overlayed additional packages override Manage base package overrides reset Remove all mutations refresh-md Generate rpm repo metadata kargs Query or modify kernel arguments Version-Release number of selected component (if applicable): OCP4.7 How reproducible: 1. Setup ocp-4.6 (with 3 master and 3 worker nodes) NAME STATUS ROLES AGE VERSION ocp46-master-0.demo.lab.mniranja Ready master 25h v1.20.0+87544c5 ocp46-master-1.demo.lab.mniranja Ready master 25h v1.20.0+87544c5 ocp46-master-2.demo.lab.mniranja Ready master 25h v1.20.0+87544c5 ocp46-worker-0.demo.lab.mniranja Ready,SchedulingDisabled worker,worker-cnf 25h v1.19.0+7070803 ocp46-worker-1.demo.lab.mniranja Ready worker,worker-cnf 25h v1.19.0+7070803 ocp46-worker-2.demo.lab.mniranja Ready worker 25h v1.20.0+87544c5 2. setup performance operator [root@dell-r730-009 ~]# oc get performanceprofile NAME AGE hugepages 23h <profile> apiVersion: performance.openshift.io/v1 kind: PerformanceProfile metadata: name: hugepages spec: cpu: reserved: "0" isolated: "1-11" hugepages: defaultHugepagesSize: "1G" pages: - size: "1G" count: 1 node: 0 - size: "2M" count: 2 node: 1 realTimeKernel: enabled: True numa: topologyPolicy: "single-numa-node" nodeSelector: node-role.kubernetes.io/worker-cnf: "" </snip> 3. Update ocp-4.6.9 cluster to 4.7 using the below command: $ oc adm upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:2419f9cd3ea9bd114764855653012e305ade2527210d332bfdd6dbdae538bd66 --allow-explicit-upgrade --allow-upgrade-with-warnings --force Actual results: [root@dell-r730-009 installation]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.9 True True 4h37m Unable to apply 4.7.0-fc.0: the cluster operator ingress is degraded Expected results: update should be successful. Additional info: Node status: [root@dell-r730-009 installation]# oc describe nodes/ocp46-worker-0.demo.lab.mniranja Name: ocp46-worker-0.demo.lab.mniranja Roles: worker,worker-cnf Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=ocp46-worker-0.demo.lab.mniranja kubernetes.io/os=linux node-role.kubernetes.io/worker= node-role.kubernetes.io/worker-cnf= node.openshift.io/os_id=rhcos Annotations: machineconfiguration.openshift.io/currentConfig: rendered-worker-cnf-6f46c1fcbab861dab73628e1a55792b6 machineconfiguration.openshift.io/desiredConfig: rendered-worker-cnf-d52bbbdfad54247dee6a1061a7522c2e machineconfiguration.openshift.io/reason: error running rpm-ostree : Usage: rpm-ostree [OPTION?] COMMAND Builtin Commands: compose Commands to compose a tree cleanup Clear cached/pending data db Commands to query the RPM database deploy Deploy a specific commit rebase Switch to a different tree rollback Revert to the previously booted tree status Get the version of the booted system upgrade Perform a system upgrade reload Reload configuration usroverlay Apply a transient overlayfs to /usr cancel Cancel an active transaction initramfs Enable or disable local initramfs regeneration install Overlay additional packages uninstall Remove overlayed additional packages override Manage base package overrides reset Remove all mutations refresh-md Generate rpm repo metadata kargs Query or modify kernel arguments Help Options: -h, --help Show help options Application Options: --version Print version information and exit error: No command specified : exit status 1 machineconfiguration.openshift.io/state: Degraded volumes.kubernetes.io/controller-managed-attach-detach: true Status of machineconfig pods [root@dell-r730-009 installation]# oc get pods NAME READY STATUS RESTARTS AGE machine-config-controller-5c97bd58-t5pz2 1/1 Running 0 125m machine-config-daemon-45vf8 2/2 Running 0 140m machine-config-daemon-674jq 2/2 Running 0 140m machine-config-daemon-98nbw 2/2 Running 0 141m machine-config-daemon-h6brw 2/2 Running 0 140m machine-config-daemon-wqtwc 2/2 Running 0 140m machine-config-daemon-xj272 2/2 Running 0 140m machine-config-operator-9457979b9-4x574 1/1 Running 0 125m machine-config-server-fhmzt 1/1 Running 0 137m machine-config-server-wq6xh 1/1 Running 0 137m machine-config-server-xlbxq 1/1 Running 0 137m
Created attachment 1743227 [details] Machine config profile
Without a must-gather, I can't say for sure what the problem is, however the MCO configuration is: Kernel Arguments: skew_tick=1 nohz=on rcu_nocbs=1-11 tuned.non_isolcpus=fffff001 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,1-11 systemd.cpu_affinity=0,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 default_hugepagesz=1G hugepagesz=2M hugepages=0 + Kernel Type: realtime That extra "+" is invalid. Please confirm that the "+" is removed and if the issue still happens, please attach a must-gather.
Adding keyword UpgradeBlocker and setting blocker? flag for triage.
Changing target release to 4.7.0 as the BZ is blocking upgrade.
Could you please add the blocker+ flag?
This is a regression from https://github.com/openshift/machine-config-operator/commit/adb9e707b0d0170d741cc40dec2eaa73aa201415#diff-349c0748c3f52201852d3027c29daf618b45300b949482728df6666a3f9ba245R963 where `rpm-ostree update` got lost to `rpm-ostree args...`
Successfully upgraded from 4.6.9 to 4.7.0-0.nightly-2021-01-19-095812 with RT kernel. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.9 True False 42m Cluster version is 4.6.9 $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-128-71.us-west-2.compute.internal Ready master 61m v1.19.0+7070803 ip-10-0-149-13.us-west-2.compute.internal Ready worker 48m v1.19.0+7070803 ip-10-0-175-147.us-west-2.compute.internal Ready master 57m v1.19.0+7070803 ip-10-0-175-54.us-west-2.compute.internal Ready worker 52m v1.19.0+7070803 ip-10-0-213-88.us-west-2.compute.internal Ready worker 48m v1.19.0+7070803 ip-10-0-220-222.us-west-2.compute.internal Ready master 57m v1.19.0+7070803 $ oc debug node/ip-10-0-175-54.us-west-2.compute.internal Starting pod/ip-10-0-175-54us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# ls bin boot dev etc home lib lib64 media mnt opt ostree proc root run sbin srv sys sysroot tmp usr var sh-4.4# ls bin boot dev etc home lib lib64 media mnt opt ostree proc root run sbin srv sys sysroot tmp usr var sh-4.4# sh-4.4# sh-4.4# ls bin boot dev etc home lib lib64 media mnt opt ostree proc root run sbin srv sys sysroot tmp usr var sh-4.4# uname -a Linux ip-10-0-175-54 4.18.0-193.28.1.rt13.77.el8_2.x86_64 #1 SMP PREEMPT RT Fri Oct 16 14:11:07 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux sh-4.4# rpm -q | grep kernel* rpm: no arguments given for query sh-4.4# rpm -qa | grep kernel kernel-rt-modules-extra-4.18.0-193.28.1.rt13.77.el8_2.x86_64 kernel-rt-modules-4.18.0-193.28.1.rt13.77.el8_2.x86_64 kernel-rt-kvm-4.18.0-193.28.1.rt13.77.el8_2.x86_64 kernel-rt-core-4.18.0-193.28.1.rt13.77.el8_2.x86_64 sh-4.4# sh-4.4# exit exit sh-4.2# exit exit Removing debug pod ... $ watch oc get clusterversion $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-01-19-095812 True False 10m Cluster version is 4.7.0-0.nightly-2021-01-19-095812 $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.7.0-0.nightly-2021-01-19-095812 True False False 4s baremetal 4.7.0-0.nightly-2021-01-19-095812 True False False 54m cloud-credential 4.7.0-0.nightly-2021-01-19-095812 True False False 151m cluster-autoscaler 4.7.0-0.nightly-2021-01-19-095812 True False False 145m config-operator 4.7.0-0.nightly-2021-01-19-095812 True False False 147m console 4.7.0-0.nightly-2021-01-19-095812 True False False 24m csi-snapshot-controller 4.7.0-0.nightly-2021-01-19-095812 True False False 18m dns 4.7.0-0.nightly-2021-01-19-095812 True False False 146m etcd 4.7.0-0.nightly-2021-01-19-095812 True False False 144m image-registry 4.7.0-0.nightly-2021-01-19-095812 True False False 138m ingress 4.7.0-0.nightly-2021-01-19-095812 True False False 137m insights 4.7.0-0.nightly-2021-01-19-095812 True False False 147m kube-apiserver 4.7.0-0.nightly-2021-01-19-095812 True False False 142m kube-controller-manager 4.7.0-0.nightly-2021-01-19-095812 True False False 145m kube-scheduler 4.7.0-0.nightly-2021-01-19-095812 True False False 143m kube-storage-version-migrator 4.7.0-0.nightly-2021-01-19-095812 True False False 15m machine-api 4.7.0-0.nightly-2021-01-19-095812 True False False 143m machine-approver 4.7.0-0.nightly-2021-01-19-095812 True False False 146m machine-config 4.7.0-0.nightly-2021-01-19-095812 True False False 13m marketplace 4.7.0-0.nightly-2021-01-19-095812 True False False 17m monitoring 4.7.0-0.nightly-2021-01-19-095812 True False False 13m network 4.7.0-0.nightly-2021-01-19-095812 True False False 37m node-tuning 4.7.0-0.nightly-2021-01-19-095812 True False False 52m openshift-apiserver 4.7.0-0.nightly-2021-01-19-095812 True False False 22m openshift-controller-manager 4.7.0-0.nightly-2021-01-19-095812 True False False 50m openshift-samples 4.7.0-0.nightly-2021-01-19-095812 True False False 52m operator-lifecycle-manager 4.7.0-0.nightly-2021-01-19-095812 True False False 146m operator-lifecycle-manager-catalog 4.7.0-0.nightly-2021-01-19-095812 True False False 146m operator-lifecycle-manager-packageserver 4.7.0-0.nightly-2021-01-19-095812 True False False 18m service-ca 4.7.0-0.nightly-2021-01-19-095812 True False False 147m storage 4.7.0-0.nightly-2021-01-19-095812 True False False 23m $ oc debug node/ip-10-0-175-54.us-west-2.compute.internal Starting pod/ip-10-0-175-54us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# uname -a Linux ip-10-0-175-54 4.18.0-240.10.1.rt7.64.el8_3.x86_64 #1 SMP PREEMPT_RT Wed Dec 16 08:22:01 EST 2020 x86_64 x86_64 x86_64 GNU/Linux sh-4.4# rpm -qa | grep kernel kernel-rt-modules-extra-4.18.0-240.10.1.rt7.64.el8_3.x86_64 kernel-rt-modules-4.18.0-240.10.1.rt7.64.el8_3.x86_64 kernel-rt-kvm-4.18.0-240.10.1.rt7.64.el8_3.x86_64 kernel-rt-core-4.18.0-240.10.1.rt7.64.el8_3.x86_64 sh-4.4#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
Gowrishankar added UpgradeBlocker in comment 4, but this shipped with 4.7's GA per comment 15, and blocked no 4.6.z backport bugs. Clearing UpgradeBlocker to remove it from our suspect queue [1]. [1]: https://github.com/openshift/enhancements/pull/475