Bug 1911632 - rpm-ostree command fail due to wrong options when updating ocp-4.6 to 4.7 on worker nodes with rt-kernel
Summary: rpm-ostree command fail due to wrong options when updating ocp-4.6 to 4.7 on ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.7
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: ---
: 4.7.0
Assignee: Ben Howard
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-30 13:26 UTC by Niranjan Mallapadi Raghavender
Modified: 2021-03-31 04:08 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:49:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Machine config daemon logs. (407.75 KB, text/plain)
2020-12-30 13:26 UTC, Niranjan Mallapadi Raghavender
no flags Details
Machine config profile (200.69 KB, text/plain)
2020-12-30 13:28 UTC, Niranjan Mallapadi Raghavender
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2329 0 None closed Bug 1911632: daemon/update: fix regression in realtime upgrades 2021-01-24 12:30:30 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:49:54 UTC

Description Niranjan Mallapadi Raghavender 2020-12-30 13:26:03 UTC
Created attachment 1743225 [details]
Machine config daemon logs.

Description of problem:
When updating ocp-4.6 (4.6.9) to 4.7.0-fc.0-x86_69 ,  mcp fails to update the worker node when it trying to apply the machineconfig. 

snippet from command:

oc logs -f -n openshift-machine-config-operator machine-config-daemon-98nbw  -c machine-config-daemon

I1230 12:41:41.537595   17256 update.go:1844] Running rpm-ostree [kargs --delete=skew_tick=1 --delete=nohz=on --delete=rcu_nocbs=1-11 --delete=tuned.non_isolcpus=fffff001 --delete=intel_pstate=disable --delete=nosoftlockup --delete=tsc=nowatchdog --delete=intel_iommu=on --delete=iommu=pt --delete=isolcpus=managed_irq,1-11 --delete=systemd.cpu_affinity=0,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 --delete=default_hugepagesz=1G --delete=hugepagesz=2M --delete=hugepages=0 --delete=+ --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1-11 --append=tuned.non_isolcpus=fffff001 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1-11 --append=systemd.cpu_affinity=0,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 --append=default_hugepagesz=1G --append=hugepagesz=2M --append=hugepages=0 --append=+]
I1230 12:41:41.543543   17256 rpm-ostree.go:261] Running captured: rpm-ostree kargs --delete=skew_tick=1 --delete=nohz=on --delete=rcu_nocbs=1-11 --delete=tuned.non_isolcpus=fffff001 --delete=intel_pstate=disable --delete=nosoftlockup --delete=tsc=nowatchdog --delete=intel_iommu=on --delete=iommu=pt --delete=isolcpus=managed_irq,1-11 --delete=systemd.cpu_affinity=0,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 --delete=default_hugepagesz=1G --delete=hugepagesz=2M --delete=hugepages=0 --delete=+ --append=skew_tick=1 --append=nohz=on --append=rcu_nocbs=1-11 --append=tuned.non_isolcpus=fffff001 --append=intel_pstate=disable --append=nosoftlockup --append=tsc=nowatchdog --append=intel_iommu=on --append=iommu=pt --append=isolcpus=managed_irq,1-11 --append=systemd.cpu_affinity=0,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31 --append=default_hugepagesz=1G --append=hugepagesz=2M --append=hugepages=0 --append=+
I1230 12:43:37.967322   17256 update.go:1844] Initiating switch from kernel realtime to realtime
I1230 12:43:37.973068   17256 update.go:1844] Updating rt-kernel packages on host: []
I1230 12:43:37.978195   17256 rpm-ostree.go:261] Running captured: rpm-ostree
I1230 12:43:38.012525   17256 update.go:437] Rolling back applied changes to OS due to error: error running rpm-ostree : Usage:
  rpm-ostree [OPTION?] COMMAND

Builtin Commands:
  compose          Commands to compose a tree
  cleanup          Clear cached/pending data
  db               Commands to query the RPM database
  deploy           Deploy a specific commit
  rebase           Switch to a different tree
  rollback         Revert to the previously booted tree
  status           Get the version of the booted system
  upgrade          Perform a system upgrade
  reload           Reload configuration
  usroverlay       Apply a transient overlayfs to /usr
  cancel           Cancel an active transaction
  initramfs        Enable or disable local initramfs regeneration
  install          Overlay additional packages
  uninstall        Remove overlayed additional packages
  override         Manage base package overrides
  reset            Remove all mutations
  refresh-md       Generate rpm repo metadata
  kargs            Query or modify kernel arguments


Version-Release number of selected component (if applicable):
OCP4.7

How reproducible:



1. Setup ocp-4.6  (with 3 master and 3 worker nodes)
NAME                               STATUS                     ROLES               AGE   VERSION
ocp46-master-0.demo.lab.mniranja   Ready                      master              25h   v1.20.0+87544c5
ocp46-master-1.demo.lab.mniranja   Ready                      master              25h   v1.20.0+87544c5
ocp46-master-2.demo.lab.mniranja   Ready                      master              25h   v1.20.0+87544c5
ocp46-worker-0.demo.lab.mniranja   Ready,SchedulingDisabled   worker,worker-cnf   25h   v1.19.0+7070803
ocp46-worker-1.demo.lab.mniranja   Ready                      worker,worker-cnf   25h   v1.19.0+7070803
ocp46-worker-2.demo.lab.mniranja   Ready                      worker              25h   v1.20.0+87544c5

2. setup performance operator 

[root@dell-r730-009 ~]# oc get performanceprofile
NAME        AGE
hugepages   23h

<profile>
apiVersion: performance.openshift.io/v1
kind: PerformanceProfile
metadata:
 name: hugepages
spec:
  cpu:
    reserved: "0"
    isolated:  "1-11"
  hugepages:
    defaultHugepagesSize: "1G"
    pages:
    - size: "1G"
      count: 1
      node: 0
    - size: "2M"
      count: 2
      node: 1
  realTimeKernel:
    enabled: True
  numa:
    topologyPolicy: "single-numa-node"
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""
</snip>


3. Update ocp-4.6.9 cluster to 4.7 using the below command:


$ oc adm upgrade --to-image  quay.io/openshift-release-dev/ocp-release@sha256:2419f9cd3ea9bd114764855653012e305ade2527210d332bfdd6dbdae538bd66 --allow-explicit-upgrade --allow-upgrade-with-warnings  --force

Actual results:


[root@dell-r730-009 installation]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.9     True        True          4h37m   Unable to apply 4.7.0-fc.0: the cluster operator ingress is degraded




Expected results:

update should be successful. 

Additional info:

Node status:


[root@dell-r730-009 installation]# oc describe nodes/ocp46-worker-0.demo.lab.mniranja
Name:               ocp46-worker-0.demo.lab.mniranja
Roles:              worker,worker-cnf
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ocp46-worker-0.demo.lab.mniranja
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node-role.kubernetes.io/worker-cnf=
                    node.openshift.io/os_id=rhcos
Annotations:        machineconfiguration.openshift.io/currentConfig: rendered-worker-cnf-6f46c1fcbab861dab73628e1a55792b6
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-cnf-d52bbbdfad54247dee6a1061a7522c2e
                    machineconfiguration.openshift.io/reason:
                      error running rpm-ostree : Usage:
                        rpm-ostree [OPTION?] COMMAND

                      Builtin Commands:
                        compose          Commands to compose a tree
                        cleanup          Clear cached/pending data
                        db               Commands to query the RPM database
                        deploy           Deploy a specific commit
                        rebase           Switch to a different tree
                        rollback         Revert to the previously booted tree
                        status           Get the version of the booted system
                        upgrade          Perform a system upgrade
                        reload           Reload configuration
                        usroverlay       Apply a transient overlayfs to /usr
                        cancel           Cancel an active transaction
                        initramfs        Enable or disable local initramfs regeneration
                        install          Overlay additional packages
                        uninstall        Remove overlayed additional packages
                        override         Manage base package overrides
                        reset            Remove all mutations
                        refresh-md       Generate rpm repo metadata
                        kargs            Query or modify kernel arguments

                      Help Options:
                        -h, --help       Show help options

                      Application Options:
                        --version        Print version information and exit

                      error: No command specified
                      : exit status 1
                    machineconfiguration.openshift.io/state: Degraded
                    volumes.kubernetes.io/controller-managed-attach-detach: true


Status of machineconfig pods


[root@dell-r730-009 installation]# oc get pods
NAME                                       READY   STATUS    RESTARTS   AGE
machine-config-controller-5c97bd58-t5pz2   1/1     Running   0          125m
machine-config-daemon-45vf8                2/2     Running   0          140m
machine-config-daemon-674jq                2/2     Running   0          140m
machine-config-daemon-98nbw                2/2     Running   0          141m
machine-config-daemon-h6brw                2/2     Running   0          140m
machine-config-daemon-wqtwc                2/2     Running   0          140m
machine-config-daemon-xj272                2/2     Running   0          140m
machine-config-operator-9457979b9-4x574    1/1     Running   0          125m
machine-config-server-fhmzt                1/1     Running   0          137m
machine-config-server-wq6xh                1/1     Running   0          137m
machine-config-server-xlbxq                1/1     Running   0          137m

Comment 1 Niranjan Mallapadi Raghavender 2020-12-30 13:28:58 UTC
Created attachment 1743227 [details]
Machine config profile

Comment 2 Ben Howard 2021-01-04 17:06:37 UTC
Without a must-gather, I can't say for sure what the problem is, however the MCO configuration is:

  Kernel Arguments:
    skew_tick=1
    nohz=on
    rcu_nocbs=1-11
    tuned.non_isolcpus=fffff001
    intel_pstate=disable
    nosoftlockup
    tsc=nowatchdog
    intel_iommu=on
    iommu=pt
    isolcpus=managed_irq,1-11
    systemd.cpu_affinity=0,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
    default_hugepagesz=1G
    hugepagesz=2M
    hugepages=0
    +
  Kernel Type:   realtime

That extra "+" is invalid. Please confirm that the "+" is removed and if the issue still happens, please attach a must-gather.

Comment 4 Gowrishankar Rajaiyan 2021-01-05 15:36:02 UTC
Adding keyword UpgradeBlocker and setting blocker? flag for triage.

Comment 7 Yaniv Joseph 2021-01-06 10:50:14 UTC
Changing target release to 4.7.0 as the BZ is blocking upgrade.

Comment 8 Gowrishankar Rajaiyan 2021-01-06 13:51:05 UTC
Could you please add the blocker+ flag?

Comment 9 Sinny Kumari 2021-01-11 11:32:03 UTC
This is a regression from https://github.com/openshift/machine-config-operator/commit/adb9e707b0d0170d741cc40dec2eaa73aa201415#diff-349c0748c3f52201852d3027c29daf618b45300b949482728df6666a3f9ba245R963 where `rpm-ostree update` got lost to `rpm-ostree args...`

Comment 12 Michael Nguyen 2021-01-19 20:46:52 UTC
Successfully upgraded from 4.6.9 to 4.7.0-0.nightly-2021-01-19-095812 with RT kernel.

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.9     True        False         42m     Cluster version is 4.6.9

$ oc get node
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-128-71.us-west-2.compute.internal    Ready    master   61m   v1.19.0+7070803
ip-10-0-149-13.us-west-2.compute.internal    Ready    worker   48m   v1.19.0+7070803
ip-10-0-175-147.us-west-2.compute.internal   Ready    master   57m   v1.19.0+7070803
ip-10-0-175-54.us-west-2.compute.internal    Ready    worker   52m   v1.19.0+7070803
ip-10-0-213-88.us-west-2.compute.internal    Ready    worker   48m   v1.19.0+7070803
ip-10-0-220-222.us-west-2.compute.internal   Ready    master   57m   v1.19.0+7070803
$ oc debug node/ip-10-0-175-54.us-west-2.compute.internal
Starting pod/ip-10-0-175-54us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# ls
bin  boot  dev	etc  home  lib	lib64  media  mnt  opt	ostree	proc  root  run  sbin  srv  sys  sysroot  tmp  usr  var
sh-4.4# ls 
bin  boot  dev	etc  home  lib	lib64  media  mnt  opt	ostree	proc  root  run  sbin  srv  sys  sysroot  tmp  usr  var
sh-4.4# 
sh-4.4# 
sh-4.4# ls
bin  boot  dev	etc  home  lib	lib64  media  mnt  opt	ostree	proc  root  run  sbin  srv  sys  sysroot  tmp  usr  var
sh-4.4# uname -a
Linux ip-10-0-175-54 4.18.0-193.28.1.rt13.77.el8_2.x86_64 #1 SMP PREEMPT RT Fri Oct 16 14:11:07 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
sh-4.4# rpm -q | grep kernel*
rpm: no arguments given for query
sh-4.4# rpm -qa | grep kernel 
kernel-rt-modules-extra-4.18.0-193.28.1.rt13.77.el8_2.x86_64
kernel-rt-modules-4.18.0-193.28.1.rt13.77.el8_2.x86_64
kernel-rt-kvm-4.18.0-193.28.1.rt13.77.el8_2.x86_64
kernel-rt-core-4.18.0-193.28.1.rt13.77.el8_2.x86_64
sh-4.4# 

sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...
$ watch oc get clusterversion
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-01-19-095812   True        False         10m     Cluster version is 4.7.0-0.nightly-2021-01-19-095812
$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.0-0.nightly-2021-01-19-095812   True        False         False      4s
baremetal                                  4.7.0-0.nightly-2021-01-19-095812   True        False         False      54m
cloud-credential                           4.7.0-0.nightly-2021-01-19-095812   True        False         False      151m
cluster-autoscaler                         4.7.0-0.nightly-2021-01-19-095812   True        False         False      145m
config-operator                            4.7.0-0.nightly-2021-01-19-095812   True        False         False      147m
console                                    4.7.0-0.nightly-2021-01-19-095812   True        False         False      24m
csi-snapshot-controller                    4.7.0-0.nightly-2021-01-19-095812   True        False         False      18m
dns                                        4.7.0-0.nightly-2021-01-19-095812   True        False         False      146m
etcd                                       4.7.0-0.nightly-2021-01-19-095812   True        False         False      144m
image-registry                             4.7.0-0.nightly-2021-01-19-095812   True        False         False      138m
ingress                                    4.7.0-0.nightly-2021-01-19-095812   True        False         False      137m
insights                                   4.7.0-0.nightly-2021-01-19-095812   True        False         False      147m
kube-apiserver                             4.7.0-0.nightly-2021-01-19-095812   True        False         False      142m
kube-controller-manager                    4.7.0-0.nightly-2021-01-19-095812   True        False         False      145m
kube-scheduler                             4.7.0-0.nightly-2021-01-19-095812   True        False         False      143m
kube-storage-version-migrator              4.7.0-0.nightly-2021-01-19-095812   True        False         False      15m
machine-api                                4.7.0-0.nightly-2021-01-19-095812   True        False         False      143m
machine-approver                           4.7.0-0.nightly-2021-01-19-095812   True        False         False      146m
machine-config                             4.7.0-0.nightly-2021-01-19-095812   True        False         False      13m
marketplace                                4.7.0-0.nightly-2021-01-19-095812   True        False         False      17m
monitoring                                 4.7.0-0.nightly-2021-01-19-095812   True        False         False      13m
network                                    4.7.0-0.nightly-2021-01-19-095812   True        False         False      37m
node-tuning                                4.7.0-0.nightly-2021-01-19-095812   True        False         False      52m
openshift-apiserver                        4.7.0-0.nightly-2021-01-19-095812   True        False         False      22m
openshift-controller-manager               4.7.0-0.nightly-2021-01-19-095812   True        False         False      50m
openshift-samples                          4.7.0-0.nightly-2021-01-19-095812   True        False         False      52m
operator-lifecycle-manager                 4.7.0-0.nightly-2021-01-19-095812   True        False         False      146m
operator-lifecycle-manager-catalog         4.7.0-0.nightly-2021-01-19-095812   True        False         False      146m
operator-lifecycle-manager-packageserver   4.7.0-0.nightly-2021-01-19-095812   True        False         False      18m
service-ca                                 4.7.0-0.nightly-2021-01-19-095812   True        False         False      147m
storage                                    4.7.0-0.nightly-2021-01-19-095812   True        False         False      23m
$ oc debug node/ip-10-0-175-54.us-west-2.compute.internal
Starting pod/ip-10-0-175-54us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# uname -a
Linux ip-10-0-175-54 4.18.0-240.10.1.rt7.64.el8_3.x86_64 #1 SMP PREEMPT_RT Wed Dec 16 08:22:01 EST 2020 x86_64 x86_64 x86_64 GNU/Linux
sh-4.4# rpm -qa | grep kernel
kernel-rt-modules-extra-4.18.0-240.10.1.rt7.64.el8_3.x86_64
kernel-rt-modules-4.18.0-240.10.1.rt7.64.el8_3.x86_64
kernel-rt-kvm-4.18.0-240.10.1.rt7.64.el8_3.x86_64
kernel-rt-core-4.18.0-240.10.1.rt7.64.el8_3.x86_64
sh-4.4#

Comment 15 errata-xmlrpc 2021-02-24 15:49:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 16 W. Trevor King 2021-03-31 04:08:33 UTC
Gowrishankar added UpgradeBlocker in comment 4, but this shipped with 4.7's GA per comment 15, and blocked no 4.6.z backport bugs.  Clearing UpgradeBlocker to remove it from our suspect queue [1].

[1]: https://github.com/openshift/enhancements/pull/475


Note You need to log in before you can comment on or make changes to this bug.