Bug 1941932 - Upgrade from 4.7 to 4.8 stuck on machine-config
Summary: Upgrade from 4.7 to 4.8 stuck on machine-config
Keywords:
Status: CLOSED DUPLICATE of bug 1933772
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Yu Qi Zhang
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-23 08:19 UTC by huirwang
Modified: 2021-03-23 15:04 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-23 15:04:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description huirwang 2021-03-23 08:19:27 UTC
Description of problem:
Upgrade from 4.7 to 4.8 stuck on machine-config

Version-Release number of selected component (if applicable):
4.7.3 -> 4.8.0-0.nightly-2021-03-22-104536

How reproducible:
So far 1 on 1 try.

Steps to Reproduce:
This issue happened on upgrade CI, OSP16 and scaled up rhel workers.

After upgrade, cluster stuck on machine-config
[2021-03-22T22:10:14.042Z] Post action: #oc get co:NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
[2021-03-22T22:10:14.042Z] authentication                             4.8.0-0.nightly-2021-03-22-104536   True        False         False      3h27m
[2021-03-22T22:10:14.042Z] baremetal                                  4.8.0-0.nightly-2021-03-22-104536   True        False         False      5h2m
[2021-03-22T22:10:14.042Z] cloud-credential                           4.8.0-0.nightly-2021-03-22-104536   True        False         False      5h9m
[2021-03-22T22:10:14.042Z] cluster-autoscaler                         4.8.0-0.nightly-2021-03-22-104536   True        False         False      5h
[2021-03-22T22:10:14.042Z] config-operator                            4.8.0-0.nightly-2021-03-22-104536   True        False         False      5h2m
[2021-03-22T22:10:14.042Z] console                                    4.8.0-0.nightly-2021-03-22-104536   True        False         False      140m
[2021-03-22T22:10:14.042Z] csi-snapshot-controller                    4.8.0-0.nightly-2021-03-22-104536   True        False         False      3h8m
[2021-03-22T22:10:14.042Z] dns                                        4.8.0-0.nightly-2021-03-22-104536   True        False         False      5h
[2021-03-22T22:10:14.042Z] etcd                                       4.8.0-0.nightly-2021-03-22-104536   True        False         False      5h1m
[2021-03-22T22:10:14.042Z] image-registry                             4.8.0-0.nightly-2021-03-22-104536   True        False         False      4h51m
[2021-03-22T22:10:14.042Z] ingress                                    4.8.0-0.nightly-2021-03-22-104536   True        False         False      4h50m
[2021-03-22T22:10:14.042Z] insights                                   4.8.0-0.nightly-2021-03-22-104536   True        False         False      4h55m
[2021-03-22T22:10:14.042Z] kube-apiserver                             4.8.0-0.nightly-2021-03-22-104536   True        False         False      4h59m
[2021-03-22T22:10:14.042Z] kube-controller-manager                    4.8.0-0.nightly-2021-03-22-104536   True        False         False      4h58m
[2021-03-22T22:10:14.042Z] kube-scheduler                             4.8.0-0.nightly-2021-03-22-104536   True        False         False      4h58m
[2021-03-22T22:10:14.042Z] kube-storage-version-migrator              4.8.0-0.nightly-2021-03-22-104536   True        False         False      3h5m
[2021-03-22T22:10:14.042Z] machine-api                                4.8.0-0.nightly-2021-03-22-104536   True        False         False      4h57m
[2021-03-22T22:10:14.042Z] machine-approver                           4.8.0-0.nightly-2021-03-22-104536   True        False         False      5h1m
[2021-03-22T22:10:14.042Z] machine-config                             4.7.3                               False       True          True       91m
[2021-03-22T22:10:14.042Z] marketplace                                4.8.0-0.nightly-2021-03-22-104536   True        False         False      141m
[2021-03-22T22:10:14.042Z] monitoring                                 4.8.0-0.nightly-2021-03-22-104536   True        False         False      6m46s
[2021-03-22T22:10:14.042Z] network                                    4.8.0-0.nightly-2021-03-22-104536   True        False         False      4h59m
[2021-03-22T22:10:14.042Z] node-tuning                                4.8.0-0.nightly-2021-03-22-104536   True        False         False      142m
[2021-03-22T22:10:14.042Z] openshift-apiserver                        4.8.0-0.nightly-2021-03-22-104536   True        False         False      175m
[2021-03-22T22:10:14.042Z] openshift-controller-manager               4.8.0-0.nightly-2021-03-22-104536   True        False         False      141m
[2021-03-22T22:10:14.042Z] openshift-samples                          4.8.0-0.nightly-2021-03-22-104536   True        False         False      142m
[2021-03-22T22:10:14.042Z] operator-lifecycle-manager                 4.8.0-0.nightly-2021-03-22-104536   True        False         False      5h1m
[2021-03-22T22:10:14.042Z] operator-lifecycle-manager-catalog         4.8.0-0.nightly-2021-03-22-104536   True        False         False      5h1m
[2021-03-22T22:10:14.042Z] operator-lifecycle-manager-packageserver   4.8.0-0.nightly-2021-03-22-104536   True        False         False      140m
[2021-03-22T22:10:14.043Z] service-ca                                 4.8.0-0.nightly-2021-03-22-104536   True        False         False      5h2m
[2021-03-22T22:10:14.043Z] storage                                    4.8.0-0.nightly-2021-03-22-104536   True        False         False      3h6m

 #oc get node: NAME                       STATUS   ROLES    AGE     VERSION           INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
[2021-03-22T22:10:14.042Z] ugd-12708-7v9h4-master-0   Ready    master   5h8m    v1.20.0+551f7b2   192.168.1.56    <none>        Red Hat Enterprise Linux CoreOS 47.83.202103140039-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.1-5.rhaos4.7.git62f21aa.el8
[2021-03-22T22:10:14.042Z] ugd-12708-7v9h4-master-1   Ready    master   5h5m    v1.20.0+551f7b2   192.168.0.23    <none>        Red Hat Enterprise Linux CoreOS 47.83.202103140039-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.1-5.rhaos4.7.git62f21aa.el8
[2021-03-22T22:10:14.042Z] ugd-12708-7v9h4-master-2   Ready    master   5h4m    v1.20.0+551f7b2   192.168.0.157   <none>        Red Hat Enterprise Linux CoreOS 47.83.202103140039-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.1-5.rhaos4.7.git62f21aa.el8
[2021-03-22T22:10:14.042Z] ugd-12708-7v9h4-rhel-0     Ready    worker   4h6m    v1.20.0+bafe72f   192.168.0.165   10.0.97.72    Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.21.1.el7.x86_64    cri-o://1.20.2-3.rhaos4.7.gitfecc319.el7
[2021-03-22T22:10:14.042Z] ugd-12708-7v9h4-rhel-1     Ready    worker   4h6m    v1.20.0+bafe72f   192.168.0.131   10.0.97.159   Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.21.1.el7.x86_64    cri-o://1.20.2-3.rhaos4.7.gitfecc319.el7
[2021-03-22T22:10:14.042Z] ugd-12708-7v9h4-worker-0   Ready    worker   4h54m   v1.20.0+551f7b2   192.168.1.66    <none>        Red Hat Enterprise Linux CoreOS 47.83.202103140039-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.1-5.rhaos4.7.git62f21aa.el8
[2021-03-22T22:10:14.042Z] ugd-12708-7v9h4-worker-1   Ready    worker   4h55m   v1.20.0+551f7b2   192.168.3.147   <none>        Red Hat Enterprise Linux CoreOS 47.83.202103140039-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.1-5.rhaos4.7.git62f21aa.el8
[2021-03-22T22:10:14.042Z] ugd-12708-7v9h4-worker-2   Ready    worker   4h55m   v1.20.0+551f7b2   192.168.1.89    <none>        Red Hat Enterprise Linux CoreOS 47.83.202103140039-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.20.1-5.rhaos4.7.git62f21aa.el8

Checked must-gather logs, some errors are in machine-config-deamon pods.

-----------------------------------------
cat machine-config-daemon/machine-config-daemon/logs/current.log 

2021-03-22T18:12:30.507057316-04:00 I0322 22:12:30.506621  150865 start.go:108] Version: v4.8.0-202103220526.p0-dirty (d6dabadeca05789363a2bbd56fbd3b16a3b21777)
2021-03-22T18:12:30.509752749-04:00 I0322 22:12:30.509683  150865 start.go:121] Calling chroot("/rootfs")
2021-03-22T18:12:30.562586722-04:00 I0322 22:12:30.562493  150865 start.go:97] Copied self to /run/bin/machine-config-daemon on host
2021-03-22T18:12:30.565547109-04:00 I0322 22:12:30.564695  150865 metrics.go:105] Registering Prometheus metrics
2021-03-22T18:12:30.565547109-04:00 I0322 22:12:30.564776  150865 metrics.go:110] Starting metrics listener on 127.0.0.1:8797
2021-03-22T18:12:30.567619353-04:00 I0322 22:12:30.567574  150865 update.go:1851] Starting to manage node: ugd-12708-7v9h4-rhel-1
2021-03-22T18:12:30.569712726-04:00 I0322 22:12:30.569621  150865 rpm-ostree.go:258] Running captured: journalctl --list-boots
2021-03-22T18:12:30.571688870-04:00 I0322 22:12:30.571533  150865 daemon.go:669] Detected a new login session: New session 1 of user cloud-user.
2021-03-22T18:12:30.571688870-04:00 I0322 22:12:30.571551  150865 daemon.go:670] Login access is discouraged! Applying annotation: machineconfiguration.openshift.io/ssh
2021-03-22T18:12:30.572547241-04:00 I0322 22:12:30.572377  150865 daemon.go:858] journalctl --list-boots:
2021-03-22T18:12:30.572547241-04:00  0 e2082f440ebd4603818fb63fe6a70f49 Mon 2021-03-22 14:03:39 EDT—Mon 2021-03-22 18:12:30 EDT
2021-03-22T18:12:30.572547241-04:00 I0322 22:12:30.572399  150865 rpm-ostree.go:258] Running captured: systemctl list-units --state=failed --no-legend
2021-03-22T18:12:30.577332744-04:00 I0322 22:12:30.577155  150865 daemon.go:871] systemctl --failed:
2021-03-22T18:12:30.577332744-04:00 afterburn-hostname.service loaded failed failed Afterburn Hostname
2021-03-22T18:12:30.577332744-04:00 ovirt-guest-agent.service  loaded failed failed oVirt Guest Agent
2021-03-22T18:12:30.577332744-04:00 I0322 22:12:30.577178  150865 daemon.go:607] Starting MachineConfigDaemon
2021-03-22T18:12:30.577332744-04:00 I0322 22:12:30.577186  150865 daemon.go:577] Guarding against sigterm signal
2021-03-22T18:12:30.577332744-04:00 I0322 22:12:30.577205  150865 daemon.go:614] Enabling Kubelet Healthz Monitor
2021-03-22T18:12:30.677649628-04:00 W0322 22:12:30.677546  150865 daemon.go:635] Got an error from auxiliary tools: error: cannot apply annotation for SSH access due to: unable to update node "nil": node "ugd-12708-7v9h4-rhel-1" not found
2021-03-22T18:12:30.677649628-04:00 I0322 22:12:30.677614  150865 daemon.go:636] Shutting down MachineConfigDaemon
2021-03-22T18:12:30.677760643-04:00 F0322 22:12:30.677698  150865 helpers.go:147] error: cannot apply annotation for SSH access due to: unable to update node "nil": node "ugd-12708-7v9h4-rhel-1" not found
~                


----------------------------
snippet of  machine-config-daemon-vhv2g.yaml 
- containerID: cri-o://d67ce5f38b2f540e7bdaf522697929beca1d84c07e13b8475fae175d5d09989a
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c183d99756e887d90846b0dd379157173646feb5fe2fe7379f26932a7a591ae4
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c183d99756e887d90846b0dd379157173646feb5fe2fe7379f26932a7a591ae4
    lastState:
      terminated:
        containerID: cri-o://d67ce5f38b2f540e7bdaf522697929beca1d84c07e13b8475fae175d5d09989a
        exitCode: 255
        finishedAt: "2021-03-22T22:12:30Z"
        message: |
          I0322 22:12:30.506621  150865 start.go:108] Version: v4.8.0-202103220526.p0-dirty (d6dabadeca05789363a2bbd56fbd3b16a3b21777)
          I0322 22:12:30.509683  150865 start.go:121] Calling chroot("/rootfs")
          I0322 22:12:30.562493  150865 start.go:97] Copied self to /run/bin/machine-config-daemon on host
          I0322 22:12:30.564695  150865 metrics.go:105] Registering Prometheus metrics
          I0322 22:12:30.564776  150865 metrics.go:110] Starting metrics listener on 127.0.0.1:8797
          I0322 22:12:30.567574  150865 update.go:1851] Starting to manage node: ugd-12708-7v9h4-rhel-1
          I0322 22:12:30.569621  150865 rpm-ostree.go:258] Running captured: journalctl --list-boots
          I0322 22:12:30.571533  150865 daemon.go:669] Detected a new login session: New session 1 of user cloud-user.
          I0322 22:12:30.571551  150865 daemon.go:670] Login access is discouraged! Applying annotation: machineconfiguration.openshift.io/ssh
          I0322 22:12:30.572377  150865 daemon.go:858] journalctl --list-boots:
           0 e2082f440ebd4603818fb63fe6a70f49 Mon 2021-03-22 14:03:39 EDT—Mon 2021-03-22 18:12:30 EDT
          I0322 22:12:30.572399  150865 rpm-ostree.go:258] Running captured: systemctl list-units --state=failed --no-legend
          I0322 22:12:30.577155  150865 daemon.go:871] systemctl --failed:
          afterburn-hostname.service loaded failed failed Afterburn Hostname
          ovirt-guest-agent.service  loaded failed failed oVirt Guest Agent
          I0322 22:12:30.577178  150865 daemon.go:607] Starting MachineConfigDaemon
          I0322 22:12:30.577186  150865 daemon.go:577] Guarding against sigterm signal
          I0322 22:12:30.577205  150865 daemon.go:614] Enabling Kubelet Healthz Monitor
          W0322 22:12:30.677546  150865 daemon.go:635] Got an error from auxiliary tools: error: cannot apply annotation for SSH access due to: unable to update node "nil": node "ugd-12708-7v9h4-rhel-1" not found
          I0322 22:12:30.677614  150865 daemon.go:636] Shutting down MachineConfigDaemon
          F0322 22:12:30.677698  150865 helpers.go:147] error: cannot apply annotation for SSH access due to: unable to update node "nil": node "ugd-12708-7v9h4-rhel-1" not found
        reason: Error
        startedAt: "2021-03-22T22:12:30Z"
    name: machine-config-daemon
    ready: false
    restartCount: 7
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=machine-config-daemon pod=machine-config-daemon-vhv2g_openshift-machine-config-operator(96b6e38f-b5f7-4b84-84fa-11a68cc5c095)
        reason: CrashLoopBackOff

---------------------------------
Abnormal co details
[2021-03-22T22:11:23.775Z]   Conditions:
[2021-03-22T22:11:23.775Z]     Last Transition Time:  2021-03-22T20:40:47Z
[2021-03-22T22:11:23.775Z]     Message:               Working towards 4.8.0-0.nightly-2021-03-22-104536
[2021-03-22T22:11:23.775Z]     Status:                True
[2021-03-22T22:11:23.775Z]     Type:                  Progressing
[2021-03-22T22:11:23.775Z]     Last Transition Time:  2021-03-22T20:50:49Z
[2021-03-22T22:11:23.775Z]     Message:               Unable to apply 4.8.0-0.nightly-2021-03-22-104536: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 8, updated: 5, ready: 6, unavailable: 2)
[2021-03-22T22:11:23.775Z]     Reason:                MachineConfigDaemonFailed
[2021-03-22T22:11:23.775Z]     Status:                True
[2021-03-22T22:11:23.775Z]     Type:                  Degraded
[2021-03-22T22:11:23.775Z]     Last Transition Time:  2021-03-22T20:38:13Z
[2021-03-22T22:11:23.775Z]     Message:               Cluster not available for 4.8.0-0.nightly-2021-03-22-104536
[2021-03-22T22:11:23.775Z]     Status:                False
[2021-03-22T22:11:23.775Z]     Type:                  Available
[2021-03-22T22:11:23.775Z]     Last Transition Time:  2021-03-22T17:09:10Z
[2021-03-22T22:11:23.775Z]     Reason:                AsExpected
[2021-03-22T22:11:23.775Z]     Status:                True
[2021-03-22T22:11:23.775Z]     Type:                  Upgradeable

Actual results:
Upgrade failed

Expected results:
Upgrade succeed.

Additional info:


Note You need to log in before you can comment on or make changes to this bug.