1874323 – Failed to upgrade to 4.6 from 4.5 due to MachineConfigDaemonFailed

Bug 1874323 - Failed to upgrade to 4.6 from 4.5 due to MachineConfigDaemonFailed

Summary: Failed to upgrade to 4.6 from 4.5 due to MachineConfigDaemonFailed

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Aniket Bhat
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-01 03:36 UTC by Jian Zhang
Modified:	2021-04-05 17:46 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:36:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 801	None	closed	Bug 1881979: Fixes gateway mode parameters for OVN	2021-02-10 18:31:43 UTC
Github	openshift ovn-kubernetes pull 269	None	closed	Bug 1872470: Upstream merge 9-14-2020	2021-02-10 18:31:44 UTC
Github	openshift ovn-kubernetes pull 281	None	closed	Bug 1880591: Allow local no bridge	2021-02-10 18:31:44 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 16:36:47 UTC

Description Jian Zhang 2020-09-01 03:36:54 UTC

Description of problem:

It has been in this status for hours:
status:
  conditions:
  - lastTransitionTime: "2020-09-01T01:39:18Z"
    message: Working towards 4.6.0-0.nightly-2020-08-31-194600
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-09-01T01:53:58Z"
    message: 'Unable to apply 4.6.0-0.nightly-2020-08-31-194600: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)'
    reason: MachineConfigDaemonFailed
    status: "True"
    type: Degraded
  - lastTransitionTime: "2020-09-01T01:37:12Z"
    message: Cluster not available for 4.6.0-0.nightly-2020-08-31-194600
    status: "False"
    type: Available
  - lastTransitionTime: "2020-08-31T01:51:00Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable
  extension:
    master: all 3 nodes are at latest configuration rendered-master-d6f41577113d3cd74dda97f9527109fb
    worker: 2 (ready 2) out of 3 nodes are updating to latest configuration rendered-worker-3ac113826d1787d04061a8d40f63eab4


Version-Release number of selected component (if applicable):
upgrade 4.5.0-0.nightly-2020-08-29-080432 to 4.6.0-0.nightly-2020-08-31-194600


How reproducible:
I tried once

Steps to Reproduce:
1. Install OCP 4.5.0-0.nightly-2020-08-29-080432(IPI OVN FIPS) on Azure.
2. Do some test and upgrade it to 4.6.
oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release@sha256:2bc7f7acaf336e1279daa796be32913c9137deb528840e5c2985d750f8a0e4c1 --force --allow-explicit-upgrade
warning: The requested upgrade image is not one of the available updates.  You have used --allow-explicit-upgrade to the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image registry.svc.ci.openshift.org/ocp/release@sha256:2bc7f7acaf336e1279daa796be32913c9137deb528840e5c2985d750f8a0e4c1
[root@preserve-olm-env data]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-08-29-080432   True        True          39s     Working towards registry.svc.ci.openshift.org/ocp/release@sha256:2bc7f7acaf336e1279daa796be32913c9137deb528840e5c2985d750f8a0e4c1: downloading update
[root@preserve-olm-env data]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-08-29-080432   True        True          15m     Working towards 4.6.0-0.nightly-2020-08-31-194600: 18% complete
...


Actual results:
Failed to upgrade.
[root@preserve-olm-env data]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-08-29-080432   True        True          146m    Unable to apply 4.6.0-0.nightly-2020-08-31-194600: the cluster operator monitoring is degraded
[root@preserve-olm-env data]# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
...
machine-config                             4.5.0-0.nightly-2020-08-29-080432   False       True          True       105m
...
monitoring                                 4.6.0-0.nightly-2020-08-31-194600   False       True          True       80m


Expected results:
Upgrade successfully.

Here is cluster for your debugging: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/109204/artifact/workdir/install-dir/auth/kubeconfig/*view*/

Additional info:
[root@preserve-olm-env data]# oc get co machine-config  -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
...
spec: {}
status:
  conditions:
  - lastTransitionTime: "2020-09-01T01:39:18Z"
    message: Working towards 4.6.0-0.nightly-2020-08-31-194600
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-09-01T01:53:58Z"
    message: 'Unable to apply 4.6.0-0.nightly-2020-08-31-194600: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)'
    reason: MachineConfigDaemonFailed
    status: "True"
    type: Degraded
  - lastTransitionTime: "2020-09-01T01:37:12Z"
    message: Cluster not available for 4.6.0-0.nightly-2020-08-31-194600
    status: "False"
    type: Available
  - lastTransitionTime: "2020-08-31T01:51:00Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable
  extension:
    master: all 3 nodes are at latest configuration rendered-master-d6f41577113d3cd74dda97f9527109fb
    worker: 2 (ready 2) out of 3 nodes are updating to latest configuration rendered-worker-3ac113826d1787d04061a8d40f63eab4
  relatedObjects:
  - group: ""
    name: openshift-machine-config-operator
    resource: namespaces
  - group: machineconfiguration.openshift.io
    name: ""
    resource: machineconfigpools
  - group: machineconfiguration.openshift.io
    name: ""
    resource: controllerconfigs
  - group: machineconfiguration.openshift.io
    name: ""
    resource: kubeletconfigs
  - group: machineconfiguration.openshift.io
    name: ""
    resource: containerruntimeconfigs
  - group: machineconfiguration.openshift.io
    name: ""
    resource: machineconfigs
  versions:
  - name: operator
    version: 4.5.0-0.nightly-2020-08-29-080432


[root@preserve-olm-env data]# oc get nodes
NAME                                        STATUS                        ROLES    AGE   VERSION
jiazha45-up-n9zw6-master-0                  Ready                         master   25h   v1.19.0-rc.2+f71a7ab-dirty
jiazha45-up-n9zw6-master-1                  Ready                         master   25h   v1.19.0-rc.2+f71a7ab-dirty
jiazha45-up-n9zw6-master-2                  Ready                         master   25h   v1.19.0-rc.2+f71a7ab-dirty
jiazha45-up-n9zw6-worker-centralus1-qnrbm   Ready                         worker   25h   v1.19.0-rc.2+f71a7ab-dirty
jiazha45-up-n9zw6-worker-centralus2-scdb2   Ready                         worker   25h   v1.19.0-rc.2+f71a7ab-dirty
jiazha45-up-n9zw6-worker-centralus3-xs4h6   NotReady,SchedulingDisabled   worker   25h   v1.18.3+6c42de8

[root@preserve-olm-env data]# oc get pods -n openshift-machine-config-operator -o wide
NAME                                         READY   STATUS    RESTARTS   AGE    IP            NODE                                        NOMINATED NODE   READINESS GATES
machine-config-controller-6c758bf7d6-pr76k   1/1     Running   0          99m    10.130.0.74   jiazha45-up-n9zw6-master-0                  <none>           <none>
machine-config-daemon-8r84w                  2/2     Running   0          110m   10.0.32.5     jiazha45-up-n9zw6-worker-centralus3-xs4h6   <none>           <none>
machine-config-daemon-g7v8x                  2/2     Running   0          109m   10.0.32.4     jiazha45-up-n9zw6-worker-centralus2-scdb2   <none>           <none>
machine-config-daemon-msxzr                  2/2     Running   0          108m   10.0.32.6     jiazha45-up-n9zw6-worker-centralus1-qnrbm   <none>           <none>
machine-config-daemon-x8vvz                  2/2     Running   0          109m   10.0.0.8      jiazha45-up-n9zw6-master-0                  <none>           <none>
machine-config-daemon-xljdv                  2/2     Running   0          110m   10.0.0.5      jiazha45-up-n9zw6-master-2                  <none>           <none>
machine-config-daemon-xr92n                  2/2     Running   0          110m   10.0.0.7      jiazha45-up-n9zw6-master-1                  <none>           <none>
machine-config-operator-86b665fb84-7zv45     1/1     Running   0          93m    10.130.0.86   jiazha45-up-n9zw6-master-0                  <none>           <none>
machine-config-server-9jr9z                  1/1     Running   0          106m   10.0.0.5      jiazha45-up-n9zw6-master-2                  <none>           <none>
machine-config-server-m6q85                  1/1     Running   0          106m   10.0.0.7      jiazha45-up-n9zw6-master-1                  <none>           <none>
machine-config-server-q2fbh                  1/1     Running   0          105m   10.0.0.8      jiazha45-up-n9zw6-master-0                  <none>           <none>

[root@preserve-olm-env data]# oc -n openshift-machine-config-operator logs machine-config-daemon-8r84w -c machine-config-daemon
Error from server: Get "https://10.0.32.5:10250/containerLogs/openshift-machine-config-operator/machine-config-daemon-8r84w/machine-config-daemon": net/http: TLS handshake timeout

[root@preserve-olm-env data]# oc describe nodes jiazha45-up-n9zw6-worker-centralus3-xs4h6 
Name:               jiazha45-up-n9zw6-worker-centralus3-xs4h6
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=Standard_D4s_v3
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=centralus
                    failure-domain.beta.kubernetes.io/zone=centralus-3
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=jiazha45-up-n9zw6-worker-centralus3-xs4h6
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=Standard_D4s_v3
                    node.openshift.io/os_id=rhcos
                    topology.kubernetes.io/region=centralus
                    topology.kubernetes.io/zone=centralus-3
Annotations:        k8s.ovn.org/l3-gateway-config:
                      {"default":{"mode":"local","interface-id":"br-local_jiazha45-up-n9zw6-worker-centralus3-xs4h6","mac-address":"00:00:a9:fe:21:02","ip-addre...
                    k8s.ovn.org/node-chassis-id: 252de6a9-fed6-4462-af22-d2f7ea5235c8
                    k8s.ovn.org/node-join-subnets: {"default":"100.64.4.0/29"}
                    k8s.ovn.org/node-mgmt-port-mac-address: 92:2c:97:18:64:4f
                    k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.32.5/19"}
                    k8s.ovn.org/node-subnets: {"default":"10.128.2.0/23"}
                    machine.openshift.io/machine: openshift-machine-api/jiazha45-up-n9zw6-worker-centralus3-xs4h6
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-454a549a60f78f18a3046963bebcd881
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-3ac113826d1787d04061a8d40f63eab4
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Working
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sun, 30 Aug 2020 22:04:42 -0400
Taints:             node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unreachable:NoSchedule
                    node.kubernetes.io/unschedulable:NoSchedule
Unschedulable:      true
Lease:
  HolderIdentity:  jiazha45-up-n9zw6-worker-centralus3-xs4h6
  AcquireTime:     <unset>
  RenewTime:       Mon, 31 Aug 2020 21:57:06 -0400
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Mon, 31 Aug 2020 21:56:44 -0400   Mon, 31 Aug 2020 21:57:46 -0400   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Mon, 31 Aug 2020 21:56:44 -0400   Mon, 31 Aug 2020 21:57:46 -0400   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Mon, 31 Aug 2020 21:56:44 -0400   Mon, 31 Aug 2020 21:57:46 -0400   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Mon, 31 Aug 2020 21:56:44 -0400   Mon, 31 Aug 2020 21:57:46 -0400   NodeStatusUnknown   Kubelet stopped posting node status.
Addresses:
  Hostname:    jiazha45-up-n9zw6-worker-centralus3-xs4h6
  InternalIP:  10.0.32.5
Capacity:
  attachable-volumes-azure-disk:  8
  cpu:                            4
  ephemeral-storage:              133665772Ki
  hugepages-1Gi:                  0
  hugepages-2Mi:                  0
  memory:                         16392876Ki
  pods:                           250
Allocatable:
  attachable-volumes-azure-disk:  8
  cpu:                            3500m
  ephemeral-storage:              122112633448
  hugepages-1Gi:                  0
  hugepages-2Mi:                  0
  memory:                         15241900Ki
  pods:                           250
System Info:
  Machine ID:                             f0d71024b5de4fd2bd1235b26fe1f168
  System UUID:                            c36ca4c3-4552-7d4e-8d69-b24cf2f6929e
  Boot ID:                                6fdbb306-b75b-498b-a23f-4bfb1d678746
  Kernel Version:                         4.18.0-193.14.3.el8_2.x86_64
  OS Image:                               Red Hat Enterprise Linux CoreOS 45.82.202008290529-0 (Ootpa)
  Operating System:                       linux
  Architecture:                           amd64
  Container Runtime Version:              cri-o://1.18.3-11.rhaos4.5.gite5bcc71.el8
  Kubelet Version:                        v1.18.3+6c42de8
  Kube-Proxy Version:                     v1.18.3+6c42de8
PodCIDR:                                  10.128.4.0/24
PodCIDRs:                                 10.128.4.0/24
ProviderID:                               azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jiazha45-up-n9zw6-rg/providers/Microsoft.Compute/virtualMachines/jiazha45-up-n9zw6-worker-centralus3-xs4h6
Non-terminated Pods:                      (16 in total)
  Namespace                               Name                               CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                               ----                               ------------  ----------  ---------------  -------------  ---
  openshift-cluster-node-tuning-operator  tuned-jkcgg                        10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         127m
  openshift-dns                           dns-default-sht5p                  65m (1%)      0 (0%)      110Mi (0%)       512Mi (3%)     107m
  openshift-image-registry                node-ca-9bklr                      10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         126m
  openshift-ingress                       router-default-68d9f9646b-qhtbx    100m (2%)     0 (0%)      256Mi (1%)       0 (0%)         103m
  openshift-local-storage                 example-local-diskmaker-wprn8      0 (0%)        0 (0%)      0 (0%)           0 (0%)         21h
  openshift-local-storage                 example-local-provisioner-hcf2z    0 (0%)        0 (0%)      0 (0%)           0 (0%)         21h
  openshift-logging                       fluentd-jxx5x                      100m (2%)     0 (0%)      736Mi (4%)       736Mi (4%)     96m
  openshift-machine-config-operator       machine-config-daemon-8r84w        40m (1%)      0 (0%)      100Mi (0%)       0 (0%)         107m
  openshift-monitoring                    node-exporter-zztm9                9m (0%)       0 (0%)      210Mi (1%)       0 (0%)         128m
  openshift-monitoring                    prometheus-k8s-1                   76m (2%)      0 (0%)      1184Mi (7%)      0 (0%)         126m
  openshift-monitoring                    thanos-querier-555c7dc77-v7c84     9m (0%)       0 (0%)      92Mi (0%)        0 (0%)         103m
  openshift-multus                        multus-xtkzh                       10m (0%)      0 (0%)      150Mi (1%)       0 (0%)         115m
  openshift-multus                        network-metrics-daemon-rvldf       20m (0%)      0 (0%)      120Mi (0%)       0 (0%)         118m
  openshift-ovn-kubernetes                ovnkube-node-jmftr                 20m (0%)      0 (0%)      600Mi (4%)       0 (0%)         118m
  openshift-ovn-kubernetes                ovnkube-node-metrics-88p4g         10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         118m
  openshift-ovn-kubernetes                ovs-node-vqj8b                     100m (2%)     0 (0%)      300Mi (2%)       0 (0%)         116m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                       Requests      Limits
  --------                       --------      ------
  cpu                            579m (16%)    0 (0%)
  memory                         3938Mi (26%)  1248Mi (8%)
  ephemeral-storage              0 (0%)        0 (0%)
  hugepages-1Gi                  0 (0%)        0 (0%)
  hugepages-2Mi                  0 (0%)        0 (0%)
  attachable-volumes-azure-disk  0             0
Events:
  Type    Reason              Age   From                                                Message
  ----    ------              ----  ----                                                -------
  Normal  NodeNotSchedulable  91m   kubelet, jiazha45-up-n9zw6-worker-centralus3-xs4h6  Node jiazha45-up-n9zw6-worker-centralus3-xs4h6 status is now: NodeNotSchedulable

Comment 2 Jian Zhang 2020-09-07 12:39:12 UTC

Encounter the same issue on the AWS(UPI, FIPS) during upgrading to the 4.6 from 4.5.8.
mac:~ jianzhang$ oc get co machine-config -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
metadata:
  creationTimestamp: "2020-09-07T03:02:46Z"
  generation: 1
  name: machine-config
  resourceVersion: "356022"
  selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-config
  uid: b0c52888-28ec-4ab9-97b5-f2440594d581
spec: {}
status:
  conditions:
  - lastTransitionTime: "2020-09-07T11:27:45Z"
    message: Cluster not available for 4.5.8
    status: "False"
    type: Available
  - lastTransitionTime: "2020-09-07T09:54:26Z"
    message: Cluster version is 4.5.8
    status: "False"
    type: Progressing
  - lastTransitionTime: "2020-09-07T11:27:45Z"
    message: 'Failed to resync 4.5.8 because: timed out waiting for the condition
      during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready.
      status: (desired: 7, updated: 7, ready: 5, unavailable: 2)'
    reason: MachineConfigDaemonFailed
    status: "True"
    type: Degraded
  - lastTransitionTime: "2020-09-07T03:03:40Z"
    reason: AsExpected
    status: "True"
    type: Upgradeable
  extension:
    master: all 3 nodes are at latest configuration rendered-master-e75ebfd7ae12d861b4475dcd26a844b4
    worker: 4 (ready 2) out of 4 nodes are updating to latest configuration rendered-worker-7801f86209954803c143bd06ad5e5cd8
  relatedObjects:
  - group: ""
    name: openshift-machine-config-operator
    resource: namespaces
  - group: machineconfiguration.openshift.io
    name: master
    resource: machineconfigpools
  - group: machineconfiguration.openshift.io
    name: worker
    resource: machineconfigpools
  - group: machineconfiguration.openshift.io
    name: machine-config-controller
    resource: controllerconfigs
  versions:
  - name: operator
    version: 4.5.8
mac:~ jianzhang$ oc get nodes
NAME                                        STATUS     ROLES    AGE     VERSION
ip-10-0-49-177.us-east-2.compute.internal   NotReady   worker   4h26m   v1.18.3+6c42de8
ip-10-0-50-216.us-east-2.compute.internal   Ready      master   9h      v1.18.3+6c42de8
ip-10-0-52-35.us-east-2.compute.internal    Ready      master   9h      v1.18.3+6c42de8
ip-10-0-56-124.us-east-2.compute.internal   Ready      worker   9h      v1.18.3+6c42de8
ip-10-0-63-76.us-east-2.compute.internal    Ready      worker   9h      v1.18.3+6c42de8
ip-10-0-64-229.us-east-2.compute.internal   Ready      master   9h      v1.18.3+6c42de8
ip-10-0-66-153.us-east-2.compute.internal   NotReady   worker   9h      v1.18.3+6c42de8

Here is the cluster for your debugging: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/110545/artifact/workdir/install-dir/auth/kubeconfig/*view*/

Comment 3 Sinny Kumari 2020-09-09 09:46:39 UTC

So far I am not sure what is causing this bug. Haven't found any indication of MCO being buggy. See below detailed analysis.

Issue described in comment #0 :
--------------------------------

- When I started looking at this bug, cluster credentials provided ( https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/109204/artifact/workdir/install-dir/auth/kubeconfig/*view*/ ) was no longer accessible.
- By looking at provided must-gather, there is no indication in MCO log that why worker node jiazha45-up-n9zw6-worker-centralus3-xs4h6 is not available. Since node is not available, must-gather doesn't have any log from corresponding daemon machine-config-daemon-8r84w
- To reproduce this issue locally, created 4.5 cluster using clusterbot in azure with ovn, fips environment and upgraded to available 4.6 nightlies. But it seems network connection gets lost during upgrade and I no longer get back access to cluster. Seeing same behavior with all three nightlies that I tried registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-09-07-224533, registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-09-03-063148,registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-09-07-162735 .


Issue described in comment #2:
-------------------------------

This doesn't look like a straight upgrade from 4.5 to 4.6. Logged into the cluster and while looking at the logs, noticed that this cluster has been first upgraded from 4.4 to 4.5 first and then to 4.6

status from  one of the available worker node ip-10-0-63-76.us-east-2.compute.internal
sh-4.4# rpm-ostree status
State: idle
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9bcf0cf0009cceac80285b3dbca0b2cb3b9cf1f2c0f6a6bc642d2109a82501e0
              CustomOrigin: Managed by machine-config-operator
                   Version: 45.82.202008290529-0 (2020-08-29T05:33:25Z)

  pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cb56ec5c38333c5aa68c06521aaa044aa200bcd8d2c601034237edb142631dde
              CustomOrigin: Managed by machine-config-operator
                   Version: 44.82.202008250531-0 (2020-08-25T05:37:37Z)

Did we know that cluster was fully upgraded including all worker nodes when 4.4 to 4.5 upgrade took place first?

Can you please also share how frequently are we seeing this issue?

Comment 4 Antonio Murdaca 2020-09-09 15:19:57 UTC

I agree with Sinny nothing really points to something MCO related - the fact that the MCP notices that some nodes aren't ready and degrades isn't a symptom of something wrong in MCO (unless we cause that and we have proof/logs about that)

failed to schedule the MCD isn't an MCO direct problem as if the node isn't ready we can't do anyhing but report that.

Comment 5 Jian Zhang 2020-09-10 11:45:31 UTC

Hi Sinny,

Thanks for your analysis!

For the first analysis:

There is a cluster(IPI Azure OVN FIPS etcd_encryption) that MCO failed to upgrade to 4.6. But, seems like it's not the same issue. Hope it helps.
https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/111181/artifact/workdir/install-dir/auth/kubeconfig/*view*/

For the second analysis:
> Did we know that cluster was fully upgraded including all worker nodes when 4.4 to 4.5 upgrade took place first?

I guess so, no error report for 4.4 upgrade to 4.5.

> Can you please also share how frequently are we seeing this issue?

I'm not sure. we create one and met this issue. Now, I created two clusters(4.4.20->4.5.9->4.6), one cluster(upi aws ovn fips etcd_encryption): https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/111273/artifact/workdir/install-dir/auth/kubeconfig/*view*/ it works well for 4.4 to 4.5, now is upgrading to 4.6

another one (upi aws fips), now it's upgrading to 4.5, it will upgrade to 4.6 soon.
https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/111316/artifact/workdir/install-dir/auth/kubeconfig/*view*/

Comment 6 Jian Zhang 2020-09-10 11:52:52 UTC

I also create a cluster with the same metirc(IPI Azure OVN FIPS) with comment 0:
[root@preserve-olm-env data]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-09-10-073112   True        False         19m     Cluster version is 4.5.0-0.nightly-2020-09-10-073112

[root@preserve-olm-env data]# oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release@sha256:1ca476cde944c29e370bbb3759df256fce191fa5849f726d8db3304040175505 --force  --allow-explicit-upgrade

[root@preserve-olm-env data]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-09-10-073112   True        True          2m41s   Working towards 4.6.0-0.nightly-2020-09-10-054902: 11% complete

Now, it's upgrading. https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/111331/artifact/workdir/install-dir/auth/kubeconfig/*view*/

Comment 7 Sinny Kumari 2020-09-10 16:16:50 UTC

Thank you Jian for running extra set of upgrade test. Based on your new upgrade results it seems like MCO is working fine and not causing upgrade issue. If you think likewise, can we close this bug?

Comment 8 Colin Walters 2020-09-10 21:55:22 UTC

I think there are OVN issues in general and I wouldn't be surprised if we somehow broke 4.5 -> 4.6 upgrades w/ovn.

Comment 9 Jian Zhang 2020-09-11 01:13:24 UTC

Hi Sinny,

As you can see, for https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/111331/artifact/workdir/install-dir/auth/kubeconfig/*view*/,
the MCO still is 4.5 version. And, we did met the "MachineConfigDaemonFailed" error twice before. I will remove the OVN metric and try to reproduce this bug. 

[root@preserve-olm-env data]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-09-10-073112   True        True          13h     Unable to apply 4.6.0-0.nightly-2020-09-10-054902: an unknown error has occurred: MultipleErrors
[root@preserve-olm-env data]# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-09-10-054902   False       True          True       28m
machine-config                             4.5.0-0.nightly-2020-09-10-073112   True        False         False      13h
image-registry                             4.6.0-0.nightly-2020-09-10-054902   True        True          True       13h
monitoring                                 4.6.0-0.nightly-2020-09-10-054902   False       False         True       4h52m
openshift-apiserver                        4.6.0-0.nightly-2020-09-10-054902   False       False         False      2m22s
dns                                        4.5.0-0.nightly-2020-09-10-073112   True        False         False      13h
...

[root@preserve-olm-env data]# oc get co openshift-apiserver -o yaml
...
  - lastTransitionTime: "2020-09-11T00:57:40Z"
    message: 'APIServicesAvailable: "project.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)'
    reason: APIServices_Error

[root@preserve-olm-env data]# oc get co authentication -o yaml
...
status:
  conditions:
  - lastTransitionTime: "2020-09-10T12:30:36Z"
    message: |-
      OAuthRouteCheckEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps.jiazha0910.qe.azure.devcluster.openshift.com/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
      OAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded: Get "https://10.128.0.9:6443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
      OAuthServiceCheckEndpointAccessibleControllerDegraded: Get "https://172.30.117.188:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
      OAuthServerDeploymentDegraded: Unable to get "openshift-browser-client" bootstrapped OAuth client: the server is currently unable to handle the request (post oauthclients.oauth.openshift.io)
    reason: OAuthRouteCheckEndpointAccessibleController_SyncError::OAuthServerDeployment_GetFailed::OAuthServiceCheckEndpointAccessibleController_SyncError::OAuthServiceEndpointsCheckEndpointAccessibleController_SyncError
...

For https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/111273/artifact/workdir/install-dir/auth/kubeconfig/*view*/, it still in upgrading.
[root@preserve-olm-env data]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.9     True        True          12h     Working towards 4.6.0-0.nightly-2020-09-10-011413: 15% complete

Comment 10 Jian Zhang 2020-09-11 08:45:50 UTC

I have a try without the OVN metic, (IPI Azure FIPS)https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/111444/artifact/workdir/install-dir/auth/kubeconfig/*view*/
It upgrded well. I didn't met the "MachineConfigDaemonFailed" error.
[root@preserve-olm-env data]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-09-10-073112   True        True          20s     Working towards 4.6.0-0.nightly-2020-09-10-195619: 0% complete
[root@preserve-olm-env data]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-09-10-195619   True        False         75m     Cluster version is 4.6.0-0.nightly-2020-09-10-195619


I move on this bug to the OVN team first, the two clusters list in comment 9 for your debugging. Please transfer it to the appropriate component if you know, thanks!

Comment 14 Jian Zhang 2020-09-18 02:17:22 UTC

Hi Aniket,

Sure, and I linked the PR269 here.

Comment 15 Jian Zhang 2020-09-18 08:17:11 UTC

Hi Aniket,

I test for 2 clusters with OVN, all upgrade failed, details:

From this payload(4.6.0-0.nightly-2020-09-17-195238) starts, there are PR269 merged in.
[root@preserve-olm-env data]# oc adm release info registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-09-17-195238 --commits |grep ovn
  ovn-kubernetes                                 https://github.com/openshift/ovn-kubernetes                                 efa6de93497d3fdd81ba5706669c54529176691c


1) IPI on AWS & FIPS on & OVN: Upgrade 4.5.0-0.nightly-2020-09-17-145245 to 4.6.0-0.nightly-2020-09-17-195238
https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/upgrade_CI/5229/console

Still failed, this cluster for your debugging:
https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/112933/artifact/workdir/install-dir/auth/kubeconfig/*view*/

[root@preserve-olm-env data]# oc get clusterversion

NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-09-17-145245   True        True          3h23m   Unable to apply 4.6.0-0.nightly-2020-09-17-195238: the control plane is reporting an internal error


[root@preserve-olm-env data]# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
dns                                        4.5.0-0.nightly-2020-09-17-145245   True        True          False      5h18m
...
kube-apiserver                             4.6.0-0.nightly-2020-09-17-195238   True        True          False      5h16m
kube-controller-manager                    4.6.0-0.nightly-2020-09-17-195238   True        False         False      5h17m
kube-scheduler                             4.6.0-0.nightly-2020-09-17-195238   True        False         False      5h16m
kube-storage-version-migrator              4.6.0-0.nightly-2020-09-17-195238   False       False         False      171m
machine-api                                4.6.0-0.nightly-2020-09-17-195238   True        False         False      5h10m
machine-approver                           4.6.0-0.nightly-2020-09-17-195238   True        False         False      5h16m
machine-config                             4.5.0-0.nightly-2020-09-17-145245   True        False         False      5h18m
marketplace                                4.6.0-0.nightly-2020-09-17-195238   True        False         False      3h2m
monitoring                                 4.6.0-0.nightly-2020-09-17-195238   True        False         False      172m
network                                    4.5.0-0.nightly-2020-09-17-145245   True        True          True       5h20m
...
storage                                    4.6.0-0.nightly-2020-09-17-195238   True        True          False      174m

2) IPI on Azure & FIPS on & OVN & Etcd Encryption on: Upgrade 4.5.0-0.nightly-2020-09-17-145245 to 4.6.0-0.nightly-2020-09-17-195238
https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/upgrade_CI/5230/console

Still failed, this cluster for your debugging:
https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/112935/artifact/workdir/install-dir/auth/kubeconfig/*view*/

[root@preserve-olm-env data]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-09-17-145245   True        True          3h2m    Working towards 4.6.0-0.nightly-2020-09-17-195238: 1% complete

[root@preserve-olm-env data]# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-09-17-195238   True        False         False      143m
cloud-credential                           4.6.0-0.nightly-2020-09-17-195238   True        False         False      5h23m
cluster-autoscaler                         4.6.0-0.nightly-2020-09-17-195238   True        False         False      5h10m
config-operator                            4.6.0-0.nightly-2020-09-17-195238   True        False         False      5h9m
console                                    4.6.0-0.nightly-2020-09-17-195238   True        False         False      153m
csi-snapshot-controller                    4.6.0-0.nightly-2020-09-17-195238   True        False         False      5h1m
dns                                        4.5.0-0.nightly-2020-09-17-145245   True        True          False      5h15m
etcd                                       4.6.0-0.nightly-2020-09-17-195238   True        False         False      5h15m
image-registry                             4.6.0-0.nightly-2020-09-17-195238   True        False         False      5h2m
ingress                                    4.6.0-0.nightly-2020-09-17-195238   True        False         False      155m
insights                                   4.6.0-0.nightly-2020-09-17-195238   True        False         False      5h11m
kube-apiserver                             4.6.0-0.nightly-2020-09-17-195238   True        False         False      5h15m
kube-controller-manager                    4.6.0-0.nightly-2020-09-17-195238   True        False         False      5h15m
kube-scheduler                             4.6.0-0.nightly-2020-09-17-195238   True        False         False      5h14m
kube-storage-version-migrator              4.6.0-0.nightly-2020-09-17-195238   False       False         False      144m
machine-api                                4.6.0-0.nightly-2020-09-17-195238   True        False         False      5h7m
machine-approver                           4.6.0-0.nightly-2020-09-17-195238   True        False         False      5h10m
machine-config                             4.5.0-0.nightly-2020-09-17-145245   True        False         False      5h8m
...
network                                    4.5.0-0.nightly-2020-09-17-145245   True        True          True       5h17m
...

Comment 16 Aniket Bhat 2020-09-22 13:05:18 UTC

Jian,

Looks like the cluster got deprovisioned over the weekend. I started an upgrade job with Azure on OVN with a downstream PR that has more fixes on the ovn-k side. But that seems to have failed as well. I will investigate. Meanwhile, can you create a reproducer cluster again?

Comment 17 Aniket Bhat 2020-09-23 15:43:33 UTC

Turns out the job failures were terraform related. My latest run for the upgrade job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/1308770921846149120 seems to have passed with CNO PR 801 and downstream ovn-kubernetes PR: 281. Once those PRs land, this can be verified. 

@Jian: Meanwhile, can you try to run the upgrade with the above two PRs?

Comment 18 Jian Zhang 2020-09-24 08:11:59 UTC

Hi Aniket,

> Meanwhile, can you create a reproducer cluster again?

Sure, but in fact, we provided two clusters in the early of last Friday. In this time, could you help debug it as soon as possible?
You know, the cluster is consumes the cloud resources, thanks for your understanding!

IPI on AWS & FIPS on & OVN: Upgrade 4.5.0-0.nightly-2020-09-20-185910 to 4.6.0-0.nightly-2020-09-24-030538
The clusrer is creating: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/114283/

> @Jian: Meanwhile, can you try to run the upgrade with the above two PRs?

Sorry, as you can see, this PR(https://github.com/openshift/ovn-kubernetes/pull/281) hasn't been merged yet! That means no release image contains it. I couldn't find a solution to update an OCP 4.5 to an OCP4.6 cluster without the release image.  
And, I don't think the cluster-bot can do that since there is no update tmeplate for the OVN metics, correct me if I'm worng, thanks!

Comment 19 Jian Zhang 2020-09-24 11:05:04 UTC

Upgrade failed, here is the cluster for your debugging: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/114283/artifact/workdir/install-dir/auth/kubeconfig/*view*/

[root@preserve-olm-env data]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-09-20-185910   True        True          133m    Unable to apply 4.6.0-0.nightly-2020-09-24-030538: the control plane is reporting an internal error
[root@preserve-olm-env data]# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-09-24-030538   False       True          False      99m
cloud-credential                           4.6.0-0.nightly-2020-09-24-030538   True        False         False      172m
cluster-autoscaler                         4.6.0-0.nightly-2020-09-24-030538   True        False         False      157m
config-operator                            4.6.0-0.nightly-2020-09-24-030538   True        False         False      157m
console                                    4.6.0-0.nightly-2020-09-24-030538   True        False         False      110m
csi-snapshot-controller                    4.6.0-0.nightly-2020-09-24-030538   True        False         False      149m
dns                                        4.5.0-0.nightly-2020-09-20-185910   True        True          False      161m
etcd                                       4.6.0-0.nightly-2020-09-24-030538   True        False         False      161m
image-registry                             4.6.0-0.nightly-2020-09-24-030538   True        False         False      150m
ingress                                    4.6.0-0.nightly-2020-09-24-030538   True        False         False      112m
insights                                   4.6.0-0.nightly-2020-09-24-030538   True        False         False      158m
kube-apiserver                             4.6.0-0.nightly-2020-09-24-030538   True        True          False      160m
kube-controller-manager                    4.6.0-0.nightly-2020-09-24-030538   True        False         False      161m
kube-scheduler                             4.6.0-0.nightly-2020-09-24-030538   True        False         False      159m
kube-storage-version-migrator              4.6.0-0.nightly-2020-09-24-030538   True        False         False      130m
machine-api                                4.6.0-0.nightly-2020-09-24-030538   True        False         False      154m
machine-approver                           4.6.0-0.nightly-2020-09-24-030538   True        False         False      160m
machine-config                             4.5.0-0.nightly-2020-09-20-185910   True        False         False      130m
marketplace                                4.6.0-0.nightly-2020-09-24-030538   True        False         False      111m
monitoring                                 4.6.0-0.nightly-2020-09-24-030538   True        False         False      101m
network                                    4.5.0-0.nightly-2020-09-20-185910   True        True          True       163m
node-tuning                                4.6.0-0.nightly-2020-09-24-030538   True        False         False      111m
openshift-apiserver                        4.6.0-0.nightly-2020-09-24-030538   True        False         False      99m
openshift-controller-manager               4.6.0-0.nightly-2020-09-24-030538   True        False         False      126m
openshift-samples                          4.6.0-0.nightly-2020-09-24-030538   True        False         False      111m
operator-lifecycle-manager                 4.6.0-0.nightly-2020-09-24-030538   True        False         False      161m
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-09-24-030538   True        False         False      162m
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-09-24-030538   True        False         False      110m
service-ca                                 4.6.0-0.nightly-2020-09-24-030538   True        False         False      162m
storage                                    4.6.0-0.nightly-2020-09-24-030538   True        False         False      103m

Comment 22 zhaozhanqi 2020-09-27 12:21:14 UTC

assign this issue according to https://bugzilla.redhat.com/show_bug.cgi?id=1880591#c25

Comment 23 zhaozhanqi 2020-09-28 03:27:55 UTC

since this original issue already be fixed for this bug. and we have another bug  https://bugzilla.redhat.com/show_bug.cgi?id=1880591 to trace the upgrade issue 
I'd like to move this bug to 'verified'. please reopen it if this issue still happen. thanks.

Comment 26 errata-xmlrpc 2020-10-27 16:36:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Comment 27 W. Trevor King 2021-04-05 17:46:02 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.