Description of problem: It has been in this status for hours: status: conditions: - lastTransitionTime: "2020-09-01T01:39:18Z" message: Working towards 4.6.0-0.nightly-2020-08-31-194600 status: "True" type: Progressing - lastTransitionTime: "2020-09-01T01:53:58Z" message: 'Unable to apply 4.6.0-0.nightly-2020-08-31-194600: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)' reason: MachineConfigDaemonFailed status: "True" type: Degraded - lastTransitionTime: "2020-09-01T01:37:12Z" message: Cluster not available for 4.6.0-0.nightly-2020-08-31-194600 status: "False" type: Available - lastTransitionTime: "2020-08-31T01:51:00Z" reason: AsExpected status: "True" type: Upgradeable extension: master: all 3 nodes are at latest configuration rendered-master-d6f41577113d3cd74dda97f9527109fb worker: 2 (ready 2) out of 3 nodes are updating to latest configuration rendered-worker-3ac113826d1787d04061a8d40f63eab4 Version-Release number of selected component (if applicable): upgrade 4.5.0-0.nightly-2020-08-29-080432 to 4.6.0-0.nightly-2020-08-31-194600 How reproducible: I tried once Steps to Reproduce: 1. Install OCP 4.5.0-0.nightly-2020-08-29-080432(IPI OVN FIPS) on Azure. 2. Do some test and upgrade it to 4.6. oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release@sha256:2bc7f7acaf336e1279daa796be32913c9137deb528840e5c2985d750f8a0e4c1 --force --allow-explicit-upgrade warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade to the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Updating to release image registry.svc.ci.openshift.org/ocp/release@sha256:2bc7f7acaf336e1279daa796be32913c9137deb528840e5c2985d750f8a0e4c1 [root@preserve-olm-env data]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-08-29-080432 True True 39s Working towards registry.svc.ci.openshift.org/ocp/release@sha256:2bc7f7acaf336e1279daa796be32913c9137deb528840e5c2985d750f8a0e4c1: downloading update [root@preserve-olm-env data]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-08-29-080432 True True 15m Working towards 4.6.0-0.nightly-2020-08-31-194600: 18% complete ... Actual results: Failed to upgrade. [root@preserve-olm-env data]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-08-29-080432 True True 146m Unable to apply 4.6.0-0.nightly-2020-08-31-194600: the cluster operator monitoring is degraded [root@preserve-olm-env data]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE ... machine-config 4.5.0-0.nightly-2020-08-29-080432 False True True 105m ... monitoring 4.6.0-0.nightly-2020-08-31-194600 False True True 80m Expected results: Upgrade successfully. Here is cluster for your debugging: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/109204/artifact/workdir/install-dir/auth/kubeconfig/*view*/ Additional info: [root@preserve-olm-env data]# oc get co machine-config -o yaml apiVersion: config.openshift.io/v1 kind: ClusterOperator ... spec: {} status: conditions: - lastTransitionTime: "2020-09-01T01:39:18Z" message: Working towards 4.6.0-0.nightly-2020-08-31-194600 status: "True" type: Progressing - lastTransitionTime: "2020-09-01T01:53:58Z" message: 'Unable to apply 4.6.0-0.nightly-2020-08-31-194600: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)' reason: MachineConfigDaemonFailed status: "True" type: Degraded - lastTransitionTime: "2020-09-01T01:37:12Z" message: Cluster not available for 4.6.0-0.nightly-2020-08-31-194600 status: "False" type: Available - lastTransitionTime: "2020-08-31T01:51:00Z" reason: AsExpected status: "True" type: Upgradeable extension: master: all 3 nodes are at latest configuration rendered-master-d6f41577113d3cd74dda97f9527109fb worker: 2 (ready 2) out of 3 nodes are updating to latest configuration rendered-worker-3ac113826d1787d04061a8d40f63eab4 relatedObjects: - group: "" name: openshift-machine-config-operator resource: namespaces - group: machineconfiguration.openshift.io name: "" resource: machineconfigpools - group: machineconfiguration.openshift.io name: "" resource: controllerconfigs - group: machineconfiguration.openshift.io name: "" resource: kubeletconfigs - group: machineconfiguration.openshift.io name: "" resource: containerruntimeconfigs - group: machineconfiguration.openshift.io name: "" resource: machineconfigs versions: - name: operator version: 4.5.0-0.nightly-2020-08-29-080432 [root@preserve-olm-env data]# oc get nodes NAME STATUS ROLES AGE VERSION jiazha45-up-n9zw6-master-0 Ready master 25h v1.19.0-rc.2+f71a7ab-dirty jiazha45-up-n9zw6-master-1 Ready master 25h v1.19.0-rc.2+f71a7ab-dirty jiazha45-up-n9zw6-master-2 Ready master 25h v1.19.0-rc.2+f71a7ab-dirty jiazha45-up-n9zw6-worker-centralus1-qnrbm Ready worker 25h v1.19.0-rc.2+f71a7ab-dirty jiazha45-up-n9zw6-worker-centralus2-scdb2 Ready worker 25h v1.19.0-rc.2+f71a7ab-dirty jiazha45-up-n9zw6-worker-centralus3-xs4h6 NotReady,SchedulingDisabled worker 25h v1.18.3+6c42de8 [root@preserve-olm-env data]# oc get pods -n openshift-machine-config-operator -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES machine-config-controller-6c758bf7d6-pr76k 1/1 Running 0 99m 10.130.0.74 jiazha45-up-n9zw6-master-0 <none> <none> machine-config-daemon-8r84w 2/2 Running 0 110m 10.0.32.5 jiazha45-up-n9zw6-worker-centralus3-xs4h6 <none> <none> machine-config-daemon-g7v8x 2/2 Running 0 109m 10.0.32.4 jiazha45-up-n9zw6-worker-centralus2-scdb2 <none> <none> machine-config-daemon-msxzr 2/2 Running 0 108m 10.0.32.6 jiazha45-up-n9zw6-worker-centralus1-qnrbm <none> <none> machine-config-daemon-x8vvz 2/2 Running 0 109m 10.0.0.8 jiazha45-up-n9zw6-master-0 <none> <none> machine-config-daemon-xljdv 2/2 Running 0 110m 10.0.0.5 jiazha45-up-n9zw6-master-2 <none> <none> machine-config-daemon-xr92n 2/2 Running 0 110m 10.0.0.7 jiazha45-up-n9zw6-master-1 <none> <none> machine-config-operator-86b665fb84-7zv45 1/1 Running 0 93m 10.130.0.86 jiazha45-up-n9zw6-master-0 <none> <none> machine-config-server-9jr9z 1/1 Running 0 106m 10.0.0.5 jiazha45-up-n9zw6-master-2 <none> <none> machine-config-server-m6q85 1/1 Running 0 106m 10.0.0.7 jiazha45-up-n9zw6-master-1 <none> <none> machine-config-server-q2fbh 1/1 Running 0 105m 10.0.0.8 jiazha45-up-n9zw6-master-0 <none> <none> [root@preserve-olm-env data]# oc -n openshift-machine-config-operator logs machine-config-daemon-8r84w -c machine-config-daemon Error from server: Get "https://10.0.32.5:10250/containerLogs/openshift-machine-config-operator/machine-config-daemon-8r84w/machine-config-daemon": net/http: TLS handshake timeout [root@preserve-olm-env data]# oc describe nodes jiazha45-up-n9zw6-worker-centralus3-xs4h6 Name: jiazha45-up-n9zw6-worker-centralus3-xs4h6 Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=Standard_D4s_v3 beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=centralus failure-domain.beta.kubernetes.io/zone=centralus-3 kubernetes.io/arch=amd64 kubernetes.io/hostname=jiazha45-up-n9zw6-worker-centralus3-xs4h6 kubernetes.io/os=linux node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=Standard_D4s_v3 node.openshift.io/os_id=rhcos topology.kubernetes.io/region=centralus topology.kubernetes.io/zone=centralus-3 Annotations: k8s.ovn.org/l3-gateway-config: {"default":{"mode":"local","interface-id":"br-local_jiazha45-up-n9zw6-worker-centralus3-xs4h6","mac-address":"00:00:a9:fe:21:02","ip-addre... k8s.ovn.org/node-chassis-id: 252de6a9-fed6-4462-af22-d2f7ea5235c8 k8s.ovn.org/node-join-subnets: {"default":"100.64.4.0/29"} k8s.ovn.org/node-mgmt-port-mac-address: 92:2c:97:18:64:4f k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.32.5/19"} k8s.ovn.org/node-subnets: {"default":"10.128.2.0/23"} machine.openshift.io/machine: openshift-machine-api/jiazha45-up-n9zw6-worker-centralus3-xs4h6 machineconfiguration.openshift.io/currentConfig: rendered-worker-454a549a60f78f18a3046963bebcd881 machineconfiguration.openshift.io/desiredConfig: rendered-worker-3ac113826d1787d04061a8d40f63eab4 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Working volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Sun, 30 Aug 2020 22:04:42 -0400 Taints: node.kubernetes.io/unreachable:NoExecute node.kubernetes.io/unreachable:NoSchedule node.kubernetes.io/unschedulable:NoSchedule Unschedulable: true Lease: HolderIdentity: jiazha45-up-n9zw6-worker-centralus3-xs4h6 AcquireTime: <unset> RenewTime: Mon, 31 Aug 2020 21:57:06 -0400 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure Unknown Mon, 31 Aug 2020 21:56:44 -0400 Mon, 31 Aug 2020 21:57:46 -0400 NodeStatusUnknown Kubelet stopped posting node status. DiskPressure Unknown Mon, 31 Aug 2020 21:56:44 -0400 Mon, 31 Aug 2020 21:57:46 -0400 NodeStatusUnknown Kubelet stopped posting node status. PIDPressure Unknown Mon, 31 Aug 2020 21:56:44 -0400 Mon, 31 Aug 2020 21:57:46 -0400 NodeStatusUnknown Kubelet stopped posting node status. Ready Unknown Mon, 31 Aug 2020 21:56:44 -0400 Mon, 31 Aug 2020 21:57:46 -0400 NodeStatusUnknown Kubelet stopped posting node status. Addresses: Hostname: jiazha45-up-n9zw6-worker-centralus3-xs4h6 InternalIP: 10.0.32.5 Capacity: attachable-volumes-azure-disk: 8 cpu: 4 ephemeral-storage: 133665772Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 16392876Ki pods: 250 Allocatable: attachable-volumes-azure-disk: 8 cpu: 3500m ephemeral-storage: 122112633448 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 15241900Ki pods: 250 System Info: Machine ID: f0d71024b5de4fd2bd1235b26fe1f168 System UUID: c36ca4c3-4552-7d4e-8d69-b24cf2f6929e Boot ID: 6fdbb306-b75b-498b-a23f-4bfb1d678746 Kernel Version: 4.18.0-193.14.3.el8_2.x86_64 OS Image: Red Hat Enterprise Linux CoreOS 45.82.202008290529-0 (Ootpa) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.18.3-11.rhaos4.5.gite5bcc71.el8 Kubelet Version: v1.18.3+6c42de8 Kube-Proxy Version: v1.18.3+6c42de8 PodCIDR: 10.128.4.0/24 PodCIDRs: 10.128.4.0/24 ProviderID: azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jiazha45-up-n9zw6-rg/providers/Microsoft.Compute/virtualMachines/jiazha45-up-n9zw6-worker-centralus3-xs4h6 Non-terminated Pods: (16 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- openshift-cluster-node-tuning-operator tuned-jkcgg 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 127m openshift-dns dns-default-sht5p 65m (1%) 0 (0%) 110Mi (0%) 512Mi (3%) 107m openshift-image-registry node-ca-9bklr 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 126m openshift-ingress router-default-68d9f9646b-qhtbx 100m (2%) 0 (0%) 256Mi (1%) 0 (0%) 103m openshift-local-storage example-local-diskmaker-wprn8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21h openshift-local-storage example-local-provisioner-hcf2z 0 (0%) 0 (0%) 0 (0%) 0 (0%) 21h openshift-logging fluentd-jxx5x 100m (2%) 0 (0%) 736Mi (4%) 736Mi (4%) 96m openshift-machine-config-operator machine-config-daemon-8r84w 40m (1%) 0 (0%) 100Mi (0%) 0 (0%) 107m openshift-monitoring node-exporter-zztm9 9m (0%) 0 (0%) 210Mi (1%) 0 (0%) 128m openshift-monitoring prometheus-k8s-1 76m (2%) 0 (0%) 1184Mi (7%) 0 (0%) 126m openshift-monitoring thanos-querier-555c7dc77-v7c84 9m (0%) 0 (0%) 92Mi (0%) 0 (0%) 103m openshift-multus multus-xtkzh 10m (0%) 0 (0%) 150Mi (1%) 0 (0%) 115m openshift-multus network-metrics-daemon-rvldf 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 118m openshift-ovn-kubernetes ovnkube-node-jmftr 20m (0%) 0 (0%) 600Mi (4%) 0 (0%) 118m openshift-ovn-kubernetes ovnkube-node-metrics-88p4g 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 118m openshift-ovn-kubernetes ovs-node-vqj8b 100m (2%) 0 (0%) 300Mi (2%) 0 (0%) 116m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 579m (16%) 0 (0%) memory 3938Mi (26%) 1248Mi (8%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) attachable-volumes-azure-disk 0 0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal NodeNotSchedulable 91m kubelet, jiazha45-up-n9zw6-worker-centralus3-xs4h6 Node jiazha45-up-n9zw6-worker-centralus3-xs4h6 status is now: NodeNotSchedulable
Encounter the same issue on the AWS(UPI, FIPS) during upgrading to the 4.6 from 4.5.8. mac:~ jianzhang$ oc get co machine-config -o yaml apiVersion: config.openshift.io/v1 kind: ClusterOperator metadata: creationTimestamp: "2020-09-07T03:02:46Z" generation: 1 name: machine-config resourceVersion: "356022" selfLink: /apis/config.openshift.io/v1/clusteroperators/machine-config uid: b0c52888-28ec-4ab9-97b5-f2440594d581 spec: {} status: conditions: - lastTransitionTime: "2020-09-07T11:27:45Z" message: Cluster not available for 4.5.8 status: "False" type: Available - lastTransitionTime: "2020-09-07T09:54:26Z" message: Cluster version is 4.5.8 status: "False" type: Progressing - lastTransitionTime: "2020-09-07T11:27:45Z" message: 'Failed to resync 4.5.8 because: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 5, unavailable: 2)' reason: MachineConfigDaemonFailed status: "True" type: Degraded - lastTransitionTime: "2020-09-07T03:03:40Z" reason: AsExpected status: "True" type: Upgradeable extension: master: all 3 nodes are at latest configuration rendered-master-e75ebfd7ae12d861b4475dcd26a844b4 worker: 4 (ready 2) out of 4 nodes are updating to latest configuration rendered-worker-7801f86209954803c143bd06ad5e5cd8 relatedObjects: - group: "" name: openshift-machine-config-operator resource: namespaces - group: machineconfiguration.openshift.io name: master resource: machineconfigpools - group: machineconfiguration.openshift.io name: worker resource: machineconfigpools - group: machineconfiguration.openshift.io name: machine-config-controller resource: controllerconfigs versions: - name: operator version: 4.5.8 mac:~ jianzhang$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-49-177.us-east-2.compute.internal NotReady worker 4h26m v1.18.3+6c42de8 ip-10-0-50-216.us-east-2.compute.internal Ready master 9h v1.18.3+6c42de8 ip-10-0-52-35.us-east-2.compute.internal Ready master 9h v1.18.3+6c42de8 ip-10-0-56-124.us-east-2.compute.internal Ready worker 9h v1.18.3+6c42de8 ip-10-0-63-76.us-east-2.compute.internal Ready worker 9h v1.18.3+6c42de8 ip-10-0-64-229.us-east-2.compute.internal Ready master 9h v1.18.3+6c42de8 ip-10-0-66-153.us-east-2.compute.internal NotReady worker 9h v1.18.3+6c42de8 Here is the cluster for your debugging: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/110545/artifact/workdir/install-dir/auth/kubeconfig/*view*/
So far I am not sure what is causing this bug. Haven't found any indication of MCO being buggy. See below detailed analysis. Issue described in comment #0 : -------------------------------- - When I started looking at this bug, cluster credentials provided ( https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/109204/artifact/workdir/install-dir/auth/kubeconfig/*view*/ ) was no longer accessible. - By looking at provided must-gather, there is no indication in MCO log that why worker node jiazha45-up-n9zw6-worker-centralus3-xs4h6 is not available. Since node is not available, must-gather doesn't have any log from corresponding daemon machine-config-daemon-8r84w - To reproduce this issue locally, created 4.5 cluster using clusterbot in azure with ovn, fips environment and upgraded to available 4.6 nightlies. But it seems network connection gets lost during upgrade and I no longer get back access to cluster. Seeing same behavior with all three nightlies that I tried registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-09-07-224533, registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-09-03-063148,registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-09-07-162735 . Issue described in comment #2: ------------------------------- This doesn't look like a straight upgrade from 4.5 to 4.6. Logged into the cluster and while looking at the logs, noticed that this cluster has been first upgraded from 4.4 to 4.5 first and then to 4.6 status from one of the available worker node ip-10-0-63-76.us-east-2.compute.internal sh-4.4# rpm-ostree status State: idle Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9bcf0cf0009cceac80285b3dbca0b2cb3b9cf1f2c0f6a6bc642d2109a82501e0 CustomOrigin: Managed by machine-config-operator Version: 45.82.202008290529-0 (2020-08-29T05:33:25Z) pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cb56ec5c38333c5aa68c06521aaa044aa200bcd8d2c601034237edb142631dde CustomOrigin: Managed by machine-config-operator Version: 44.82.202008250531-0 (2020-08-25T05:37:37Z) Did we know that cluster was fully upgraded including all worker nodes when 4.4 to 4.5 upgrade took place first? Can you please also share how frequently are we seeing this issue?
I agree with Sinny nothing really points to something MCO related - the fact that the MCP notices that some nodes aren't ready and degrades isn't a symptom of something wrong in MCO (unless we cause that and we have proof/logs about that) failed to schedule the MCD isn't an MCO direct problem as if the node isn't ready we can't do anyhing but report that.
Hi Sinny, Thanks for your analysis! For the first analysis: There is a cluster(IPI Azure OVN FIPS etcd_encryption) that MCO failed to upgrade to 4.6. But, seems like it's not the same issue. Hope it helps. https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/111181/artifact/workdir/install-dir/auth/kubeconfig/*view*/ For the second analysis: > Did we know that cluster was fully upgraded including all worker nodes when 4.4 to 4.5 upgrade took place first? I guess so, no error report for 4.4 upgrade to 4.5. > Can you please also share how frequently are we seeing this issue? I'm not sure. we create one and met this issue. Now, I created two clusters(4.4.20->4.5.9->4.6), one cluster(upi aws ovn fips etcd_encryption): https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/111273/artifact/workdir/install-dir/auth/kubeconfig/*view*/ it works well for 4.4 to 4.5, now is upgrading to 4.6 another one (upi aws fips), now it's upgrading to 4.5, it will upgrade to 4.6 soon. https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/111316/artifact/workdir/install-dir/auth/kubeconfig/*view*/
I also create a cluster with the same metirc(IPI Azure OVN FIPS) with comment 0: [root@preserve-olm-env data]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-09-10-073112 True False 19m Cluster version is 4.5.0-0.nightly-2020-09-10-073112 [root@preserve-olm-env data]# oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release@sha256:1ca476cde944c29e370bbb3759df256fce191fa5849f726d8db3304040175505 --force --allow-explicit-upgrade [root@preserve-olm-env data]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-09-10-073112 True True 2m41s Working towards 4.6.0-0.nightly-2020-09-10-054902: 11% complete Now, it's upgrading. https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/111331/artifact/workdir/install-dir/auth/kubeconfig/*view*/
Thank you Jian for running extra set of upgrade test. Based on your new upgrade results it seems like MCO is working fine and not causing upgrade issue. If you think likewise, can we close this bug?
I think there are OVN issues in general and I wouldn't be surprised if we somehow broke 4.5 -> 4.6 upgrades w/ovn.
Hi Sinny, As you can see, for https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/111331/artifact/workdir/install-dir/auth/kubeconfig/*view*/, the MCO still is 4.5 version. And, we did met the "MachineConfigDaemonFailed" error twice before. I will remove the OVN metric and try to reproduce this bug. [root@preserve-olm-env data]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-09-10-073112 True True 13h Unable to apply 4.6.0-0.nightly-2020-09-10-054902: an unknown error has occurred: MultipleErrors [root@preserve-olm-env data]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-09-10-054902 False True True 28m machine-config 4.5.0-0.nightly-2020-09-10-073112 True False False 13h image-registry 4.6.0-0.nightly-2020-09-10-054902 True True True 13h monitoring 4.6.0-0.nightly-2020-09-10-054902 False False True 4h52m openshift-apiserver 4.6.0-0.nightly-2020-09-10-054902 False False False 2m22s dns 4.5.0-0.nightly-2020-09-10-073112 True False False 13h ... [root@preserve-olm-env data]# oc get co openshift-apiserver -o yaml ... - lastTransitionTime: "2020-09-11T00:57:40Z" message: 'APIServicesAvailable: "project.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)' reason: APIServices_Error [root@preserve-olm-env data]# oc get co authentication -o yaml ... status: conditions: - lastTransitionTime: "2020-09-10T12:30:36Z" message: |- OAuthRouteCheckEndpointAccessibleControllerDegraded: Get "https://oauth-openshift.apps.jiazha0910.qe.azure.devcluster.openshift.com/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) OAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded: Get "https://10.128.0.9:6443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) OAuthServiceCheckEndpointAccessibleControllerDegraded: Get "https://172.30.117.188:443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) OAuthServerDeploymentDegraded: Unable to get "openshift-browser-client" bootstrapped OAuth client: the server is currently unable to handle the request (post oauthclients.oauth.openshift.io) reason: OAuthRouteCheckEndpointAccessibleController_SyncError::OAuthServerDeployment_GetFailed::OAuthServiceCheckEndpointAccessibleController_SyncError::OAuthServiceEndpointsCheckEndpointAccessibleController_SyncError ... For https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/111273/artifact/workdir/install-dir/auth/kubeconfig/*view*/, it still in upgrading. [root@preserve-olm-env data]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.9 True True 12h Working towards 4.6.0-0.nightly-2020-09-10-011413: 15% complete
I have a try without the OVN metic, (IPI Azure FIPS)https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/111444/artifact/workdir/install-dir/auth/kubeconfig/*view*/ It upgrded well. I didn't met the "MachineConfigDaemonFailed" error. [root@preserve-olm-env data]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-09-10-073112 True True 20s Working towards 4.6.0-0.nightly-2020-09-10-195619: 0% complete [root@preserve-olm-env data]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-09-10-195619 True False 75m Cluster version is 4.6.0-0.nightly-2020-09-10-195619 I move on this bug to the OVN team first, the two clusters list in comment 9 for your debugging. Please transfer it to the appropriate component if you know, thanks!
Hi Aniket, Sure, and I linked the PR269 here.
Hi Aniket, I test for 2 clusters with OVN, all upgrade failed, details: From this payload(4.6.0-0.nightly-2020-09-17-195238) starts, there are PR269 merged in. [root@preserve-olm-env data]# oc adm release info registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-09-17-195238 --commits |grep ovn ovn-kubernetes https://github.com/openshift/ovn-kubernetes efa6de93497d3fdd81ba5706669c54529176691c 1) IPI on AWS & FIPS on & OVN: Upgrade 4.5.0-0.nightly-2020-09-17-145245 to 4.6.0-0.nightly-2020-09-17-195238 https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/upgrade_CI/5229/console Still failed, this cluster for your debugging: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/112933/artifact/workdir/install-dir/auth/kubeconfig/*view*/ [root@preserve-olm-env data]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-09-17-145245 True True 3h23m Unable to apply 4.6.0-0.nightly-2020-09-17-195238: the control plane is reporting an internal error [root@preserve-olm-env data]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE dns 4.5.0-0.nightly-2020-09-17-145245 True True False 5h18m ... kube-apiserver 4.6.0-0.nightly-2020-09-17-195238 True True False 5h16m kube-controller-manager 4.6.0-0.nightly-2020-09-17-195238 True False False 5h17m kube-scheduler 4.6.0-0.nightly-2020-09-17-195238 True False False 5h16m kube-storage-version-migrator 4.6.0-0.nightly-2020-09-17-195238 False False False 171m machine-api 4.6.0-0.nightly-2020-09-17-195238 True False False 5h10m machine-approver 4.6.0-0.nightly-2020-09-17-195238 True False False 5h16m machine-config 4.5.0-0.nightly-2020-09-17-145245 True False False 5h18m marketplace 4.6.0-0.nightly-2020-09-17-195238 True False False 3h2m monitoring 4.6.0-0.nightly-2020-09-17-195238 True False False 172m network 4.5.0-0.nightly-2020-09-17-145245 True True True 5h20m ... storage 4.6.0-0.nightly-2020-09-17-195238 True True False 174m 2) IPI on Azure & FIPS on & OVN & Etcd Encryption on: Upgrade 4.5.0-0.nightly-2020-09-17-145245 to 4.6.0-0.nightly-2020-09-17-195238 https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/upgrade_CI/5230/console Still failed, this cluster for your debugging: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/112935/artifact/workdir/install-dir/auth/kubeconfig/*view*/ [root@preserve-olm-env data]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-09-17-145245 True True 3h2m Working towards 4.6.0-0.nightly-2020-09-17-195238: 1% complete [root@preserve-olm-env data]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-09-17-195238 True False False 143m cloud-credential 4.6.0-0.nightly-2020-09-17-195238 True False False 5h23m cluster-autoscaler 4.6.0-0.nightly-2020-09-17-195238 True False False 5h10m config-operator 4.6.0-0.nightly-2020-09-17-195238 True False False 5h9m console 4.6.0-0.nightly-2020-09-17-195238 True False False 153m csi-snapshot-controller 4.6.0-0.nightly-2020-09-17-195238 True False False 5h1m dns 4.5.0-0.nightly-2020-09-17-145245 True True False 5h15m etcd 4.6.0-0.nightly-2020-09-17-195238 True False False 5h15m image-registry 4.6.0-0.nightly-2020-09-17-195238 True False False 5h2m ingress 4.6.0-0.nightly-2020-09-17-195238 True False False 155m insights 4.6.0-0.nightly-2020-09-17-195238 True False False 5h11m kube-apiserver 4.6.0-0.nightly-2020-09-17-195238 True False False 5h15m kube-controller-manager 4.6.0-0.nightly-2020-09-17-195238 True False False 5h15m kube-scheduler 4.6.0-0.nightly-2020-09-17-195238 True False False 5h14m kube-storage-version-migrator 4.6.0-0.nightly-2020-09-17-195238 False False False 144m machine-api 4.6.0-0.nightly-2020-09-17-195238 True False False 5h7m machine-approver 4.6.0-0.nightly-2020-09-17-195238 True False False 5h10m machine-config 4.5.0-0.nightly-2020-09-17-145245 True False False 5h8m ... network 4.5.0-0.nightly-2020-09-17-145245 True True True 5h17m ...
Jian, Looks like the cluster got deprovisioned over the weekend. I started an upgrade job with Azure on OVN with a downstream PR that has more fixes on the ovn-k side. But that seems to have failed as well. I will investigate. Meanwhile, can you create a reproducer cluster again?
Turns out the job failures were terraform related. My latest run for the upgrade job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/1308770921846149120 seems to have passed with CNO PR 801 and downstream ovn-kubernetes PR: 281. Once those PRs land, this can be verified. @Jian: Meanwhile, can you try to run the upgrade with the above two PRs?
Hi Aniket, > Meanwhile, can you create a reproducer cluster again? Sure, but in fact, we provided two clusters in the early of last Friday. In this time, could you help debug it as soon as possible? You know, the cluster is consumes the cloud resources, thanks for your understanding! IPI on AWS & FIPS on & OVN: Upgrade 4.5.0-0.nightly-2020-09-20-185910 to 4.6.0-0.nightly-2020-09-24-030538 The clusrer is creating: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/114283/ > @Jian: Meanwhile, can you try to run the upgrade with the above two PRs? Sorry, as you can see, this PR(https://github.com/openshift/ovn-kubernetes/pull/281) hasn't been merged yet! That means no release image contains it. I couldn't find a solution to update an OCP 4.5 to an OCP4.6 cluster without the release image. And, I don't think the cluster-bot can do that since there is no update tmeplate for the OVN metics, correct me if I'm worng, thanks!
Upgrade failed, here is the cluster for your debugging: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/114283/artifact/workdir/install-dir/auth/kubeconfig/*view*/ [root@preserve-olm-env data]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-09-20-185910 True True 133m Unable to apply 4.6.0-0.nightly-2020-09-24-030538: the control plane is reporting an internal error [root@preserve-olm-env data]# oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-09-24-030538 False True False 99m cloud-credential 4.6.0-0.nightly-2020-09-24-030538 True False False 172m cluster-autoscaler 4.6.0-0.nightly-2020-09-24-030538 True False False 157m config-operator 4.6.0-0.nightly-2020-09-24-030538 True False False 157m console 4.6.0-0.nightly-2020-09-24-030538 True False False 110m csi-snapshot-controller 4.6.0-0.nightly-2020-09-24-030538 True False False 149m dns 4.5.0-0.nightly-2020-09-20-185910 True True False 161m etcd 4.6.0-0.nightly-2020-09-24-030538 True False False 161m image-registry 4.6.0-0.nightly-2020-09-24-030538 True False False 150m ingress 4.6.0-0.nightly-2020-09-24-030538 True False False 112m insights 4.6.0-0.nightly-2020-09-24-030538 True False False 158m kube-apiserver 4.6.0-0.nightly-2020-09-24-030538 True True False 160m kube-controller-manager 4.6.0-0.nightly-2020-09-24-030538 True False False 161m kube-scheduler 4.6.0-0.nightly-2020-09-24-030538 True False False 159m kube-storage-version-migrator 4.6.0-0.nightly-2020-09-24-030538 True False False 130m machine-api 4.6.0-0.nightly-2020-09-24-030538 True False False 154m machine-approver 4.6.0-0.nightly-2020-09-24-030538 True False False 160m machine-config 4.5.0-0.nightly-2020-09-20-185910 True False False 130m marketplace 4.6.0-0.nightly-2020-09-24-030538 True False False 111m monitoring 4.6.0-0.nightly-2020-09-24-030538 True False False 101m network 4.5.0-0.nightly-2020-09-20-185910 True True True 163m node-tuning 4.6.0-0.nightly-2020-09-24-030538 True False False 111m openshift-apiserver 4.6.0-0.nightly-2020-09-24-030538 True False False 99m openshift-controller-manager 4.6.0-0.nightly-2020-09-24-030538 True False False 126m openshift-samples 4.6.0-0.nightly-2020-09-24-030538 True False False 111m operator-lifecycle-manager 4.6.0-0.nightly-2020-09-24-030538 True False False 161m operator-lifecycle-manager-catalog 4.6.0-0.nightly-2020-09-24-030538 True False False 162m operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-09-24-030538 True False False 110m service-ca 4.6.0-0.nightly-2020-09-24-030538 True False False 162m storage 4.6.0-0.nightly-2020-09-24-030538 True False False 103m
assign this issue according to https://bugzilla.redhat.com/show_bug.cgi?id=1880591#c25
since this original issue already be fixed for this bug. and we have another bug https://bugzilla.redhat.com/show_bug.cgi?id=1880591 to trace the upgrade issue I'd like to move this bug to 'verified'. please reopen it if this issue still happen. thanks.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475