Description of problem: ---------------------------- On an OCP 4.2.18 setup (with OCS 4.2.2.-rc6 configured), initiated an OCP upgrade to OCP 4.3.2 The upgrade failed to complete and following are the issues seen: a) One of the Worker node is left behind in NotReady,SchedulingDisabled state b) The network operator failed to complete upgrade c) The monitoring operator is also Degraded state d) Must-gather is getting timed out, hence unable to collect OCP must-gather. e)Unable to perform oc debug node on the affected node Note: 1. Before we started OCP upgrade, I was able to collect ocp and ocs must-gather without any issue. 2. Once upgrade failed, Collected logs in the form of "oc cluster-info dump" as OCP must-gather timed out for i in $(oc get project|awk '{print$1}'); do echo $i; oc cluster-info dump -n $i --output-directory=$i-logs ; done Setip Details: ================= 1. OCP cluster = 3 W, 3 M 2. Worker node OS = RHEL 7.7 3. Platofrm: AWS +UPI + RHEL , W node instance type = m5.4xlarge 4. OCS - installed on 3 W nodes 5. Worker node CPU = 16 6. Worker node memory: 64299320Ki Version-Release number of selected component (if applicable): ---------------------------- OCP version before upgrade: GA'd OCP 4.2.18 OCP version after upgrade: GA'd OCP 4.3.2 OCS version : image: quay.io/rhceph-dev/ocs-olm-operator:4.2.2-rc6 How reproducible: ---------------------------- Tested once on RHEL based UPI cluster on AWS Steps to Reproduce: ---------------------------- 1. Create an OCP 4.2.18 cluster with 3 M and 3 W nodes (W= m5.4xlarge). Config: UPI+RHEL+AWS 2. Install OCS 4.2.2-rc6(or equivalent) on the 3 Worker nodes. Add 3 more OSDs 3. Configure openshift-monitoring, openshift-logging and openshift-registry to be backed by OCS by following [1] 4. Start some fedora pods with FIO, one PGSQL pod 5. With IO running on the app pods, start OCP upgrade date --utc; oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.3.2 --force 6. Keep a check on CO upgrade status - Cluster Setting->CO. It was seen that two CO were left in Degraded state $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.18 True True 3h9m Unable to apply 4.3.2: the cluster operator monitoring is degraded 7. Check the node status, if any node is left in NotReady, SchedulingDisabled state. [1] >> Doc link : https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.2/html-single/managing_openshift_container_storage/index?lb_target=stage#configure-storage-for-openshift-container-platform-services_rhocs Actual results: ---------------------------- Some CO failed to get upgraded, overall OCP upgrade failed Cluster update in progress. Unable to apply 4.3.2: the cluster operator monitoring is degraded Expected results: ---------------------------- Upgrade should succeed and no CO should be left behind in Degraded state Additional info: ---------------------------- >>a) The node in NotReady state after OCP upgrade ip-10-0-52-118.us-east-2.compute.internal NotReady,SchedulingDisabled worker 6h22m v1.14.6+9b1ffa798 >> b)Network Cluster Operator state network Degraded 4.3.2 DaemonSet "openshift-multus/multus" rollout is not making progress - last change 2020-02-24T13:09:33Z DaemonSet "openshift-sdn/ovs" rollout is not making progress - last change 2020-02-24T13:09:34Z DaemonSet "openshift-sdn/sdn" rollout is not making... >>c)Monitoring CO state monitoring Degraded 4.3.2 Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status:... >> d) oc debug node is also unable to succeed > Some logs from MCO [nberry@localhost openshift-machine-config-operator-logs]$ cat openshift-machine-config-operator/machine-config-daemon-7hwg6/logs.txt ==== START logs for container machine-config-daemon of pod openshift-machine-config-operator/machine-config-daemon-7hwg6 ==== Request log error: an error on the server ("unknown") has prevented the request from succeeding (get pods machine-config-daemon-7hwg6) ==== END logs for container machine-config-daemon of pod openshift-machine-config-operator/machine-config-daemon-7hwg6 ==== ==== START logs for container oauth-proxy of pod openshift-machine-config-operator/machine-config-daemon-7hwg6 ==== Request log error: an error on the server ("unknown") has prevented the request from succeeding (get pods machine-config-daemon-7hwg6) ==== END logs for container oauth-proxy of pod openshift-machine-config-operator/machine-config-daemon-7hwg6 ==== >> The Machine-config CO also took too much time to get upgraded.
The AWS instance is reporting no issue and Status Checks are fine
This should be reproduced using a supported upgrade. There was a known issue with 4.3.1->4.3.2 which likely also includes 4.2.z -> 4.3.z. 4.3.5 should have this corrected. Marking duplicate of this for now. *** This bug has been marked as a duplicate of bug 1805444 ***