1806603 – OCP upgrade 4.2.18->4.3.2 failed: 1 node in NotReady state, 2 CO in Degraded state in AWS+ UPI+RHEL cluster

Bug 1806603 - OCP upgrade 4.2.18->4.3.2 failed: 1 node in NotReady state, 2 CO in Degraded state in AWS+ UPI+RHEL cluster

Summary: OCP upgrade 4.2.18->4.3.2 failed: 1 node in NotReady state, 2 CO in Degraded ...

Keywords:
Status:	CLOSED DUPLICATE of bug 1805444
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-24 15:42 UTC by Neha Berry
Modified:	2020-03-06 13:00 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-06 13:00:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Neha Berry 2020-02-24 15:42:56 UTC

Description of problem:
----------------------------
On an OCP 4.2.18 setup (with OCS 4.2.2.-rc6 configured), initiated an OCP upgrade to OCP 4.3.2
The upgrade failed to complete and following are the issues seen:
a) One of the Worker node is left behind in NotReady,SchedulingDisabled state

b) The network operator failed to complete upgrade

c) The monitoring operator is also Degraded state

d) Must-gather is getting timed out, hence unable to collect OCP must-gather.

e)Unable to perform oc debug node on the affected node

Note: 

1. Before we started OCP upgrade, I was able to collect ocp and ocs must-gather without any issue.
2. Once upgrade failed, Collected logs in the form of "oc cluster-info dump" as OCP must-gather timed out


 for i in $(oc get project|awk '{print$1}'); do echo $i; oc cluster-info dump -n $i --output-directory=$i-logs ; done


Setip Details:
=================
1. OCP cluster = 3 W, 3 M
2. Worker node OS = RHEL 7.7
3. Platofrm: AWS +UPI + RHEL , W node instance type = m5.4xlarge
4. OCS - installed on 3 W nodes
5. Worker node CPU = 16
6. Worker node memory: 64299320Ki



Version-Release number of selected component (if applicable):
----------------------------
OCP version before upgrade: GA'd OCP 4.2.18

OCP version after upgrade: GA'd OCP 4.3.2

OCS version :  image: quay.io/rhceph-dev/ocs-olm-operator:4.2.2-rc6


How reproducible:
----------------------------
Tested once on RHEL based UPI cluster on AWS

Steps to Reproduce:
----------------------------
1. Create an OCP 4.2.18 cluster with 3 M and 3 W nodes (W= m5.4xlarge). Config: UPI+RHEL+AWS
2. Install OCS 4.2.2-rc6(or equivalent) on the 3 Worker nodes. Add 3 more OSDs
3. Configure openshift-monitoring, openshift-logging and openshift-registry to be backed by OCS by following [1]
4. Start some fedora pods with FIO, one PGSQL pod
5. With IO running on the app pods, start OCP upgrade

date --utc; oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.3.2 --force

6. Keep a check on CO upgrade status - Cluster Setting->CO. It was seen that two CO were left in Degraded state

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.18    True        True          3h9m    Unable to apply 4.3.2: the cluster operator monitoring is degraded

7. Check the node status, if any node is left in NotReady, SchedulingDisabled  state.



[1]  >>  Doc link : https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.2/html-single/managing_openshift_container_storage/index?lb_target=stage#configure-storage-for-openshift-container-platform-services_rhocs

Actual results:
----------------------------
Some CO failed to get upgraded, overall OCP upgrade failed

Cluster update in progress.
Unable to apply 4.3.2: the cluster operator monitoring is degraded



Expected results:
----------------------------
Upgrade should succeed and no CO should be left behind in Degraded state

Additional info:
----------------------------
>>a) The node in NotReady state after OCP upgrade

ip-10-0-52-118.us-east-2.compute.internal   NotReady,SchedulingDisabled   worker   6h22m   v1.14.6+9b1ffa798

>> b)Network Cluster Operator state

network  Degraded	4.3.2	DaemonSet "openshift-multus/multus" rollout is not making progress - last change 2020-02-24T13:09:33Z
DaemonSet "openshift-sdn/ovs" rollout is not making progress - last change 2020-02-24T13:09:34Z
DaemonSet "openshift-sdn/sdn" rollout is not making...

>>c)Monitoring CO state

monitoring  Degraded	4.3.2	Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status:...

>> d) oc debug node is also unable to succeed

> Some logs from MCO

[nberry@localhost openshift-machine-config-operator-logs]$ cat openshift-machine-config-operator/machine-config-daemon-7hwg6/logs.txt 
==== START logs for container machine-config-daemon of pod openshift-machine-config-operator/machine-config-daemon-7hwg6 ====
Request log error: an error on the server ("unknown") has prevented the request from succeeding (get pods machine-config-daemon-7hwg6)
==== END logs for container machine-config-daemon of pod openshift-machine-config-operator/machine-config-daemon-7hwg6 ====
==== START logs for container oauth-proxy of pod openshift-machine-config-operator/machine-config-daemon-7hwg6 ====
Request log error: an error on the server ("unknown") has prevented the request from succeeding (get pods machine-config-daemon-7hwg6)
==== END logs for container oauth-proxy of pod openshift-machine-config-operator/machine-config-daemon-7hwg6 ====


>> The Machine-config CO also took too much time to get upgraded.

Comment 2 Neha Berry 2020-02-24 15:53:07 UTC

The AWS instance is reporting no issue and Status Checks are fine

Comment 6 Stephen Cuppett 2020-03-06 13:00:54 UTC

This should be reproduced using a supported upgrade. There was a known issue with 4.3.1->4.3.2 which likely also includes 4.2.z -> 4.3.z. 4.3.5 should have this corrected. Marking duplicate of this for now.

*** This bug has been marked as a duplicate of bug 1805444 ***

Note You need to log in before you can comment on or make changes to this bug.