Bug 1821369

Summary:	Can't upgrade to 4.4.0-rc.6 due openshift-apiserver operator is degraded
Product:	OpenShift Container Platform	Reporter:	Simon <skordas>
Component:	Machine Config Operator	Assignee:	Antonio Murdaca <amurdaca>
Status:	CLOSED DUPLICATE	QA Contact:	Michael Nguyen <mnguyen>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.4	CC:	aos-bugs, mfojtik, wking
Target Milestone:	---	Keywords:	Upgrades
Target Release:	4.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-04-07 20:14:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Simon 2020-04-06 17:18:42 UTC

Description of problem:
During upgrade chain 4.2.24 -> 4.3.8 -> 4.4.0-rc.6 on loaded cluster, upgrade stuck during upgrade to 4.4.0-rc.6 due degradation of openshift-apiserver operator.


Version-Release number of selected component (if applicable):
4.4.0-rc.6
Cluster provider: AWS

How reproducible:
So far 1 on 1 attempt

Steps to Reproduce:
1. Create cluster 4.2.24- scale up to 10 x m5.xlarge working nodes  
2. Load with 250 projects (each project 1 pod, 10 build-config, 10 build template, 10 image streams, 10 secrets, 10 routes)
3. Upgrade to 4.3.8
4. Upgrade to 4.4.0-rc.6

Actual results:
oc get clusterversions.config.openshift.io
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.8     True        True          70m     Unable to apply 4.4.0-rc.6: the cluster operator openshift-apiserver is degraded

oc get clusteroperators.config.openshift.io
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.0-rc.6   True        False         False      4h29m
cloud-credential                           4.4.0-rc.6   True        False         False      4h41m
cluster-autoscaler                         4.4.0-rc.6   True        False         False      4h37m
console                                    4.4.0-rc.6   True        False         False      33m
csi-snapshot-controller                    4.4.0-rc.6   True        False         False      25m
dns                                        4.4.0-rc.6   True        False         False      4h41m
etcd                                       4.4.0-rc.6   True        False         False      68m
image-registry                             4.4.0-rc.6   True        False         False      11m
ingress                                    4.4.0-rc.6   True        False         False      14m
insights                                   4.4.0-rc.6   True        False         False      4h41m
kube-apiserver                             4.4.0-rc.6   True        False         False      4h40m
kube-controller-manager                    4.4.0-rc.6   True        False         False      65m
kube-scheduler                             4.4.0-rc.6   True        False         False      65m
kube-storage-version-migrator              4.4.0-rc.6   True        False         False      17m
machine-api                                4.4.0-rc.6   True        False         False      4h41m
machine-config                             4.3.8        False       True          True       22m
marketplace                                4.4.0-rc.6   True        False         False      59m
monitoring                                 4.4.0-rc.6   True        False         False      55m
network                                    4.4.0-rc.6   True        False         False      4h40m
node-tuning                                4.4.0-rc.6   True        False         False      60m
openshift-apiserver                        4.4.0-rc.6   True        False         True       51m
openshift-controller-manager               4.4.0-rc.6   True        False         False      4h39m
openshift-samples                          4.4.0-rc.6   True        False         False      60m
operator-lifecycle-manager                 4.4.0-rc.6   True        False         False      4h40m
operator-lifecycle-manager-catalog         4.4.0-rc.6   True        False         False      4h40m
operator-lifecycle-manager-packageserver   4.4.0-rc.6   True        False         False      32m
service-ca                                 4.4.0-rc.6   True        False         False      4h41m
service-catalog-apiserver                  4.4.0-rc.6   True        False         False      4h38m
service-catalog-controller-manager         4.4.0-rc.6   True        False         False      4h38m
storage                                    4.4.0-rc.6   True        False         False      60m


Expected results:
Successful upgrade to 4.4.0-rc.6 

Additional info:

Comment 2 W. Trevor King 2020-04-06 18:10:29 UTC

From the private must-gather:

cluster-scoped-resources/config.openshift.io/clusteroperators/openshift-apiserver.yaml:

  - lastTransitionTime: "2020-04-03T20:36:36Z"
    message: 'APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable'
    reason: APIServerDeployment_UnavailablePod
    status: "True"
    type: Degraded

so degraded because a control-plane node is down?  From comment 0's ClusterOperator dump, you can see that machine-config is stuck on 4.3.8 and unavailable.  From cluster-scoped-resources/config.openshift.io/clusteroperators/machine-config.yaml:

  - lastTransitionTime: "2020-04-03T20:44:54Z"
    message: 'Unable to apply 4.4.0-rc.6: timed out waiting for the condition during
      syncRequiredMachineConfigPools: pool master has not progressed to latest configuration:
      controller version mismatch for rendered-master-05ae789544906d872f30dd4d3304b6ca
      expected a7b13759061f645a76f03c04d385d275bbbd0c02 has ab4d62a3bf3774b77b6f9b04a2028faec1568aca,
      retrying'
    reason: RequiredPoolsFailed
    status: "True"
    type: Degraded

But:

  $ for X in cluster-scoped-resources/core/nodes/*.yaml; do yaml2json <"${X}" | jq -r '.ready = (.status.conditions[] | select(.type == "Ready")) | .metadata.name + " " + .metadata.annotations["machineconfiguration.openshift.io/currentConfig"] + " " + .ready.status + " " + .ready.lastTransitionTime + " " + .ready.message'; done | grep -v worker
  ip-10-0-142-202.us-east-2.compute.internal rendered-master-05ae789544906d872f30dd4d3304b6ca True 2020-04-03T19:30:06Z kubelet is posting ready status
  ip-10-0-157-62.us-east-2.compute.internal rendered-master-05ae789544906d872f30dd4d3304b6ca True 2020-04-03T19:25:12Z kubelet is posting ready status
  ip-10-0-173-119.us-east-2.compute.internal rendered-master-05ae789544906d872f30dd4d3304b6ca True 2020-04-03T19:34:32Z kubelet is posting ready status

So the nodes all seem to be ready, even if they happen to be stuck on 4.3.8.  Finding the stuck pod:

  $ grep nodeName namespaces/openshift-apiserver/pods/apiserver-85b7c7855d-*/*.yaml
  namespaces/openshift-apiserver/pods/apiserver-85b7c7855d-6drpj/apiserver-85b7c7855d-6drpj.yaml:  nodeName: ip-10-0-173-119.us-east-2.compute.internal
  namespaces/openshift-apiserver/pods/apiserver-85b7c7855d-6g86z/apiserver-85b7c7855d-6g86z.yaml:  nodeName: ip-10-0-142-202.us-east-2.compute.internal

From the pod with no node, namespaces/openshift-apiserver/pods/apiserver-85b7c7855d-mnkcn/apiserver-85b7c7855d-mnkcn.yaml:

  - lastProbeTime: null
    lastTransitionTime: "2020-04-03T20:34:31Z"
    message: '0/13 nodes are available: 1 node(s) were unschedulable, 10 node(s) didn''t
      match node selector, 2 node(s) didn''t match pod affinity/anti-affinity.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled

Back to the node YAML cluster-scoped-resources/core/nodes/ip-10-0-157-62.us-east-2.compute.internal.yaml:

  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    timeAdded: "2020-04-03T20:34:31Z"
  unschedulable: true

so possibly this is still cordoned while the machine-config operator attempts to drain it.  Moving to them to weigh in, but might be a dup of bug 1814241 / bug 1814282.

Comment 3 W. Trevor King 2020-04-07 20:14:54 UTC

Checking for any csi-* stuff to see if this is a dup of bug 1814282:

  $ grep -r csi- host_service_logs
  ...no hits...

And I don't expect there to have been an e2e run to leak storage resources in the history of this cluster.  So this is probably a distinct issue.  Looking for drain details:

$ grep 'nodeName: ip-10-0-157-62\.' namespaces/openshift-machine-config-operator/pods/machine-config-daemon-*/*.yaml
namespaces/openshift-machine-config-operator/pods/machine-config-daemon-xzzsq/machine-config-daemon-xzzsq.yaml:  nodeName: ip-10-0-157-62.us-east-2.compute.internal
$ tail -n2 namespaces/openshift-machine-config-operator/pods/machine-config-daemon-xzzsq/machine-config-daemon/machine-config-daemon/logs/current.log 
2020-04-03T21:10:27.309942395Z I0403 21:10:27.309939  159222 update.go:811] Removed stale file "/etc/kubernetes/manifests/etcd-member.yaml"
2020-04-03T21:10:27.310397419Z E0403 21:10:27.309986  159222 writer.go:135] Marking Degraded due to: rename /etc/machine-config-daemon/orig/usr/local/bin/etcd-member-add.sh.mcdorig /usr/local/bin/etcd-member-add.sh: invalid cross-device link

Ah, so maybe this one is a dup of bug 1814397, which should have been fixed for 4.4 via mco#1609 and bug 1817455 and which is still open for 4.3 with bug 1817458.  Checking rc.6:

  $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.4.0-rc.6-x86_64 | grep machine-config
    machine-config-operator                        https://github.com/openshift/machine-config-operator                        a7b13759061f645a76f03c04d385d275bbbd0c02
  $ git --no-pager log --format='%h %s' --first-parent -10 origin/release-4.4
  b3578068 Merge pull request #1609 from runcom/upbug-4.4
  a7b13759 Merge pull request #1583 from openshift-cherrypick-robot/cherry-pick-1580-to-release-4.4

So yeah, just missed picking up that fix.  Closing this one as a dup.

*** This bug has been marked as a duplicate of bug 1817455 ***