Bug 1913620 - 4.7 to 4.6 downgrade stuck at openshift-apiserver with "No route to host" networking error
Summary: 4.7 to 4.6 downgrade stuck at openshift-apiserver with "No route to host" net...
Keywords:
Status: CLOSED DUPLICATE of bug 1906936
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.7.0
Assignee: Douglas Smith
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-07 09:23 UTC by Xingxing Xia
Modified: 2021-01-12 15:18 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-12 15:18:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Xingxing Xia 2021-01-07 09:23:04 UTC
Description of problem:
4.7 to 4.6 downgrade stuck at openshift-apiserver with various log errors.
Blocking the test of https://issues.redhat.com/browse/MSTR-1055 , so adding the TestBlocker keyword.

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2021-01-05-203053 upgrades to 4.7.0-0.nightly-2021-01-06-222035 then downgrades to 4.6.0-0.nightly-2021-01-05-203053

How reproducible:
Not sure

Steps to Reproduce:
1. Install 4.6.0-0.nightly-2021-01-05-203053 UPI GCP env successfully. Run:
oc patch apiserver/cluster -p '{"spec":{"encryption": {"type":"aescbc"}}}' --type merge
Wait for 20 mins. Check all pods/nodes/COs, all are well.
2. Upgrade to 4.7.0-0.nightly-2021-01-06-222035 successfully. Check all pods/nodes/COs again, all are still well.
3. Downgrade back to 4.6.0-0.nightly-2021-01-05-203053

Actual results:
3. Downgrade failed. Stuck at below state, some debugging as below:
[xxia@pres 2021-01-07 14:52:44 CST my]$ ogpcn # my script that gets abnormal projects/pods/COs
openshift-multus                                   Terminating   4h49m
openshift-network-diagnostics                      Terminating   139m
openshift-sdn                                      Terminating   4h49m

openshift-multus                network-metrics-daemon-xm84l    0/2   Terminating  0   138m   10.128.2.3    xxia07story-f4zvs-worker-a-pdxts.c.openshift-qe.internal
...
openshift-network-diagnostics   network-check-target-77gct      0/1   Terminating  0   139m   10.128.2.2    xxia07story-f4zvs-worker-a-pdxts.c.openshift-qe.internal
...
openshift-oauth-apiserver       apiserver-5d44b68d87-qt74h      0/1   Terminating  0   104m   10.129.0.37   xxia07story-f4zvs-m-1.c.openshift-qe.internal
...

Clusteroperators which are not 4.6.0-0.nightly-2021-01-05-203053 True False False:
authentication                             4.6.0-0.nightly-2021-01-05-203053   False   True    True    52m
baremetal                                  4.7.0-0.nightly-2021-01-06-222035   True    False   False   144m
console                                    4.6.0-0.nightly-2021-01-05-203053   True    False   True    54m
dns                                        4.7.0-0.nightly-2021-01-06-222035   True    False   False   4h46m
machine-config                             4.7.0-0.nightly-2021-01-06-222035   True    False   False   110m
monitoring                                 4.6.0-0.nightly-2021-01-05-203053   False   False   True    50m
network                                    4.7.0-0.nightly-2021-01-06-222035   True    True    True    134m
openshift-apiserver                        4.6.0-0.nightly-2021-01-05-203053   False   False   False   52m
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2021-01-05-203053   False   True    False   52m

# all nodes are Ready and v1.20.0+b1e9f0d
NAME                                                       STATUS   ROLES    AGE     VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
xxia07story-f4zvs-m-0.c.openshift-qe.internal              Ready    master   4h49m   v1.20.0+b1e9f0d   10.0.0.5      <none>        Red Hat Enterprise Linux CoreOS 47.83.202101060443-0 (Ootpa)   4.18.0-240.10.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
..
[xxia@pres 2021-01-07 14:54:16 CST my]$ oc describe co openshift-apiserver
...
    Message:               APIServicesAvailable: "build.openshift.io.v1" is not ready: 503 (the server is currently unable to handle the request)
[xxia@pres 2021-01-07 14:55:51 CST my]$ oc get ns openshift-sdn -o yaml |& tee openshift-sdn-ns.yaml
...
  - lastTransitionTime: "2021-01-07T06:00:19Z"
    message: 'Discovery failed for some groups, 12 failing: unable to retrieve the complete list of server APIs: apps.openshift.io/v1: the server is currently unable to handle the request, ...
    reason: DiscoveryFailed
    status: "True"
    type: NamespaceDeletionDiscoveryFailure

Check kube-apiserver logs, many errors like below:
2021-01-07T06:57:43.586271409Z E0107 06:57:43.586000      18 available_controller.go:490] v1.quota.openshift.io failed with: failing or missing response from https://10.129.0.50:8443/apis/quota.openshift.io/v1: Get "https://10.129.0.50:8443/apis/quota.openshift.io/v1": dial tcp 10.129.0.50:8443: connect: no route to host

[xxia@pres 2021-01-07 15:14:52 CST downgrade]$ oloas # my script that gets OAS pods and logs
apiserver-697545cc9c-cv64d   2/2   Running   1     81m   10.130.0.38   xxia07story-f4zvs-m-0.c.openshift-qe.internal   <none>   <none>   apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=697545cc9c,revision=5
apiserver-697545cc9c-dhkbc   2/2   Running   1     83m   10.128.0.32   xxia07story-f4zvs-m-2.c.openshift-qe.internal   <none>   <none>   apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=697545cc9c,revision=5
apiserver-697545cc9c-lml9m   2/2   Running   1     82m   10.129.0.50   xxia07story-f4zvs-m-1.c.openshift-qe.internal   <none>   <none>   apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=697545cc9c,revision=5

Check openshift-apiserver container logs, many errors like below:
2021-01-07T07:15:27.332243189Z E0107 07:15:27.332172       1 cacher.go:416] cacher (*oauth.OAuthAccessToken): unexpected ListAndWatch error: failed to list *oauth.OAuthAccessToken: illegal base64 data at input byte 3; reinitializing...

Check `oc get po -n openshift-apiserver -o yaml`, saw "restartCount: 1" exists in openshift-apiserver-check-endpoints container. Check its logs, the last lines are as below:
[xxia@pres 2021-01-07 15:18:53 CST downgrade]$ oc logs -p -c openshift-apiserver-check-endpoints -n openshift-apiserver apiserver-697545cc9c-cv64d
I0107 05:55:44.320250       1 base_controller.go:113] Shutting down worker of CheckEndpointsTimeToStart controller ...
I0107 05:55:44.320297       1 base_controller.go:103] All CheckEndpointsTimeToStart workers have been terminated
...
I0107 05:55:44.421010       1 base_controller.go:109] Starting #1 worker of check-endpoints controller ...
I0107 06:01:27.841861       1 start_stop_controllers.go:70] The server doesn't have a resource type "podnetworkconnectivitychecks.controlplane.operator.openshift.io".

[xxia@pres 2021-01-07 15:19:39 CST downgrade]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-01-06-222035   True        True          106m    Unable to apply 4.6.0-0.nightly-2021-01-05-203053: the cluster operator openshift-apiserver has not yet successfully rolled out

[xxia@pres 2021-01-07 15:52:36 CST downgrade]$ oc rsh -n openshift-kube-apiserver kube-apiserver-xxia07story-f4zvs-m-0.c.openshift-qe.internal
sh-4.4# curl -k https://10.129.0.50:8443
curl: (7) Failed to connect to 10.129.0.50 port 8443: No route to host
[xxia@pres 2021-01-07 16:03:37 CST downgrade]$ oc rsh -n openshift-apiserver apiserver-697545cc9c-lml9m
sh-4.4# curl -k https://10.129.0.50:8443
...
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "code": 403
...

[xxia@pres 2021-01-07 16:22:28 CST downgrade]$ oc get pods -n openshift-sdn
No resources found in openshift-sdn namespace.
[xxia@pres 2021-01-07 16:23:37 CST downgrade]$ oc describe co network
    Last Transition Time:  2021-01-07T06:00:09Z
    Message:               Waiting for DaemonSet "openshift-multus/multus" to be created
Waiting for DaemonSet "openshift-multus/network-metrics-daemon" to be created
Waiting for DaemonSet "openshift-multus/multus-admission-controller" to be created
Waiting for DaemonSet "openshift-sdn/sdn-controller" to be created
Waiting for DaemonSet "openshift-sdn/ovs" to be created
Waiting for DaemonSet "openshift-sdn/sdn" to be created
    Reason:                Deploying
    Status:                True
    Type:                  Progressing

[xxia@pres 2021-01-07 16:23:43 CST downgrade]$ oc get po -n openshift-network-operator
NAME                                READY   STATUS    RESTARTS   AGE
network-operator-674b58cd88-pmtbv   1/1     Running   0          166m
[xxia@pres 2021-01-07 16:46:48 CST downgrade]$ oc logs network-operator-674b58cd88-pmtbv -n openshift-network-operator
2021/01/07 05:59:55 Go Version: go1.15.5
...
2021/01/07 06:00:05 Became the leader.
I0107 06:00:06.745945       1 request.go:621] Throttling request took 1.04468026s, request: GET:https://api-int...com:6443/apis/metal3.io/v1alpha1?timeout=32s
2021/01/07 06:00:08 Registering Components.
...
2021/01/07 06:00:08 ConfigMap "openshift-service-ca" not found
2021/01/07 06:00:08 ERROR ConfigMap "openshift-service-ca" not found - Reconciler error
...
2021/01/07 08:35:12 Reconciling update to openshift-multus/multus-admission-controller
2021/01/07 08:35:12 Error getting DaemonSet "openshift-multus/multus": DaemonSet.apps "multus" not found
2021/01/07 08:35:12 Error getting DaemonSet "openshift-multus/network-metrics-daemon": DaemonSet.apps "network-metrics-daemon" not found
2021/01/07 08:35:12 Error getting DaemonSet "openshift-multus/multus-admission-controller": DaemonSet.apps "multus-admission-controller" not found
2021/01/07 08:35:12 Error getting DaemonSet "openshift-sdn/sdn-controller": DaemonSet.apps "sdn-controller" not found
2021/01/07 08:35:12 Error getting DaemonSet "openshift-sdn/ovs": DaemonSet.apps "ovs" not found
2021/01/07 08:35:12 Error getting DaemonSet "openshift-sdn/sdn": DaemonSet.apps "sdn" not found
2021/01/07 08:35:26 Reconciling update for openshift-service-ca from /cluster
2021/01/07 08:35:26 ConfigMap "openshift-service-ca" not found
2021/01/07 08:35:26 ERROR ConfigMap "openshift-service-ca" not found - Reconciler error
2021/01/07 08:40:10 Reconciling update to openshift-multus/multus
2021/01/07 08:40:10 Error getting DaemonSet "openshift-multus/multus": DaemonSet.apps "multus" not found
2021/01/07 08:40:10 Error getting DaemonSet "openshift-multus/network-metrics-daemon": DaemonSet.apps "network-metrics-daemon" not found
2021/01/07 08:40:10 Error getting DaemonSet "openshift-multus/multus-admission-controller": DaemonSet.apps "multus-admission-controller" not found
2021/01/07 08:40:10 Error getting DaemonSet "openshift-sdn/sdn-controller": DaemonSet.apps "sdn-controller" not found
2021/01/07 08:40:10 Error getting DaemonSet "openshift-sdn/ovs": DaemonSet.apps "ovs" not found
2021/01/07 08:40:10 Error getting DaemonSet "openshift-sdn/sdn": DaemonSet.apps "sdn" not found
2021/01/07 08:40:10 Reconciling update to openshift-multus/network-metrics-daemon
...
[xxia@pres 2021-01-07 16:48:58 CST downgrade]$ oc get cm -A | grep " openshift-service-ca "
openshift-controller-manager                       openshift-service-ca                                    1      6h44m

Expected results:
3. Downgrade should succeed

Additional info:
must-gather failed:
oc adm must-gather --dest-dir must-gather-xxia07story-130119
error: gather did not start for pod must-gather-7zdw2: timed out waiting for the condition
[xxia 2021-01-07 15:24:06 CST my]$ du -sh must-gather-xxia07story-130119
12K     must-gather-xxia07story-130119

Comment 2 Juan Luis de Sousa-Valadas 2021-01-12 15:18:40 UTC

*** This bug has been marked as a duplicate of bug 1906936 ***


Note You need to log in before you can comment on or make changes to this bug.