Bug 1754125

Summary:	During upgrade: unable to retrieve cluster version during upgrade: the server could not find the requested resource (get clusterversions.config.openshift.io version)
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Networking	Assignee:	Casey Callendrello <cdc>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	aos-bugs, ccoleman, cdc, jcallen, mfojtik, mifiedle, wking
Version:	4.2.0	Keywords:	Reopened
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-10-03 15:42:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2019-09-20 22:26:54 UTC

In an upgrade from 4.2.0-0.ci-2019-09-20-072742 to 4.2.0-0.ci-2019-09-20-103359 [1]:

Sep 20 11:36:17.601: INFO: cluster upgrade is Progressing: Working towards 4.2.0-0.ci-2019-09-20-103359: 25% complete
Sep 20 11:36:28.454: INFO: unable to retrieve cluster version during upgrade: the server could not find the requested resource (get clusterversions.config.openshift.io version)
Sep 20 11:36:37.614: INFO: cluster upgrade is Progressing: Working towards 4.2.0-0.ci-2019-09-20-103359: 25% complete

Bracketing error events:

Sep 20 11:33:22.832 E ns/openshift-kube-apiserver-operator pod/kube-apiserver-operator-5c6cdfd454-ntw28 node/ip-10-0-130-255.ec2.internal container=kube-apiserver-operator container exited with code 255 (Error): 1229f60d4836", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: All master node(s) are ready\nStaticPodsDegraded: nodes/ip-10-0-128-212.ec2.internal pods/kube-apiserver-ip-10-0-128-212.ec2.internal container=\"kube-apiserver-6\" is not ready" to "NodeControllerDegraded: All master node(s) are ready"\nI0920 11:20:46.085767 1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"bb7595b3-db96-11e9-9481-1229f60d4836", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/revision-pruner-6-ip-10-0-128-212.ec2.internal -n openshift-kube-apiserver because it was missing\nW0920 11:24:36.422033 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 14863 (16521)\nW0920 11:25:52.244738 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 14842 (16862)\nW0920 11:26:27.258144 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 15671 (17031)\nW0920 11:27:57.643811 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 14858 (17429)\nW0920 11:30:51.428601 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 16657 (18716)\nW0920 11:32:10.269798 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 17202 (19063)\nI0920 11:33:21.780841 1 cmd.go:78] Received SIGTERM or SIGINT signal, shutting down controller.\nF0920 11:33:21.780928 1 leaderelection.go:66] leaderelection lost\n Sep 20 11:34:47.591 E ns/openshift-kube-controller-manager-operator pod/kube-controller-manager-operator-796b57f544-4twgc node/ip-10-0-128-212.ec2.internal container=kube-controller-manager-operator container exited with code 255 (Error): b96-11e9-9481-1229f60d4836", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-controller-manager changed: Degraded message changed from "NodeControllerDegraded: All master node(s) are ready\nStaticPodsDegraded: nodes/ip-10-0-128-212.ec2.internal pods/kube-controller-manager-ip-10-0-128-212.ec2.internal container=\"kube-controller-manager-7\" is not ready" to "NodeControllerDegraded: All master node(s) are ready"\nI0920 11:19:24.515767 1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"bb75197e-db96-11e9-9481-1229f60d4836", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/revision-pruner-7-ip-10-0-128-212.ec2.internal -n openshift-kube-controller-manager because it was missing\nW0920 11:23:56.833830 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 14852 (16365)\nW0920 11:24:58.232367 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 14852 (16614)\nW0920 11:26:11.631933 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 15273 (16960)\nW0920 11:32:24.838814 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 16502 (19131)\nW0920 11:33:29.238779 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 16769 (19430)\nI0920 11:34:44.893055 1 cmd.go:79] Received SIGTERM or SIGINT signal, shutting down controller.\nF0920 11:34:44.893218 1 leaderelection.go:66] leaderelection lost\nI0920 11:34:44.901307 1 targetconfigcontroller.go:461] Shutting down TargetConfigController\n Sep 20 11:34:58.162 E ns/openshift-kube-scheduler-operator pod/openshift-kube-scheduler-operator-59b5c549db-kxklj node/ip-10-0-130-255.ec2.internal container=kube-scheduler-operator-container container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated Sep 20 11:38:34.761 E ns/openshift-machine-api pod/machine-api-controllers-7fb6b86d8f-5smk4 node/ip-10-0-157-157.ec2.internal container=controller-manager container exited with code 1 (Error): 

We've seen this seven times today on e2e jobs [2]

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/7375
[2]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?search=unable%20to%20retrieve%20cluster%20version%20during%20upgrade:%20the%20server%20could%20not%20find%20the%20requested%20resource

Comment 2 Clayton Coleman 2019-09-27 20:18:41 UTC

Upgrades from 4.1 to 4.2 are failing about 30% of the time due to this error.  This blocks upgrading for customers from 4.1 to 4.2.  Needs high priority triage to determine whether we have to slip the release to get this fixed.

Comment 3 Michal Fojtik 2019-09-30 11:27:14 UTC

F0920 12:49:17.437946       1 controller.go:157] Unable to perform initial IP allocation check: unable to refresh the service IP block: Get https://localhost:6443/api/v1/services: dial tcp 127.0.0.1:6443: i/o timeout

Comment 4 Michal Fojtik 2019-09-30 11:38:56 UTC

Initial artifacts triage:

The .212 node seems to be very unhappy:

│  openshift-kube-apiserver/kube-apiserver-ip-10-0-128-212.ec2.internal                         | Running   | 15 restarts                                                                                                                                                      
│  openshift-kube-controller-manager/kube-controller-manager-ip-10-0-128-212.ec2.internal       | Running   | 16 restarts                                                                                                                                                      
│  openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-128-212.ec2.internal               | Running   | 16 restarts                                                                                                                                                     


From the kubelet log on that node:


│Sep 20 12:51:36 ip-10-0-128-212 hyperkube[1863]: I0920 12:51:36.279342    1863 kuberuntime_manager.go:808] checking backoff for container "sdn" in pod "sdn-25dhj_openshift-sdn(dedfb659-db9c-11e9-a707-0a154cfece42)"                                                        │
│Sep 20 12:51:36 ip-10-0-128-212 hyperkube[1863]: I0920 12:51:36.279598    1863 kuberuntime_manager.go:818] Back-off 5m0s restarting failed container=sdn pod=sdn-25dhj_openshift-sdn(dedfb659-db9c-11e9-a707-0a154cfece42)                                                    │
│Sep 20 12:51:36 ip-10-0-128-212 hyperkube[1863]: E0920 12:51:36.279638    1863 pod_workers.go:190] Error syncing pod dedfb659-db9c-11e9-a707-0a154cfece42 ("sdn-25dhj_openshift-sdn(dedfb659-db9c-11e9-a707-0a154cfece42)"), skipping: failed to "StartContainer" for "sdn" wi│
th CrashLoopBackOff: "Back-off 5m0s restarting failed container=sdn pod=sdn-25dhj_openshift-sdn(dedfb659-db9c-11e9-a707-0a154cfece42)"

Which I think cause "Unable to perform initial IP allocation check: unable to refresh the service IP block: Get https://localhost:6443/api/v1/services: dial tcp 127.0.0.1:6443: i/o timeout" in apiserver logs?

Comment 5 Casey Callendrello 2019-09-30 11:43:32 UTC

Yup, this is the nasty loopback bug - fixed a week ago - https://github.com/openshift/containernetworking-plugins/pull/15

Comment 6 Michal Fojtik 2019-09-30 11:49:38 UTC


*** This bug has been marked as a duplicate of bug 1754638 ***

Comment 7 Michal Fojtik 2019-09-30 12:42:31 UTC

*** Bug 1754133 has been marked as a duplicate of this bug. ***

Comment 8 Clayton Coleman 2019-09-30 12:59:43 UTC

How can a bug fixed a week ago be the cause of current upgrade failures?

This has failed twice in the last 24h

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/417/
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/418/

Comment 9 Casey Callendrello 2019-09-30 15:26:19 UTC

Looking at https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/417/artifacts/e2e-aws-upgrade/pods.json

ip-10-0-143-142.ec2.internal is NotReady because of no CNI configuration
multus-5xg2x is NotReady
sdn-vrh9v is NotReady because the initcontainer install-cni-plugins is failing

I don't think this one is the loopback issue.

And we have no pod logs, so this is difficult to debug further. Moving on to the next failure.

Comment 10 Casey Callendrello 2019-09-30 15:37:01 UTC

Digging further, it looks like the SDN failure is because of this:

Sep 29 05:20:50 ip-10-0-143-142 hyperkube[1174]: I0929 05:20:50.342049    1174 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-sdn", Name:"sdn-vrh9v", UID:"52236c99-e277-11e9-8aee-0a8df94ebcc2", APIVersion:"v1", ResourceVersion:"33238", FieldPath:""}): type: 'Warning' reason: 'FailedMount' MountVolume.SetUp failed for volume "sdn-token-8465p" : secrets "sdn-token-8465p" is forbidden: User "system:node:ip-10-0-143-142.ec2.internal" cannot get resource "secrets" in API group "" in the namespace "openshift-sdn": no relationship found between node "ip-10-0-143-142.ec2.internal" and this object

Comment 11 Casey Callendrello 2019-09-30 16:38:18 UTC

Yeah, something weird happened where the node lost access to things after it rebooted. But it rebooted in to rhcos 4.1, which is expected.

Comment 12 Joseph Callen 2019-09-30 16:57:11 UTC

(In reply to Casey Callendrello from comment #10)
> Digging further, it looks like the SDN failure is because of this:
> 
> Sep 29 05:20:50 ip-10-0-143-142 hyperkube[1174]: I0929 05:20:50.342049   
> 1174 event.go:221] Event(v1.ObjectReference{Kind:"Pod",
> Namespace:"openshift-sdn", Name:"sdn-vrh9v",
> UID:"52236c99-e277-11e9-8aee-0a8df94ebcc2", APIVersion:"v1",
> ResourceVersion:"33238", FieldPath:""}): type: 'Warning' reason:
> 'FailedMount' MountVolume.SetUp failed for volume "sdn-token-8465p" :
> secrets "sdn-token-8465p" is forbidden: User
> "system:node:ip-10-0-143-142.ec2.internal" cannot get resource "secrets" in
> API group "" in the namespace "openshift-sdn": no relationship found between
> node "ip-10-0-143-142.ec2.internal" and this object

Witnessing this error in vSphere:

https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_release/5195/rehearse-5195-pull-ci-openshift-installer-master-e2e-vsphere/13/artifacts/e2e-vsphere/nodes/

Comment 16 Michal Fojtik 2019-10-03 11:25:01 UTC

pods/kube-apiserver-ip-10-0-128-212.ec2.internal container=\"kube-apiserver-7\" is waiting: \"CrashLoopBackOff\":


```
                                "message": "o/informers/factory.go:133: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500\u0026resourceVersion=0: dial tcp 127.0.0.1:6443: i/o timeout\nI0920 12:49:17.436008       1 trace.go:81] Trace[2025714792]: \"Reflector k8s.io/client-go/informers/factory.go:133 ListAndWatch\" (started: 2019-09-20 12:48:47.435470515 +0000 UTC m=+4.945813564) (total time: 30.00051517s):\nTrace[2025714792]: [30.00051517s] [30.00051517s] END\nE0920 12:49:17.436033       1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.RoleBinding: Get https://localhost:6443/apis/rbac.authorization.k8s.io/v1/rolebindings?limit=500\u0026resourceVersion=0: dial tcp 127.0.0.1:6443: i/o timeout\nI0920 12:49:17.436251       1 trace.go:81] Trace[1248233046]: \"Reflector k8s.io/client-go/informers/factory.go:133 ListAndWatch\" (started: 2019-09-20 12:48:47.435819255 +0000 UTC m=+4.946162474) (total time: 30.000411262s):\nTrace[1248233046]: [30.000411262s] [30.000411262s] END\nE0920 12:49:17.436317       1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.StorageClass: Get https://localhost:6443/apis/storage.k8s.io/v1/storageclasses?limit=500\u0026resourceVersion=0: dial tcp 127.0.0.1:6443: i/o timeout\nI0920 12:49:17.436256       1 trace.go:81] Trace[1001861287]: \"Reflector k8s.io/client-go/informers/factory.go:133 ListAndWatch\" (started: 2019-09-20 12:48:47.435870405 +0000 UTC m=+4.946213407) (total time: 30.000368976s):\nTrace[1001861287]: [30.000368976s] [30.000368976s] END\nE0920 12:49:17.436373       1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.ValidatingWebhookConfiguration: Get https://localhost:6443/apis/admissionregistration.k8s.io/v1beta1/validatingwebhookconfigurations?limit=500\u0026resourceVersion=0: dial tcp 127.0.0.1:6443: i/o timeout\nF0920 12:49:17.437946       1 controller.go:157] Unable to perform initial IP allocation check: unable to refresh the service IP block: Get https://localhost:6443/api/v1/services: dial tcp 127.0.0.1:6443: i/o timeout\n",
```

This means the API server cannot connect to itself via localhost, which is why I closed this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1754125#c6 as duplicate.

│  openshift-kube-apiserver/kube-apiserver-ip-10-0-128-212.ec2.internal                         | Running   | 15 restarts                                                                                                                                                      
│  openshift-kube-controller-manager/kube-controller-manager-ip-10-0-128-212.ec2.internal       | Running   | 16 restarts                                                                                                                                                      
│  openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-128-212.ec2.internal               | Running   | 16 restarts      

Shows that one particular node has problem here.

Comment 17 Casey Callendrello 2019-10-03 12:18:52 UTC

Right, but that was from before the cri-o fix was merged. So that's expected.

I suspect we can just close this.

Comment 18 Casey Callendrello 2019-10-03 15:42:52 UTC

Closing this. I scanned the last few upgrade failures, and none of them have been networking issues. We can file new bugs as appropriate.

*** This bug has been marked as a duplicate of bug 1754638 ***