Bug 1754125
Summary: | During upgrade: unable to retrieve cluster version during upgrade: the server could not find the requested resource (get clusterversions.config.openshift.io version) | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
Component: | Networking | Assignee: | Casey Callendrello <cdc> |
Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> |
Status: | CLOSED DUPLICATE | Docs Contact: | |
Severity: | urgent | ||
Priority: | unspecified | CC: | aos-bugs, ccoleman, cdc, jcallen, mfojtik, mifiedle, wking |
Version: | 4.2.0 | Keywords: | Reopened |
Target Milestone: | --- | ||
Target Release: | 4.2.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-10-03 15:42:52 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
W. Trevor King
2019-09-20 22:26:54 UTC
Upgrades from 4.1 to 4.2 are failing about 30% of the time due to this error. This blocks upgrading for customers from 4.1 to 4.2. Needs high priority triage to determine whether we have to slip the release to get this fixed. F0920 12:49:17.437946 1 controller.go:157] Unable to perform initial IP allocation check: unable to refresh the service IP block: Get https://localhost:6443/api/v1/services: dial tcp 127.0.0.1:6443: i/o timeout Initial artifacts triage: The .212 node seems to be very unhappy: │ openshift-kube-apiserver/kube-apiserver-ip-10-0-128-212.ec2.internal | Running | 15 restarts │ openshift-kube-controller-manager/kube-controller-manager-ip-10-0-128-212.ec2.internal | Running | 16 restarts │ openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-128-212.ec2.internal | Running | 16 restarts From the kubelet log on that node: │Sep 20 12:51:36 ip-10-0-128-212 hyperkube[1863]: I0920 12:51:36.279342 1863 kuberuntime_manager.go:808] checking backoff for container "sdn" in pod "sdn-25dhj_openshift-sdn(dedfb659-db9c-11e9-a707-0a154cfece42)" │ │Sep 20 12:51:36 ip-10-0-128-212 hyperkube[1863]: I0920 12:51:36.279598 1863 kuberuntime_manager.go:818] Back-off 5m0s restarting failed container=sdn pod=sdn-25dhj_openshift-sdn(dedfb659-db9c-11e9-a707-0a154cfece42) │ │Sep 20 12:51:36 ip-10-0-128-212 hyperkube[1863]: E0920 12:51:36.279638 1863 pod_workers.go:190] Error syncing pod dedfb659-db9c-11e9-a707-0a154cfece42 ("sdn-25dhj_openshift-sdn(dedfb659-db9c-11e9-a707-0a154cfece42)"), skipping: failed to "StartContainer" for "sdn" wi│ th CrashLoopBackOff: "Back-off 5m0s restarting failed container=sdn pod=sdn-25dhj_openshift-sdn(dedfb659-db9c-11e9-a707-0a154cfece42)" Which I think cause "Unable to perform initial IP allocation check: unable to refresh the service IP block: Get https://localhost:6443/api/v1/services: dial tcp 127.0.0.1:6443: i/o timeout" in apiserver logs? Yup, this is the nasty loopback bug - fixed a week ago - https://github.com/openshift/containernetworking-plugins/pull/15 *** This bug has been marked as a duplicate of bug 1754638 *** *** Bug 1754133 has been marked as a duplicate of this bug. *** How can a bug fixed a week ago be the cause of current upgrade failures? This has failed twice in the last 24h https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/417/ https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/418/ Looking at https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/417/artifacts/e2e-aws-upgrade/pods.json ip-10-0-143-142.ec2.internal is NotReady because of no CNI configuration multus-5xg2x is NotReady sdn-vrh9v is NotReady because the initcontainer install-cni-plugins is failing I don't think this one is the loopback issue. And we have no pod logs, so this is difficult to debug further. Moving on to the next failure. Digging further, it looks like the SDN failure is because of this: Sep 29 05:20:50 ip-10-0-143-142 hyperkube[1174]: I0929 05:20:50.342049 1174 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-sdn", Name:"sdn-vrh9v", UID:"52236c99-e277-11e9-8aee-0a8df94ebcc2", APIVersion:"v1", ResourceVersion:"33238", FieldPath:""}): type: 'Warning' reason: 'FailedMount' MountVolume.SetUp failed for volume "sdn-token-8465p" : secrets "sdn-token-8465p" is forbidden: User "system:node:ip-10-0-143-142.ec2.internal" cannot get resource "secrets" in API group "" in the namespace "openshift-sdn": no relationship found between node "ip-10-0-143-142.ec2.internal" and this object Yeah, something weird happened where the node lost access to things after it rebooted. But it rebooted in to rhcos 4.1, which is expected. (In reply to Casey Callendrello from comment #10) > Digging further, it looks like the SDN failure is because of this: > > Sep 29 05:20:50 ip-10-0-143-142 hyperkube[1174]: I0929 05:20:50.342049 > 1174 event.go:221] Event(v1.ObjectReference{Kind:"Pod", > Namespace:"openshift-sdn", Name:"sdn-vrh9v", > UID:"52236c99-e277-11e9-8aee-0a8df94ebcc2", APIVersion:"v1", > ResourceVersion:"33238", FieldPath:""}): type: 'Warning' reason: > 'FailedMount' MountVolume.SetUp failed for volume "sdn-token-8465p" : > secrets "sdn-token-8465p" is forbidden: User > "system:node:ip-10-0-143-142.ec2.internal" cannot get resource "secrets" in > API group "" in the namespace "openshift-sdn": no relationship found between > node "ip-10-0-143-142.ec2.internal" and this object Witnessing this error in vSphere: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_release/5195/rehearse-5195-pull-ci-openshift-installer-master-e2e-vsphere/13/artifacts/e2e-vsphere/nodes/ pods/kube-apiserver-ip-10-0-128-212.ec2.internal container=\"kube-apiserver-7\" is waiting: \"CrashLoopBackOff\": ``` "message": "o/informers/factory.go:133: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500\u0026resourceVersion=0: dial tcp 127.0.0.1:6443: i/o timeout\nI0920 12:49:17.436008 1 trace.go:81] Trace[2025714792]: \"Reflector k8s.io/client-go/informers/factory.go:133 ListAndWatch\" (started: 2019-09-20 12:48:47.435470515 +0000 UTC m=+4.945813564) (total time: 30.00051517s):\nTrace[2025714792]: [30.00051517s] [30.00051517s] END\nE0920 12:49:17.436033 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.RoleBinding: Get https://localhost:6443/apis/rbac.authorization.k8s.io/v1/rolebindings?limit=500\u0026resourceVersion=0: dial tcp 127.0.0.1:6443: i/o timeout\nI0920 12:49:17.436251 1 trace.go:81] Trace[1248233046]: \"Reflector k8s.io/client-go/informers/factory.go:133 ListAndWatch\" (started: 2019-09-20 12:48:47.435819255 +0000 UTC m=+4.946162474) (total time: 30.000411262s):\nTrace[1248233046]: [30.000411262s] [30.000411262s] END\nE0920 12:49:17.436317 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.StorageClass: Get https://localhost:6443/apis/storage.k8s.io/v1/storageclasses?limit=500\u0026resourceVersion=0: dial tcp 127.0.0.1:6443: i/o timeout\nI0920 12:49:17.436256 1 trace.go:81] Trace[1001861287]: \"Reflector k8s.io/client-go/informers/factory.go:133 ListAndWatch\" (started: 2019-09-20 12:48:47.435870405 +0000 UTC m=+4.946213407) (total time: 30.000368976s):\nTrace[1001861287]: [30.000368976s] [30.000368976s] END\nE0920 12:49:17.436373 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.ValidatingWebhookConfiguration: Get https://localhost:6443/apis/admissionregistration.k8s.io/v1beta1/validatingwebhookconfigurations?limit=500\u0026resourceVersion=0: dial tcp 127.0.0.1:6443: i/o timeout\nF0920 12:49:17.437946 1 controller.go:157] Unable to perform initial IP allocation check: unable to refresh the service IP block: Get https://localhost:6443/api/v1/services: dial tcp 127.0.0.1:6443: i/o timeout\n", ``` This means the API server cannot connect to itself via localhost, which is why I closed this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1754125#c6 as duplicate. │ openshift-kube-apiserver/kube-apiserver-ip-10-0-128-212.ec2.internal | Running | 15 restarts │ openshift-kube-controller-manager/kube-controller-manager-ip-10-0-128-212.ec2.internal | Running | 16 restarts │ openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-128-212.ec2.internal | Running | 16 restarts Shows that one particular node has problem here. Right, but that was from before the cri-o fix was merged. So that's expected. I suspect we can just close this. Closing this. I scanned the last few upgrade failures, and none of them have been networking issues. We can file new bugs as appropriate. *** This bug has been marked as a duplicate of bug 1754638 *** |