In an upgrade from 4.2.0-0.ci-2019-09-20-072742 to 4.2.0-0.ci-2019-09-20-103359 [1]: Sep 20 11:36:17.601: INFO: cluster upgrade is Progressing: Working towards 4.2.0-0.ci-2019-09-20-103359: 25% complete Sep 20 11:36:28.454: INFO: unable to retrieve cluster version during upgrade: the server could not find the requested resource (get clusterversions.config.openshift.io version) Sep 20 11:36:37.614: INFO: cluster upgrade is Progressing: Working towards 4.2.0-0.ci-2019-09-20-103359: 25% complete Bracketing error events: Sep 20 11:33:22.832 E ns/openshift-kube-apiserver-operator pod/kube-apiserver-operator-5c6cdfd454-ntw28 node/ip-10-0-130-255.ec2.internal container=kube-apiserver-operator container exited with code 255 (Error): 1229f60d4836", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-apiserver changed: Degraded message changed from "NodeControllerDegraded: All master node(s) are ready\nStaticPodsDegraded: nodes/ip-10-0-128-212.ec2.internal pods/kube-apiserver-ip-10-0-128-212.ec2.internal container=\"kube-apiserver-6\" is not ready" to "NodeControllerDegraded: All master node(s) are ready"\nI0920 11:20:46.085767 1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"bb7595b3-db96-11e9-9481-1229f60d4836", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/revision-pruner-6-ip-10-0-128-212.ec2.internal -n openshift-kube-apiserver because it was missing\nW0920 11:24:36.422033 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 14863 (16521)\nW0920 11:25:52.244738 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 14842 (16862)\nW0920 11:26:27.258144 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 15671 (17031)\nW0920 11:27:57.643811 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 14858 (17429)\nW0920 11:30:51.428601 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 16657 (18716)\nW0920 11:32:10.269798 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 17202 (19063)\nI0920 11:33:21.780841 1 cmd.go:78] Received SIGTERM or SIGINT signal, shutting down controller.\nF0920 11:33:21.780928 1 leaderelection.go:66] leaderelection lost\n Sep 20 11:34:47.591 E ns/openshift-kube-controller-manager-operator pod/kube-controller-manager-operator-796b57f544-4twgc node/ip-10-0-128-212.ec2.internal container=kube-controller-manager-operator container exited with code 255 (Error): b96-11e9-9481-1229f60d4836", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-controller-manager changed: Degraded message changed from "NodeControllerDegraded: All master node(s) are ready\nStaticPodsDegraded: nodes/ip-10-0-128-212.ec2.internal pods/kube-controller-manager-ip-10-0-128-212.ec2.internal container=\"kube-controller-manager-7\" is not ready" to "NodeControllerDegraded: All master node(s) are ready"\nI0920 11:19:24.515767 1 event.go:209] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"bb75197e-db96-11e9-9481-1229f60d4836", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/revision-pruner-7-ip-10-0-128-212.ec2.internal -n openshift-kube-controller-manager because it was missing\nW0920 11:23:56.833830 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 14852 (16365)\nW0920 11:24:58.232367 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 14852 (16614)\nW0920 11:26:11.631933 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 15273 (16960)\nW0920 11:32:24.838814 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 16502 (19131)\nW0920 11:33:29.238779 1 reflector.go:289] k8s.io/client-go/informers/factory.go:133: watch of *v1.ConfigMap ended with: too old resource version: 16769 (19430)\nI0920 11:34:44.893055 1 cmd.go:79] Received SIGTERM or SIGINT signal, shutting down controller.\nF0920 11:34:44.893218 1 leaderelection.go:66] leaderelection lost\nI0920 11:34:44.901307 1 targetconfigcontroller.go:461] Shutting down TargetConfigController\n Sep 20 11:34:58.162 E ns/openshift-kube-scheduler-operator pod/openshift-kube-scheduler-operator-59b5c549db-kxklj node/ip-10-0-130-255.ec2.internal container=kube-scheduler-operator-container container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated Sep 20 11:38:34.761 E ns/openshift-machine-api pod/machine-api-controllers-7fb6b86d8f-5smk4 node/ip-10-0-157-157.ec2.internal container=controller-manager container exited with code 1 (Error): We've seen this seven times today on e2e jobs [2] [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/7375 [2]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?search=unable%20to%20retrieve%20cluster%20version%20during%20upgrade:%20the%20server%20could%20not%20find%20the%20requested%20resource
Upgrades from 4.1 to 4.2 are failing about 30% of the time due to this error. This blocks upgrading for customers from 4.1 to 4.2. Needs high priority triage to determine whether we have to slip the release to get this fixed.
F0920 12:49:17.437946 1 controller.go:157] Unable to perform initial IP allocation check: unable to refresh the service IP block: Get https://localhost:6443/api/v1/services: dial tcp 127.0.0.1:6443: i/o timeout
Initial artifacts triage: The .212 node seems to be very unhappy: │ openshift-kube-apiserver/kube-apiserver-ip-10-0-128-212.ec2.internal | Running | 15 restarts │ openshift-kube-controller-manager/kube-controller-manager-ip-10-0-128-212.ec2.internal | Running | 16 restarts │ openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-128-212.ec2.internal | Running | 16 restarts From the kubelet log on that node: │Sep 20 12:51:36 ip-10-0-128-212 hyperkube[1863]: I0920 12:51:36.279342 1863 kuberuntime_manager.go:808] checking backoff for container "sdn" in pod "sdn-25dhj_openshift-sdn(dedfb659-db9c-11e9-a707-0a154cfece42)" │ │Sep 20 12:51:36 ip-10-0-128-212 hyperkube[1863]: I0920 12:51:36.279598 1863 kuberuntime_manager.go:818] Back-off 5m0s restarting failed container=sdn pod=sdn-25dhj_openshift-sdn(dedfb659-db9c-11e9-a707-0a154cfece42) │ │Sep 20 12:51:36 ip-10-0-128-212 hyperkube[1863]: E0920 12:51:36.279638 1863 pod_workers.go:190] Error syncing pod dedfb659-db9c-11e9-a707-0a154cfece42 ("sdn-25dhj_openshift-sdn(dedfb659-db9c-11e9-a707-0a154cfece42)"), skipping: failed to "StartContainer" for "sdn" wi│ th CrashLoopBackOff: "Back-off 5m0s restarting failed container=sdn pod=sdn-25dhj_openshift-sdn(dedfb659-db9c-11e9-a707-0a154cfece42)" Which I think cause "Unable to perform initial IP allocation check: unable to refresh the service IP block: Get https://localhost:6443/api/v1/services: dial tcp 127.0.0.1:6443: i/o timeout" in apiserver logs?
Yup, this is the nasty loopback bug - fixed a week ago - https://github.com/openshift/containernetworking-plugins/pull/15
*** This bug has been marked as a duplicate of bug 1754638 ***
*** Bug 1754133 has been marked as a duplicate of this bug. ***
How can a bug fixed a week ago be the cause of current upgrade failures? This has failed twice in the last 24h https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/417/ https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/418/
Looking at https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2/417/artifacts/e2e-aws-upgrade/pods.json ip-10-0-143-142.ec2.internal is NotReady because of no CNI configuration multus-5xg2x is NotReady sdn-vrh9v is NotReady because the initcontainer install-cni-plugins is failing I don't think this one is the loopback issue. And we have no pod logs, so this is difficult to debug further. Moving on to the next failure.
Digging further, it looks like the SDN failure is because of this: Sep 29 05:20:50 ip-10-0-143-142 hyperkube[1174]: I0929 05:20:50.342049 1174 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-sdn", Name:"sdn-vrh9v", UID:"52236c99-e277-11e9-8aee-0a8df94ebcc2", APIVersion:"v1", ResourceVersion:"33238", FieldPath:""}): type: 'Warning' reason: 'FailedMount' MountVolume.SetUp failed for volume "sdn-token-8465p" : secrets "sdn-token-8465p" is forbidden: User "system:node:ip-10-0-143-142.ec2.internal" cannot get resource "secrets" in API group "" in the namespace "openshift-sdn": no relationship found between node "ip-10-0-143-142.ec2.internal" and this object
Yeah, something weird happened where the node lost access to things after it rebooted. But it rebooted in to rhcos 4.1, which is expected.
(In reply to Casey Callendrello from comment #10) > Digging further, it looks like the SDN failure is because of this: > > Sep 29 05:20:50 ip-10-0-143-142 hyperkube[1174]: I0929 05:20:50.342049 > 1174 event.go:221] Event(v1.ObjectReference{Kind:"Pod", > Namespace:"openshift-sdn", Name:"sdn-vrh9v", > UID:"52236c99-e277-11e9-8aee-0a8df94ebcc2", APIVersion:"v1", > ResourceVersion:"33238", FieldPath:""}): type: 'Warning' reason: > 'FailedMount' MountVolume.SetUp failed for volume "sdn-token-8465p" : > secrets "sdn-token-8465p" is forbidden: User > "system:node:ip-10-0-143-142.ec2.internal" cannot get resource "secrets" in > API group "" in the namespace "openshift-sdn": no relationship found between > node "ip-10-0-143-142.ec2.internal" and this object Witnessing this error in vSphere: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_release/5195/rehearse-5195-pull-ci-openshift-installer-master-e2e-vsphere/13/artifacts/e2e-vsphere/nodes/
pods/kube-apiserver-ip-10-0-128-212.ec2.internal container=\"kube-apiserver-7\" is waiting: \"CrashLoopBackOff\": ``` "message": "o/informers/factory.go:133: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500\u0026resourceVersion=0: dial tcp 127.0.0.1:6443: i/o timeout\nI0920 12:49:17.436008 1 trace.go:81] Trace[2025714792]: \"Reflector k8s.io/client-go/informers/factory.go:133 ListAndWatch\" (started: 2019-09-20 12:48:47.435470515 +0000 UTC m=+4.945813564) (total time: 30.00051517s):\nTrace[2025714792]: [30.00051517s] [30.00051517s] END\nE0920 12:49:17.436033 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.RoleBinding: Get https://localhost:6443/apis/rbac.authorization.k8s.io/v1/rolebindings?limit=500\u0026resourceVersion=0: dial tcp 127.0.0.1:6443: i/o timeout\nI0920 12:49:17.436251 1 trace.go:81] Trace[1248233046]: \"Reflector k8s.io/client-go/informers/factory.go:133 ListAndWatch\" (started: 2019-09-20 12:48:47.435819255 +0000 UTC m=+4.946162474) (total time: 30.000411262s):\nTrace[1248233046]: [30.000411262s] [30.000411262s] END\nE0920 12:49:17.436317 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.StorageClass: Get https://localhost:6443/apis/storage.k8s.io/v1/storageclasses?limit=500\u0026resourceVersion=0: dial tcp 127.0.0.1:6443: i/o timeout\nI0920 12:49:17.436256 1 trace.go:81] Trace[1001861287]: \"Reflector k8s.io/client-go/informers/factory.go:133 ListAndWatch\" (started: 2019-09-20 12:48:47.435870405 +0000 UTC m=+4.946213407) (total time: 30.000368976s):\nTrace[1001861287]: [30.000368976s] [30.000368976s] END\nE0920 12:49:17.436373 1 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1beta1.ValidatingWebhookConfiguration: Get https://localhost:6443/apis/admissionregistration.k8s.io/v1beta1/validatingwebhookconfigurations?limit=500\u0026resourceVersion=0: dial tcp 127.0.0.1:6443: i/o timeout\nF0920 12:49:17.437946 1 controller.go:157] Unable to perform initial IP allocation check: unable to refresh the service IP block: Get https://localhost:6443/api/v1/services: dial tcp 127.0.0.1:6443: i/o timeout\n", ``` This means the API server cannot connect to itself via localhost, which is why I closed this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1754125#c6 as duplicate. │ openshift-kube-apiserver/kube-apiserver-ip-10-0-128-212.ec2.internal | Running | 15 restarts │ openshift-kube-controller-manager/kube-controller-manager-ip-10-0-128-212.ec2.internal | Running | 16 restarts │ openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-128-212.ec2.internal | Running | 16 restarts Shows that one particular node has problem here.
Right, but that was from before the cri-o fix was merged. So that's expected. I suspect we can just close this.
Closing this. I scanned the last few upgrade failures, and none of them have been networking issues. We can file new bugs as appropriate. *** This bug has been marked as a duplicate of bug 1754638 ***