Hide Forgot
recently (~ Jan 2nd) our aws-ovn-upgrade job is failing almost every time on an upgrade test "[sig-network] pods should successfully create sandboxes by other" example here [0] after the upgrade is complete, there are some failures along the same lines of this: ns/openshift-kube-controller-manager pod/kube-controller-manager-guard-ip-10-0-149-222.us-west-2.compute.internal node/ip-10-0-149-222.us-west-2.compute.internal - 376.91 seconds after deletion - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kube-controller-manager-guard-ip-10-0-149-222.us-west-2.compute.internal_openshift-kube-controller-manager_3a2d0426-d4df-4faa-9b7f-47acb5466fda_0(5e0f557c05f0dcb455b548b1213756c0300b814e9961c692b7e9211fc323e4ee): error adding pod openshift-kube-controller-manager_kube-controller-manager-guard-ip-10-0-149-222.us-west-2.compute.internal to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): Multus: [openshift-kube-controller-manager/kube-controller-manager-guard-ip-10-0-149-222.us-west-2.compute.internal/3a2d0426-d4df-4faa-9b7f-47acb5466fda]: have you checked that your default network is ready? still waiting for readinessindicatorfile @ /var/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition I did not see this happening on our GCP version of this job, which may give some clue. The test code [1] is looking at all the events and parsing over all the "Failed to create pod sandbox" messages. It does allow for these events to happen up to 5 seconds after the pod was deleted. as you can see in the test log message above, it came 376 seconds after. [0] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade/1479445783269871616 [1] https://github.com/openshift/origin/blob/0e6a62416ffcc8d2189a0243977006c5b2f9fa2c/pkg/synthetictests/networking.go
@dosmith, I spent some time trying to dig in to what is going on here, but I didn't know what to look for. Please let me know if there is something I can try to gather to help figure this one out. The job(s) have a lot of artifacts to dig through. I also tried looking for any changes that could have happened recently in our test code, multus, onv and nothing seemed relevant to match up with when this went south (Jan 2nd-ish)
This error is indicative of OVN-K failing to produce a configuration file, and Multus has given the unfortunate news. Assigning to OVN-K component for triage.
*** Bug 2038386 has been marked as a duplicate of this bug. ***
*** Bug 1961772 has been marked as a duplicate of this bug. ***
update: what's happening is that two new pods were introduced in 4.10 (kube-controller-manager-guard and openshift-kube-scheduler-guard) from these two PRs [0][1] which was built on top of this library-go PR [2]. They only run on master nodes and are being evicted when the nodes are going to be rebooted. This eviction deletes these -guard pods, but they are actually restarted within ~5s and come back fully before the node actually reboots. Upon the node coming back up after the reboot, these pods initially timeout waiting for the ovnk config file and the sandbox error is reported. The openshift-test code does have a history of the original pod delete that happened with the eviction and to the test this sandbox error is coming much later (~5m) so it reports it as a failure [3] The new guard pod PRs indicate that they should have a poddisruptionbudget configured along with them, but I don't see it in a running 4.10 cluster. I don't know if that matters, or where I'm missing it otherwise. I don't know yet how to explain the guard pods being deleted with the eviction but getting started right back (the node is cordoned) again. If this is proper behavior then I can look in to how to fix the test code to account for it and not report a failure. If this is not proper behavior, I'm wondering if the pod should stay evicted and only be fully started again after the node is rebooted and finally uncordoned and the ovnk config file should be in place by then and no sandbox error would be reported. [0] https://github.com/openshift/cluster-kube-scheduler-operator/pull/373 [1] https://github.com/openshift/cluster-kube-controller-manager-operator/pull/568 [2] https://github.com/openshift/library-go/pull/1238 [3] https://github.com/openshift/origin/blob/b58b70a0d0084e3be2b2faf7c030d06c6df3f569/pkg/synthetictests/networking.go#L79-L81
slack thread with some confirmation that this belongs with kube-scheduler: https://coreos.slack.com/archives/CKJR6200N/p1642096272047700
*** Bug 2042956 has been marked as a duplicate of this bug. ***
the real fixes for this are still a work in progress, but in the meantime we have merged this commit [0] which turns this very specific case of a guard pod and this sandbox failure after a reboot in to a flake. The job is no longer perma-failing [1] [0] https://github.com/openshift/origin/commit/333d91371a1835f499073208dbb712467921aea5 [1] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade
(In reply to jamo luhrsen from comment #9) > the real fixes for this are still a work in progress, but in the meantime we > have merged this > commit [0] which turns this very specific case of a guard pod and this > sandbox failure after > a reboot in to a flake. The job is no longer perma-failing [1] > > [0] > https://github.com/openshift/origin/commit/ > 333d91371a1835f499073208dbb712467921aea5 > [1] > https://testgrid.k8s.io/redhat-openshift-ocp-release-4.10-informing#periodic- > ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn- > upgrade also, I have a jira item for myself to revert the test workaround once all the real fixes make it in. No hurry, but didn't want to lose sight and forget to revert: https://issues.redhat.com/browse/SDN-2636
Moving back to POST as there's still one PR left for merging in the KCM component.
Verified bug with nightly build below and i see that guard pods for KS/KCM/KAS are not being deleted and restarted, although i see an issue where one of the installer pod for kube-apiserver is in error state and doing a describe on the pod shows below, will raise a bug for the same. Status: Failed IP: 10.128.0.65 IPs: IP: 10.128.0.65 Containers: installer: Container ID: cri-o://ca4f50f6f6776cbca88811b00f9666c6f14c176b22e3b19292afaae3dcc8f11d Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2f3e877035668bafd5d8e87fd106dcb010973638b7993411eb5df074c5cffe3c Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2f3e877035668bafd5d8e87fd106dcb010973638b7993411eb5df074c5cffe3c Port: <none> Host Port: <none> Command: cluster-kube-apiserver-operator installer Args: -v=2 --revision=6 --namespace=openshift-kube-apiserver --pod=kube-apiserver-pod --resource-dir=/etc/kubernetes/static-pod-resources --pod-manifest-dir=/etc/kubernetes/manifests --configmaps=kube-apiserver-pod --configmaps=config --configmaps=kube-apiserver-cert-syncer-kubeconfig --optional-configmaps=oauth-metadata --optional-configmaps=cloud-config --configmaps=bound-sa-token-signing-certs --configmaps=etcd-serving-ca --optional-configmaps=kube-apiserver-server-ca --configmaps=kubelet-serving-ca --configmaps=sa-token-signing-certs --configmaps=kube-apiserver-audit-policies --secrets=etcd-client --optional-secrets=encryption-config --secrets=localhost-recovery-serving-certkey --secrets=localhost-recovery-client-token --optional-secrets=webhook-authenticator --cert-dir=/etc/kubernetes/static-pod-resources/kube-apiserver-certs --cert-configmaps=aggregator-client-ca --cert-configmaps=client-ca --optional-cert-configmaps=trusted-ca-bundle --cert-configmaps=control-plane-node-kubeconfig --cert-configmaps=check-endpoints-kubeconfig --cert-secrets=aggregator-client --cert-secrets=localhost-serving-cert-certkey --cert-secrets=service-network-serving-certkey --cert-secrets=external-loadbalancer-serving-certkey --cert-secrets=internal-loadbalancer-serving-certkey --cert-secrets=bound-service-account-signing-key --cert-secrets=control-plane-node-admin-client-cert-key --cert-secrets=check-endpoints-client-cert-key --cert-secrets=kubelet-client --cert-secrets=node-kubeconfigs --optional-cert-secrets=user-serving-cert --optional-cert-secrets=user-serving-cert-000 --optional-cert-secrets=user-serving-cert-001 --optional-cert-secrets=user-serving-cert-002 --optional-cert-secrets=user-serving-cert-003 --optional-cert-secrets=user-serving-cert-004 --optional-cert-secrets=user-serving-cert-005 --optional-cert-secrets=user-serving-cert-006 --optional-cert-secrets=user-serving-cert-007 --optional-cert-secrets=user-serving-cert-008 --optional-cert-secrets=user-serving-cert-009 State: Terminated Reason: Error Message: 01", (string) (len=21) "user-serving-cert-002", (string) (len=21) "user-serving-cert-003", (string) (len=21) "user-serving-cert-004", (string) (len=21) "user-serving-cert-005", (string) (len=21) "user-serving-cert-006", (string) (len=21) "user-serving-cert-007", (string) (len=21) "user-serving-cert-008", (string) (len=21) "user-serving-cert-009" }, CertConfigMapNamePrefixes: ([]string) (len=4 cap=4) { (string) (len=20) "aggregator-client-ca", (string) (len=9) "client-ca", (string) (len=29) "control-plane-node-kubeconfig", (string) (len=26) "check-endpoints-kubeconfig" }, OptionalCertConfigMapNamePrefixes: ([]string) (len=1 cap=1) { (string) (len=17) "trusted-ca-bundle" }, CertDir: (string) (len=57) "/etc/kubernetes/static-pod-resources/kube-apiserver-certs", ResourceDir: (string) (len=36) "/etc/kubernetes/static-pod-resources", PodManifestDir: (string) (len=25) "/etc/kubernetes/manifests", Timeout: (time.Duration) 2m0s, StaticPodManifestsLockFile: (string) "", PodMutationFns: ([]installerpod.PodMutationFunc) <nil>, KubeletVersion: (string) "" }) W0131 07:21:52.106709 1 cmd.go:413] unable to get owner reference (falling back to namespace): Get "https://172.30.0.1:443/api/v1/namespaces/openshift-kube-apiserver/pods/installer-6-ip-10-0-178-236.us-east-2.compute.internal": dial tcp 172.30.0.1:443: i/o timeout W0131 07:22:11.786877 1 cmd.go:426] unable to get kubelet version for node "ip-10-0-178-236.us-east-2.compute.internal": Get "https://172.30.0.1:443/api/v1/nodes/ip-10-0-178-236.us-east-2.compute.internal": context deadline exceeded I0131 07:22:11.786930 1 cmd.go:284] Creating target resource directory "/etc/kubernetes/static-pod-resources/kube-apiserver-pod-6" ... I0131 07:22:11.787032 1 cmd.go:212] Creating target resource directory "/etc/kubernetes/static-pod-resources/kube-apiserver-pod-6" ... I0131 07:22:11.787042 1 cmd.go:220] Getting secrets ... F0131 07:22:11.796879 1 cmd.go:101] failed to copy: timed out waiting for the condition Below are the steps i followed to verify the bug: ===================================================== 1) Install latest 4.10 cluster 2) Drain a master node using the command `oc adm drain ip-10-0-140-122.us-east-2.compute.internal --force --ignore-daemonsets --delete-emptydir-data` 3) I see that no guard pod for kube-scheduler, kcm, kas are present on the drained node. [knarra@knarra openshift-client-linux-4.9.0-0.nightly-2022-01-28-192738]$ oc get pods -n openshift-kube-scheduler NAME READY STATUS RESTARTS AGE installer-2-ip-10-0-178-236.us-east-2.compute.internal 0/1 Completed 0 8h installer-3-ip-10-0-178-236.us-east-2.compute.internal 0/1 Completed 0 8h installer-4-ip-10-0-178-236.us-east-2.compute.internal 0/1 Completed 0 8h installer-5-ip-10-0-178-236.us-east-2.compute.internal 0/1 Completed 0 8h installer-5-ip-10-0-192-9.us-east-2.compute.internal 0/1 Completed 0 8h installer-6-ip-10-0-178-236.us-east-2.compute.internal 0/1 Completed 0 8h installer-6-ip-10-0-192-9.us-east-2.compute.internal 0/1 Completed 0 8h openshift-kube-scheduler-guard-ip-10-0-178-236.us-east-2.compute.internal 1/1 Running 0 8h openshift-kube-scheduler-guard-ip-10-0-192-9.us-east-2.compute.internal 1/1 Running 0 6h12m openshift-kube-scheduler-ip-10-0-140-122.us-east-2.compute.internal 3/3 Running 0 8h openshift-kube-scheduler-ip-10-0-178-236.us-east-2.compute.internal 3/3 Running 0 8h openshift-kube-scheduler-ip-10-0-192-9.us-east-2.compute.internal 3/3 Running 0 8h [knarra@knarra openshift-client-linux-4.9.0-0.nightly-2022-01-28-192738]$ oc get pods -n openshift-kube-controller-manager NAME READY STATUS RESTARTS AGE installer-4-ip-10-0-178-236.us-east-2.compute.internal 0/1 Completed 0 8h installer-5-ip-10-0-178-236.us-east-2.compute.internal 0/1 Completed 0 8h installer-5-ip-10-0-192-9.us-east-2.compute.internal 0/1 Completed 0 8h installer-6-ip-10-0-178-236.us-east-2.compute.internal 0/1 Completed 0 8h installer-6-ip-10-0-192-9.us-east-2.compute.internal 0/1 Completed 0 8h installer-7-ip-10-0-178-236.us-east-2.compute.internal 0/1 Completed 0 8h installer-7-ip-10-0-192-9.us-east-2.compute.internal 0/1 Completed 0 8h kube-controller-manager-guard-ip-10-0-178-236.us-east-2.compute.internal 1/1 Running 0 8h kube-controller-manager-guard-ip-10-0-192-9.us-east-2.compute.internal 1/1 Running 0 8h kube-controller-manager-ip-10-0-140-122.us-east-2.compute.internal 4/4 Running 0 8h kube-controller-manager-ip-10-0-178-236.us-east-2.compute.internal 4/4 Running 0 8h kube-controller-manager-ip-10-0-192-9.us-east-2.compute.internal 4/4 Running 1 (8h ago) 8h [knarra@knarra openshift-client-linux-4.9.0-0.nightly-2022-01-28-192738]$ oc get pods -n openshift-kube-apiserver NAME READY STATUS RESTARTS AGE installer-2-ip-10-0-178-236.us-east-2.compute.internal 0/1 Completed 0 8h installer-3-ip-10-0-178-236.us-east-2.compute.internal 0/1 Completed 0 8h installer-4-ip-10-0-178-236.us-east-2.compute.internal 0/1 Completed 0 8h installer-5-ip-10-0-178-236.us-east-2.compute.internal 0/1 Completed 0 8h installer-6-ip-10-0-178-236.us-east-2.compute.internal 0/1 Error 0 8h installer-7-ip-10-0-178-236.us-east-2.compute.internal 0/1 Completed 0 8h installer-8-ip-10-0-178-236.us-east-2.compute.internal 0/1 Completed 0 8h installer-8-ip-10-0-192-9.us-east-2.compute.internal 0/1 Completed 0 8h kube-apiserver-guard-ip-10-0-178-236.us-east-2.compute.internal 1/1 Running 0 8h kube-apiserver-guard-ip-10-0-192-9.us-east-2.compute.internal 1/1 Running 0 8h kube-apiserver-ip-10-0-140-122.us-east-2.compute.internal 5/5 Running 0 8h kube-apiserver-ip-10-0-178-236.us-east-2.compute.internal 5/5 Running 1 (8h ago) 8h kube-apiserver-ip-10-0-192-9.us-east-2.compute.internal 5/5 Running 0 8h Based on the above moving bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056
*** Bug 2040263 has been marked as a duplicate of this bug. ***