The test titled: [sig-auth] ServiceAccounts should allow opting out of API token automount [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] failed in the job: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.2/1571 ``` fail [k8s.io/kubernetes/test/e2e/framework/framework.go:338]: Jul 4 10:26:05.143: Couldn't delete ns: "svcaccounts-8171": namespace svcaccounts-8171 was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed (&errors.errorString{s:"namespace svcaccounts-8171 was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed"}) ``` As this is a timeout on the namespace deleting, it's unlikely that this is actually the test itself, however, it would be useful for next time if maybe the finalizers still left on the namespace would be printed.
*** Bug 1734427 has been marked as a duplicate of this bug. ***
The real question is why does it only happen on this test? Every test creates a new test namespace and deletes it. Why is it only this test where the pod deletion takes a long time? Asking myself and any one else that might know.
Also seems to happen occasionally on [sig-cli] Kubectl client [k8s.io] Simple pod should contain last line of the log [Suite:openshift/conformance/parallel] [Suite:k8s]
*** Bug 1713135 has been marked as a duplicate of this bug. ***
Created attachment 1604194 [details] e2e-profile.png I ran this against my local cluster with `openshift run openshift/conformance/parallel` and the cluster got super unstable. Prometheus died in the middle so I lost my metrics. Only metrics I have are from the hypervisor running the VMs that are the cluster nodes. test start at 2:06 test end at 3:06 but there is a huge backlog of terminating namespaces (~30) load stays at 100% until 3:22 (+16m) load at about 75% until 3:32 when load goes back to normal (+26m) only e2e-provisioning namespace stuck terminating This was on 3 4vcpu/8GB masters and 2 4cpu/6GB workers. I didn't see any evidence of memory thrashing. So I am able to reproduce this. The trick will be keeping monitoring up so I can figure out what is going on.
Just to clarify, my monitoring stack was evicted when the worker nodes went NotReady afaict $ oc get event | grep Evict 72m Normal TaintManagerEviction pod/alertmanager-main-0 Marking for deletion Pod openshift-monitoring/alertmanager-main-0 81m Normal TaintManagerEviction pod/alertmanager-main-1 Marking for deletion Pod openshift-monitoring/alertmanager-main-1 72m Normal TaintManagerEviction pod/alertmanager-main-2 Marking for deletion Pod openshift-monitoring/alertmanager-main-2 72m Normal TaintManagerEviction pod/grafana-768f56b5c5-v8mh8 Marking for deletion Pod openshift-monitoring/grafana-768f56b5c5-v8mh8 72m Normal TaintManagerEviction pod/kube-state-metrics-6c9974679c-jzg75 Marking for deletion Pod openshift-monitoring/kube-state-metrics-6c9974679c-jzg75 81m Normal TaintManagerEviction pod/kube-state-metrics-6c9974679c-xrrgb Marking for deletion Pod openshift-monitoring/kube-state-metrics-6c9974679c-xrrgb 72m Normal TaintManagerEviction pod/openshift-state-metrics-84fd7f5c9c-gfrqj Marking for deletion Pod openshift-monitoring/openshift-state-metrics-84fd7f5c9c-gfrqj 81m Normal TaintManagerEviction pod/openshift-state-metrics-84fd7f5c9c-nsnd4 Marking for deletion Pod openshift-monitoring/openshift-state-metrics-84fd7f5c9c-nsnd4 72m Normal TaintManagerEviction pod/prometheus-adapter-596cf8794c-4htc9 Marking for deletion Pod openshift-monitoring/prometheus-adapter-596cf8794c-4htc9 81m Normal TaintManagerEviction pod/prometheus-adapter-596cf8794c-8gsz7 Marking for deletion Pod openshift-monitoring/prometheus-adapter-596cf8794c-8gsz7 72m Normal TaintManagerEviction pod/prometheus-adapter-596cf8794c-t22cn Marking for deletion Pod openshift-monitoring/prometheus-adapter-596cf8794c-t22cn 81m Normal TaintManagerEviction pod/prometheus-k8s-0 Marking for deletion Pod openshift-monitoring/prometheus-k8s-0 72m Normal TaintManagerEviction pod/prometheus-k8s-1 Marking for deletion Pod openshift-monitoring/prometheus-k8s-1 72m Normal TaintManagerEviction pod/prometheus-operator-764cbd99f9-xhxl8 Marking for deletion Pod openshift-monitoring/prometheus-operator-764cbd99f9-xhxl8 81m Normal TaintManagerEviction pod/telemeter-client-798455f9fd-58875 Marking for deletion Pod openshift-monitoring/telemeter-client-798455f9fd-58875 72m Normal TaintManagerEviction pod/telemeter-client-798455f9fd-xk9l8 Marking for deletion Pod openshift-monitoring/telemeter-client-798455f9fd-xk9l8 $ oc get event --all-namespaces | grep Node default 48m Normal NodeHasSufficientMemory node/worker-0 Node worker-0 status is now: NodeHasSufficientMemory default 48m Normal NodeHasNoDiskPressure node/worker-0 Node worker-0 status is now: NodeHasNoDiskPressure default 48m Normal NodeHasSufficientPID node/worker-0 Node worker-0 status is now: NodeHasSufficientPID default 48m Normal NodeReady node/worker-0 Node worker-0 status is now: NodeReady default 88m Normal NodeNotReady node/worker-0 Node worker-0 status is now: NodeNotReady default 56m Normal NodeNotReady node/worker-0 Node worker-0 status is now: NodeNotReady default 40m Normal NodeHasSufficientMemory node/worker-1 Node worker-1 status is now: NodeHasSufficientMemory default 40m Normal NodeHasNoDiskPressure node/worker-1 Node worker-1 status is now: NodeHasNoDiskPressure default 40m Normal NodeHasSufficientPID node/worker-1 Node worker-1 status is now: NodeHasSufficientPID default 39m Normal NodeReady node/worker-1 Node worker-1 status is now: NodeReady default 79m Normal NodeNotReady node/worker-1 Node worker-1 status is now: NodeNotReady default 40m Normal NodeNotReady node/worker-1 Node worker-1 status is now: NodeNotReady Going to try with larger workers. Though this should not happen :-/
*** Bug 1726802 has been marked as a duplicate of this bug. ***
https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/49/build-log.txt https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/42/build-log.txt https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/30/build-log.txt In these days above jobs, the case "[Feature:Builds][webhook] TestWebhook [Suite:openshift/conformance/parallel]" also hits: namespace e2e-test-build-webhooks-7g7vj was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed (&errors.errorString{s:"namespace e2e-test-build-webhooks-7g7vj was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed
*** Bug 1743929 has been marked as a duplicate of this bug. ***
*** Bug 1696470 has been marked as a duplicate of this bug. ***
The GCP CI is seeing similar timeout errors in the "ServiceAccounts" test from comment #1 and also in the "Image change build triggers" tests: - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/177 - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/180
Another one https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-aws-upi-4.2/115
*** Bug 1752591 has been marked as a duplicate of this bug. ***
*** Bug 1752811 has been marked as a duplicate of this bug. ***
Opened https://github.com/openshift/origin/pull/23813 . It looks like pods are slow getting cleaned up by kubelet and we backoff for 5 minutes which causes us to miss the check window in e2e.
*** This bug has been marked as a duplicate of bug 1752982 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days