Bug 1727090
| Summary: | e2e flake: namespace was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Frederic Branczyk <fbranczy> | ||||
| Component: | Node | Assignee: | Robert Krawitz <rkrawitz> | ||||
| Status: | CLOSED DUPLICATE | QA Contact: | Sunil Choudhary <schoudha> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 4.2.0 | CC: | adahiya, aos-bugs, ccoleman, deads, jcallen, jcantril, jokerman, lsm5, mfojtik, miabbott, mmccomas, nagrawal, shlao, sjenning, tnozicka, wking, yinzhou | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.3.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2019-09-18 14:15:37 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Frederic Branczyk
2019-07-04 12:30:37 UTC
*** Bug 1734427 has been marked as a duplicate of this bug. *** The real question is why does it only happen on this test? Every test creates a new test namespace and deletes it. Why is it only this test where the pod deletion takes a long time? Asking myself and any one else that might know. Also seems to happen occasionally on [sig-cli] Kubectl client [k8s.io] Simple pod should contain last line of the log [Suite:openshift/conformance/parallel] [Suite:k8s] *** Bug 1713135 has been marked as a duplicate of this bug. *** Created attachment 1604194 [details]
e2e-profile.png
I ran this against my local cluster with `openshift run openshift/conformance/parallel` and the cluster got super unstable.
Prometheus died in the middle so I lost my metrics. Only metrics I have are from the hypervisor running the VMs that are the cluster nodes.
test start at 2:06
test end at 3:06 but there is a huge backlog of terminating namespaces (~30)
load stays at 100% until 3:22 (+16m)
load at about 75% until 3:32 when load goes back to normal (+26m)
only e2e-provisioning namespace stuck terminating
This was on 3 4vcpu/8GB masters and 2 4cpu/6GB workers. I didn't see any evidence of memory thrashing.
So I am able to reproduce this. The trick will be keeping monitoring up so I can figure out what is going on.
Just to clarify, my monitoring stack was evicted when the worker nodes went NotReady afaict $ oc get event | grep Evict 72m Normal TaintManagerEviction pod/alertmanager-main-0 Marking for deletion Pod openshift-monitoring/alertmanager-main-0 81m Normal TaintManagerEviction pod/alertmanager-main-1 Marking for deletion Pod openshift-monitoring/alertmanager-main-1 72m Normal TaintManagerEviction pod/alertmanager-main-2 Marking for deletion Pod openshift-monitoring/alertmanager-main-2 72m Normal TaintManagerEviction pod/grafana-768f56b5c5-v8mh8 Marking for deletion Pod openshift-monitoring/grafana-768f56b5c5-v8mh8 72m Normal TaintManagerEviction pod/kube-state-metrics-6c9974679c-jzg75 Marking for deletion Pod openshift-monitoring/kube-state-metrics-6c9974679c-jzg75 81m Normal TaintManagerEviction pod/kube-state-metrics-6c9974679c-xrrgb Marking for deletion Pod openshift-monitoring/kube-state-metrics-6c9974679c-xrrgb 72m Normal TaintManagerEviction pod/openshift-state-metrics-84fd7f5c9c-gfrqj Marking for deletion Pod openshift-monitoring/openshift-state-metrics-84fd7f5c9c-gfrqj 81m Normal TaintManagerEviction pod/openshift-state-metrics-84fd7f5c9c-nsnd4 Marking for deletion Pod openshift-monitoring/openshift-state-metrics-84fd7f5c9c-nsnd4 72m Normal TaintManagerEviction pod/prometheus-adapter-596cf8794c-4htc9 Marking for deletion Pod openshift-monitoring/prometheus-adapter-596cf8794c-4htc9 81m Normal TaintManagerEviction pod/prometheus-adapter-596cf8794c-8gsz7 Marking for deletion Pod openshift-monitoring/prometheus-adapter-596cf8794c-8gsz7 72m Normal TaintManagerEviction pod/prometheus-adapter-596cf8794c-t22cn Marking for deletion Pod openshift-monitoring/prometheus-adapter-596cf8794c-t22cn 81m Normal TaintManagerEviction pod/prometheus-k8s-0 Marking for deletion Pod openshift-monitoring/prometheus-k8s-0 72m Normal TaintManagerEviction pod/prometheus-k8s-1 Marking for deletion Pod openshift-monitoring/prometheus-k8s-1 72m Normal TaintManagerEviction pod/prometheus-operator-764cbd99f9-xhxl8 Marking for deletion Pod openshift-monitoring/prometheus-operator-764cbd99f9-xhxl8 81m Normal TaintManagerEviction pod/telemeter-client-798455f9fd-58875 Marking for deletion Pod openshift-monitoring/telemeter-client-798455f9fd-58875 72m Normal TaintManagerEviction pod/telemeter-client-798455f9fd-xk9l8 Marking for deletion Pod openshift-monitoring/telemeter-client-798455f9fd-xk9l8 $ oc get event --all-namespaces | grep Node default 48m Normal NodeHasSufficientMemory node/worker-0 Node worker-0 status is now: NodeHasSufficientMemory default 48m Normal NodeHasNoDiskPressure node/worker-0 Node worker-0 status is now: NodeHasNoDiskPressure default 48m Normal NodeHasSufficientPID node/worker-0 Node worker-0 status is now: NodeHasSufficientPID default 48m Normal NodeReady node/worker-0 Node worker-0 status is now: NodeReady default 88m Normal NodeNotReady node/worker-0 Node worker-0 status is now: NodeNotReady default 56m Normal NodeNotReady node/worker-0 Node worker-0 status is now: NodeNotReady default 40m Normal NodeHasSufficientMemory node/worker-1 Node worker-1 status is now: NodeHasSufficientMemory default 40m Normal NodeHasNoDiskPressure node/worker-1 Node worker-1 status is now: NodeHasNoDiskPressure default 40m Normal NodeHasSufficientPID node/worker-1 Node worker-1 status is now: NodeHasSufficientPID default 39m Normal NodeReady node/worker-1 Node worker-1 status is now: NodeReady default 79m Normal NodeNotReady node/worker-1 Node worker-1 status is now: NodeNotReady default 40m Normal NodeNotReady node/worker-1 Node worker-1 status is now: NodeNotReady Going to try with larger workers. Though this should not happen :-/ *** Bug 1726802 has been marked as a duplicate of this bug. *** https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/49/build-log.txt https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/42/build-log.txt https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/30/build-log.txt In these days above jobs, the case "[Feature:Builds][webhook] TestWebhook [Suite:openshift/conformance/parallel]" also hits: namespace e2e-test-build-webhooks-7g7vj was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed (&errors.errorString{s:"namespace e2e-test-build-webhooks-7g7vj was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed *** Bug 1743929 has been marked as a duplicate of this bug. *** *** Bug 1696470 has been marked as a duplicate of this bug. *** The GCP CI is seeing similar timeout errors in the "ServiceAccounts" test from comment #1 and also in the "Image change build triggers" tests: - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/177 - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/180 Another one https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-aws-upi-4.2/115 *** Bug 1752591 has been marked as a duplicate of this bug. *** *** Bug 1752811 has been marked as a duplicate of this bug. *** Opened https://github.com/openshift/origin/pull/23813 . It looks like pods are slow getting cleaned up by kubelet and we backoff for 5 minutes which causes us to miss the check window in e2e. *** This bug has been marked as a duplicate of bug 1752982 *** The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |