Created attachment 1141866 [details] syslog with atomic-openshift-master and node errors Description of problem: This is in a 3 node (master, 2 workers) reliability test where projects are added, deleted, updated, users created, apps created and removed, builds run, etc. Not at high stress, just a steady stream of activity. I can provide details on the exact activity. After running for 2 hours a project delete for project ruby-hello-world-usr-2323-1-2356-4 triggered many errors similar to this in the log (on the master): Mar 29 13:08:39 ip-172-31-58-185 atomic-openshift-node: W0329 13:08:39.535890 29134 registry.go:694] Service 'dancer-mysql-example' in namespace 'dancer-mysql-example-usr-2323-1-2340-4' has an Endpoint pointing to non-existent pod (172.20.2.5) Mar 29 13:08:39 ip-172-31-58-185 atomic-openshift-node: W0329 13:08:39.535952 29134 registry.go:698] Service 'cakephp-mysql-example' in namespace 'cakephp-mysql-example-usr-2323-2-2340-5' has an Endpoint pointing to pod 172.20.2.4 in namespace 'ruby-hello-world-usr-2323-1-2356-4' Mar 29 13:08:39 ip-172-31-58-185 atomic-openshift-node: I0329 13:08:39.536079 29134 proxier.go:494] Removing endpoints for "ruby-hello-world-usr-2323-1-2323-5/frontend:web" Mar 29 13:08:39 ip-172-31-58-185 atomic-openshift-node: W0329 13:08:39.594961 29134 registry.go:698] Service 'cakephp-mysql-example' in namespace 'cakephp-mysql-example-usr-2323-2-2340-5' has an Endpoint pointing to pod 172.20.2.4 in namespace 'ruby-hello-world-usr-2323-1-2356-4' Note that some messages referring to the deleted namespace. Next there are several messages about not being able to create objects in a namespace being terminated. Expected. But the "has an Endpoint pointing to pod" in the deleted namespace continue. At 23:12:07, project ruby-hello-world-usr-2323-1-2323-5 is deleted. It is one of the projects in the recurring "has an Endpoint pointing to a pod in a deleted namespace" messages At 23:12:09 Things seem to go haywire. Logs start filling with Mar 29 23:12:42 ip-172-31-58-185 atomic-openshift-master: E0329 23:12:42.138919 16076 namespace_controller.go:139] unexpected items still remain in namespace: ruby-hello-world-usr-2340-1-2441-5 for gvr: { v1 pods} Mar 29 23:12:42 ip-172-31-58-185 atomic-openshift-master: E0329 23:12:42.210320 16076 namespace_controller.go:139] unexpected items still remain in namespace: ruby-hello-world-usr-2340-1-2441-5 for gvr: { v1 pods} Mar 29 23:12:42 ip-172-31-58-185 atomic-openshift-master: E0329 23:12:42.280941 16076 namespace_controller.go:139] unexpected items still remain in namespace: ruby-hello-world-usr-2340-1-2441-5 for gvr: { v1 pods} As more projects are deleted even more of these start showing up (see end of full logs) for other namespaces. Version-Release number of selected component (if applicable): 3.2.0.8 How reproducible: Unknown Steps to Reproduce: 1. 3 nodes - 1 master, 2 nodes 2. Clone openshift/svt repository 3. Contact vlaad for fastpath assistance setting the tests up. Actual results: The cluster ends up in a state with pods stuck in Pending or Error root@ip-172-31-58-185: ~ # oc get pods --all-namespaces | wc -l 98 root@ip-172-31-58-185: ~ # oc get pods --all-namespaces | grep Pending | wc -l 19 root@ip-172-31-58-185: ~ # oc get pods --all-namespaces | grep Error | wc -l 27 root@ip-172-31-58-185: ~ # oc get pods --all-namespaces | grep Terminating | wc -l 16 root@ip-172-31-58-185: ~ # oc get pods --all-namespaces | grep Complete | wc -l 12 Expected results: Reliability test continues to run all builds, pods, etc normally Additional info: full logs attached
Are the 16 pods in "Terminating" state on the same node or multiple nodes?
Only one of them is on a different node root@ip-172-31-58-185: ~ # oc get pods --all-namespaces -o wide| grep Terminating cakephp-mysql-example-usr-2356-1-2624-4 cakephp-mysql-example-5-build 0/1 Terminating 0 4h ip-172-31-58-186.us-west-2.compute.internal cakephp-mysql-example-usr-2356-3-2642-5 cakephp-mysql-example-3-build 0/1 Terminating 0 5h ip-172-31-58-186.us-west-2.compute.internal cakephp-mysql-example-usr-2356-3-2642-5 cakephp-mysql-example-6-build 0/1 Terminating 0 2h ip-172-31-58-186.us-west-2.compute.internal cakephp-mysql-example-usr-2374-1-2758-4 mysql-1-erzd0 0/1 Terminating 0 4h ip-172-31-58-186.us-west-2.compute.internal eap64-mysql-s2i-usr-2356-0-2591-4 eap-app-1-1w9b7 1/1 Terminating 0 8h ip-172-31-58-186.us-west-2.compute.internal ruby-hello-world-usr-2356-0-2575-4 database-1-ffz07 1/1 Terminating 0 16h ip-172-31-58-186.us-west-2.compute.internal ruby-hello-world-usr-2356-0-2575-4 frontend-1-3k239 0/1 Terminating 0 7h ip-172-31-58-186.us-west-2.compute.internal ruby-hello-world-usr-2356-0-2575-4 frontend-1-jpofk 0/1 Terminating 0 6h ip-172-31-58-186.us-west-2.compute.internal ruby-hello-world-usr-2356-0-2575-4 frontend-1-kdsvn 1/1 Terminating 0 9h ip-172-31-58-186.us-west-2.compute.internal ruby-hello-world-usr-2356-1-2575-5 database-1-mvht3 1/1 Terminating 0 16h ip-172-31-58-186.us-west-2.compute.internal ruby-hello-world-usr-2356-1-2575-5 frontend-1-gfd5s 1/1 Terminating 0 10h ip-172-31-58-186.us-west-2.compute.internal ruby-hello-world-usr-2356-1-2575-5 frontend-1-kakh2 0/1 Terminating 0 7h ip-172-31-58-186.us-west-2.compute.internal ruby-hello-world-usr-2356-1-2591-5 frontend-1-sh0mi 0/1 Terminating 0 7h ip-172-31-58-186.us-west-2.compute.internal ruby-hello-world-usr-2356-3-2658-5 frontend-1-7zqni 1/1 Terminating 0 9h ip-172-31-58-186.us-west-2.compute.internal ruby-hello-world-usr-2356-3-2675-4 frontend-3-v56cj 0/1 Terminating 0 42m ip-172-31-58-187.us-west-2.compute.internal ruby-hello-world-usr-2374-1-2725-5 database-1-b8s5t 0/1 Terminating 0 6h ip-172-31-58-186.us-west-2.compute.internal ruby-hello-world-usr-2374-2-2792-4 database-1-6zu88 0/1 Terminating 0 2h ip-172-31-58-186.us-west-2.compute.internal
This looks like a duplicate of https://github.com/openshift/origin/issues/8176
Cherry pick of upstream fix for pods stuck in terminating state: https://github.com/openshift/origin/pull/8378
Required cherry-pick has merged.
Started the test run again with the build openshift v3.2.0.15 kubernetes v1.2.0-36-g4a3f9c5 etcd 2.2.5
Verified this by running the reliability long-run which reliably reproduced this with regular project creation and deletion. The problem has not been seen since the build with the fix.
I ran into this issue again today, logs are attached.
Created attachment 1151182 [details] logs output of journalctl -u atomic-openshift-master and journalctl -u atomic-openshift-node
I started OSE reliability tests again on 04/21, the tests were running fine till 04/26. There is one pod stuck in Terminating state for a day. dancer-mysql-example-user-384-91-1373-329 dancer-mysql-example-1-5jbh5 0/1 Terminating 0 1d ip-172-31-25-19.us-west-2.compute.internal Here is the version of openshift on the cluster openshift v3.2.0.18 kubernetes v1.2.0-36-g4a3f9c5 etcd 2.2.5 I still have the cluster available in case we need anything else.
Errors encountered on 04/26 were related to following bug https://bugzilla.redhat.com/show_bug.cgi?id=1329603 Errors seen in the logs were same as this bug but root cause this time was different and is getting tracked in the above mentioned bug. Changing this bug back to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:1064