Bug 1322538 - Project delete leads to unexpected items in namespace and causes reliability cluster to eventually go unusable
Summary: Project delete leads to unexpected items in namespace and causes reliability ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.2.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Derek Carr
QA Contact: Mike Fiedler
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-30 16:54 UTC by Mike Fiedler
Modified: 2016-05-12 16:35 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-12 16:35:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
syslog with atomic-openshift-master and node errors (3.20 MB, application/x-gzip)
2016-03-30 16:54 UTC, Mike Fiedler
no flags Details
logs (4.83 MB, application/zip)
2016-04-27 02:09 UTC, Vikas Laad
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:1064 0 normal SHIPPED_LIVE Important: Red Hat OpenShift Enterprise 3.2 security, bug fix, and enhancement update 2016-05-12 20:19:17 UTC

Description Mike Fiedler 2016-03-30 16:54:32 UTC
Created attachment 1141866 [details]
syslog with atomic-openshift-master and node errors

Description of problem:

This is in a 3 node (master, 2 workers) reliability test where projects are added, deleted, updated, users created, apps created and removed, builds run, etc.   Not at high stress, just a steady stream of activity.  I can provide details on the exact activity.

After running for 2 hours a project delete for project ruby-hello-world-usr-2323-1-2356-4 triggered many errors similar to this in the log (on the master):

Mar 29 13:08:39 ip-172-31-58-185 atomic-openshift-node: W0329 13:08:39.535890   29134 registry.go:694] Service 'dancer-mysql-example' in namespace 'dancer-mysql-example-usr-2323-1-2340-4' has an Endpoint pointing to non-existent pod (172.20.2.5)
Mar 29 13:08:39 ip-172-31-58-185 atomic-openshift-node: W0329 13:08:39.535952   29134 registry.go:698] Service 'cakephp-mysql-example' in namespace 'cakephp-mysql-example-usr-2323-2-2340-5' has an Endpoint pointing to pod 172.20.2.4 in namespace 'ruby-hello-world-usr-2323-1-2356-4'
Mar 29 13:08:39 ip-172-31-58-185 atomic-openshift-node: I0329 13:08:39.536079   29134 proxier.go:494] Removing endpoints for "ruby-hello-world-usr-2323-1-2323-5/frontend:web"
Mar 29 13:08:39 ip-172-31-58-185 atomic-openshift-node: W0329 13:08:39.594961   29134 registry.go:698] Service 'cakephp-mysql-example' in namespace 'cakephp-mysql-example-usr-2323-2-2340-5' has an Endpoint pointing to pod 172.20.2.4 in namespace 'ruby-hello-world-usr-2323-1-2356-4'

Note that some messages referring to the deleted namespace.

Next there are several messages about not being able to create objects in a namespace being terminated.   Expected.  But the "has an Endpoint pointing to pod" in the deleted namespace continue.

At 23:12:07, project ruby-hello-world-usr-2323-1-2323-5 is deleted.  It is one of the projects in the recurring "has an Endpoint pointing to a pod in a deleted namespace" messages

At 23:12:09 Things seem to go haywire.  Logs start filling with

Mar 29 23:12:42 ip-172-31-58-185 atomic-openshift-master: E0329 23:12:42.138919   16076 namespace_controller.go:139] unexpected items still remain in namespace: ruby-hello-world-usr-2340-1-2441-5 for gvr: { v1 pods}
Mar 29 23:12:42 ip-172-31-58-185 atomic-openshift-master: E0329 23:12:42.210320   16076 namespace_controller.go:139] unexpected items still remain in namespace: ruby-hello-world-usr-2340-1-2441-5 for gvr: { v1 pods}
Mar 29 23:12:42 ip-172-31-58-185 atomic-openshift-master: E0329 23:12:42.280941   16076 namespace_controller.go:139] unexpected items still remain in namespace: ruby-hello-world-usr-2340-1-2441-5 for gvr: { v1 pods}

As more projects are deleted even more of these start showing up (see end of full logs) for other namespaces.



Version-Release number of selected component (if applicable): 3.2.0.8


How reproducible: Unknown


Steps to Reproduce:
1. 3 nodes - 1 master, 2 nodes
2. Clone openshift/svt repository
3. Contact vlaad for fastpath assistance setting the tests up.

Actual results:
The cluster ends up in a state with pods stuck in Pending or Error

root@ip-172-31-58-185: ~ # oc get pods --all-namespaces | wc -l
98
root@ip-172-31-58-185: ~ # oc get pods --all-namespaces | grep Pending | wc -l
19
root@ip-172-31-58-185: ~ # oc get pods --all-namespaces | grep Error | wc -l
27
root@ip-172-31-58-185: ~ # oc get pods --all-namespaces | grep Terminating | wc -l
16
root@ip-172-31-58-185: ~ # oc get pods --all-namespaces | grep Complete | wc -l
12

Expected results:
Reliability test continues to run all builds, pods, etc normally

Additional info:

full logs attached

Comment 1 Derek Carr 2016-03-30 17:26:10 UTC
Are the 16 pods in "Terminating" state on the same node or multiple nodes?

Comment 2 Vikas Laad 2016-03-30 17:31:55 UTC
Only one of them is on a different node

root@ip-172-31-58-185: ~ # oc get pods --all-namespaces -o wide| grep Terminating
cakephp-mysql-example-usr-2356-1-2624-4   cakephp-mysql-example-5-build      0/1       Terminating        0          4h        ip-172-31-58-186.us-west-2.compute.internal
cakephp-mysql-example-usr-2356-3-2642-5   cakephp-mysql-example-3-build      0/1       Terminating        0          5h        ip-172-31-58-186.us-west-2.compute.internal
cakephp-mysql-example-usr-2356-3-2642-5   cakephp-mysql-example-6-build      0/1       Terminating        0          2h        ip-172-31-58-186.us-west-2.compute.internal
cakephp-mysql-example-usr-2374-1-2758-4   mysql-1-erzd0                      0/1       Terminating        0          4h        ip-172-31-58-186.us-west-2.compute.internal
eap64-mysql-s2i-usr-2356-0-2591-4         eap-app-1-1w9b7                    1/1       Terminating        0          8h        ip-172-31-58-186.us-west-2.compute.internal
ruby-hello-world-usr-2356-0-2575-4        database-1-ffz07                   1/1       Terminating        0          16h       ip-172-31-58-186.us-west-2.compute.internal
ruby-hello-world-usr-2356-0-2575-4        frontend-1-3k239                   0/1       Terminating        0          7h        ip-172-31-58-186.us-west-2.compute.internal
ruby-hello-world-usr-2356-0-2575-4        frontend-1-jpofk                   0/1       Terminating        0          6h        ip-172-31-58-186.us-west-2.compute.internal
ruby-hello-world-usr-2356-0-2575-4        frontend-1-kdsvn                   1/1       Terminating        0          9h        ip-172-31-58-186.us-west-2.compute.internal
ruby-hello-world-usr-2356-1-2575-5        database-1-mvht3                   1/1       Terminating        0          16h       ip-172-31-58-186.us-west-2.compute.internal
ruby-hello-world-usr-2356-1-2575-5        frontend-1-gfd5s                   1/1       Terminating        0          10h       ip-172-31-58-186.us-west-2.compute.internal
ruby-hello-world-usr-2356-1-2575-5        frontend-1-kakh2                   0/1       Terminating        0          7h        ip-172-31-58-186.us-west-2.compute.internal
ruby-hello-world-usr-2356-1-2591-5        frontend-1-sh0mi                   0/1       Terminating        0          7h        ip-172-31-58-186.us-west-2.compute.internal
ruby-hello-world-usr-2356-3-2658-5        frontend-1-7zqni                   1/1       Terminating        0          9h        ip-172-31-58-186.us-west-2.compute.internal
ruby-hello-world-usr-2356-3-2675-4        frontend-3-v56cj                   0/1       Terminating        0          42m       ip-172-31-58-187.us-west-2.compute.internal
ruby-hello-world-usr-2374-1-2725-5        database-1-b8s5t                   0/1       Terminating        0          6h        ip-172-31-58-186.us-west-2.compute.internal
ruby-hello-world-usr-2374-2-2792-4        database-1-6zu88                   0/1       Terminating        0          2h        ip-172-31-58-186.us-west-2.compute.internal

Comment 3 Derek Carr 2016-03-30 17:36:33 UTC
This looks like a duplicate of https://github.com/openshift/origin/issues/8176

Comment 4 Derek Carr 2016-04-06 17:28:36 UTC
Cherry pick of upstream fix for pods stuck in terminating state:

https://github.com/openshift/origin/pull/8378

Comment 5 Derek Carr 2016-04-14 14:23:27 UTC
Required cherry-pick has merged.

Comment 6 Vikas Laad 2016-04-14 14:29:27 UTC
Started the test run again with the build 
openshift v3.2.0.15
kubernetes v1.2.0-36-g4a3f9c5
etcd 2.2.5

Comment 7 Mike Fiedler 2016-04-25 23:42:19 UTC
Verified this by running the reliability long-run which reliably reproduced this with regular project creation and deletion.   The problem has not been seen since the build with the fix.

Comment 8 Vikas Laad 2016-04-27 02:05:33 UTC
I ran into this issue again today, logs are attached.

Comment 9 Vikas Laad 2016-04-27 02:09:56 UTC
Created attachment 1151182 [details]
logs

output of journalctl -u atomic-openshift-master and journalctl -u atomic-openshift-node

Comment 10 Vikas Laad 2016-04-27 13:19:41 UTC
I started OSE reliability tests again on 04/21, the tests were running fine till 04/26. There is one pod stuck in Terminating state for a day.

dancer-mysql-example-user-384-91-1373-329      dancer-mysql-example-1-5jbh5    0/1       Terminating   0          1d        ip-172-31-25-19.us-west-2.compute.internal

Here is the version of openshift on the cluster
openshift v3.2.0.18
kubernetes v1.2.0-36-g4a3f9c5
etcd 2.2.5

I still have the cluster available in case we need anything else.

Comment 11 Vikas Laad 2016-04-27 14:00:57 UTC
Errors encountered on 04/26 were related to following bug

https://bugzilla.redhat.com/show_bug.cgi?id=1329603

Errors seen in the logs were same as this bug but root cause this time was different and is getting tracked in the above mentioned bug. Changing this bug back to verified.

Comment 13 errata-xmlrpc 2016-05-12 16:35:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1064


Note You need to log in before you can comment on or make changes to this bug.