Bug 1322538

Summary: Project delete leads to unexpected items in namespace and causes reliability cluster to eventually go unusable
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: NodeAssignee: Derek Carr <decarr>
Status: CLOSED ERRATA QA Contact: Mike Fiedler <mifiedle>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: aos-bugs, decarr, jokerman, mifiedle, mmccomas, tdawson, vlaad, wsun
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-12 16:35:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
syslog with atomic-openshift-master and node errors
none
logs none

Description Mike Fiedler 2016-03-30 16:54:32 UTC
Created attachment 1141866 [details]
syslog with atomic-openshift-master and node errors

Description of problem:

This is in a 3 node (master, 2 workers) reliability test where projects are added, deleted, updated, users created, apps created and removed, builds run, etc.   Not at high stress, just a steady stream of activity.  I can provide details on the exact activity.

After running for 2 hours a project delete for project ruby-hello-world-usr-2323-1-2356-4 triggered many errors similar to this in the log (on the master):

Mar 29 13:08:39 ip-172-31-58-185 atomic-openshift-node: W0329 13:08:39.535890   29134 registry.go:694] Service 'dancer-mysql-example' in namespace 'dancer-mysql-example-usr-2323-1-2340-4' has an Endpoint pointing to non-existent pod (172.20.2.5)
Mar 29 13:08:39 ip-172-31-58-185 atomic-openshift-node: W0329 13:08:39.535952   29134 registry.go:698] Service 'cakephp-mysql-example' in namespace 'cakephp-mysql-example-usr-2323-2-2340-5' has an Endpoint pointing to pod 172.20.2.4 in namespace 'ruby-hello-world-usr-2323-1-2356-4'
Mar 29 13:08:39 ip-172-31-58-185 atomic-openshift-node: I0329 13:08:39.536079   29134 proxier.go:494] Removing endpoints for "ruby-hello-world-usr-2323-1-2323-5/frontend:web"
Mar 29 13:08:39 ip-172-31-58-185 atomic-openshift-node: W0329 13:08:39.594961   29134 registry.go:698] Service 'cakephp-mysql-example' in namespace 'cakephp-mysql-example-usr-2323-2-2340-5' has an Endpoint pointing to pod 172.20.2.4 in namespace 'ruby-hello-world-usr-2323-1-2356-4'

Note that some messages referring to the deleted namespace.

Next there are several messages about not being able to create objects in a namespace being terminated.   Expected.  But the "has an Endpoint pointing to pod" in the deleted namespace continue.

At 23:12:07, project ruby-hello-world-usr-2323-1-2323-5 is deleted.  It is one of the projects in the recurring "has an Endpoint pointing to a pod in a deleted namespace" messages

At 23:12:09 Things seem to go haywire.  Logs start filling with

Mar 29 23:12:42 ip-172-31-58-185 atomic-openshift-master: E0329 23:12:42.138919   16076 namespace_controller.go:139] unexpected items still remain in namespace: ruby-hello-world-usr-2340-1-2441-5 for gvr: { v1 pods}
Mar 29 23:12:42 ip-172-31-58-185 atomic-openshift-master: E0329 23:12:42.210320   16076 namespace_controller.go:139] unexpected items still remain in namespace: ruby-hello-world-usr-2340-1-2441-5 for gvr: { v1 pods}
Mar 29 23:12:42 ip-172-31-58-185 atomic-openshift-master: E0329 23:12:42.280941   16076 namespace_controller.go:139] unexpected items still remain in namespace: ruby-hello-world-usr-2340-1-2441-5 for gvr: { v1 pods}

As more projects are deleted even more of these start showing up (see end of full logs) for other namespaces.



Version-Release number of selected component (if applicable): 3.2.0.8


How reproducible: Unknown


Steps to Reproduce:
1. 3 nodes - 1 master, 2 nodes
2. Clone openshift/svt repository
3. Contact vlaad for fastpath assistance setting the tests up.

Actual results:
The cluster ends up in a state with pods stuck in Pending or Error

root@ip-172-31-58-185: ~ # oc get pods --all-namespaces | wc -l
98
root@ip-172-31-58-185: ~ # oc get pods --all-namespaces | grep Pending | wc -l
19
root@ip-172-31-58-185: ~ # oc get pods --all-namespaces | grep Error | wc -l
27
root@ip-172-31-58-185: ~ # oc get pods --all-namespaces | grep Terminating | wc -l
16
root@ip-172-31-58-185: ~ # oc get pods --all-namespaces | grep Complete | wc -l
12

Expected results:
Reliability test continues to run all builds, pods, etc normally

Additional info:

full logs attached

Comment 1 Derek Carr 2016-03-30 17:26:10 UTC
Are the 16 pods in "Terminating" state on the same node or multiple nodes?

Comment 2 Vikas Laad 2016-03-30 17:31:55 UTC
Only one of them is on a different node

root@ip-172-31-58-185: ~ # oc get pods --all-namespaces -o wide| grep Terminating
cakephp-mysql-example-usr-2356-1-2624-4   cakephp-mysql-example-5-build      0/1       Terminating        0          4h        ip-172-31-58-186.us-west-2.compute.internal
cakephp-mysql-example-usr-2356-3-2642-5   cakephp-mysql-example-3-build      0/1       Terminating        0          5h        ip-172-31-58-186.us-west-2.compute.internal
cakephp-mysql-example-usr-2356-3-2642-5   cakephp-mysql-example-6-build      0/1       Terminating        0          2h        ip-172-31-58-186.us-west-2.compute.internal
cakephp-mysql-example-usr-2374-1-2758-4   mysql-1-erzd0                      0/1       Terminating        0          4h        ip-172-31-58-186.us-west-2.compute.internal
eap64-mysql-s2i-usr-2356-0-2591-4         eap-app-1-1w9b7                    1/1       Terminating        0          8h        ip-172-31-58-186.us-west-2.compute.internal
ruby-hello-world-usr-2356-0-2575-4        database-1-ffz07                   1/1       Terminating        0          16h       ip-172-31-58-186.us-west-2.compute.internal
ruby-hello-world-usr-2356-0-2575-4        frontend-1-3k239                   0/1       Terminating        0          7h        ip-172-31-58-186.us-west-2.compute.internal
ruby-hello-world-usr-2356-0-2575-4        frontend-1-jpofk                   0/1       Terminating        0          6h        ip-172-31-58-186.us-west-2.compute.internal
ruby-hello-world-usr-2356-0-2575-4        frontend-1-kdsvn                   1/1       Terminating        0          9h        ip-172-31-58-186.us-west-2.compute.internal
ruby-hello-world-usr-2356-1-2575-5        database-1-mvht3                   1/1       Terminating        0          16h       ip-172-31-58-186.us-west-2.compute.internal
ruby-hello-world-usr-2356-1-2575-5        frontend-1-gfd5s                   1/1       Terminating        0          10h       ip-172-31-58-186.us-west-2.compute.internal
ruby-hello-world-usr-2356-1-2575-5        frontend-1-kakh2                   0/1       Terminating        0          7h        ip-172-31-58-186.us-west-2.compute.internal
ruby-hello-world-usr-2356-1-2591-5        frontend-1-sh0mi                   0/1       Terminating        0          7h        ip-172-31-58-186.us-west-2.compute.internal
ruby-hello-world-usr-2356-3-2658-5        frontend-1-7zqni                   1/1       Terminating        0          9h        ip-172-31-58-186.us-west-2.compute.internal
ruby-hello-world-usr-2356-3-2675-4        frontend-3-v56cj                   0/1       Terminating        0          42m       ip-172-31-58-187.us-west-2.compute.internal
ruby-hello-world-usr-2374-1-2725-5        database-1-b8s5t                   0/1       Terminating        0          6h        ip-172-31-58-186.us-west-2.compute.internal
ruby-hello-world-usr-2374-2-2792-4        database-1-6zu88                   0/1       Terminating        0          2h        ip-172-31-58-186.us-west-2.compute.internal

Comment 3 Derek Carr 2016-03-30 17:36:33 UTC
This looks like a duplicate of https://github.com/openshift/origin/issues/8176

Comment 4 Derek Carr 2016-04-06 17:28:36 UTC
Cherry pick of upstream fix for pods stuck in terminating state:

https://github.com/openshift/origin/pull/8378

Comment 5 Derek Carr 2016-04-14 14:23:27 UTC
Required cherry-pick has merged.

Comment 6 Vikas Laad 2016-04-14 14:29:27 UTC
Started the test run again with the build 
openshift v3.2.0.15
kubernetes v1.2.0-36-g4a3f9c5
etcd 2.2.5

Comment 7 Mike Fiedler 2016-04-25 23:42:19 UTC
Verified this by running the reliability long-run which reliably reproduced this with regular project creation and deletion.   The problem has not been seen since the build with the fix.

Comment 8 Vikas Laad 2016-04-27 02:05:33 UTC
I ran into this issue again today, logs are attached.

Comment 9 Vikas Laad 2016-04-27 02:09:56 UTC
Created attachment 1151182 [details]
logs

output of journalctl -u atomic-openshift-master and journalctl -u atomic-openshift-node

Comment 10 Vikas Laad 2016-04-27 13:19:41 UTC
I started OSE reliability tests again on 04/21, the tests were running fine till 04/26. There is one pod stuck in Terminating state for a day.

dancer-mysql-example-user-384-91-1373-329      dancer-mysql-example-1-5jbh5    0/1       Terminating   0          1d        ip-172-31-25-19.us-west-2.compute.internal

Here is the version of openshift on the cluster
openshift v3.2.0.18
kubernetes v1.2.0-36-g4a3f9c5
etcd 2.2.5

I still have the cluster available in case we need anything else.

Comment 11 Vikas Laad 2016-04-27 14:00:57 UTC
Errors encountered on 04/26 were related to following bug

https://bugzilla.redhat.com/show_bug.cgi?id=1329603

Errors seen in the logs were same as this bug but root cause this time was different and is getting tracked in the above mentioned bug. Changing this bug back to verified.

Comment 13 errata-xmlrpc 2016-05-12 16:35:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1064