Bug 1559987

Summary: Unable to delete deploymentconfig
Product: OpenShift Container Platform Reporter: Robert Bost <rbost>
Component: MasterAssignee: Michal Fojtik <mfojtik>
Status: CLOSED ERRATA QA Contact: Wang Haoran <haowang>
Severity: high Docs Contact:
Priority: high    
Version: 3.7.0CC: acomabon, aos-bugs, bfurtado, deads, dsafford, fshaikh, glamb, jdesousa, jkaur, jmalde, jokerman, kmendez, maszulik, mfojtik, mmccomas, openshift-bugs-escalate, rbost, smunilla, sthangav, stwalter, suchaudh
Target Milestone: ---Keywords: Reopened
Target Release: 3.7.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: v3.7.49-1 Doc Type: Bug Fix
Doc Text:
Cause: In some cases the shared informer caches is was not initialized properly or failed to initialize. Consequence: Controllers like garbage collection stuck in wait for caches to be initialized Fix: In case the cache is stuck, don't wait for it to initialize but forward the request to storage (etcd) directly to unblock controllers. Result: Controllers can reach the resources without being stuck on cache to initialize.
Story Points: ---
Clone Of:
: 1678028 (view as bug list) Environment:
Last Closed: 2018-07-11 09:57:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1267746, 1678028    

Description Robert Bost 2018-03-23 16:29:20 UTC
Description of problem: 

Unable to delete deploymentconfig resource. The deploymentconfig has a finalizer but unable to find the blocking resource:

# oc get dc -o yaml NAME_OF_DC
apiVersion: v1
kind: DeploymentConfig
metadata:
  creationTimestamp: 2018-03-12T00:49:16Z
  deletionGracePeriodSeconds: 0
  deletionTimestamp: 2018-03-21T18:41:29Z
  finalizers:
  - foregroundDeletion
  generation: 29
  labels:
    app: firsttestgateway
  name: firsttestgateway
  namespace: first-dt
  resourceVersion: "125930908"
  selfLink: /oapi/v1/namespaces/first-dt/deploymentconfigs/firsttestgateway
  ...

Version-Release number of selected component (if applicable): atomic-openshift-3.7.23-1.git.0.8edc154.el7.x86_64


How reproducible: Reproducer steps unclear

Actual results: Running `oc delete dc/firsttestgateway` returned successful message but running `oc get dc` after still showed the deploymentconfig. The deploymentconfig hung around for days until manually deleting the finalizer from the dc yaml and running `oc delete` again.

Comment 1 Michal Fojtik 2018-03-26 08:22:32 UTC
Can you please provide a dump of pods/replication controllers associated with this DC? Additionally an API server and controllers journal will be helpful to analyze.

Comment 2 Robert Bost 2018-03-26 15:40:15 UTC
We do not currently have those details but requesting them now from another instance of the issue. Leaving needfinfo set.

Comment 4 Maciej Szulik 2018-03-28 10:19:50 UTC
We would need to see controller logs from the time this removal was being invoked (at least for +1h after the initial oc delete invocation). It looks like there were some problems removing the dependant objects (either replication controllers or pods) the DC owned. Without the dependants being properly removed the actual DC won't be removed either. I'd like to investigate the logs to further confirm that theory and examine what might be causing this problem.

Comment 6 Maciej Szulik 2018-04-05 07:44:17 UTC
I've reviewed the attached logs and unfortunately I can't figure out what's exactly going on. The logs suggest as if everything is working as expected (with that I mean I don't see any errors), but I can't verify any theory without the full yaml of dependant resources or garbage collector logs at a higher level. 

I'd suggest the next time this situation happens, before applying the workaround, please gather the following data:

- controller logs but with loglevel at least 2 or higher (this is level at which garbage collector produces valuable output)
- full yamls for all the resources involved, in the case similar to the one described in comment 1 that will be: deployment config, replication controllers and pods.

Comment 29 Robert Bost 2018-05-21 13:44:47 UTC
Resetting needinfo

Comment 43 errata-xmlrpc 2018-06-26 06:43:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1798