Bug 1749620

Summary: [operator-sdk] Evicted pod do not release controller ConfigMap lock
Product: OpenShift Container Platform Reporter: Jian Zhang <jiazha>
Component: Operator SDKAssignee: amacdona <austin>
Status: CLOSED WORKSFORME QA Contact: Inbar Rose <irose>
Severity: high Docs Contact:
Priority: high    
Version: 4.2.0CC: abeekhof, aos-bugs, austin, awels, chezhang, dageoffr, irose, jiazha, ksimon, mhernon, ncredi, phoracek, schoudha, shurley, stirabos, xiuwang, xtian
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1878603 (view as bug list) Environment:
Last Closed: 2020-11-20 12:14:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1805019    
Bug Blocks: 1878603    

Description Jian Zhang 2019-09-06 02:32:50 UTC
Description of problem:
This bug is clone from upstream: https://github.com/operator-framework/operator-sdk/issues/1874, it blocks CNV operators. Details:

The controller pod becomes a leader by acquiring a lock on a configmap resource.
Due to a bug in my code which caused a memory leak, the pod was evicted after a few hours.
However, the evicted pod lock was not released and thus, the new pod could not become a leader.
The only way to fix that was to manually delete the evicted pod, which was locking the resource and delete the new pod which could not lock the resource and gave up after a few tries.
According to Kubernetes documentation, evicted pods locks are deleted as they evict.


Version-Release number of selected component (if applicable):


How reproducible:
always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
The evicted pod should release controller ConfigMap lock.

Additional info:

Comment 1 Joe Lanford 2019-09-09 18:08:31 UTC
Hi Jian,

I followed up in the GitHub issue with some comments about the leader election approaches and some possible things to investigate to improve the reliability of leaders giving up their lock when they're evicted. However, I think there is still a common case where the leader-for-life approach can result in a deadlock: a network partition that causes the operator's leader pod to lose contact with the API and the API server to lose contact with the kubelet on the node where the leader pod is running. For simple examples, think power outage to that node or network cable being unplugged.

- https://github.com/operator-framework/operator-sdk/issues/1874#issuecomment-528842246
- https://github.com/operator-framework/operator-sdk/issues/1874#issuecomment-529529313

Are CNV operators able to use a different election approach (e.g. leader with lease) or fix the resource consumption bug?

Comment 2 amacdona@redhat.com 2019-11-20 15:48:57 UTC
>Due to a bug in my code which caused a memory leak, the pod was evicted after a few hours.
>However, the evicted pod lock was not released and thus, the new pod could not become a leader.
>The only way to fix that was to manually delete the evicted pod, which was locking the resource and delete the new pod which could not lock the resource and gave up after a few tries.
>According to Kubernetes documentation, evicted pods locks are deleted as they evict.

Soft evictions result in deleted pods, but some circumstances (including memory pressure) can result in a hard eviction, which does not delete the pod.

The fix for this is to detect the evicted leader and delete the pod, allowing garbage collection to clean up the lock. 
https://github.com/operator-framework/operator-sdk/pull/2210/

Comment 5 Jian Zhang 2020-02-20 07:14:39 UTC
Cluster version is 4.4.0-0.nightly-2020-02-18-042756

Comment 17 Zhang Cheng 2020-03-27 06:54:23 UTC
Removed 1811212 in 'depend on' since the associated bug was verified.

Comment 24 Inbar Rose 2020-10-15 12:35:39 UTC
We are not able to reproduce this. We have not seen this happen during any of our tests.

We are not sure how we can verify this apart from checking the components are using the fixed and updated version of operator-sdk (and at the moment it seems that not all of them are)

We are okay with pushing this back (yet) again.

Comment 36 Sunil Choudhary 2020-11-20 12:14:12 UTC
As per comment #35 and after discussing with Austin Macdonald, I am closing this bug.