Bug 1749620 - [operator-sdk] Evicted pod do not release controller ConfigMap lock
Summary: [operator-sdk] Evicted pod do not release controller ConfigMap lock
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Operator SDK
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: amacdona@redhat.com
QA Contact: Inbar Rose
URL:
Whiteboard:
Depends On: 1805019
Blocks: 1878603
TreeView+ depends on / blocked
 
Reported: 2019-09-06 02:32 UTC by Jian Zhang
Modified: 2023-03-24 15:24 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1878603 (view as bug list)
Environment:
Last Closed: 2020-11-20 12:14:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt hostpath-provisioner-operator pull 59 0 None closed Update operator sdk to 0.16. 2020-11-30 11:02:37 UTC

Description Jian Zhang 2019-09-06 02:32:50 UTC
Description of problem:
This bug is clone from upstream: https://github.com/operator-framework/operator-sdk/issues/1874, it blocks CNV operators. Details:

The controller pod becomes a leader by acquiring a lock on a configmap resource.
Due to a bug in my code which caused a memory leak, the pod was evicted after a few hours.
However, the evicted pod lock was not released and thus, the new pod could not become a leader.
The only way to fix that was to manually delete the evicted pod, which was locking the resource and delete the new pod which could not lock the resource and gave up after a few tries.
According to Kubernetes documentation, evicted pods locks are deleted as they evict.


Version-Release number of selected component (if applicable):


How reproducible:
always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
The evicted pod should release controller ConfigMap lock.

Additional info:

Comment 1 Joe Lanford 2019-09-09 18:08:31 UTC
Hi Jian,

I followed up in the GitHub issue with some comments about the leader election approaches and some possible things to investigate to improve the reliability of leaders giving up their lock when they're evicted. However, I think there is still a common case where the leader-for-life approach can result in a deadlock: a network partition that causes the operator's leader pod to lose contact with the API and the API server to lose contact with the kubelet on the node where the leader pod is running. For simple examples, think power outage to that node or network cable being unplugged.

- https://github.com/operator-framework/operator-sdk/issues/1874#issuecomment-528842246
- https://github.com/operator-framework/operator-sdk/issues/1874#issuecomment-529529313

Are CNV operators able to use a different election approach (e.g. leader with lease) or fix the resource consumption bug?

Comment 2 amacdona@redhat.com 2019-11-20 15:48:57 UTC
>Due to a bug in my code which caused a memory leak, the pod was evicted after a few hours.
>However, the evicted pod lock was not released and thus, the new pod could not become a leader.
>The only way to fix that was to manually delete the evicted pod, which was locking the resource and delete the new pod which could not lock the resource and gave up after a few tries.
>According to Kubernetes documentation, evicted pods locks are deleted as they evict.

Soft evictions result in deleted pods, but some circumstances (including memory pressure) can result in a hard eviction, which does not delete the pod.

The fix for this is to detect the evicted leader and delete the pod, allowing garbage collection to clean up the lock. 
https://github.com/operator-framework/operator-sdk/pull/2210/

Comment 5 Jian Zhang 2020-02-20 07:14:39 UTC
Cluster version is 4.4.0-0.nightly-2020-02-18-042756

Comment 17 Zhang Cheng 2020-03-27 06:54:23 UTC
Removed 1811212 in 'depend on' since the associated bug was verified.

Comment 24 Inbar Rose 2020-10-15 12:35:39 UTC
We are not able to reproduce this. We have not seen this happen during any of our tests.

We are not sure how we can verify this apart from checking the components are using the fixed and updated version of operator-sdk (and at the moment it seems that not all of them are)

We are okay with pushing this back (yet) again.

Comment 36 Sunil Choudhary 2020-11-20 12:14:12 UTC
As per comment #35 and after discussing with Austin Macdonald, I am closing this bug.


Note You need to log in before you can comment on or make changes to this bug.