1749620 – [operator-sdk] Evicted pod do not release controller ConfigMap lock

Bug 1749620 - [operator-sdk] Evicted pod do not release controller ConfigMap lock

Summary: [operator-sdk] Evicted pod do not release controller ConfigMap lock

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Operator SDK
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	amacdona@redhat.com
QA Contact:	Inbar Rose
Docs Contact:
URL:
Whiteboard:
Depends On:	1805019
Blocks:	1878603
TreeView+	depends on / blocked

Reported:	2019-09-06 02:32 UTC by Jian Zhang
Modified:	2023-03-24 15:24 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1878603 (view as bug list)
Environment:
Last Closed:	2020-11-20 12:14:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	kubevirt hostpath-provisioner-operator pull 59	0	None	closed	Update operator sdk to 0.16.	2020-11-30 11:02:37 UTC

Description Jian Zhang 2019-09-06 02:32:50 UTC

Description of problem:
This bug is clone from upstream: https://github.com/operator-framework/operator-sdk/issues/1874, it blocks CNV operators. Details:

The controller pod becomes a leader by acquiring a lock on a configmap resource.
Due to a bug in my code which caused a memory leak, the pod was evicted after a few hours.
However, the evicted pod lock was not released and thus, the new pod could not become a leader.
The only way to fix that was to manually delete the evicted pod, which was locking the resource and delete the new pod which could not lock the resource and gave up after a few tries.
According to Kubernetes documentation, evicted pods locks are deleted as they evict.


Version-Release number of selected component (if applicable):


How reproducible:
always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
The evicted pod should release controller ConfigMap lock.

Additional info:

Comment 1 Joe Lanford 2019-09-09 18:08:31 UTC

Hi Jian,

I followed up in the GitHub issue with some comments about the leader election approaches and some possible things to investigate to improve the reliability of leaders giving up their lock when they're evicted. However, I think there is still a common case where the leader-for-life approach can result in a deadlock: a network partition that causes the operator's leader pod to lose contact with the API and the API server to lose contact with the kubelet on the node where the leader pod is running. For simple examples, think power outage to that node or network cable being unplugged.

- https://github.com/operator-framework/operator-sdk/issues/1874#issuecomment-528842246
- https://github.com/operator-framework/operator-sdk/issues/1874#issuecomment-529529313

Are CNV operators able to use a different election approach (e.g. leader with lease) or fix the resource consumption bug?

Comment 2 amacdona@redhat.com 2019-11-20 15:48:57 UTC

>Due to a bug in my code which caused a memory leak, the pod was evicted after a few hours.
>However, the evicted pod lock was not released and thus, the new pod could not become a leader.
>The only way to fix that was to manually delete the evicted pod, which was locking the resource and delete the new pod which could not lock the resource and gave up after a few tries.
>According to Kubernetes documentation, evicted pods locks are deleted as they evict.

Soft evictions result in deleted pods, but some circumstances (including memory pressure) can result in a hard eviction, which does not delete the pod.

The fix for this is to detect the evicted leader and delete the pod, allowing garbage collection to clean up the lock. 
https://github.com/operator-framework/operator-sdk/pull/2210/

Comment 3 amacdona@redhat.com 2019-11-20 16:48:51 UTC

Fixed upstream issues:
https://github.com/operator-framework/operator-sdk/issues/1874
https://github.com/operator-framework/operator-sdk/issues/1305

Comment 5 Jian Zhang 2020-02-20 07:14:39 UTC

Cluster version is 4.4.0-0.nightly-2020-02-18-042756

Comment 17 Zhang Cheng 2020-03-27 06:54:23 UTC

Removed 1811212 in 'depend on' since the associated bug was verified.

Comment 24 Inbar Rose 2020-10-15 12:35:39 UTC

We are not able to reproduce this. We have not seen this happen during any of our tests.

We are not sure how we can verify this apart from checking the components are using the fixed and updated version of operator-sdk (and at the moment it seems that not all of them are)

We are okay with pushing this back (yet) again.

Comment 36 Sunil Choudhary 2020-11-20 12:14:12 UTC

As per comment #35 and after discussing with Austin Macdonald, I am closing this bug.

Note You need to log in before you can comment on or make changes to this bug.