Bug 1749620

Summary:	[operator-sdk] Evicted pod do not release controller ConfigMap lock
Product:	OpenShift Container Platform	Reporter:	Jian Zhang <jiazha>
Component:	Operator SDK	Assignee:	amacdona <austin>
Status:	CLOSED WORKSFORME	QA Contact:	Inbar Rose <irose>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.2.0	CC:	abeekhof, aos-bugs, austin, awels, chezhang, dageoffr, irose, jiazha, ksimon, mhernon, ncredi, phoracek, schoudha, shurley, stirabos, xiuwang, xtian
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1878603 (view as bug list)		Environment:
Last Closed:	2020-11-20 12:14:12 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1805019
Bug Blocks:	1878603

Description Jian Zhang 2019-09-06 02:32:50 UTC

Description of problem:
This bug is clone from upstream: https://github.com/operator-framework/operator-sdk/issues/1874, it blocks CNV operators. Details:

The controller pod becomes a leader by acquiring a lock on a configmap resource.
Due to a bug in my code which caused a memory leak, the pod was evicted after a few hours.
However, the evicted pod lock was not released and thus, the new pod could not become a leader.
The only way to fix that was to manually delete the evicted pod, which was locking the resource and delete the new pod which could not lock the resource and gave up after a few tries.
According to Kubernetes documentation, evicted pods locks are deleted as they evict.


Version-Release number of selected component (if applicable):


How reproducible:
always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
The evicted pod should release controller ConfigMap lock.

Additional info:

Comment 1 Joe Lanford 2019-09-09 18:08:31 UTC

Hi Jian,

I followed up in the GitHub issue with some comments about the leader election approaches and some possible things to investigate to improve the reliability of leaders giving up their lock when they're evicted. However, I think there is still a common case where the leader-for-life approach can result in a deadlock: a network partition that causes the operator's leader pod to lose contact with the API and the API server to lose contact with the kubelet on the node where the leader pod is running. For simple examples, think power outage to that node or network cable being unplugged.

- https://github.com/operator-framework/operator-sdk/issues/1874#issuecomment-528842246
- https://github.com/operator-framework/operator-sdk/issues/1874#issuecomment-529529313

Are CNV operators able to use a different election approach (e.g. leader with lease) or fix the resource consumption bug?

Comment 2 amacdona@redhat.com 2019-11-20 15:48:57 UTC

>Due to a bug in my code which caused a memory leak, the pod was evicted after a few hours.
>However, the evicted pod lock was not released and thus, the new pod could not become a leader.
>The only way to fix that was to manually delete the evicted pod, which was locking the resource and delete the new pod which could not lock the resource and gave up after a few tries.
>According to Kubernetes documentation, evicted pods locks are deleted as they evict.

Soft evictions result in deleted pods, but some circumstances (including memory pressure) can result in a hard eviction, which does not delete the pod.

The fix for this is to detect the evicted leader and delete the pod, allowing garbage collection to clean up the lock. 
https://github.com/operator-framework/operator-sdk/pull/2210/

Comment 3 amacdona@redhat.com 2019-11-20 16:48:51 UTC

Fixed upstream issues:
https://github.com/operator-framework/operator-sdk/issues/1874
https://github.com/operator-framework/operator-sdk/issues/1305

Comment 5 Jian Zhang 2020-02-20 07:14:39 UTC

Cluster version is 4.4.0-0.nightly-2020-02-18-042756

Comment 17 Zhang Cheng 2020-03-27 06:54:23 UTC

Removed 1811212 in 'depend on' since the associated bug was verified.

Comment 24 Inbar Rose 2020-10-15 12:35:39 UTC

We are not able to reproduce this. We have not seen this happen during any of our tests.

We are not sure how we can verify this apart from checking the components are using the fixed and updated version of operator-sdk (and at the moment it seems that not all of them are)

We are okay with pushing this back (yet) again.

Comment 36 Sunil Choudhary 2020-11-20 12:14:12 UTC

As per comment #35 and after discussing with Austin Macdonald, I am closing this bug.