Description of problem: This bug is clone from upstream: https://github.com/operator-framework/operator-sdk/issues/1874, it blocks CNV operators. Details: The controller pod becomes a leader by acquiring a lock on a configmap resource. Due to a bug in my code which caused a memory leak, the pod was evicted after a few hours. However, the evicted pod lock was not released and thus, the new pod could not become a leader. The only way to fix that was to manually delete the evicted pod, which was locking the resource and delete the new pod which could not lock the resource and gave up after a few tries. According to Kubernetes documentation, evicted pods locks are deleted as they evict. Version-Release number of selected component (if applicable): How reproducible: always Steps to Reproduce: 1. 2. 3. Actual results: Expected results: The evicted pod should release controller ConfigMap lock. Additional info:
Hi Jian, I followed up in the GitHub issue with some comments about the leader election approaches and some possible things to investigate to improve the reliability of leaders giving up their lock when they're evicted. However, I think there is still a common case where the leader-for-life approach can result in a deadlock: a network partition that causes the operator's leader pod to lose contact with the API and the API server to lose contact with the kubelet on the node where the leader pod is running. For simple examples, think power outage to that node or network cable being unplugged. - https://github.com/operator-framework/operator-sdk/issues/1874#issuecomment-528842246 - https://github.com/operator-framework/operator-sdk/issues/1874#issuecomment-529529313 Are CNV operators able to use a different election approach (e.g. leader with lease) or fix the resource consumption bug?
>Due to a bug in my code which caused a memory leak, the pod was evicted after a few hours. >However, the evicted pod lock was not released and thus, the new pod could not become a leader. >The only way to fix that was to manually delete the evicted pod, which was locking the resource and delete the new pod which could not lock the resource and gave up after a few tries. >According to Kubernetes documentation, evicted pods locks are deleted as they evict. Soft evictions result in deleted pods, but some circumstances (including memory pressure) can result in a hard eviction, which does not delete the pod. The fix for this is to detect the evicted leader and delete the pod, allowing garbage collection to clean up the lock. https://github.com/operator-framework/operator-sdk/pull/2210/
Fixed upstream issues: https://github.com/operator-framework/operator-sdk/issues/1874 https://github.com/operator-framework/operator-sdk/issues/1305
Cluster version is 4.4.0-0.nightly-2020-02-18-042756
Removed 1811212 in 'depend on' since the associated bug was verified.
We are not able to reproduce this. We have not seen this happen during any of our tests. We are not sure how we can verify this apart from checking the components are using the fixed and updated version of operator-sdk (and at the moment it seems that not all of them are) We are okay with pushing this back (yet) again.
As per comment #35 and after discussing with Austin Macdonald, I am closing this bug.