Version-Release number: --------------------------- [kni@provisionhost-0-0 ~]$ oc version Client Version: 4.10.0-0.nightly-2022-03-23-153617 Server Version: 4.10.0-0.nightly-2022-03-23-153617 Kubernetes Version: v1.23.5+b0357ed env: ----------- disconnected ocp4.10 cluster, with OVNKubernetes network type, ipv4 baremetal network and ipv6 provisioning network. Description of problem: ------------------------- As part of testing self-remediation based operator (poison-pill), a ds pod is placed on a node and constantly monitor its health. we simulated a case in which the node become unhealthy by making "disk pressure" condition on that node. once the node is detected as unhealthy, it is going through a process of remediation where eventually the node rebooted (with system reboot) and return to the cluster as a healthy, running functional node. in this case, we expect that also the poison-pill pod that is place on that node, will be functional and running. which does not happen if that cluster has OVNKubernetes network type. the pod is stucked in CrashLoopBackOff status, and also other pod that set on that node are in the same state: [kni@provisionhost-0-0 ~]$ oc get po -A -o wide| grep -v Running | grep -v Completed NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES openshift-dns dns-default-985hd 1/2 CrashLoopBackOff 35 (2m13s ago) 174m 10.130.2.15 worker-0-0 <none> <none> openshift-operators poison-pill-ds-r62xf 0/1 CrashLoopBackOff 9 (33s ago) 76m 10.131.2.4 worker-0-0 <none> <none> [kni@provisionhost-0-0 ~]$ oc get po -o wide -n openshift-operators NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES poison-pill-controller-manager-84c85d56fb-67f29 1/1 Running 0 3h30m 10.131.0.40 worker-0-1 <none> <none> poison-pill-ds-jsctj 1/1 Running 0 70m 10.128.2.17 worker-0-2 <none> <none> poison-pill-ds-r62xf 0/1 CrashLoopBackOff 7 (4m47s ago) 70m 10.131.2.4 worker-0-0 <none> <none> poison-pill-ds-rzfzn 1/1 Running 0 70m 10.131.0.53 worker-0-1 <none> <none> when looking into poison-pill-ds-r62xf pod description, we can see the following error: Warning ErrorAddingLogicalPort 64s controlplane unable to parse node L3 gw annotation: k8s.ovn.org/l3-gateway-config annotation not found for node "worker-0-0" Normal AddedInterface 34s multus Add eth0 [10.131.2.4/23] from ovn-kubernetes --------------------------- From the log of this pod: 2022-03-27T13:07:44.247Z ERROR controller-runtime.manager Failed to get API Group-Resources {"error": "Get \"https://172.30.0.1:443/api?timeout=32s\": dial tcp 172.30.0.1:443: connect: no route to host"} github.com/go-logr/zapr.(*zapLogger).Error /remote-source/app/vendor/github.com/go-logr/zapr/zapr.go:132 sigs.k8s.io/controller-runtime/pkg/cluster.New /remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/cluster/cluster.go:160 sigs.k8s.io/controller-runtime/pkg/manager.New /remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/manager/manager.go:278 main.main /remote-source/app/main.go:96 runtime.main /opt/rh/go-toolset-1.16/root/usr/lib/go-toolset-1.16-golang/src/runtime/proc.go:225 2022-03-27T13:07:44.247Z ERROR setup unable to start manager {"error": "Get \"https://172.30.0.1:443/api?timeout=32s\": dial tcp 172.30.0.1:443: connect: no route to host"} github.com/go-logr/zapr.(*zapLogger).Error when we delete the pod, it return back to Running state. How reproducible: ------------------------- always Steps to Reproduce: ------------------------- 1. install poison-pill operator 2. place large file on a node to simulate disk pressure (for example: fallocate -l 28G Gigfile) and wait for it to reboot and re-create 3. node return to ready state, the poison pill pod stucked in CrashLoopBackOff state. Actual results: ------------------- the pod is stucked in CrashLoopBackOff status, and also other pod that set on that node are in the same state Expected results: ------------------- after reboot and re-creation of node instance, all the workloads in this worker are running. Additional info: ------------------- must-gather attached to the bug in the next comment
must-gather - http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/BZ-2068910-must-gather
ok, sure. I''l try to reproduce and I'll update you
1. I do Disk pressure for the worker node 3. NHC detects it 2. Node becomes in not ready state 4. Created resource with the name poison pill remediation and this resource indicates that remediation is started for this node 5. Node rebooted (this process could occur several times) 6. And after that node will be deleted (this process could occur several times) 7. And after that node will be returned back , however pod in CrashLoopBackOff state
|Could you confirm that the node is indeed added from a backup and what are the contents of this backup and if it contains all the previous annotations? I checked the process again. I added label and annotation to the worker-0-0 before the remediation, however, after node re-creation this data didn't save.. So, I suppose we have a problem with the backup part (from our side). -------------------- Before Node Deletion: Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=worker-0-0 kubernetes.io/os=linux node-role.kubernetes.io/worker= node.openshift.io/os_id=rhcos test-deletion= Annotations: k8s.ovn.org/host-addresses: ["192.168.123.139"] k8s.ovn.org/l3-gateway-config: {"default":{"mode":"shared","interface-id":"br-ex_worker-0-0","mac-address":"52:54:00:3e:2b:0e","ip-addresses":["192.168.123.139/24"],"ip-... k8s.ovn.org/node-chassis-id: 9304749a-a518-4caf-b91f-f6b543746d5a k8s.ovn.org/node-mgmt-port-mac-address: 2a:57:c9:9f:97:92 k8s.ovn.org/node-primary-ifaddr: {"ipv4":"192.168.123.139/24"} k8s.ovn.org/node-subnets: {"default":"10.129.2.0/23"} machine.openshift.io/machine: openshift-machine-api/ocp-edge-cluster-0-5mr87-worker-0-vn4vt machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-4ae896e1ae6f565724711c746bbc37f1 machineconfiguration.openshift.io/desiredConfig: rendered-worker-4ae896e1ae6f565724711c746bbc37f1 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done test/deletion: true volumes.kubernetes.io/controller-managed-attach-detach: true -------------------- After Node re-creating: Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=worker-0-0 kubernetes.io/os=linux node-role.kubernetes.io/worker= node.openshift.io/os_id=rhcos Annotations: k8s.ovn.org/host-addresses: ["192.168.123.139","fd00:1101:0:1:5207:3236:7124:1e1"] k8s.ovn.org/l3-gateway-config: {"default":{"mode":"shared","interface-id":"br-ex_worker-0-0","mac-address":"52:54:00:3e:2b:0e","ip-addresses":["192.168.123.139/24"],"ip-... k8s.ovn.org/node-chassis-id: 9304749a-a518-4caf-b91f-f6b543746d5a k8s.ovn.org/node-mgmt-port-mac-address: 2a:57:c9:9f:97:92 k8s.ovn.org/node-primary-ifaddr: {"ipv4":"192.168.123.139/24"} k8s.ovn.org/node-subnets: {"default":"10.130.2.0/23"} machine.openshift.io/machine: openshift-machine-api/ocp-edge-cluster-0-5mr87-worker-0-vn4vt machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-4ae896e1ae6f565724711c746bbc37f1 machineconfiguration.openshift.io/desiredConfig: rendered-worker-4ae896e1ae6f565724711c746bbc37f1 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: true As we can see label (test-deletion=) and annotation (test/deletion: true) didn't save.
Before closing BZ, I'll check with OpenshiftSDN as well, to be sure that the problem on our side and doesn't relate to OVN
I checked with OpenshiftSDN: -------------------- Before Node Deletion: Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=worker-0-0 kubernetes.io/os=linux node-role.kubernetes.io/worker= node.openshift.io/os_id=rhcos test/deletion= Annotations: is-reboot-capable.poison-pill.medik8s.io: true machine.openshift.io/machine: openshift-machine-api/ocp-edge-cluster-0-xjdwg-worker-0-bflt9 machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-33d59c6e83fda8461b8b44b86b53f155 machineconfiguration.openshift.io/desiredConfig: rendered-worker-33d59c6e83fda8461b8b44b86b53f155 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done test/Deletion: true volumes.kubernetes.io/controller-managed-attach-detach: true -------------------- After Node re-creating: Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=worker-0-0 kubernetes.io/os=linux node-role.kubernetes.io/worker= node.openshift.io/os_id=rhcos test/deletion= Annotations: is-reboot-capable.poison-pill.medik8s.io: true machine.openshift.io/machine: openshift-machine-api/ocp-edge-cluster-0-xjdwg-worker-0-bflt9 machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-33d59c6e83fda8461b8b44b86b53f155 machineconfiguration.openshift.io/desiredConfig: rendered-worker-33d59c6e83fda8461b8b44b86b53f155 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done test/Deletion: true volumes.kubernetes.io/controller-managed-attach-detach: true As we can see label and annotation were saved.
In a nutshell: 1. Detecting a faulty node NHC creates a SNR remediation 2. SNR backups the node on the SNR remediation 3. SNR triggers a reboot either: - watchdog (first priority) - softdog (second priority) - system reboot (last priority) 4. After a period of time (in which reboot assumed to have completed) we delete the Node 5. We create the node from the backup If you have more questions, feel free to write in the slack to me or Michael.
I provide must-gather with the required information (poison pill related logs & CRs) - http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/BZ-2068910-must-gather-poison [kni@provisionhost-0-0 ~]$ oc get pods -n openshift-operators -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES node-healthcheck-operator-controller-manager-df4944675-f6c8x 2/2 Running 0 70m 10.130.1.39 master-0-1 <none> <none> poison-pill-controller-manager-6574647945-z9w9z 1/1 Running 0 60m 10.129.2.14 worker-0-2 <none> <none> poison-pill-ds-dklj9 0/1 CrashLoopBackOff 13 (4m40s ago) 54m 10.128.2.5 worker-0-0 <none> <none> poison-pill-ds-xr26v 1/1 Running 0 70m 10.129.2.13 worker-0-2 <none> <none> poison-pill-ds-zn6x8 1/1 Running 0 70m 10.131.0.62 worker-0-1 <none> <none>
The must-gather provided in comment#19 unfortunately does not have the required CRs (the poisonpillremediation). It does feature other poison pill related CRs (poisonpillconfig / poisonpillremediationtemplate), which lead me to believe the sought after CR was deleted already - or never created ?... Would you upload the `poisonpillremediation` CR manually captured while the issue is happening ? @prabinov Thanks in advance.
poison pill must-gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/BZ-2068910-pp-must-gather-during-remediation regular must-gather (includes OVN-K logs): http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/BZ-2068910-must-gather-during-remediation
From comment#6 I understand the issue could be fixed by deleting all the workloads running on the node - which will be reconciled, and re-created. On comment#9, I understand there's a different remediation policy - `ResourceDeletion` - that will indeed remove the node's workloads instead of the node itself. Have you tried this self-remediation mechanism with that remediation policy @prabinov ? If so, did it work as expected ?
Yes, I checked ResourceDeletion strategy, it works as expected. -------- till 4.11 - NodeDeletion is a default remediation strategy, because of that, we wanted to support the OVN network type. Because of the issue we decided that it more reliable will be used the ResourceDeletion strategy as a default , however even though the NodeDeletion strategy doesn't work properly in OVN we still provide this strategy for the user (but from 4.11 isn't as a default strategy)
must gather from the poison-pill stuff while the issue occurs. - http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/BZ-2068910-must-gather-pp-while-the-issue must gather from the openshift namespaces after the issue occurs - http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/BZ-2068910-regular-must-gather-after-the-issue must gather from the poison-pill stuff after the issue occurs - http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/BZ-2068910-must-gather-pp-after-the-issue After the remediation, node (worker-0-0) was deleted (you can see tha age 5m59s ): [kni@provisionhost-0-0 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 7h15m v1.23.5+3afdacb master-0-1 Ready master 7h15m v1.23.5+3afdacb master-0-2 Ready master 7h16m v1.23.5+3afdacb worker-0-0 Ready worker 5m59s v1.23.5+3afdacb worker-0-1 Ready worker 6h51m v1.23.5+3afdacb worker-0-2 Ready worker 6h51m v1.23.5+3afdacb And node was rebooted (you can see by uptime): [core@worker-0-0 ~]$ uptime 20:01:28 up 10 min, 1 user, load average: 0.11, 0.29, 0.24
An update, to re-scope the bug; it can be seen in the provided must-gathers the recovered node *features* the correct annotations (i.e. the node subnet annotation is the correct one). This data can be seen in [0]. It can be seen the node subnet annotation features the same CIDR both before and after the node delete / re-add. The issue is that the re-created node logical switches *do not* feature the node load balancers created for some services- notably, the kubernetes service [1]. This, in turn, causes the poison-pill pod on that node to crash-loop [2]. Reconciling these load balancers on node deletion (removing them) would force re-creation when the node is added. This is important because the node logical switch is only associated with the node load balancers *when* the service controller re-creates the node load balancers for a service [3]. Does this seem the right thing to do ? [0] - https://gist.github.com/maiqueb/feead4bfe72f2bd1e9bb9b1eab915ee9#node-after-reboot [1] - https://gist.github.com/maiqueb/feead4bfe72f2bd1e9bb9b1eab915ee9#check-the-logical-switches-load-balancers [2] - https://gist.github.com/maiqueb/feead4bfe72f2bd1e9bb9b1eab915ee9#no-connection-to-the-k8s-api [3] - https://github.com/ovn-org/ovn-kubernetes/blob/0573fe590a6f307200dc61a9cd0a6409db754c3d/go-controller/pkg/ovn/loadbalancer/loadbalancer.go#L118
We are soon planing to deprecate and remove Node Deletion strategy, therefore I think this bug can be closed - since soon it'll be irrelevant.
(In reply to Michael Shitrit from comment #28) > We are soon planing to deprecate and remove Node Deletion strategy, > therefore I think this bug can be closed - since soon it'll be irrelevant. We have just recently merged code upstream that addresses this bug. Up to you - we can close it, or merge it into 4.12 (and optionally back-port it to the required releases). @mshitrit
Thanks ! I think that if you merge it into 4.12 it will be great. At the moment I'm not sure about back-porting, but we're defiantly glad that this is an option.
Downstream PR (not yet merged) is seen on https://github.com/openshift/ovn-kubernetes/pull/1205
I simulated remediation process one after another. And I see that the pod/pods still in CrashLoopBackOff.. [kni@provisionhost-0-0 ~]$ oc get pods -n openshift-operators -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES node-healthcheck-operator-controller-manager-564685c55f-s9v4x 2/2 Running 0 88m 10.128.0.42 master-0-1 <none> <none> poison-pill-controller-manager-7cd87f55-8b644 2/2 Running 0 17m 10.131.1.135 worker-0-2 <none> <none> poison-pill-ds-cs96l 1/1 Running 0 15m 10.131.1.137 worker-0-2 <none> <none> poison-pill-ds-v9gz8 0/1 CrashLoopBackOff 4 (91s ago) 5m38s 10.129.2.9 worker-0-0 <none> <none> poison-pill-ds-xqz2x 0/1 CrashLoopBackOff 3 (32s ago) 3m42s 10.128.2.4 worker-0-1 <none> <none> $ oc describe pod poison-pill-ds-v9gz8 -n openshift-operators Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 8m5s default-scheduler Successfully assigned openshift-operators/poison-pill-ds-v9gz8 to worker-0-0 by master-0-2 Normal AddedInterface 8m4s multus Add eth0 [10.129.2.9/23] from ovn-kubernetes Normal Pulled 8m2s kubelet Successfully pulled image "quay.io/medik8s/poison-pill-operator:0.1.4" in 2.148536905s Normal Pulled 7m29s kubelet Successfully pulled image "quay.io/medik8s/poison-pill-operator:0.1.4" in 1.757707605s Normal Pulled 6m44s kubelet Successfully pulled image "quay.io/medik8s/poison-pill-operator:0.1.4" in 1.746327879s Normal Created 5m50s (x4 over 8m2s) kubelet Created container manager Normal Started 5m50s (x4 over 8m2s) kubelet Started container manager Normal Pulled 5m50s kubelet Successfully pulled image "quay.io/medik8s/poison-pill-operator:0.1.4" in 1.809316164s Normal Pulling 4m31s (x5 over 8m4s) kubelet Pulling image "quay.io/medik8s/poison-pill-operator:0.1.4" Normal Pulled 4m29s kubelet Successfully pulled image "quay.io/medik8s/poison-pill-operator:0.1.4" in 1.868824057s Warning BackOff 2m54s (x13 over 6m59s) kubelet Back-off restarting failed container
Please note the bug was *not* back-ported into 4.11 / 4.10 - as per #comment33, we're waiting on feedback to understand if we should backport or not. To verify the bug, you'll need to check on 4.12 - the fix landed on 4.12.0-0.nightly-2022-07-22-194831; any build *after that one* will feature the fix we want to get feedback from.
After discussion with Michael, we decided that we want back-port to 4.11 & 4.10. We still support the NodeDeletion strategy, so this fix would help us. If it's not difficult for you and without any risk, we definitely want the back-port. Thanks for your help! @mduarted
(In reply to Polina Rabinovich from comment #37) > After discussion with Michael, we decided that we want back-port to 4.11 & > 4.10. We still support the NodeDeletion strategy, so this fix would help us. > If it's not difficult for you and without any risk, we definitely want the > back-port. Thanks for your help! > > @mduarted Created bugs and PRs to backport this into 4.11 and 4.10, as requested. The bugs are: - https://bugzilla.redhat.com/show_bug.cgi?id=2113860 - https://bugzilla.redhat.com/show_bug.cgi?id=2113861 Currently they're blocked because they do not have enough priority to be merged into 4.11 after the feature freeze, meaning we need to wait until 4.11 GAs, and only then can we finish the back-ports.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399