Bug 1771738 - [OVN] Missing Readiness probe on OVNKubernetes pods
Summary: [OVN] Missing Readiness probe on OVNKubernetes pods
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.4.0
Assignee: Alexander Constantinescu
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-12 20:58 UTC by Anurag saxena
Modified: 2020-02-13 07:43 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-06 20:20:42 UTC
Target Upstream Version:
anusaxen: needinfo-


Attachments (Terms of Use)
oc describe on running pod on Not Ready node (14.05 KB, text/plain)
2019-11-12 20:58 UTC, Anurag saxena
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 411 0 'None' closed Bug 1771738: Cherry-pick readinessProbe to 4.3 2020-05-07 14:51:50 UTC

Description Anurag saxena 2019-11-12 20:58:09 UTC
Created attachment 1635501 [details]
oc describe on running pod on Not Ready node

Description of problem: A worker node was turned off and became Not Ready in few seconds. Test pods on that node went into Terminating state which is exepected but ovnkube pods are still running under openshift-ovn-kubernetes project. I was expecting that pod to land in Error or Terminating state state.

Are we dealing ds pods behavior differently vs non ds pods? I believe this is not specific to OVN clusters but in general.

oc describe pod on ovnkube says Readiness Probe Failed.

Version-Release number of selected component (if applicable):4.3.0-0.nightly-2019-11-12-095307


How reproducible:Always


Steps to Reproduce:
1.Kill any node
2.Check corresponding ovn pods under openshift-ovn-kubernetes
3.

Actual results: Pod still in Running state


Expected results:Expecting pod to land in Error or Terminating state state.


Additional info:

$ oc get nodes
NAME                                          STATUS     ROLES    AGE     VERSION
ip-10-0-131-115.ap-south-1.compute.internal   NotReady   worker   3h53m   v1.16.2
ip-10-0-141-227.ap-south-1.compute.internal   Ready      master   4h2m    v1.16.2
ip-10-0-151-119.ap-south-1.compute.internal   Ready      worker   3h53m   v1.16.2
ip-10-0-153-109.ap-south-1.compute.internal   Ready      master   4h2m    v1.16.2
ip-10-0-168-34.ap-south-1.compute.internal    Ready      master   4h2m    v1.16.2
[anusaxen@anusaxen ~]$ oc get pods
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-7jzqv   4/4     Running   5          4h2m
ovnkube-master-9qrmt   4/4     Running   0          4h2m
ovnkube-master-mzgv7   4/4     Running   4          4h2m
ovnkube-node-jglc2     3/3     Running   3          4h2m
ovnkube-node-sz5ss     3/3     Running   4          4h2m
ovnkube-node-tbjhd     3/3     Running   7          4h2m
ovnkube-node-tc6ww     3/3     Running   1          3h54m
ovnkube-node-vnt7x     3/3     Running   2          3h54m
[anusaxen@anusaxen ~]$ oc get pods -o wide
NAME                   READY   STATUS    RESTARTS   AGE     IP             NODE                                          NOMINATED NODE   READINESS GATES
ovnkube-master-7jzqv   4/4     Running   5          4h2m    10.0.141.227   ip-10-0-141-227.ap-south-1.compute.internal   <none>           <none>
ovnkube-master-9qrmt   4/4     Running   0          4h2m    10.0.168.34    ip-10-0-168-34.ap-south-1.compute.internal    <none>           <none>
ovnkube-master-mzgv7   4/4     Running   4          4h2m    10.0.153.109   ip-10-0-153-109.ap-south-1.compute.internal   <none>           <none>
ovnkube-node-jglc2     3/3     Running   3          4h2m    10.0.168.34    ip-10-0-168-34.ap-south-1.compute.internal    <none>           <none>
ovnkube-node-sz5ss     3/3     Running   4          4h2m    10.0.141.227   ip-10-0-141-227.ap-south-1.compute.internal   <none>           <none>
ovnkube-node-tbjhd     3/3     Running   7          4h2m    10.0.153.109   ip-10-0-153-109.ap-south-1.compute.internal   <none>           <none>
ovnkube-node-tc6ww     3/3     Running   1          3h54m   10.0.131.115   ip-10-0-131-115.ap-south-1.compute.internal   <none>           <none>
ovnkube-node-vnt7x     3/3     Running   2          3h54m   10.0.151.119   ip-10-0-151-119.ap-south-1.compute.internal   <none>           <none>
[anusaxen@anusaxen ~]$ oc get clusterversion

Comment 1 Casey Callendrello 2019-11-13 12:35:53 UTC
This is a nice-to-have for 4.3, but I don't consider it a blocking bug.

Alexander, when the higher-priority things have been taken care of, let's think about readiness probes.

Comment 2 Alexander Constantinescu 2019-12-09 22:44:26 UTC
Hi

This bug has a corresponding PR on github which I have not managed to update here: https://github.com/openshift/cluster-network-operator/pull/414

/Alex

Comment 5 Alexander Constantinescu 2020-02-04 12:19:11 UTC
Hi Anurag

I am not sure what the expected behavior is when restarting/stopping a node and testing if the readinessProbe works. The kubelet is the one taking care of executing all readinessProbes on a node, and obviously if the node is stopped: so is the kubelet, and thus there's nothing on that node to check readinessProbes. 

Two things:

- Could you create a new bug and assign that bug to the node team to have a further look into. I have reproduced what you noted, and saw that all hostNetwork/hostPID pods were still in a "Running" state once the node was stopped, but all normal pods were in a "Terminating" state, just as you had noted. They might need to have a look at why that is, and if it "works as designed"

- You should be testing readinessProbes differently. One example would be to perform the following: in the ovnkube-node container in any ovnkube-node pod, there's a readinessProbe checking if the file: /etc/cni/net.d/10-ovn-kubernetes.conf exists, if it dosen't that container should go not ready. Delete that file in a container and check. That should work just fine. 

aconstan@linux-3 ~ $ oc rsh -c ovnkube-node ovnkube-node-s8sh4 
sh-4.2# ls -l /etc/cni/net.d/10-ovn-kubernetes.conf
-rw-------. 1 root root 94 Feb  4 09:28 /etc/cni/net.d/10-ovn-kubernetes.conf
sh-4.2# rm /etc/cni/net.d/10-ovn-kubernetes.conf
aconstan@linux-3 ~ $ oc get pod ovnkube-node-s8sh4 
ovnkube-node-s8sh4     1/2     Running   0          115m

In general, the very function of the readinessProbe execution is up to the kubelet to handle. What each readinessProbe does (and ensuring it's done correctly) is up to each developer creating that probe. You just need to find a good way to test the failure of the probe...and I think stopping the node is too extreme. 

/Alex

Comment 7 Weibin Liang 2020-02-06 20:17:55 UTC
(In reply to Alexander Constantinescu from comment #5)
> Hi Anurag
> 
> I am not sure what the expected behavior is when restarting/stopping a node
> and testing if the readinessProbe works. The kubelet is the one taking care
> of executing all readinessProbes on a node, and obviously if the node is
> stopped: so is the kubelet, and thus there's nothing on that node to check
> readinessProbes. 
> 
> Two things:
> 
> - Could you create a new bug and assign that bug to the node team to have a
> further look into. I have reproduced what you noted, and saw that all
> hostNetwork/hostPID pods were still in a "Running" state once the node was
> stopped, but all normal pods were in a "Terminating" state, just as you had
> noted. They might need to have a look at why that is, and if it "works as
> designed"
> 

New bug submitted: Bug 1800136 - Not all pods get Terminating after shutdown the worker node


> - You should be testing readinessProbes differently. One example would be to
> perform the following: in the ovnkube-node container in any ovnkube-node
> pod, there's a readinessProbe checking if the file:
> /etc/cni/net.d/10-ovn-kubernetes.conf exists, if it dosen't that container
> should go not ready. Delete that file in a container and check. That should
> work just fine. 
> 
> aconstan@linux-3 ~ $ oc rsh -c ovnkube-node ovnkube-node-s8sh4 
> sh-4.2# ls -l /etc/cni/net.d/10-ovn-kubernetes.conf
> -rw-------. 1 root root 94 Feb  4 09:28 /etc/cni/net.d/10-ovn-kubernetes.conf
> sh-4.2# rm /etc/cni/net.d/10-ovn-kubernetes.conf
> aconstan@linux-3 ~ $ oc get pod ovnkube-node-s8sh4 
> ovnkube-node-s8sh4     1/2     Running   0          115m
> 
> In general, the very function of the readinessProbe execution is up to the
> kubelet to handle. What each readinessProbe does (and ensuring it's done
> correctly) is up to each developer creating that probe. You just need to
> find a good way to test the failure of the probe...and I think stopping the
> node is too extreme. 
> 
> /Alex

Comment 8 Weibin Liang 2020-02-06 20:20:42 UTC
Tested in 4.4.0-0.nightly-2020-02-06-131745 and below log show readinessProbes works fine according to comment#5

[root@dhcp-41-193 FILE]# oc rsh -c ovnkube-node ovnkube-node-62z9b
sh-4.2# ls -l /etc/cni/net.d/10-ovn-kubernetes.conf
-rw-------. 1 root root 94 Feb  6 18:57 /etc/cni/net.d/10-ovn-kubernetes.conf
sh-4.2# rm /etc/cni/net.d/10-ovn-kubernetes.conf
sh-4.2# exit
exit
[root@dhcp-41-193 FILE]# oc get pods
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-88zm2   4/4     Running   1          75m
ovnkube-master-gjsbx   4/4     Running   0          78m
ovnkube-master-j6976   4/4     Running   0          76m
ovnkube-node-62z9b     2/2     Running   0          77m
ovnkube-node-bpt9q     2/2     Running   0          76m
ovnkube-node-h99q2     2/2     Running   0          77m
ovnkube-node-p6v9c     2/2     Running   0          76m
ovnkube-node-rcq94     2/2     Running   2          76m
ovnkube-node-vh45h     2/2     Running   0          76m
ovs-node-9zs5q         1/1     Running   0          76m
ovs-node-d9c2g         1/1     Running   0          84m
ovs-node-hx6v5         1/1     Running   0          76m
ovs-node-mf68m         1/1     Running   1          76m
ovs-node-rxp97         1/1     Running   0          84m
ovs-node-zc9lr         1/1     Running   0          84m
[root@dhcp-41-193 FILE]# oc get pods ovnkube-node-62z9b
NAME                 READY   STATUS    RESTARTS   AGE
ovnkube-node-62z9b   1/2     Running   0          77m
[root@dhcp-41-193 FILE]#


Note You need to log in before you can comment on or make changes to this bug.