1769962 – Ovn-kubernetes pods crashing when enabling hybrid networking

Bug 1769962 - Ovn-kubernetes pods crashing when enabling hybrid networking

Summary: Ovn-kubernetes pods crashing when enabling hybrid networking

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Dan Williams
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-11-07 19:55 UTC by sumehta
Modified:	2019-12-03 14:28 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-12-03 14:28:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description sumehta 2019-11-07 19:55:21 UTC

Description of problem:
In order to enable hybrid networking with OVNKubernetes, a 4.3 OpenShift cluster was created and the Network configuration was changed to enable the hybrid networking. After that, the openshift-ovn-kubernetes pods were in CrashLoopBackOff state and deleting them didn't solve the problem


Version-Release number of selected component (if applicable):
OpenShift installer and client version 4.3.0-0.ci-2019-11-07-124057

How reproducible:
Always

Steps to Reproduce:
1. Bring up an OpenShift 4.3 cluster with network type as OVNKubernetes
2. oc edit Network.operator.openshift.io cluster to look like the following
defaultNetwork:
    ovnKubernetesConfig:
      hybridOverlayConfig:
        hybridClusterNetwork:
        - cidr: 10.132.0.0/14
          hostPrefix: 23
    type: OVNKubernetes

3. Get status of pods in openshift-ovn-kubernetes

Actual results:
Pods in CrashLoopBackOff or Error state and constantly restarting

Expected results:
All pods in Running state

Additional info:

Comment 1 Dan Williams 2019-11-10 02:48:45 UTC

We believe the issue is fixed by https://github.com/openshift/ovn-kubernetes/pull/57

Comment 2 Casey Callendrello 2019-11-15 15:49:49 UTC

To Anurag - can you let us know if hybrid windows is working?

Comment 3 Anurag saxena 2019-11-15 16:11:01 UTC

Currently blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1771083

Comment 4 Anurag saxena 2019-11-15 16:12:45 UTC

(In reply to Anurag saxena from comment #3)
> Currently blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1771083

Also have to check on latest build. Will let you know thanks

Comment 6 Anurag saxena 2019-11-15 21:15:59 UTC

This seems okay to me on 4.3.0-0.nightly-2019-11-13-233341

# oc get Network.operator.openshift.io cluster -oyaml
apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  creationTimestamp: "2019-11-15T20:39:55Z"
  generation: 2
  name: cluster
  resourceVersion: "20551"
  selfLink: /apis/operator.openshift.io/v1/networks/cluster
  uid: 28f18360-7121-4315-a653-58a03e8e0868
spec:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  defaultNetwork:
    ovnKubernetesConfig:
      hybridOverlayConfig:
        hybridClusterNetwork:
        - cidr: 10.132.0.0/14
          hostPrefix: 23
    type: OVNKubernetes
  logLevel: ""
  serviceNetwork:
  - 172.30.0.0/16
status: {}

[root@localhost fedora]# oc get pods -n openshift-ovn-kubernetes
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-65gj4   4/4     Running   0          3m9s
ovnkube-master-6w72z   4/4     Running   0          3m26s
ovnkube-master-r8snk   4/4     Running   0          2m48s
ovnkube-node-886pd     3/3     Running   0          2m39s
ovnkube-node-bpzbg     3/3     Running   0          3m3s
ovnkube-node-dddcg     3/3     Running   0          2m16s
ovnkube-node-mz4zt     3/3     Running   1          3m26s
ovnkube-node-v76nb     3/3     Running   0          113s

[root@localhost fedora]# oc get co | awk '{print $4,$5}' | grep True    <<<all operators seems okay

[root@localhost fedora]# oc get ds -n openshift-ovn-kubernetes
NAME             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                 AGE
ovnkube-master   3         3         3       3            3           beta.kubernetes.io/os=linux,node-role.kubernetes.io/master=   33m
ovnkube-node     5         5         5       5            5           beta.kubernetes.io/os=linux                                   33m

Comment 7 Aravindh Puthiyaparambil 2019-11-15 22:30:18 UTC

@dcbw we are seeing issues with 4.3.0-0.ci-2019-11-15-172916 on Azure

On applying the following to the network.operator cluster object:
```
apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  creationTimestamp: "2019-11-15T20:14:46Z"
  generation: 2
  name: cluster
  resourceVersion: "41906"
  selfLink: /apis/operator.openshift.io/v1/networks/cluster
  uid: 7952352d-11c2-4233-80b4-d479117bb454
spec:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  defaultNetwork:
    ovnKubernetesConfig:
      hybridOverlayConfig:
        hybridClusterNetwork:
        - cidr: 10.132.0.0/14
          hostPrefix: 23
    type: OVNKubernetes
  logLevel: ""
  serviceNetwork:
  - 172.30.0.0/16
status: {}
```
one of the OVN nodes pods goes into crash loop back off

# oc logs -f ovnkube-node-msdpv -c ovnkube-node -n openshift-ovn-kubernetes
+ [[ -f /env/aravindh-winc-2fn76-master-2 ]]
+ cp -f /usr/libexec/cni/ovn-k8s-cni-overlay /cni-bin-dir/
+ ovn_config_namespace=openshift-ovn-kubernetes
+ retries=0
+ true
++ kubectl get ep -n openshift-ovn-kubernetes ovnkube-db -o 'jsonpath={.subsets[0].addresses[0].ip}'
+ db_ip=10.0.0.4
+ [[ -n 10.0.0.4 ]]
+ break
+ hybrid_overlay_flags=
+ [[ -n true ]]
+ hybrid_overlay_flags=--enable-hybrid-overlay
+ [[ -n 10.132.0.0/14 ]]
+ hybrid_overlay_flags='--enable-hybrid-overlay --hybrid-overlay-cluster-subnets=10.132.0.0/14'
+ OVN_NODES_ARRAY=(aravindh-winc-2fn76-master-0 aravindh-winc-2fn76-master-1 aravindh-winc-2fn76-master-2)
+ nb_addr_list=
+ sb_addr_list=
+ for i in '"${!OVN_NODES_ARRAY[@]}"'
+ [[ 0 != 0 ]]
++ getent ahostsv4 aravindh-winc-2fn76-master-0
++ grep RAW
++ awk '{print $1}'
+ host=10.0.0.5
+ nb_addr_list=ssl://10.0.0.5:9641
+ sb_addr_list=ssl://10.0.0.5:9642
+ for i in '"${!OVN_NODES_ARRAY[@]}"'
+ [[ 1 != 0 ]]
+ nb_addr_list=ssl://10.0.0.5:9641,
+ sb_addr_list=ssl://10.0.0.5:9642,
++ getent ahostsv4 aravindh-winc-2fn76-master-1
++ grep RAW
++ awk '{print $1}'
+ host=10.0.0.6
+ nb_addr_list=ssl://10.0.0.5:9641,ssl://10.0.0.6:9641
+ sb_addr_list=ssl://10.0.0.5:9642,ssl://10.0.0.6:9642
+ for i in '"${!OVN_NODES_ARRAY[@]}"'
+ [[ 2 != 0 ]]
+ nb_addr_list=ssl://10.0.0.5:9641,ssl://10.0.0.6:9641,
+ sb_addr_list=ssl://10.0.0.5:9642,ssl://10.0.0.6:9642,
++ getent ahostsv4 aravindh-winc-2fn76-master-2
++ grep RAW
++ awk '{print $1}'
+ host=10.0.0.4
+ nb_addr_list=ssl://10.0.0.5:9641,ssl://10.0.0.6:9641,ssl://10.0.0.4:9641
+ sb_addr_list=ssl://10.0.0.5:9642,ssl://10.0.0.6:9642,ssl://10.0.0.4:9642
+ exec /usr/bin/ovnkube --init-node aravindh-winc-2fn76-master-2 --cluster-subnets 10.128.0.0/14/23 --k8s-service-cidr 172.30.0.0/16 --k8s-apiserver https://api-int.aravindh-winc.winc.azure.devcluster.openshift.com:6443 --ovn-config-namespace openshift-ovn-kubernetes --nb-address ssl://10.0.0.5:9641,ssl://10.0.0.6:9641,ssl://10.0.0.4:9641 --sb-address ssl://10.0.0.5:9642,ssl://10.0.0.6:9642,ssl://10.0.0.4:9642 --nb-client-privkey /ovn-cert/tls.key --nb-client-cert /ovn-cert/tls.crt --nb-client-cacert /ovn-ca/ca-bundle.crt --sb-client-privkey /ovn-cert/tls.key --sb-client-cert /ovn-cert/tls.crt --sb-client-cacert /ovn-ca/ca-bundle.crt --nodeport --gateway-mode local --enable-hybrid-overlay --hybrid-overlay-cluster-subnets=10.132.0.0/14 --pidfile /var/run/openvswitch/ovnkube-node.pid --loglevel 4 --logfile /dev/stdout --metrics-bind-address 0.0.0.0:9101
E1115 22:18:31.957597  268074 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Endpoints: Get https://api-int.aravindh-winc.winc.azure.devcluster.openshift.com:6443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.0.0.8:6443: i/o timeout
E1115 22:18:31.958403  268074 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Namespace: Get https://api-int.aravindh-winc.winc.azure.devcluster.openshift.com:6443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.0.0.8:6443: i/o timeout
time="2019-11-15T22:18:33Z" level=info msg="Node aravindh-winc-2fn76-master-2 ready for ovn initialization with subnet 10.129.0.0/23"
time="2019-11-15T22:18:33Z" level=info msg="node aravindh-winc-2fn76-master-2 connection status = connected"
time="2019-11-15T22:18:33Z" level=info msg="Opening healthcheck \"openshift-ssh-bastion/ssh-bastion\" on port 32252"
time="2019-11-15T22:18:33Z" level=info msg="Opening healthcheck \"openshift-ingress/router-default\" on port 31464"
time="2019-11-15T22:18:33Z" level=info msg="Setting annotations map[k8s.ovn.org/node-gateway-iface-id:br-local_aravindh-winc-2fn76-master-2 k8s.ovn.org/node-gateway-ip:169.254.33.2/24 k8s.ovn.org/node-gateway-mac-address:3e:45:60:65:9c:4a k8s.ovn.org/node-gateway-mode:local k8s.ovn.org/node-gateway-next-hop:169.254.33.1 k8s.ovn.org/node-gateway-vlan-id:\x00 k8s.ovn.org/node-mgmt-port-mac-address:ee:d3:dc:aa:a6:d5] on node aravindh-winc-2fn76-master-2"
time="2019-11-15T22:18:33Z" level=error msg="Error while obtaining gateway router addresses for aravindh-winc-2fn76-master-2 - OVN command '/usr/bin/ovn-nbctl --private-key=/ovn-cert/tls.key --certificate=/ovn-cert/tls.crt --bootstrap-ca-cert=/ovn-ca/ca-bundle.crt --db=ssl:10.0.0.5:9641,ssl:10.0.0.6:9641,ssl:10.0.0.4:9641 --timeout=15 lsp-get-addresses etor-GR_aravindh-winc-2fn76-master-2' failed: exit status 1"
time="2019-11-15T22:18:33Z" level=fatal msg="Timeout error while obtaining addresses for k8s-aravindh-winc-2fn76-master-2 (OVN command '/usr/bin/ovn-nbctl --private-key=/ovn-cert/tls.key --certificate=/ovn-cert/tls.crt --bootstrap-ca-cert=/ovn-ca/ca-bundle.crt --db=ssl:10.0.0.5:9641,ssl:10.0.0.6:9641,ssl:10.0.0.4:9641 --timeout=15 lsp-get-addresses etor-GR_aravindh-winc-2fn76-master-2' failed: exit status 1)"


The corresponding node also goes to not ready:

~ oc get nodes
NAME                                          STATUS     ROLES    AGE    VERSION
aravindh-winc-2fn76-master-0                  Ready      master   129m   v1.16.2
aravindh-winc-2fn76-master-1                  Ready      master   129m   v1.16.2
aravindh-winc-2fn76-master-2                  NotReady   master   129m   v1.16.2
aravindh-winc-2fn76-worker-centralus1-ncftp   Ready      worker   120m   v1.16.2
aravindh-winc-2fn76-worker-centralus2-ntjmf   Ready      worker   120m   v1.16.2
aravindh-winc-2fn76-worker-centralus3-mkptf   Ready      worker   120m   v1.16.2

Checking the kubelet logs on the node shows:
Nov 15 22:24:26 aravindh-winc-2fn76-master-2 hyperkube[2777]: E1115 22:24:26.728808    2777 kubelet.go:2195] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network

Comment 8 sumehta 2019-11-15 22:31:24 UTC

(In reply to Anurag saxena from comment #6)
> This seems okay to me on 4.3.0-0.nightly-2019-11-13-233341
> 
> # oc get Network.operator.openshift.io cluster -oyaml
> apiVersion: operator.openshift.io/v1
> kind: Network
> metadata:
>   creationTimestamp: "2019-11-15T20:39:55Z"
>   generation: 2
>   name: cluster
>   resourceVersion: "20551"
>   selfLink: /apis/operator.openshift.io/v1/networks/cluster
>   uid: 28f18360-7121-4315-a653-58a03e8e0868
> spec:
>   clusterNetwork:
>   - cidr: 10.128.0.0/14
>     hostPrefix: 23
>   defaultNetwork:
>     ovnKubernetesConfig:
>       hybridOverlayConfig:
>         hybridClusterNetwork:
>         - cidr: 10.132.0.0/14
>           hostPrefix: 23
>     type: OVNKubernetes
>   logLevel: ""
>   serviceNetwork:
>   - 172.30.0.0/16
> status: {}
> 
> [root@localhost fedora]# oc get pods -n openshift-ovn-kubernetes
> NAME                   READY   STATUS    RESTARTS   AGE
> ovnkube-master-65gj4   4/4     Running   0          3m9s
> ovnkube-master-6w72z   4/4     Running   0          3m26s
> ovnkube-master-r8snk   4/4     Running   0          2m48s
> ovnkube-node-886pd     3/3     Running   0          2m39s
> ovnkube-node-bpzbg     3/3     Running   0          3m3s
> ovnkube-node-dddcg     3/3     Running   0          2m16s
> ovnkube-node-mz4zt     3/3     Running   1          3m26s
> ovnkube-node-v76nb     3/3     Running   0          113s
> 
> [root@localhost fedora]# oc get co | awk '{print $4,$5}' | grep True   
> <<<all operators seems okay
> 
> [root@localhost fedora]# oc get ds -n openshift-ovn-kubernetes
> NAME             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE
> SELECTOR                                                 AGE
> ovnkube-master   3         3         3       3            3          
> beta.kubernetes.io/os=linux,node-role.kubernetes.io/master=   33m
> ovnkube-node     5         5         5       5            5          
> beta.kubernetes.io/os=linux                                   33m


The nightly version used here (4.3.0-0.nightly-2019-11-13-233341) does not reflect the latest changes for OVN-Kubernetes and cluster-network-operator.
Nightlies built tonight should have them

Comment 9 Anurag saxena 2019-11-15 23:15:26 UTC

Debugged Aravindh cluster,

It seems like we couldn't write CNI config on that node due to some reason

$ oc debug node/aravindh-winc-2fn76-master-2 -- chroot /host ls -l /var/run/multus/cni/net.d/
Starting pod/aravindh-winc-2fn76-master-2-debug ...
To use host binaries, run chroot /host
total 0
Removing debug pod ...

While on Ready node it is fine
$ oc debug node/aravindh-winc-2fn76-master-0 -- chroot /host ls -l /var/run/multus/cni/net.d/
Starting pod/aravindh-winc-2fn76-master-0-debug ...
To use host binaries, run chroot /host
total 4
-rw-------. 1 root root 94 Nov 15 20:16 10-ovn-kubernetes.conf
Removing debug pod ...

I believe only write CNI config on the node when controller finish one successful iteration. Hence some issues noticed with corresponding node container too - ovn-kube.

Comment 10 Anurag saxena 2019-11-16 03:47:18 UTC

Seems like we had glitch in previous few builds, Azure cluster is not coming up for me on last 2-3 builds. 
I would suggest to try 4.3.0-0.nightly-2019-11-15-213610+ which i did on Azure and seeing no issues. As Dan Williams also mentioned lot of fixes went in.


# oc get networks.operator.openshift.io cluster -oyaml
apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  creationTimestamp: "2019-11-16T03:10:11Z"
  generation: 2
  name: cluster
  resourceVersion: "20328"
  selfLink: /apis/operator.openshift.io/v1/networks/cluster
  uid: b45923b9-10ab-45da-a483-e1024b70e663
spec:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  defaultNetwork:
    ovnKubernetesConfig:
      hybridOverlayConfig:
        hybridClusterNetwork:
        - cidr: 10.132.0.0/14
          hostPrefix: 23
    type: OVNKubernetes
  logLevel: ""
  serviceNetwork:
  - 172.30.0.0/16
status: {}
[root@localhost fedora]# oc get pods
NAME                   READY   STATUS    RESTARTS   AGE
ovnkube-master-8hwdd   4/4     Running   0          3m46s
ovnkube-master-l9nrs   4/4     Running   0          3m56s
ovnkube-master-vrrt8   4/4     Running   0          4m3s
ovnkube-node-68gns     3/3     Running   0          2m33s
ovnkube-node-9fb7g     3/3     Running   0          4m7s
ovnkube-node-s2b92     3/3     Running   0          77s
ovnkube-node-vnvvg     3/3     Running   0          3m25s
ovnkube-node-xpsz9     3/3     Running   0          104s

Comment 11 Anurag saxena 2019-11-18 21:09:26 UTC

This needs more investigation once Azure env blocker for openshift-qe gets resolved https://bugzilla.redhat.com/show_bug.cgi?id=1773676
I will let it stay on ON_QA until the fix of the referenced test blocker get its way

Comment 15 Anurag saxena 2019-11-19 03:22:38 UTC

Thanks Aravindh. 

@Dan William, seems like this issue is happening due to race condition which was fixed in https://github.com/openshift/ovn-kubernetes/pull/57 and got merged in 4.3.0-0.nightly-2019-11-15-213610 but Suhani reproduced this issue on same build today which suggests that this issue might still pop up intermittently. I am using intermittently word as i haven't yet seen it in my env.

SUhani had to follow the same steps "oc edit the node and remove the k8s.ovn.org annotations" to bring back cluster to stable state.

Comment 16 Anurag saxena 2019-11-25 15:37:10 UTC

As per talking with Phil Cameron a bit in person, this seems like a race condition which might be hardware environment specific - CPU, clocks etc involved. The reason i am not able to reproduce it even a single time but Suhani/Aravind reproduce it consistently is diff in our env and how we spin up clusters. I would request dev to debug in their env if required.

Comment 17 sumehta 2019-11-25 16:49:26 UTC

(In reply to Anurag saxena from comment #16)
> As per talking with Phil Cameron a bit in person, this seems like a race
> condition which might be hardware environment specific - CPU, clocks etc
> involved. The reason i am not able to reproduce it even a single time but
> Suhani/Aravind reproduce it consistently is diff in our env and how we spin
> up clusters. I would request dev to debug in their env if required.

Can you give us more information about how it is hardware env specific if it is happening on Azure?
Also, are you editing the network config before or after bringing up the cluster?

Comment 18 Anurag saxena 2019-11-25 18:34:23 UTC

>>Can you give us more information about how it is hardware env specific if it is happening on Azure?
To make it more apples-apples comparison, Can you tell me whats your master and worker instance types?

In my case,

master: 'Standard_DS3_v2'
worker: 'Standard_DS2_v2' 

>>Also, are you editing the network config before or after bringing up the cluster?
After bringing up the cluster

Comment 19 Anurag saxena 2019-11-25 18:36:17 UTC

Sorry 'Standard_DS4_v2' for master and worker in our env

Comment 20 sumehta 2019-11-25 18:43:27 UTC

(In reply to Anurag saxena from comment #18)
> >>Can you give us more information about how it is hardware env specific if it is happening on Azure?
> To make it more apples-apples comparison, Can you tell me whats your master
> and worker instance types?
> 
> In my case,
> 
> master: 'Standard_DS3_v2'
> worker: 'Standard_DS2_v2' 
> 
> >>Also, are you editing the network config before or after bringing up the cluster?
> After bringing up the cluster

Master: 'Standard D8s v3 (8 vcpus, 32 GiB memory)'
Worker: 'Standard D2s v3 (2 vcpus, 8 GiB memory)'

Comment 21 Anurag saxena 2019-11-25 18:50:31 UTC

(In reply to sumehta from comment #20)
> (In reply to Anurag saxena from comment #18)
> > >>Can you give us more information about how it is hardware env specific if it is happening on Azure?
> > To make it more apples-apples comparison, Can you tell me whats your master
> > and worker instance types?
> > 
> > In my case,
> > 
> > master: 'Standard_DS3_v2'
> > worker: 'Standard_DS2_v2' 
> > 
> > >>Also, are you editing the network config before or after bringing up the cluster?
> > After bringing up the cluster
> 
> Master: 'Standard D8s v3 (8 vcpus, 32 GiB memory)'
> Worker: 'Standard D2s v3 (2 vcpus, 8 GiB memory)'

Thanks. Let me give a shot on that set of config

Comment 22 Anurag saxena 2019-11-25 21:40:11 UTC

This is not reproducible on D8s v3 and D2s v2 instances too in my case

Steps i took:

1) Brought up cluster with networktype: OVNKubernetes
2) Once it came up , i checked nodes and openshift-ovn-kubernetes pods status which seemed good
3) Went ahead and ahead and edited networks.operator.openshift,io cluster and added overlay config

   defaultNetwork:
    ovnKubernetesConfig:
      hybridOverlayConfig:
        hybridClusterNetwork:
        - cidr: 10.132.0.0/14
          hostPrefix: 23
    type: OVNKubernetes


4) and once ds/ovbnkube-pod rolled out ccompletely, pods and nodes were in Running status

Comment 23 Jacob Tanenbaum 2019-11-26 20:42:39 UTC

I reproduced this with openshift-install-linux-4.3.0-0.nightly-2019-11-25-153929

I used the steps outlined above and

Comment 24 sumehta 2019-11-27 18:59:28 UTC

We have tested with newer CI and nightly builds of 4.3 after the latest commits on Nov 21st went in, and we do not see this issue any more on Azure.

Note You need to log in before you can comment on or make changes to this bug.