Description of problem: In order to enable hybrid networking with OVNKubernetes, a 4.3 OpenShift cluster was created and the Network configuration was changed to enable the hybrid networking. After that, the openshift-ovn-kubernetes pods were in CrashLoopBackOff state and deleting them didn't solve the problem Version-Release number of selected component (if applicable): OpenShift installer and client version 4.3.0-0.ci-2019-11-07-124057 How reproducible: Always Steps to Reproduce: 1. Bring up an OpenShift 4.3 cluster with network type as OVNKubernetes 2. oc edit Network.operator.openshift.io cluster to look like the following defaultNetwork: ovnKubernetesConfig: hybridOverlayConfig: hybridClusterNetwork: - cidr: 10.132.0.0/14 hostPrefix: 23 type: OVNKubernetes 3. Get status of pods in openshift-ovn-kubernetes Actual results: Pods in CrashLoopBackOff or Error state and constantly restarting Expected results: All pods in Running state Additional info:
We believe the issue is fixed by https://github.com/openshift/ovn-kubernetes/pull/57
To Anurag - can you let us know if hybrid windows is working?
Currently blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1771083
(In reply to Anurag saxena from comment #3) > Currently blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1771083 Also have to check on latest build. Will let you know thanks
This seems okay to me on 4.3.0-0.nightly-2019-11-13-233341 # oc get Network.operator.openshift.io cluster -oyaml apiVersion: operator.openshift.io/v1 kind: Network metadata: creationTimestamp: "2019-11-15T20:39:55Z" generation: 2 name: cluster resourceVersion: "20551" selfLink: /apis/operator.openshift.io/v1/networks/cluster uid: 28f18360-7121-4315-a653-58a03e8e0868 spec: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 defaultNetwork: ovnKubernetesConfig: hybridOverlayConfig: hybridClusterNetwork: - cidr: 10.132.0.0/14 hostPrefix: 23 type: OVNKubernetes logLevel: "" serviceNetwork: - 172.30.0.0/16 status: {} [root@localhost fedora]# oc get pods -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-master-65gj4 4/4 Running 0 3m9s ovnkube-master-6w72z 4/4 Running 0 3m26s ovnkube-master-r8snk 4/4 Running 0 2m48s ovnkube-node-886pd 3/3 Running 0 2m39s ovnkube-node-bpzbg 3/3 Running 0 3m3s ovnkube-node-dddcg 3/3 Running 0 2m16s ovnkube-node-mz4zt 3/3 Running 1 3m26s ovnkube-node-v76nb 3/3 Running 0 113s [root@localhost fedora]# oc get co | awk '{print $4,$5}' | grep True <<<all operators seems okay [root@localhost fedora]# oc get ds -n openshift-ovn-kubernetes NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE ovnkube-master 3 3 3 3 3 beta.kubernetes.io/os=linux,node-role.kubernetes.io/master= 33m ovnkube-node 5 5 5 5 5 beta.kubernetes.io/os=linux 33m
@dcbw we are seeing issues with 4.3.0-0.ci-2019-11-15-172916 on Azure On applying the following to the network.operator cluster object: ``` apiVersion: operator.openshift.io/v1 kind: Network metadata: creationTimestamp: "2019-11-15T20:14:46Z" generation: 2 name: cluster resourceVersion: "41906" selfLink: /apis/operator.openshift.io/v1/networks/cluster uid: 7952352d-11c2-4233-80b4-d479117bb454 spec: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 defaultNetwork: ovnKubernetesConfig: hybridOverlayConfig: hybridClusterNetwork: - cidr: 10.132.0.0/14 hostPrefix: 23 type: OVNKubernetes logLevel: "" serviceNetwork: - 172.30.0.0/16 status: {} ``` one of the OVN nodes pods goes into crash loop back off # oc logs -f ovnkube-node-msdpv -c ovnkube-node -n openshift-ovn-kubernetes + [[ -f /env/aravindh-winc-2fn76-master-2 ]] + cp -f /usr/libexec/cni/ovn-k8s-cni-overlay /cni-bin-dir/ + ovn_config_namespace=openshift-ovn-kubernetes + retries=0 + true ++ kubectl get ep -n openshift-ovn-kubernetes ovnkube-db -o 'jsonpath={.subsets[0].addresses[0].ip}' + db_ip=10.0.0.4 + [[ -n 10.0.0.4 ]] + break + hybrid_overlay_flags= + [[ -n true ]] + hybrid_overlay_flags=--enable-hybrid-overlay + [[ -n 10.132.0.0/14 ]] + hybrid_overlay_flags='--enable-hybrid-overlay --hybrid-overlay-cluster-subnets=10.132.0.0/14' + OVN_NODES_ARRAY=(aravindh-winc-2fn76-master-0 aravindh-winc-2fn76-master-1 aravindh-winc-2fn76-master-2) + nb_addr_list= + sb_addr_list= + for i in '"${!OVN_NODES_ARRAY[@]}"' + [[ 0 != 0 ]] ++ getent ahostsv4 aravindh-winc-2fn76-master-0 ++ grep RAW ++ awk '{print $1}' + host=10.0.0.5 + nb_addr_list=ssl://10.0.0.5:9641 + sb_addr_list=ssl://10.0.0.5:9642 + for i in '"${!OVN_NODES_ARRAY[@]}"' + [[ 1 != 0 ]] + nb_addr_list=ssl://10.0.0.5:9641, + sb_addr_list=ssl://10.0.0.5:9642, ++ getent ahostsv4 aravindh-winc-2fn76-master-1 ++ grep RAW ++ awk '{print $1}' + host=10.0.0.6 + nb_addr_list=ssl://10.0.0.5:9641,ssl://10.0.0.6:9641 + sb_addr_list=ssl://10.0.0.5:9642,ssl://10.0.0.6:9642 + for i in '"${!OVN_NODES_ARRAY[@]}"' + [[ 2 != 0 ]] + nb_addr_list=ssl://10.0.0.5:9641,ssl://10.0.0.6:9641, + sb_addr_list=ssl://10.0.0.5:9642,ssl://10.0.0.6:9642, ++ getent ahostsv4 aravindh-winc-2fn76-master-2 ++ grep RAW ++ awk '{print $1}' + host=10.0.0.4 + nb_addr_list=ssl://10.0.0.5:9641,ssl://10.0.0.6:9641,ssl://10.0.0.4:9641 + sb_addr_list=ssl://10.0.0.5:9642,ssl://10.0.0.6:9642,ssl://10.0.0.4:9642 + exec /usr/bin/ovnkube --init-node aravindh-winc-2fn76-master-2 --cluster-subnets 10.128.0.0/14/23 --k8s-service-cidr 172.30.0.0/16 --k8s-apiserver https://api-int.aravindh-winc.winc.azure.devcluster.openshift.com:6443 --ovn-config-namespace openshift-ovn-kubernetes --nb-address ssl://10.0.0.5:9641,ssl://10.0.0.6:9641,ssl://10.0.0.4:9641 --sb-address ssl://10.0.0.5:9642,ssl://10.0.0.6:9642,ssl://10.0.0.4:9642 --nb-client-privkey /ovn-cert/tls.key --nb-client-cert /ovn-cert/tls.crt --nb-client-cacert /ovn-ca/ca-bundle.crt --sb-client-privkey /ovn-cert/tls.key --sb-client-cert /ovn-cert/tls.crt --sb-client-cacert /ovn-ca/ca-bundle.crt --nodeport --gateway-mode local --enable-hybrid-overlay --hybrid-overlay-cluster-subnets=10.132.0.0/14 --pidfile /var/run/openvswitch/ovnkube-node.pid --loglevel 4 --logfile /dev/stdout --metrics-bind-address 0.0.0.0:9101 E1115 22:18:31.957597 268074 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Endpoints: Get https://api-int.aravindh-winc.winc.azure.devcluster.openshift.com:6443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.0.0.8:6443: i/o timeout E1115 22:18:31.958403 268074 reflector.go:126] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Namespace: Get https://api-int.aravindh-winc.winc.azure.devcluster.openshift.com:6443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.0.0.8:6443: i/o timeout time="2019-11-15T22:18:33Z" level=info msg="Node aravindh-winc-2fn76-master-2 ready for ovn initialization with subnet 10.129.0.0/23" time="2019-11-15T22:18:33Z" level=info msg="node aravindh-winc-2fn76-master-2 connection status = connected" time="2019-11-15T22:18:33Z" level=info msg="Opening healthcheck \"openshift-ssh-bastion/ssh-bastion\" on port 32252" time="2019-11-15T22:18:33Z" level=info msg="Opening healthcheck \"openshift-ingress/router-default\" on port 31464" time="2019-11-15T22:18:33Z" level=info msg="Setting annotations map[k8s.ovn.org/node-gateway-iface-id:br-local_aravindh-winc-2fn76-master-2 k8s.ovn.org/node-gateway-ip:169.254.33.2/24 k8s.ovn.org/node-gateway-mac-address:3e:45:60:65:9c:4a k8s.ovn.org/node-gateway-mode:local k8s.ovn.org/node-gateway-next-hop:169.254.33.1 k8s.ovn.org/node-gateway-vlan-id:\x00 k8s.ovn.org/node-mgmt-port-mac-address:ee:d3:dc:aa:a6:d5] on node aravindh-winc-2fn76-master-2" time="2019-11-15T22:18:33Z" level=error msg="Error while obtaining gateway router addresses for aravindh-winc-2fn76-master-2 - OVN command '/usr/bin/ovn-nbctl --private-key=/ovn-cert/tls.key --certificate=/ovn-cert/tls.crt --bootstrap-ca-cert=/ovn-ca/ca-bundle.crt --db=ssl:10.0.0.5:9641,ssl:10.0.0.6:9641,ssl:10.0.0.4:9641 --timeout=15 lsp-get-addresses etor-GR_aravindh-winc-2fn76-master-2' failed: exit status 1" time="2019-11-15T22:18:33Z" level=fatal msg="Timeout error while obtaining addresses for k8s-aravindh-winc-2fn76-master-2 (OVN command '/usr/bin/ovn-nbctl --private-key=/ovn-cert/tls.key --certificate=/ovn-cert/tls.crt --bootstrap-ca-cert=/ovn-ca/ca-bundle.crt --db=ssl:10.0.0.5:9641,ssl:10.0.0.6:9641,ssl:10.0.0.4:9641 --timeout=15 lsp-get-addresses etor-GR_aravindh-winc-2fn76-master-2' failed: exit status 1)" The corresponding node also goes to not ready: ~ oc get nodes NAME STATUS ROLES AGE VERSION aravindh-winc-2fn76-master-0 Ready master 129m v1.16.2 aravindh-winc-2fn76-master-1 Ready master 129m v1.16.2 aravindh-winc-2fn76-master-2 NotReady master 129m v1.16.2 aravindh-winc-2fn76-worker-centralus1-ncftp Ready worker 120m v1.16.2 aravindh-winc-2fn76-worker-centralus2-ntjmf Ready worker 120m v1.16.2 aravindh-winc-2fn76-worker-centralus3-mkptf Ready worker 120m v1.16.2 Checking the kubelet logs on the node shows: Nov 15 22:24:26 aravindh-winc-2fn76-master-2 hyperkube[2777]: E1115 22:24:26.728808 2777 kubelet.go:2195] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network
(In reply to Anurag saxena from comment #6) > This seems okay to me on 4.3.0-0.nightly-2019-11-13-233341 > > # oc get Network.operator.openshift.io cluster -oyaml > apiVersion: operator.openshift.io/v1 > kind: Network > metadata: > creationTimestamp: "2019-11-15T20:39:55Z" > generation: 2 > name: cluster > resourceVersion: "20551" > selfLink: /apis/operator.openshift.io/v1/networks/cluster > uid: 28f18360-7121-4315-a653-58a03e8e0868 > spec: > clusterNetwork: > - cidr: 10.128.0.0/14 > hostPrefix: 23 > defaultNetwork: > ovnKubernetesConfig: > hybridOverlayConfig: > hybridClusterNetwork: > - cidr: 10.132.0.0/14 > hostPrefix: 23 > type: OVNKubernetes > logLevel: "" > serviceNetwork: > - 172.30.0.0/16 > status: {} > > [root@localhost fedora]# oc get pods -n openshift-ovn-kubernetes > NAME READY STATUS RESTARTS AGE > ovnkube-master-65gj4 4/4 Running 0 3m9s > ovnkube-master-6w72z 4/4 Running 0 3m26s > ovnkube-master-r8snk 4/4 Running 0 2m48s > ovnkube-node-886pd 3/3 Running 0 2m39s > ovnkube-node-bpzbg 3/3 Running 0 3m3s > ovnkube-node-dddcg 3/3 Running 0 2m16s > ovnkube-node-mz4zt 3/3 Running 1 3m26s > ovnkube-node-v76nb 3/3 Running 0 113s > > [root@localhost fedora]# oc get co | awk '{print $4,$5}' | grep True > <<<all operators seems okay > > [root@localhost fedora]# oc get ds -n openshift-ovn-kubernetes > NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE > SELECTOR AGE > ovnkube-master 3 3 3 3 3 > beta.kubernetes.io/os=linux,node-role.kubernetes.io/master= 33m > ovnkube-node 5 5 5 5 5 > beta.kubernetes.io/os=linux 33m The nightly version used here (4.3.0-0.nightly-2019-11-13-233341) does not reflect the latest changes for OVN-Kubernetes and cluster-network-operator. Nightlies built tonight should have them
Debugged Aravindh cluster, It seems like we couldn't write CNI config on that node due to some reason $ oc debug node/aravindh-winc-2fn76-master-2 -- chroot /host ls -l /var/run/multus/cni/net.d/ Starting pod/aravindh-winc-2fn76-master-2-debug ... To use host binaries, run chroot /host total 0 Removing debug pod ... While on Ready node it is fine $ oc debug node/aravindh-winc-2fn76-master-0 -- chroot /host ls -l /var/run/multus/cni/net.d/ Starting pod/aravindh-winc-2fn76-master-0-debug ... To use host binaries, run chroot /host total 4 -rw-------. 1 root root 94 Nov 15 20:16 10-ovn-kubernetes.conf Removing debug pod ... I believe only write CNI config on the node when controller finish one successful iteration. Hence some issues noticed with corresponding node container too - ovn-kube.
Seems like we had glitch in previous few builds, Azure cluster is not coming up for me on last 2-3 builds. I would suggest to try 4.3.0-0.nightly-2019-11-15-213610+ which i did on Azure and seeing no issues. As Dan Williams also mentioned lot of fixes went in. # oc get networks.operator.openshift.io cluster -oyaml apiVersion: operator.openshift.io/v1 kind: Network metadata: creationTimestamp: "2019-11-16T03:10:11Z" generation: 2 name: cluster resourceVersion: "20328" selfLink: /apis/operator.openshift.io/v1/networks/cluster uid: b45923b9-10ab-45da-a483-e1024b70e663 spec: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 defaultNetwork: ovnKubernetesConfig: hybridOverlayConfig: hybridClusterNetwork: - cidr: 10.132.0.0/14 hostPrefix: 23 type: OVNKubernetes logLevel: "" serviceNetwork: - 172.30.0.0/16 status: {} [root@localhost fedora]# oc get pods NAME READY STATUS RESTARTS AGE ovnkube-master-8hwdd 4/4 Running 0 3m46s ovnkube-master-l9nrs 4/4 Running 0 3m56s ovnkube-master-vrrt8 4/4 Running 0 4m3s ovnkube-node-68gns 3/3 Running 0 2m33s ovnkube-node-9fb7g 3/3 Running 0 4m7s ovnkube-node-s2b92 3/3 Running 0 77s ovnkube-node-vnvvg 3/3 Running 0 3m25s ovnkube-node-xpsz9 3/3 Running 0 104s
This needs more investigation once Azure env blocker for openshift-qe gets resolved https://bugzilla.redhat.com/show_bug.cgi?id=1773676 I will let it stay on ON_QA until the fix of the referenced test blocker get its way
Thanks Aravindh. @Dan William, seems like this issue is happening due to race condition which was fixed in https://github.com/openshift/ovn-kubernetes/pull/57 and got merged in 4.3.0-0.nightly-2019-11-15-213610 but Suhani reproduced this issue on same build today which suggests that this issue might still pop up intermittently. I am using intermittently word as i haven't yet seen it in my env. SUhani had to follow the same steps "oc edit the node and remove the k8s.ovn.org annotations" to bring back cluster to stable state.
As per talking with Phil Cameron a bit in person, this seems like a race condition which might be hardware environment specific - CPU, clocks etc involved. The reason i am not able to reproduce it even a single time but Suhani/Aravind reproduce it consistently is diff in our env and how we spin up clusters. I would request dev to debug in their env if required.
(In reply to Anurag saxena from comment #16) > As per talking with Phil Cameron a bit in person, this seems like a race > condition which might be hardware environment specific - CPU, clocks etc > involved. The reason i am not able to reproduce it even a single time but > Suhani/Aravind reproduce it consistently is diff in our env and how we spin > up clusters. I would request dev to debug in their env if required. Can you give us more information about how it is hardware env specific if it is happening on Azure? Also, are you editing the network config before or after bringing up the cluster?
>>Can you give us more information about how it is hardware env specific if it is happening on Azure? To make it more apples-apples comparison, Can you tell me whats your master and worker instance types? In my case, master: 'Standard_DS3_v2' worker: 'Standard_DS2_v2' >>Also, are you editing the network config before or after bringing up the cluster? After bringing up the cluster
Sorry 'Standard_DS4_v2' for master and worker in our env
(In reply to Anurag saxena from comment #18) > >>Can you give us more information about how it is hardware env specific if it is happening on Azure? > To make it more apples-apples comparison, Can you tell me whats your master > and worker instance types? > > In my case, > > master: 'Standard_DS3_v2' > worker: 'Standard_DS2_v2' > > >>Also, are you editing the network config before or after bringing up the cluster? > After bringing up the cluster Master: 'Standard D8s v3 (8 vcpus, 32 GiB memory)' Worker: 'Standard D2s v3 (2 vcpus, 8 GiB memory)'
(In reply to sumehta from comment #20) > (In reply to Anurag saxena from comment #18) > > >>Can you give us more information about how it is hardware env specific if it is happening on Azure? > > To make it more apples-apples comparison, Can you tell me whats your master > > and worker instance types? > > > > In my case, > > > > master: 'Standard_DS3_v2' > > worker: 'Standard_DS2_v2' > > > > >>Also, are you editing the network config before or after bringing up the cluster? > > After bringing up the cluster > > Master: 'Standard D8s v3 (8 vcpus, 32 GiB memory)' > Worker: 'Standard D2s v3 (2 vcpus, 8 GiB memory)' Thanks. Let me give a shot on that set of config
This is not reproducible on D8s v3 and D2s v2 instances too in my case Steps i took: 1) Brought up cluster with networktype: OVNKubernetes 2) Once it came up , i checked nodes and openshift-ovn-kubernetes pods status which seemed good 3) Went ahead and ahead and edited networks.operator.openshift,io cluster and added overlay config defaultNetwork: ovnKubernetesConfig: hybridOverlayConfig: hybridClusterNetwork: - cidr: 10.132.0.0/14 hostPrefix: 23 type: OVNKubernetes 4) and once ds/ovbnkube-pod rolled out ccompletely, pods and nodes were in Running status
I reproduced this with openshift-install-linux-4.3.0-0.nightly-2019-11-25-153929 I used the steps outlined above and
We have tested with newer CI and nightly builds of 4.3 after the latest commits on Nov 21st went in, and we do not see this issue any more on Azure.