1819611 – [4.5]ovn cluster cannot be started with network errors

Bug 1819611 - [4.5]ovn cluster cannot be started with network errors

Summary: [4.5]ovn cluster cannot be started with network errors

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Ben Bennett
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1819930 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-01 07:45 UTC by huirwang
Modified:	2020-07-13 17:25 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-13 17:24:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1600	0	None	closed	Revert "cri-o: set manage_ns_lifecycle to true"	2020-09-11 08:27:35 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:25:14 UTC

Description huirwang 2020-04-01 07:45:17 UTC

Description of problem:


Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-03-30-174407


How reproducible:


Steps to Reproduce:
1.
Setup cluster with 3 masters, 7 workers.
2.
Create more pods with :
or i in {1..9} ; do oc new-project project$i ; done
for i in {1..9} ; do oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/networking/list_for_pods.json -n project$i ; done

3.
New a project test, create networkpolicy in it.
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: test-podselector-and-ipblock
spec:
  podSelector: {}
  ingress:
  - from:
    - ipBlock:
        cidr: 10.131.0.0/24

4. Create pv load following guide
 https://github.com/qinpingli/external-storage/tree/master/iscsi/targetd2 in test project.
oc get pvc
NAME      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS               AGE
myclaim   Bound    pvc-4a01bb9a-e406-440e-af35-a3790b796358   100Mi      RWO            iscsi-targetd-vg-targetd   9s

oc get pod iscsi-pv-pod1 
NAME            READY   STATUS    RESTARTS   AGE
iscsi-pv-pod1   1/1     Running   0          23s

5. oc annotate Network.operator.openshift.io cluster "networkoperator.openshift.io/network-migration"=""

6. oc patch Network.config.openshift.io cluster --type='merge' --patch '{"spec":{"networkType":"OVNKubernetes"}}'

7. Wait until all old the pods(openshift-sdn) are gone.

8. reboot the all the nodes
for ip in `oc get node -o wide | egrep -v "NAME" |awk '{print $6}'`
do
   echo "reboot node $ip"
   ssh -i ~/.ssh/openshift-qe.pem -o StrictHostKeyChecking=no core@$ip sudo shutdown -r -t 3
done



Actual results:

After nodes reboot and in ready status, check the pods status
oc get pods --all-namespaces -o wide | egrep -v "Running|Comple" | wc -l
     119


 Warning  FailedCreatePodSandBox  94s  kubelet, hrw-bar5-77h2d-compute-2  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-rc-xm4r9_test_19f60add-93f2-4b9e-805b-81f5a99b5126_0(c24e4057c820edf25465d91466a7b9ee96aaad1f09036e19fe793a5a497c6bcd): Multus: [test/test-rc-xm4r9]: error adding container to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: '[test/test-rc-xm4r9] failed to configure pod interface: failed to open netns "/var/run/crio/ns/2e699a7e-ed0c-4bc7-9c38-196cc3e5ee83/net": failed to Statfs "/var/run/crio/ns/2e699a7e-ed0c-4bc7-9c38-196cc3e5ee83/net": no such file or directory


Expected results:
All the pods should in running or complete status

Additional info:

Reboot all the nodes again, but it does not help.

Comment 2 zhaozhanqi 2020-04-01 10:06:42 UTC

seems this is not migrate issue. the cluster cannot be worked for OVN, 
maybe this PR https://github.com/openshift/machine-config-operator/pull/1568/  caused it.

Comment 3 Yang Yang 2020-04-01 10:33:30 UTC

GCP cluster installation with OVN failed with error:

# oc describe -n openshift-apiserver-operator pod/openshift-apiserver-operator-7f64c8f747-8czpd
Warning  FailedCreatePodSandBox  <invalid> (x309 over 69m)  kubelet, yy3775-6pnkt-m-0.c.openshift-qe.internal  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_openshift-apiserver-operator-7f64c8f747-8czpd_openshift-apiserver-operator_ed803cd5-47b9-45cc-b80c-b15a37765bc1_0(b79e99040efde0ac4818ff48e50d385cfb9cde8d02b4c2a0574f9f81d7a9f763): Multus: [openshift-apiserver-operator/openshift-apiserver-operator-7f64c8f747-8czpd]: error adding container to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: '[openshift-apiserver-operator/openshift-apiserver-operator-7f64c8f747-8czpd] failed to configure pod interface: failed to open netns "/var/run/crio/ns/b92b8d46-b3f4-40b4-b338-efd3765234e7/net": failed to Statfs "/var/run/crio/ns/b92b8d46-b3f4-40b4-b338-efd3765234e7/net": no such file or directory 
' 

ovnkube-master pod has below error:

[core@yy3775-6pnkt-m-0 ~]$ sudo crictl logs $(sudo crictl ps --pod=$(sudo crictl pods --name=ovnkube-master --quiet) --quiet)
E0401 07:09:02.951433       1 reflector.go:283] k8s.io/client-go/informers/factory.go:133: Failed to watch *v1.Namespace: Get https://api-int.yy3775.qe.gcp.devcluster.openshift.com:6443/api/v1/namespaces?resourceVersion=2345&timeout=6m33s&timeoutSeconds=393&watch=true: dial tcp 34.71.121.10:6443: connect: connection refused

ovs-node pod has below error:

# oc logs ovs-node-4xrsp -n openshift-ovn-kubernetes
2020-04-01T06:52:34.886Z|00004|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2020-04-01T06:52:34.886Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2020-04-01T06:52:34.888Z|00006|dpdk|INFO|DPDK Disabled - Use other_config:dpdk-init to enable
2020-04-01T06:52:34.929Z|00007|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.12.0
2020-04-01T06:52:44.581Z|00003|memory|INFO|6296 kB peak resident set size after 10.0 seconds
2020-04-01T06:52:44.582Z|00004|memory|INFO|cells:38 monitors:3 sessions:2
2020-04-01T06:53:12.613Z|00005|jsonrpc|WARN|unix#7: receive error: Connection reset by peer
2020-04-01T06:53:12.613Z|00006|reconnect|WARN|unix#7: connection dropped (Connection reset by peer)
2020-04-01T06:53:12.646Z|00007|stream_ssl|ERR|SSL_use_certificate_file: error:02001002:system library:fopen:No such file or directory
2020-04-01T06:53:12.646Z|00008|stream_ssl|ERR|SSL_use_PrivateKey_file: error:20074002:BIO routines:FILE_CTRL:system lib

Comment 4 Johnny Liu 2020-04-02 02:08:07 UTC

AWS install also hit such problem, so set testblocker keyword.

Comment 5 zhaozhanqi 2020-04-02 02:18:23 UTC

https://github.com/openshift/machine-config-operator/pull/1600 has been merged. 
this issue should be fixed by checking registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-04-01-232323

Comment 6 zhaozhanqi 2020-04-02 08:42:12 UTC

*** Bug 1819930 has been marked as a duplicate of this bug. ***

Comment 7 zhaozhanqi 2020-04-02 08:43:01 UTC

Verified this bug according to comment 5

Comment 10 errata-xmlrpc 2020-07-13 17:24:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.