Created attachment 1510067 [details] logs from ovn-logs ovnkube-master-pt46f Description of problem: Using the latest image docker.io/ovnkube/ovn-daemonset:latest(id:67f36c783fd7). the ovn-master-xxx pod cannot be running due to the container 'run-ovn-northd' connection is not stable. Version-Release number of selected component (if applicable): How reproducible: always Steps to Reproduce: 1. setup the cluster with OVN kubernetes according to the docs 2. 3. Actual results: ovn-master pod cannot be running, see the logs in the attachment. Expected results: ovn-master pod works well Additional info:
It appears there is something wrong with the ovs daemons. The image, id:67f36c783fd7, has been tested and works in a kubernetes cluster. What version of Openshift is running? Is ovs running when installing ovn (were all existing networking daemonsets deleted)? How was ovn installed on this cluster? Please run and attach report from ./ovn-logs The default, without args, reports on all pods. The high number of restarts, especially in the ovnkube pods, shows that there is a problem.
I was trying this using v4.0.0-0.79.0 and v3.11.50, there are same issue according to https://github.com/openvswitch/ovn-kubernetes/blob/master/dist/READMEopenshifttechpreview.md
Created attachment 1510777 [details] ovn pod failed logs
The container start sequences have been reworked. See https://github.com/openvswitch/ovn-kubernetes/pull/536
Created attachment 1516067 [details] logs for ovn pod cannot be running.
hi, Phil please help check above logs, for now the ovn related pod still cannot be running.
Hi Zhanqi I am working on a fix in PR 559, its not ready yet.
Hi Zhanqi It is proving difficult to get ovn to be stable and operate properly. I am still working on PR 559 which will fix the problem.
Note ovn is tech preview in 4.0
Removed the TestBlocker label since OVN is not going to be included in 4.0.
https://github.com/openvswitch/ovn-kubernetes/pull/559 - MERGED
hi, Phil Cameron I did the testing on 3.11 version following the docs. Found the ovnkube-master cannot be scheduled on the node and ovnkube-node cannot be running with stable status. # oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE ovnkube-master-685564c955-qsh8m 0/4 Pending 0 1h <none> <none> <none> ovnkube-node-jjsmk 3/3 Running 16 1h 10.0.77.78 qe-zzhao2-master-etcd-nfs-1 <none> ovnkube-node-mm5hm 3/3 Running 16 1h 10.0.77.73 qe-zzhao2-node-registry-router-1 <none> ovnkube-node-x8dfr 3/3 Running 16 1h 10.0.77.69 qe-zzhao2-node-1 <none> # oc get node NAME STATUS ROLES AGE VERSION qe-zzhao2-master-etcd-nfs-1 NotReady master 1h v1.11.0+d4cacc0 qe-zzhao2-node-1 NotReady compute 58m v1.11.0+d4cacc0 qe-zzhao2-node-registry-router-1 NotReady <none> 58m v1.11.0+d4cacc0 I found the kind has changed from 'DaemonSet' to 'Deployment' in ovnkube-master.yaml.j2. as I know, Deployment will check the node if ready firstly and then scheduled the pod in 3.11 version. So the current issue is the node cannot be ready due to OVN did not be running. And the ovnkube-master cannot be scheduled due to the node is not ready..
Could you do an oc describe ovnkube-master-685564c955-qsh8m and attach the results? Thanks. 4.0 starts the deployment and does not have this problem. Community developers on kubernetes are not seeing this either.
Created attachment 1539726 [details] oc description pod ovnkube-master
yes, I'm not sure if 4.0 has this issue but in 3.11 version. the Deployment will check the node if ready firstly and then scheduled the pod on the node.
This bug is not related above errata. So I changed the version to 3.11 since it can be reproduced in 3.11 version at least according to comment 12. I will file another bug if this still happen in 4.0 when OVN is merged to 4.0.
Apparently, reverting back to "Daemonset" doesn't solve this problem Steps taken 1) Changed 'Kind' to Daemonset in ovnkube-master.yaml # cat ovnkube-master.yaml | grep kind kind: Daemonset #kind: Deployment 2) Deleted ovn master pod to trigger restart 3) ovn master pod stuck in Pending status indefinitely 4) ovn nodes continuouslyback and forth in crashloopbackof-restart phase. (Attaching ovn master and ovn node oc describe details as oc_describe_master_comment18.txt and oc_describe_ovn_node_comment18.txt)
Created attachment 1546276 [details] oc describe ovn node for comment18
Created attachment 1546277 [details] oc describe ovn master for comment18
Please note that I have been developing ovn on a bare metal 3 node 3.11 cluster and have had no problems with the deployment. We need to investigate what is different in the cluster setup.
(In reply to Phil Cameron from comment #21) > Please note that I have been developing ovn on a bare metal 3 node 3.11 > cluster and have had no problems with the deployment. We need to investigate > what is different in the cluster setup. Phil, I can share my setup with you if you like. I am in Boston, same time zone as your i believe (westford ?). Thanks!
Master deployment has metadata: labels: app: ovnkube-master node-role.kubernetes.io/master: "true" and nodeSelector: node-role.kubernetes.io/master: "true" beta.kubernetes.io/os: "linux" The label may be missing. It feels like the there is a selector problem. I don't think it has to do with networking being up. I am working on 4.1 bugs and can't devote a lot of time to this until 4.2 dev starts. I am usually in Westford, today I am wfh
(In reply to Phil Cameron from comment #23) > Master deployment has > metadata: > labels: > app: ovnkube-master > node-role.kubernetes.io/master: "true" > and > nodeSelector: > node-role.kubernetes.io/master: "true" > beta.kubernetes.io/os: "linux" > > The label may be missing. > > It feels like the there is a selector problem. I don't think it has to do > with networking being up. > I am working on 4.1 bugs and can't devote a lot of time to this until 4.2 > dev starts. > I am usually in Westford, today I am wfh Hi Phil, I can see following present under master deployment on my setup with :latest image: metadata: labels: name: ovnkube-master component: network type: infra openshift.io/component: network beta.kubernetes.io/os: "linux" and nodeSelector: node-role.kubernetes.io/master: "true" beta.kubernetes.io/os: "linux"
3.11 (from openvswitch/ovn-kubernetes) is work in progress. This is a sandbox that we use to work on ovn before the cluster-network-operator became available. None of what is here is stable or testable. Some times it works, sometimes it doesn't. The real, testable, ovn will be installed by the cluster-network-operator on OCP 4.2+