Bug 1654942 - [OVN] ovn-master pod cannot be scheduled after changed to 'deployment' from 'Daemonset' in 3.11
Summary: [OVN] ovn-master pod cannot be scheduled after changed to 'deployment' from '...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.11.0
Hardware: All
OS: All
high
high
Target Milestone: ---
: 3.11.z
Assignee: Phil Cameron
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-30 06:11 UTC by zhaozhanqi
Modified: 2019-05-06 13:00 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: start sequence broken Consequence: reported bug Fix: fix start sequence Result:
Clone Of:
Environment:
Last Closed: 2019-04-08 19:09:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs from ovn-logs ovnkube-master-pt46f (15.70 KB, text/plain)
2018-11-30 06:11 UTC, zhaozhanqi
no flags Details
ovn pod failed logs (44.59 KB, text/plain)
2018-12-03 02:22 UTC, zhaozhanqi
no flags Details
logs for ovn pod cannot be running. (43.95 KB, text/plain)
2018-12-21 08:53 UTC, zhaozhanqi
no flags Details
oc description pod ovnkube-master (9.01 KB, text/plain)
2019-03-01 05:56 UTC, zhaozhanqi
no flags Details
oc describe ovn node for comment18 (9.83 KB, text/plain)
2019-03-21 00:55 UTC, Anurag saxena
no flags Details
oc describe ovn master for comment18 (8.99 KB, text/plain)
2019-03-21 00:56 UTC, Anurag saxena
no flags Details

Description zhaozhanqi 2018-11-30 06:11:32 UTC
Created attachment 1510067 [details]
logs from ovn-logs ovnkube-master-pt46f

Description of problem:
Using the latest image docker.io/ovnkube/ovn-daemonset:latest(id:67f36c783fd7). the ovn-master-xxx pod cannot be running due to the container 'run-ovn-northd' connection is not stable.

Version-Release number of selected component (if applicable):


How reproducible:
always

Steps to Reproduce:
1. setup the cluster with OVN kubernetes according to the docs
2.
3.

Actual results:

ovn-master pod cannot be running, see the logs in the attachment.


Expected results:
ovn-master pod works well

Additional info:

Comment 1 Phil Cameron 2018-11-30 15:22:16 UTC
It appears there is something wrong with the ovs daemons. The image, id:67f36c783fd7, has been tested and works in a kubernetes cluster.

What version of Openshift is running?

Is ovs running when installing ovn (were all existing networking daemonsets deleted)?

How was ovn installed on this cluster?

Please run and attach report from
./ovn-logs
The default, without args, reports on all pods.


The high number of restarts, especially in the ovnkube pods, shows that there is a problem.

Comment 2 zhaozhanqi 2018-12-03 02:21:33 UTC
I was trying this using v4.0.0-0.79.0 and v3.11.50, there are same issue according to https://github.com/openvswitch/ovn-kubernetes/blob/master/dist/READMEopenshifttechpreview.md

Comment 3 zhaozhanqi 2018-12-03 02:22:19 UTC
Created attachment 1510777 [details]
ovn pod failed logs

Comment 4 Phil Cameron 2018-12-10 15:04:43 UTC
The container start sequences have been reworked. See 
https://github.com/openvswitch/ovn-kubernetes/pull/536

Comment 5 zhaozhanqi 2018-12-21 08:53:44 UTC
Created attachment 1516067 [details]
logs for ovn pod cannot be running.

Comment 6 zhaozhanqi 2018-12-21 08:54:37 UTC
hi, Phil

please help check above logs, for now the ovn related pod still cannot be running.

Comment 7 Phil Cameron 2019-01-03 14:30:57 UTC
Hi Zhanqi
I am working on a fix in PR 559, its not ready yet.

Comment 8 Phil Cameron 2019-01-08 14:03:20 UTC
Hi Zhanqi
It is proving difficult to get ovn to be stable and operate properly. I am still working on PR 559 which will fix the problem.

Comment 9 Phil Cameron 2019-01-09 21:28:26 UTC
Note ovn is tech preview in 4.0

Comment 10 Ben Bennett 2019-01-17 18:33:53 UTC
Removed the TestBlocker label since OVN is not going to be included in 4.0.

Comment 11 Phil Cameron 2019-01-31 18:31:11 UTC
https://github.com/openvswitch/ovn-kubernetes/pull/559 - MERGED

Comment 12 zhaozhanqi 2019-02-28 02:33:19 UTC
hi, Phil Cameron

I did the testing on 3.11 version following the docs. Found the ovnkube-master cannot be scheduled on the node and ovnkube-node cannot be running with stable status.

# oc  get pod -o wide
NAME                              READY     STATUS    RESTARTS   AGE       IP           NODE                               NOMINATED NODE
ovnkube-master-685564c955-qsh8m   0/4       Pending   0          1h        <none>       <none>                             <none>
ovnkube-node-jjsmk                3/3       Running   16         1h        10.0.77.78   qe-zzhao2-master-etcd-nfs-1        <none>
ovnkube-node-mm5hm                3/3       Running   16         1h        10.0.77.73   qe-zzhao2-node-registry-router-1   <none>
ovnkube-node-x8dfr                3/3       Running   16         1h        10.0.77.69   qe-zzhao2-node-1                   <none>


# oc get node
NAME                               STATUS     ROLES     AGE       VERSION
qe-zzhao2-master-etcd-nfs-1        NotReady   master    1h        v1.11.0+d4cacc0
qe-zzhao2-node-1                   NotReady   compute   58m       v1.11.0+d4cacc0
qe-zzhao2-node-registry-router-1   NotReady   <none>    58m       v1.11.0+d4cacc0



I found the kind has changed from 'DaemonSet' to 'Deployment' in ovnkube-master.yaml.j2.  as I know, Deployment will check the node if ready firstly and then scheduled the pod in 3.11 version. 
So the current issue is the node cannot be ready due to OVN did not be running. And the ovnkube-master cannot be scheduled due to the node is not ready..

Comment 13 Phil Cameron 2019-02-28 13:29:59 UTC
Could you do an 
oc describe ovnkube-master-685564c955-qsh8m
and attach the results? Thanks.

4.0 starts the deployment and does not have this problem. Community developers on kubernetes are not seeing this either.

Comment 14 zhaozhanqi 2019-03-01 05:56:30 UTC
Created attachment 1539726 [details]
oc description pod ovnkube-master

Comment 15 zhaozhanqi 2019-03-01 06:04:20 UTC
yes, I'm not sure if 4.0 has this issue
but in 3.11 version. the Deployment will check the node if ready firstly and then scheduled the pod on the node.

Comment 17 zhaozhanqi 2019-03-19 06:05:55 UTC
This bug is not related above errata. So I changed the version to 3.11 since it can be reproduced in 3.11 version at least according to comment 12.

I will file another bug if this still happen in 4.0 when OVN is merged to 4.0.

Comment 18 Anurag saxena 2019-03-21 00:54:27 UTC
Apparently, reverting back to "Daemonset" doesn't solve this problem

Steps taken

1) Changed 'Kind' to Daemonset in ovnkube-master.yaml

# cat ovnkube-master.yaml | grep kind
kind: Daemonset
#kind: Deployment

2) Deleted ovn master pod to trigger restart
3) ovn master pod stuck in Pending status indefinitely
4) ovn nodes continuouslyback and forth in crashloopbackof-restart phase.

(Attaching ovn master and ovn node oc describe details as oc_describe_master_comment18.txt and oc_describe_ovn_node_comment18.txt)

Comment 19 Anurag saxena 2019-03-21 00:55:34 UTC
Created attachment 1546276 [details]
oc describe ovn node for comment18

Comment 20 Anurag saxena 2019-03-21 00:56:12 UTC
Created attachment 1546277 [details]
oc describe ovn master for comment18

Comment 21 Phil Cameron 2019-03-29 12:03:35 UTC
Please note that I have been developing ovn on a bare metal 3 node 3.11 cluster and have had no problems with the deployment. We need to investigate what is different in the cluster setup.

Comment 22 Anurag saxena 2019-03-29 13:53:56 UTC
(In reply to Phil Cameron from comment #21)
> Please note that I have been developing ovn on a bare metal 3 node 3.11
> cluster and have had no problems with the deployment. We need to investigate
> what is different in the cluster setup.

Phil, I can share my setup with you if you like. I am in Boston, same time zone as your i believe (westford ?). Thanks!

Comment 23 Phil Cameron 2019-04-05 18:06:45 UTC
Master deployment has
    metadata:
      labels:
        app: ovnkube-master
        node-role.kubernetes.io/master: "true"
and
     nodeSelector:
        node-role.kubernetes.io/master: "true"
        beta.kubernetes.io/os: "linux"

The label may be missing.

It feels like the there is a selector problem. I don't think it has to do with networking being up.
I am working on 4.1 bugs and can't devote a lot of time to this until 4.2 dev starts.
I am usually in Westford, today I am wfh

Comment 24 Anurag saxena 2019-04-05 20:45:21 UTC
(In reply to Phil Cameron from comment #23)
> Master deployment has
>     metadata:
>       labels:
>         app: ovnkube-master
>         node-role.kubernetes.io/master: "true"
> and
>      nodeSelector:
>         node-role.kubernetes.io/master: "true"
>         beta.kubernetes.io/os: "linux"
> 
> The label may be missing.
> 
> It feels like the there is a selector problem. I don't think it has to do
> with networking being up.
> I am working on 4.1 bugs and can't devote a lot of time to this until 4.2
> dev starts.
> I am usually in Westford, today I am wfh

Hi Phil, I can see following present under master deployment on my setup with :latest image:

metadata:
      labels:
        name: ovnkube-master
        component: network
        type: infra
        openshift.io/component: network
        beta.kubernetes.io/os: "linux"

and

nodeSelector:
        node-role.kubernetes.io/master: "true"
        beta.kubernetes.io/os: "linux"

Comment 25 Phil Cameron 2019-04-08 19:09:16 UTC
3.11 (from openvswitch/ovn-kubernetes) is work in progress. This is a sandbox that we use to work on ovn before the cluster-network-operator became available. None of what is here is stable or testable. Some times it works, sometimes it doesn't. The real, testable, ovn will be installed by the cluster-network-operator on OCP 4.2+


Note You need to log in before you can comment on or make changes to this bug.