Bug 1746616 - [ovn] pod cannot be created when ovnkube-master pod is recreated and schedule to another master
Summary: [ovn] pod cannot be created when ovnkube-master pod is recreated and schedule...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.2.0
Hardware: All
OS: All
high
high
Target Milestone: ---
: ---
Assignee: Casey Callendrello
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-29 01:03 UTC by zhaozhanqi
Modified: 2019-08-29 13:39 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-08-29 13:39:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description zhaozhanqi 2019-08-29 01:03:07 UTC
Description of problem:
Create cluster with 3 masters and 2 workers with OVN network type. Check the ovn-master pod is running on one of master

# oc get pod -o wide
NAME                              READY   STATUS    RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
ovnkube-master-86db46c79b-n8mhn   4/4     Running   0          7h15m   10.0.130.36    ip-10-0-130-36.eu-west-2.compute.internal    <none>           <none>
ovnkube-node-bbhvj                3/3     Running   0          7h10m   10.0.141.250   ip-10-0-141-250.eu-west-2.compute.internal   <none>           <none>
ovnkube-node-mk74p                3/3     Running   0          7h10m   10.0.146.207   ip-10-0-146-207.eu-west-2.compute.internal   <none>           <none>
ovnkube-node-qc5mq                3/3     Running   0          7h15m   10.0.130.36    ip-10-0-130-36.eu-west-2.compute.internal    <none>           <none>
ovnkube-node-r2s67                3/3     Running   0          7h15m   10.0.146.202   ip-10-0-146-202.eu-west-2.compute.internal   <none>           <none>
ovnkube-node-t6742                3/3     Running   0          7h15m   10.0.160.194   ip-10-0-160-194.eu-west-2.compute.internal   <none>           <none>

When I delete the ovn-master pod, the new created ovn-master pod is scheduled another master.

oc get pod -o wide
NAME                              READY   STATUS    RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
ovnkube-master-86db46c79b-fgp4v   4/4     Running   1          39s     10.0.146.202   ip-10-0-146-202.eu-west-2.compute.internal   <none>           <none>

Check the new created pod logs of run-ovn-northd:

#oc logs ovnkube-master-86db46c79b-fgp4v -c run-ovn-northd
================== ovnkube.sh --- version: 3 ================
 ==================== command: run-ovn-northd
 =================== hostname: ip-10-0-146-202
 =================== daemonset version 3
 =================== Image built from ovn-kubernetes ref: refs/heads/rhaos-4.2-rhel-7  commit: fb435e034a426d1a11fc61b284426e8ea82187ee
=============== run-ovn-northd (wait for ready_to_start_node)
=============== run_ovn_northd ========== MASTER ONLY
ovn_db_host 10.0.130.36
ovn_nbdb tcp://10.0.130.36:6641   ovn_sbdb tcp://10.0.130.36:6642
ovn_northd_opts=--db-nb-sock=/var/run/openvswitch/ovnnb_db.sock --db-sb-sock=/var/run/openvswitch/ovnsb_db.sock
ovn_log_northd=-vconsole:info
nice: cannot set niceness: Permission denied
2019-08-29T00:25:50Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovn-northd.log
Starting ovn-northd.
run as: /usr/share/openvswitch/scripts/ovn-ctl start_northd --no-monitor --ovn-manage-ovsdb=no --ovn-northd-nb-db=tcp:10.0.130.36:6641 --ovn-northd-sb-db=tcp:10.0.130.36:6642 --ovn-northd-log=-vconsole:info --db-nb-sock=/var/run/openvswitch/ovnnb_db.sock --db-sb-sock=/var/run/openvswitch/ovnsb_db.sock
=============== run_ovn_northd ========== RUNNING
2019-08-29T00:25:50.576Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovn-northd.log
2019-08-29T00:25:50.577Z|00002|reconnect|INFO|tcp:10.0.130.36:6641: connecting...
2019-08-29T00:25:50.577Z|00003|reconnect|INFO|tcp:10.0.130.36:6642: connecting...
2019-08-29T00:25:50.577Z|00004|reconnect|INFO|tcp:10.0.130.36:6641: connected
2019-08-29T00:25:50.577Z|00005|reconnect|INFO|tcp:10.0.130.36:6642: connected
2019-08-29T00:26:02.580Z|00006|reconnect|INFO|tcp:10.0.130.36:6641: connection closed by peer
2019-08-29T00:26:03.579Z|00007|reconnect|INFO|tcp:10.0.130.36:6641: connecting...
2019-08-29T00:26:03.579Z|00008|reconnect|INFO|tcp:10.0.130.36:6641: connection attempt failed (Connection refused)
2019-08-29T00:26:03.579Z|00009|reconnect|INFO|tcp:10.0.130.36:6641: waiting 2 seconds before reconnect
2019-08-29T00:26:04.387Z|00010|reconnect|INFO|tcp:10.0.130.36:6642: connection closed by peer
2019-08-29T00:26:05.388Z|00011|reconnect|INFO|tcp:10.0.130.36:6642: connecting...
2019-08-29T00:26:05.389Z|00012|reconnect|INFO|tcp:10.0.130.36:6642: connection attempt failed (Connection refused)
2019-08-29T00:26:05.389Z|00013|reconnect|INFO|tcp:10.0.130.36:6642: waiting 2 seconds before reconnect
2019-08-29T00:26:05.580Z|00014|reconnect|INFO|tcp:10.0.130.36:6641: connecting...
2019-08-29T00:26:05.581Z|00015|reconnect|INFO|tcp:10.0.130.36:6641: connection attempt failed (Connection refused)
2019-08-29T00:26:05.581Z|00016|reconnect|INFO|tcp:10.0.130.36:6641: waiting 4 seconds before reconnect
2019-08-29T00:26:07.389Z|00017|reconnect|INFO|tcp:10.0.130.36:6642: connecting...
2019-08-29T00:26:07.390Z|00018|reconnect|INFO|tcp:10.0.130.36:6642: connection attempt failed (Connection refused)
2019-08-29T00:26:07.390Z|00019|reconnect|INFO|tcp:10.0.130.36:6642: waiting 4 seconds before reconnect
2019-08-29T00:26:09.582Z|00020|reconnect|INFO|tcp:10.0.130.36:6641: connecting...
2019-08-29T00:26:09.583Z|00021|reconnect|INFO|tcp:10.0.130.36:6641: connection attempt failed (Connection refused)
2019-08-29T00:26:09.583Z|00022|reconnect|INFO|tcp:10.0.130.36:6641: continuing to reconnect in the background but suppressing further logging
2019-08-29T00:26:11.391Z|00023|reconnect|INFO|tcp:10.0.130.36:6642: connecting...
2019-08-29T00:26:11.392Z|00024|reconnect|INFO|tcp:10.0.130.36:6642: connection attempt failed (Connection refused)
2019-08-29T00:26:11.392Z|00025|reconnect|INFO|tcp:10.0.130.36:6642: continuing to reconnect in the background but suppressing further logging

*****************************************************

. Create test pods, it shows error: 

 Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-rc-gmq5f_z3_0ed2d931-c9f4-11e9-b781-0ad6b5818122_0(b7ab06ded547b23f37415c134610e507daaa087e4b436aa3574c085308e63411): CNI request failed with status 400: 'Nil response to CNI request

Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-08-28-083236

How reproducible:
always

Steps to Reproduce:
1. setup 3 masters and 2 worker with OVN type
2. Create test pod and check it works well
3. Delete the ovn-master pod and make it re-schedule to another master
4. Check the ovn-master logs
5. Create test pod

Actual results:

4. See logs in $description
5. test pod cannot be created

Expected results:



Additional info:

Comment 1 Casey Callendrello 2019-08-29 13:39:24 UTC
This is definitely a known issue, and we have an extensive effort to fix this. Marking this bug as WONTFIX.


Note You need to log in before you can comment on or make changes to this bug.