Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1746616

Summary:	[ovn] pod cannot be created when ovnkube-master pod is recreated and schedule to another master
Product:	OpenShift Container Platform	Reporter:	zhaozhanqi <zzhao>
Component:	Networking	Assignee:	Casey Callendrello <cdc>
Status:	CLOSED WONTFIX	QA Contact:	zhaozhanqi <zzhao>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.2.0	CC:	aos-bugs
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-08-29 13:39:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description zhaozhanqi 2019-08-29 01:03:07 UTC

Description of problem:
Create cluster with 3 masters and 2 workers with OVN network type. Check the ovn-master pod is running on one of master

# oc get pod -o wide
NAME                              READY   STATUS    RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
ovnkube-master-86db46c79b-n8mhn   4/4     Running   0          7h15m   10.0.130.36    ip-10-0-130-36.eu-west-2.compute.internal    <none>           <none>
ovnkube-node-bbhvj                3/3     Running   0          7h10m   10.0.141.250   ip-10-0-141-250.eu-west-2.compute.internal   <none>           <none>
ovnkube-node-mk74p                3/3     Running   0          7h10m   10.0.146.207   ip-10-0-146-207.eu-west-2.compute.internal   <none>           <none>
ovnkube-node-qc5mq                3/3     Running   0          7h15m   10.0.130.36    ip-10-0-130-36.eu-west-2.compute.internal    <none>           <none>
ovnkube-node-r2s67                3/3     Running   0          7h15m   10.0.146.202   ip-10-0-146-202.eu-west-2.compute.internal   <none>           <none>
ovnkube-node-t6742                3/3     Running   0          7h15m   10.0.160.194   ip-10-0-160-194.eu-west-2.compute.internal   <none>           <none>

When I delete the ovn-master pod, the new created ovn-master pod is scheduled another master.

oc get pod -o wide
NAME                              READY   STATUS    RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
ovnkube-master-86db46c79b-fgp4v   4/4     Running   1          39s     10.0.146.202   ip-10-0-146-202.eu-west-2.compute.internal   <none>           <none>

Check the new created pod logs of run-ovn-northd:

#oc logs ovnkube-master-86db46c79b-fgp4v -c run-ovn-northd
================== ovnkube.sh --- version: 3 ================
 ==================== command: run-ovn-northd
 =================== hostname: ip-10-0-146-202
 =================== daemonset version 3
 =================== Image built from ovn-kubernetes ref: refs/heads/rhaos-4.2-rhel-7  commit: fb435e034a426d1a11fc61b284426e8ea82187ee
=============== run-ovn-northd (wait for ready_to_start_node)
=============== run_ovn_northd ========== MASTER ONLY
ovn_db_host 10.0.130.36
ovn_nbdb tcp://10.0.130.36:6641   ovn_sbdb tcp://10.0.130.36:6642
ovn_northd_opts=--db-nb-sock=/var/run/openvswitch/ovnnb_db.sock --db-sb-sock=/var/run/openvswitch/ovnsb_db.sock
ovn_log_northd=-vconsole:info
nice: cannot set niceness: Permission denied
2019-08-29T00:25:50Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovn-northd.log
Starting ovn-northd.
run as: /usr/share/openvswitch/scripts/ovn-ctl start_northd --no-monitor --ovn-manage-ovsdb=no --ovn-northd-nb-db=tcp:10.0.130.36:6641 --ovn-northd-sb-db=tcp:10.0.130.36:6642 --ovn-northd-log=-vconsole:info --db-nb-sock=/var/run/openvswitch/ovnnb_db.sock --db-sb-sock=/var/run/openvswitch/ovnsb_db.sock
=============== run_ovn_northd ========== RUNNING
2019-08-29T00:25:50.576Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovn-northd.log
2019-08-29T00:25:50.577Z|00002|reconnect|INFO|tcp:10.0.130.36:6641: connecting...
2019-08-29T00:25:50.577Z|00003|reconnect|INFO|tcp:10.0.130.36:6642: connecting...
2019-08-29T00:25:50.577Z|00004|reconnect|INFO|tcp:10.0.130.36:6641: connected
2019-08-29T00:25:50.577Z|00005|reconnect|INFO|tcp:10.0.130.36:6642: connected
2019-08-29T00:26:02.580Z|00006|reconnect|INFO|tcp:10.0.130.36:6641: connection closed by peer
2019-08-29T00:26:03.579Z|00007|reconnect|INFO|tcp:10.0.130.36:6641: connecting...
2019-08-29T00:26:03.579Z|00008|reconnect|INFO|tcp:10.0.130.36:6641: connection attempt failed (Connection refused)
2019-08-29T00:26:03.579Z|00009|reconnect|INFO|tcp:10.0.130.36:6641: waiting 2 seconds before reconnect
2019-08-29T00:26:04.387Z|00010|reconnect|INFO|tcp:10.0.130.36:6642: connection closed by peer
2019-08-29T00:26:05.388Z|00011|reconnect|INFO|tcp:10.0.130.36:6642: connecting...
2019-08-29T00:26:05.389Z|00012|reconnect|INFO|tcp:10.0.130.36:6642: connection attempt failed (Connection refused)
2019-08-29T00:26:05.389Z|00013|reconnect|INFO|tcp:10.0.130.36:6642: waiting 2 seconds before reconnect
2019-08-29T00:26:05.580Z|00014|reconnect|INFO|tcp:10.0.130.36:6641: connecting...
2019-08-29T00:26:05.581Z|00015|reconnect|INFO|tcp:10.0.130.36:6641: connection attempt failed (Connection refused)
2019-08-29T00:26:05.581Z|00016|reconnect|INFO|tcp:10.0.130.36:6641: waiting 4 seconds before reconnect
2019-08-29T00:26:07.389Z|00017|reconnect|INFO|tcp:10.0.130.36:6642: connecting...
2019-08-29T00:26:07.390Z|00018|reconnect|INFO|tcp:10.0.130.36:6642: connection attempt failed (Connection refused)
2019-08-29T00:26:07.390Z|00019|reconnect|INFO|tcp:10.0.130.36:6642: waiting 4 seconds before reconnect
2019-08-29T00:26:09.582Z|00020|reconnect|INFO|tcp:10.0.130.36:6641: connecting...
2019-08-29T00:26:09.583Z|00021|reconnect|INFO|tcp:10.0.130.36:6641: connection attempt failed (Connection refused)
2019-08-29T00:26:09.583Z|00022|reconnect|INFO|tcp:10.0.130.36:6641: continuing to reconnect in the background but suppressing further logging
2019-08-29T00:26:11.391Z|00023|reconnect|INFO|tcp:10.0.130.36:6642: connecting...
2019-08-29T00:26:11.392Z|00024|reconnect|INFO|tcp:10.0.130.36:6642: connection attempt failed (Connection refused)
2019-08-29T00:26:11.392Z|00025|reconnect|INFO|tcp:10.0.130.36:6642: continuing to reconnect in the background but suppressing further logging

*****************************************************

. Create test pods, it shows error: 

 Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-rc-gmq5f_z3_0ed2d931-c9f4-11e9-b781-0ad6b5818122_0(b7ab06ded547b23f37415c134610e507daaa087e4b436aa3574c085308e63411): CNI request failed with status 400: 'Nil response to CNI request

Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-08-28-083236

How reproducible:
always

Steps to Reproduce:
1. setup 3 masters and 2 worker with OVN type
2. Create test pod and check it works well
3. Delete the ovn-master pod and make it re-schedule to another master
4. Check the ovn-master logs
5. Create test pod

Actual results:

4. See logs in $description
5. test pod cannot be created

Expected results:



Additional info:

Comment 1 Casey Callendrello 2019-08-29 13:39:24 UTC

This is definitely a known issue, and we have an extensive effort to fix this. Marking this bug as WONTFIX.