Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1579744

Summary:	Should be able to detected the ovs is down and recreate the ovs pod automatically
Product:	OpenShift Container Platform	Reporter:	Meng Bo <bmeng>
Component:	Networking	Assignee:	Ben Bennett <bbennett>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	low
Priority:	medium	CC:	aos-bugs, bbennett, cshereme, hongli, jortizpa, jtanenba
Version:	3.10.0
Target Milestone:	---
Target Release:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-05-15 09:17:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Meng Bo 2018-05-18 08:41:45 UTC

Description of problem:
Setup OCP cluster, the ovs and sdn pod will be running on the node. If the ovs related process are killed by some reason, the sdn pod will keep restarting since the ovs db.sock disappeared, and the ovs pod will keep running as normal. 

Version-Release number of selected component (if applicable):
v3.10.0-0.47.0

How reproducible:
always

Steps to Reproduce:
1. Setup ocp cluster
2. Kill the ovs related process on the node
killall ovsdb-server
killall ovs-vswitchd
3. Check the sdn and ovs pod on the node

Actual results:
The sdn pod is keep restarting and the ovs pod looks innocent.

# oc get po 
NAME        READY     STATUS    RESTARTS   AGE
ovs-tvv9h   1/1       Running   0          54m
sdn-6dlnl   1/1       Running   7          54m

# oc logs sdn-6dlnl
I0518 08:14:20.592928   26616 node.go:292] Starting openshift-sdn network plugin
I0518 08:14:20.639228   26616 healthcheck.go:29] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I0518 08:14:21.641202   26616 healthcheck.go:29] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I0518 08:14:22.641369   26616 healthcheck.go:29] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I0518 08:14:23.641290   26616 healthcheck.go:29] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory

# oc logs ovs-tvv9h
/etc/openvswitch/conf.db does not exist ... (warning).
Creating empty database /etc/openvswitch/conf.db [  OK  ]
Starting ovsdb-server [  OK  ]
PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4)
PMD: net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
PMD: net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5)
Configuring Open vSwitch system IDs [  OK  ]
Inserting openvswitch module [  OK  ]
PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4)
PMD: net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
PMD: net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5)
Starting ovs-vswitchd [  OK  ]
Enabling remote OVSDB managers [  OK  ]


Expected results:
Should detect the problem on ovs pod and restart the pod

Additional info:

Comment 1 Ben Bennett 2018-05-30 18:52:42 UTC

The ovs-vswitchd and ovsdb-server both have monitoring processes that catch crashes due to abnormal termination.  If you send SYSSEGV to the monitored process then it will be restarted.

If all of the processes are killed with killall that will not restart the pod.  It also will not restart it in 3.9 because systemd also considers it a normal termination and does not restart the process.

Comment 2 Meng Bo 2018-05-31 02:22:54 UTC

Yeah, it also won't restart in 3.9.

But I think we should handle this better since all the sdn service and ovs service are running in container and will not rely on systemd now.

Comment 3 Casey Callendrello 2019-05-15 09:17:46 UTC

We fixed this in 3.10, 3.11, and 4.1 as part of another bug. Marking this as FIXED.