Bug 1579744 - Should be able to detected the ovs is down and recreate the ovs pod automatically
Summary: Should be able to detected the ovs is down and recreate the ovs pod automatic...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 3.11.0
Assignee: Ben Bennett
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-18 08:41 UTC by Meng Bo
Modified: 2020-05-18 13:13 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-05-15 09:17:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Meng Bo 2018-05-18 08:41:45 UTC
Description of problem:
Setup OCP cluster, the ovs and sdn pod will be running on the node. If the ovs related process are killed by some reason, the sdn pod will keep restarting since the ovs db.sock disappeared, and the ovs pod will keep running as normal. 

Version-Release number of selected component (if applicable):
v3.10.0-0.47.0

How reproducible:
always

Steps to Reproduce:
1. Setup ocp cluster
2. Kill the ovs related process on the node
killall ovsdb-server
killall ovs-vswitchd
3. Check the sdn and ovs pod on the node

Actual results:
The sdn pod is keep restarting and the ovs pod looks innocent.

# oc get po 
NAME        READY     STATUS    RESTARTS   AGE
ovs-tvv9h   1/1       Running   0          54m
sdn-6dlnl   1/1       Running   7          54m

# oc logs sdn-6dlnl
I0518 08:14:20.592928   26616 node.go:292] Starting openshift-sdn network plugin
I0518 08:14:20.639228   26616 healthcheck.go:29] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I0518 08:14:21.641202   26616 healthcheck.go:29] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I0518 08:14:22.641369   26616 healthcheck.go:29] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I0518 08:14:23.641290   26616 healthcheck.go:29] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory

# oc logs ovs-tvv9h
/etc/openvswitch/conf.db does not exist ... (warning).
Creating empty database /etc/openvswitch/conf.db [  OK  ]
Starting ovsdb-server [  OK  ]
PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4)
PMD: net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
PMD: net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5)
Configuring Open vSwitch system IDs [  OK  ]
Inserting openvswitch module [  OK  ]
PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4)
PMD: net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
PMD: net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5)
Starting ovs-vswitchd [  OK  ]
Enabling remote OVSDB managers [  OK  ]


Expected results:
Should detect the problem on ovs pod and restart the pod

Additional info:

Comment 1 Ben Bennett 2018-05-30 18:52:42 UTC
The ovs-vswitchd and ovsdb-server both have monitoring processes that catch crashes due to abnormal termination.  If you send SYSSEGV to the monitored process then it will be restarted.

If all of the processes are killed with killall that will not restart the pod.  It also will not restart it in 3.9 because systemd also considers it a normal termination and does not restart the process.

Comment 2 Meng Bo 2018-05-31 02:22:54 UTC
Yeah, it also won't restart in 3.9.

But I think we should handle this better since all the sdn service and ovs service are running in container and will not rely on systemd now.

Comment 3 Casey Callendrello 2019-05-15 09:17:46 UTC
We fixed this in 3.10, 3.11, and 4.1 as part of another bug. Marking this as FIXED.


Note You need to log in before you can comment on or make changes to this bug.