Description of problem: Setup OCP cluster, the ovs and sdn pod will be running on the node. If the ovs related process are killed by some reason, the sdn pod will keep restarting since the ovs db.sock disappeared, and the ovs pod will keep running as normal. Version-Release number of selected component (if applicable): v3.10.0-0.47.0 How reproducible: always Steps to Reproduce: 1. Setup ocp cluster 2. Kill the ovs related process on the node killall ovsdb-server killall ovs-vswitchd 3. Check the sdn and ovs pod on the node Actual results: The sdn pod is keep restarting and the ovs pod looks innocent. # oc get po NAME READY STATUS RESTARTS AGE ovs-tvv9h 1/1 Running 0 54m sdn-6dlnl 1/1 Running 7 54m # oc logs sdn-6dlnl I0518 08:14:20.592928 26616 node.go:292] Starting openshift-sdn network plugin I0518 08:14:20.639228 26616 healthcheck.go:29] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I0518 08:14:21.641202 26616 healthcheck.go:29] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I0518 08:14:22.641369 26616 healthcheck.go:29] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I0518 08:14:23.641290 26616 healthcheck.go:29] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory # oc logs ovs-tvv9h /etc/openvswitch/conf.db does not exist ... (warning). Creating empty database /etc/openvswitch/conf.db [ OK ] Starting ovsdb-server [ OK ] PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4) PMD: net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory PMD: net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5) Configuring Open vSwitch system IDs [ OK ] Inserting openvswitch module [ OK ] PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4) PMD: net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory PMD: net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5) Starting ovs-vswitchd [ OK ] Enabling remote OVSDB managers [ OK ] Expected results: Should detect the problem on ovs pod and restart the pod Additional info:
The ovs-vswitchd and ovsdb-server both have monitoring processes that catch crashes due to abnormal termination. If you send SYSSEGV to the monitored process then it will be restarted. If all of the processes are killed with killall that will not restart the pod. It also will not restart it in 3.9 because systemd also considers it a normal termination and does not restart the process.
Yeah, it also won't restart in 3.9. But I think we should handle this better since all the sdn service and ovs service are running in container and will not rely on systemd now.
We fixed this in 3.10, 3.11, and 4.1 as part of another bug. Marking this as FIXED.