Description of problem: Sometimes Killing ovs process lost some ovs openflows Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-06-17-090638 How reproducible: Intermittent, it is difficult to reproduce in manual, but happens a lot in automation run. Steps to Reproduce: 1. Create a project 2. Create 2 pods in the project. 3. On one pod node, kill ovs process pgrep ovs-vswitchd | xargs kill 4. After the new ovs projcess comes up, from pod to curl another pod Acutal Result: oc project 42itc oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-rc-j8gnp 1/1 Running 0 6m 10.131.0.37 ip-10-0-67-181.us-east-2.compute.internal <none> <none> test-rc-tlckr 1/1 Running 0 5m59s 10.129.2.22 ip-10-0-49-187.us-east-2.compute.internal <none> <none> oc exec test-rc-tlckr -- curl 10.131.0.37:8080 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0curl: (7) Failed to connect to 10.131.0.37 port 8080: Host is unreachable command terminated with exit code 7 oc get netnamespaces 42itc NAME NETID EGRESS IPS 42itc 15911775 printf '%x\n' 15911775 f2cb5f Check ovs openflows on sdn pod which ovs process was killed before. The related openflows lost. oc rsh -n openshift-sdn sdn-6xbx6 sh-4.2# ovs-ofctl dump-flows br0 -O openflow13 | grep f2cb5f sh-4.2# Expected Result: The related openflows should not lost
logs:http://virt-openshift-05.lab.eng.nay.redhat.com/huirwang/bug1848374/
On 4.6 ovs-vswitchd is managed by systemd. ovs-vswitchd is set to Restart=on-failure According to https://www.freedesktop.org/software/systemd/man/systemd.service.html#Restart= "on-failure" will not restart on clean exit code, so it won't restart on pkill it looks like the ovs-vswitchd.service is in RHCOS, so we need to modify it to be Restart="always" in RCHOS?
And the fact that ovs-vswitched on 4.6+ is now a service unit so not a good idea to use Process kill (pkill) which also implies that ovs-vswitched should be used with systemctl now. That being said, we still need to investigate why flows lost on < 4.6 during post pkill
This will still block verification though on 4.6 due to https://bugzilla.redhat.com/show_bug.cgi?id=1854801
*** This bug has been marked as a duplicate of bug 1852618 ***