Bug 1848374
| Summary: | Killing ovs-vswitchd cause some ovs openflows lost | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | huirwang | |
| Component: | Networking | Assignee: | Daniel Mellado <dmellado> | |
| Networking sub component: | openshift-sdn | QA Contact: | huirwang | |
| Status: | CLOSED DUPLICATE | Docs Contact: | ||
| Severity: | medium | |||
| Priority: | unspecified | CC: | aconstan, anbhat, anusaxen, danw, mcambria, rbrattai, weliang, zzhao | |
| Version: | 4.4 | |||
| Target Milestone: | --- | |||
| Target Release: | 4.6.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | No Doc Update | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1855118 (view as bug list) | Environment: | ||
| Last Closed: | 2020-07-15 13:19:07 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1854801 | |||
| Bug Blocks: | ||||
On 4.6 ovs-vswitchd is managed by systemd. ovs-vswitchd is set to Restart=on-failure According to https://www.freedesktop.org/software/systemd/man/systemd.service.html#Restart= "on-failure" will not restart on clean exit code, so it won't restart on pkill it looks like the ovs-vswitchd.service is in RHCOS, so we need to modify it to be Restart="always" in RCHOS? And the fact that ovs-vswitched on 4.6+ is now a service unit so not a good idea to use Process kill (pkill) which also implies that ovs-vswitched should be used with systemctl now. That being said, we still need to investigate why flows lost on < 4.6 during post pkill This will still block verification though on 4.6 due to https://bugzilla.redhat.com/show_bug.cgi?id=1854801 *** This bug has been marked as a duplicate of bug 1852618 *** |
Description of problem: Sometimes Killing ovs process lost some ovs openflows Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-06-17-090638 How reproducible: Intermittent, it is difficult to reproduce in manual, but happens a lot in automation run. Steps to Reproduce: 1. Create a project 2. Create 2 pods in the project. 3. On one pod node, kill ovs process pgrep ovs-vswitchd | xargs kill 4. After the new ovs projcess comes up, from pod to curl another pod Acutal Result: oc project 42itc oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-rc-j8gnp 1/1 Running 0 6m 10.131.0.37 ip-10-0-67-181.us-east-2.compute.internal <none> <none> test-rc-tlckr 1/1 Running 0 5m59s 10.129.2.22 ip-10-0-49-187.us-east-2.compute.internal <none> <none> oc exec test-rc-tlckr -- curl 10.131.0.37:8080 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0curl: (7) Failed to connect to 10.131.0.37 port 8080: Host is unreachable command terminated with exit code 7 oc get netnamespaces 42itc NAME NETID EGRESS IPS 42itc 15911775 printf '%x\n' 15911775 f2cb5f Check ovs openflows on sdn pod which ovs process was killed before. The related openflows lost. oc rsh -n openshift-sdn sdn-6xbx6 sh-4.2# ovs-ofctl dump-flows br0 -O openflow13 | grep f2cb5f sh-4.2# Expected Result: The related openflows should not lost