1848374 – Killing ovs-vswitchd cause some ovs openflows lost

Bug 1848374 - Killing ovs-vswitchd cause some ovs openflows lost

Summary: Killing ovs-vswitchd cause some ovs openflows lost

Keywords:
Status:	CLOSED DUPLICATE of bug 1852618
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Daniel Mellado
QA Contact:	huirwang
Docs Contact:
URL:
Whiteboard:
Depends On:	1854801
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-18 09:06 UTC by huirwang
Modified:	2020-07-15 13:19 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1855118 (view as bug list)
Environment:
Last Closed:	2020-07-15 13:19:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-network-operator pull 683	0	None	closed	Bug 1848374: Delete flows.sh after restore.	2020-11-30 19:35:44 UTC
Github	openshift sdn pull 158	0	None	closed	Add support for --may-exist when adding the bridge in sdn	2020-11-30 19:35:17 UTC

Description huirwang 2020-06-18 09:06:01 UTC

Description of problem:
Sometimes Killing ovs process lost some ovs openflows

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-06-17-090638

How reproducible:
Intermittent, it is difficult to reproduce in manual, but happens a lot in automation run.

Steps to Reproduce:
1. Create a project
2. Create 2 pods in the project.
3. On one pod node, kill ovs process
pgrep ovs-vswitchd | xargs kill
4. After the new ovs projcess comes up, from pod to curl another pod

Acutal Result:

oc project 42itc

oc get pods -o wide
NAME            READY   STATUS    RESTARTS   AGE     IP            NODE                                        NOMINATED NODE   READINESS GATES
test-rc-j8gnp   1/1     Running   0          6m      10.131.0.37   ip-10-0-67-181.us-east-2.compute.internal   <none>           <none>
test-rc-tlckr   1/1     Running   0          5m59s   10.129.2.22   ip-10-0-49-187.us-east-2.compute.internal   <none>           <none>

oc exec test-rc-tlckr -- curl 10.131.0.37:8080
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0curl: (7) Failed to connect to 10.131.0.37 port 8080: Host is unreachable
command terminated with exit code 7

oc get netnamespaces 42itc 
NAME    NETID      EGRESS IPS
42itc   15911775

printf '%x\n'  15911775
f2cb5f

Check ovs openflows on sdn pod which ovs process was killed before. The related openflows lost.

oc rsh -n openshift-sdn sdn-6xbx6 
sh-4.2# ovs-ofctl dump-flows br0 -O openflow13  | grep f2cb5f
sh-4.2# 


Expected Result:
The related openflows should not lost

Comment 2 huirwang 2020-06-18 10:50:51 UTC

logs:http://virt-openshift-05.lab.eng.nay.redhat.com/huirwang/bug1848374/

Comment 12 Ross Brattain 2020-07-08 16:44:07 UTC

On 4.6 ovs-vswitchd is managed by systemd.  ovs-vswitchd is set to Restart=on-failure

According to 
https://www.freedesktop.org/software/systemd/man/systemd.service.html#Restart=

"on-failure" will not restart on clean exit code, so it won't restart on pkill

it looks like the ovs-vswitchd.service is in RHCOS, so we need to modify it to be Restart="always" in RCHOS?

Comment 13 Anurag saxena 2020-07-08 17:10:55 UTC

And the fact that ovs-vswitched on 4.6+ is now a service unit so not a good idea to use Process kill (pkill) which also implies that ovs-vswitched should be used with systemctl now. That being said, we still need to investigate why flows lost on < 4.6 during post pkill

Comment 24 Anurag saxena 2020-07-10 16:10:45 UTC

This will still block verification though on 4.6 due to https://bugzilla.redhat.com/show_bug.cgi?id=1854801

Comment 25 Aniket Bhat 2020-07-15 13:19:07 UTC


*** This bug has been marked as a duplicate of bug 1852618 ***

Note You need to log in before you can comment on or make changes to this bug.