Bug 1595092
| Summary: | Creating and deleting a large number of pods causes ovs errors | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | sfu <sfu> | ||||
| Component: | Networking | Assignee: | Ben Bennett <bbennett> | ||||
| Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> | ||||
| Status: | CLOSED CURRENTRELEASE | Docs Contact: | |||||
| Severity: | high | ||||||
| Priority: | high | CC: | abodhe, aconole, aos-bugs, bbennett, bleanhar, cdc, danw, ekuric, erich, fleitner, jtanenba, mifiedle, pasik, rbost, scuppett, sfu, zzhao | ||||
| Version: | 3.9.0 | Keywords: | NeedsTestCase | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 3.11.z | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2019-03-26 18:18:08 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
I was able to reproduce the error messages in the description: |WARN|could not open network device veth389bfe8b (No such device) |WARN|could not add network device veth90f0d4ed to ofproto (No such device) NetworkPlugin cni failed to set up pod "deploymentconfig1-1-wvmnl_svt" network: CNI request failed with status 400: 'error on port vethd9ba5ea3: "could not open network device vethd9ba5ea3 (No such device)" These messages are not the core problem - they are a result of deleting the dc before the pods/containers are fully initialized. As long as the ports are cleaned up, things are operating correctly However, I was not able to reproduce the reported condition where ovs failed to delete the container interface ports. In my test, each time the dc was deleted, all ports were removed. I will attach the script and dc used to test this. OCP 3.9.33 with openvswitch 2.8 and 2.7 openvswitch.x86_64 1:2.8.0-4.el7fdb openvswitch.x86_64 1:2.7.0-8.git20170530.el7fdb Going to let this run overnight, but so far, the condition where ports are not removed has not been reproduced. Created attachment 1459283 [details]
Script and deployment config
The test has been running for about 20 hours now without seeing ports left in a bad state in OVS - they continue to be cleaned up with each deletion of the dc. I do still see the error messages in the description and in comment 6, but they are not affecting cluster operation. (In reply to Mike Fiedler from comment #8) > The test has been running for about 20 hours now without seeing ports left > in a bad state in OVS - they continue to be cleaned up with each deletion of > the dc. I do still see the error messages in the description and in > comment 6, but they are not affecting cluster operation. Thanks for you time to reproduce. , if you look at the sosreport from customer production server. sosreport-node1.szcsmanager.cmbchina.net ovs-vsctl show report 2273 ports on the bridge, most of them are in status like be 2. because container is not running anymore . ---------- Port "vethb66c0018" Interface "vethb66c0018" error: "could not add network device vethb66c0018 to ofproto (No such device)" Port "vethfd31e168" Interface "vethfd31e168" error: "could not open network device vethfd31e168 (No such device)" -------- it looks like ovsdb performance is not enough, but I'm not sure about that. . I found CNI is using below command to create / delete ovs-ports. Jul 17 22:40:39 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br0 veth75304f6a -- set Interface veth75304f6a "external-ids=sandbox=\"5b984c16b456ac76698dc93566b8a84ff053d8646467b848551c3f67ccc1953e\",ip=\"10.128.0.34\"" Jul 17 22:40:39 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists clear port veth75304f6a qos Jul 17 22:40:39 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists destroy qos Jul 17 22:40:40 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br0 veth3c30dd21 -- set Interface veth3c30dd21 "external-ids=sandbox=\"25c5d987c5f7ec0e44546723b2bc5d176b983a45b0c32427a916c9770e335e60\",ip=\"10.128.0.35\"" Jul 17 22:40:40 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists clear port veth3c30dd21 qos Jul 17 22:40:40 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists destroy qos Jul 17 22:40:43 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br0 veth00c05ab7 -- set Interface veth00c05ab7 "external-ids=sandbox=\"c7a3dcd25910075bafb0e572e5fff353dfa45a28432942631cd2f9a633907e1f\",ip=\"10.128.0.36\"" Jul 17 22:40:43 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists clear port veth00c05ab7 qos Jul 17 22:40:43 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists destroy qos Jul 17 22:40:54 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br0 veth847a495b -- set Interface veth847a495b "external-ids=sandbox=\"38a310de0687703edbfd667e08a6b55e5a0b9af363fa90149fc6f17899b1a924\",ip=\"10.128.0.37\"" Jul 17 22:40:54 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists clear port veth847a495b qos Jul 17 22:40:54 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists destroy qos Jul 17 22:50:41 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists clear port veth75304f6a qos Jul 17 22:50:41 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists destroy qos Jul 17 22:50:41 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-port br0 veth75304f6a Jul 17 22:50:42 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists clear port veth3c30dd21 qos Jul 17 22:50:42 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists destroy qos Jul 17 22:50:42 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-port br0 veth3c30dd21 Jul 17 22:50:43 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists clear port veth00c05ab7 qos Jul 17 22:50:43 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists destroy qos Jul 17 22:50:43 ocp39 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --if-exists del-port br0 veth00c05ab7 I'm trying to write a script to direct fire these command to vswitch and to see what will happen.. OK, I think there may be an actual OVS problem here (as opposed to an OpenShift's-use-of-OVS problem).
For veth2d159069, in the OpenShift logs we see:
May 29 16:41:14 node1... atomic-openshift-node[17092]: I0529 16:41:14.829786 17092 kubelet.go:1837] SyncLoop (ADD, "api"): "csdemo-7fdf8fdb64-748jr_yxw-prod(0c14beb3-631c-11e8-b361-005056947c2e)"
May 29 16:41:15 node1... atomic-openshift-node[17092]: I0529 16:41:15.398903 17092 kubelet.go:1853] SyncLoop (DELETE, "api"): "csdemo-7fdf8fdb64-748jr_yxw-prod(0c14beb3-631c-11e8-b361-005056947c2e)"
May 29 16:41:15 node1.... ovs-vsctl[90114]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br0 veth2d159069 -- set Interface veth2d159069 "external-ids=sandbox=\"ccad598ff4218127ef969954c848e263cdc19d00d26d52414e72c71281007e5f\",ip=\"10.130.0.99\""
May 29 16:41:15 node1... atomic-openshift-node[17092]: I0529 16:41:15.948196 17092 kubelet_pods.go:1118] Killing unwanted pod "csdemo-7fdf8fdb64-748jr"
May 29 16:41:25 node1... atomic-openshift-node[17092]: E0529 16:41:25.586803 17092 cni.go:259] Error adding network: CNI request failed with status 400: 'error on port veth2d159069: "could not open network device veth2d159069 (No such device)"
So, kubernetes starts creating the pod, then decides it needs to destroy it instead, and starts destroying it while it's in the middle of creating it, because kubernetes is like that. The "ovs-vsctl add-port" call fails because by the time it reaches OVS, the veth device has already been destroyed. Because "ovs-vsctl add-port" failed, OpenShift does not call "ovs-vsctl del-port" or do any other sort of cleanup on the OVS side.
But the "ovs-vsctl show" output shows that interface as being attached to the bridge (in an error state), and ovs-vswitchd is still spewing errors about that device to ovs-vswitchd.log multiple times a second when the sosreport is generated. (Note that ovs-vswitchd.log and the journal output seem to be in different timezones.)
Likewise for veth61060184 which has a different error ("could not add network device veth61060184 to ofproto (No such device)"). Kubernetes started creating the pod and then started deleting it while it was still being created. It looks like the time between creation and deletion was a bit longer in that case, which may explain the different error. In any case, from OpenShift's perspective, it's still the same: ovs-vsctl returned an error, so we don't do any cleanup, and we expect OVS to have forgotten about the interface, but apparently it hasn't. (Are we wrong?)
(For veth2d319e8e, that is completely unrelated and isn't really an OVS issue at all; the problem there is that docker got restarted for some reason, which kills all of the containers on the node, and in the process of ensuring that everything is cleaned up correctly, kubernetes attempts to delete a network interface that the kernel already deleted. But then it figures things out and successfully recreates the pod; there's no persistent problem in that case.)
Please attach the output of 'ovsdb-tool -m show-log' after had reproduced the issue. That output will tell us what OVS was instructed to do, how and when. Also note that ovs by default waits the vswitchd to reconfigure before return, so openshift (if using ovs-vsctl tool) would execute synchronous operations. However, if OpenShift is passing --no-wait, then the command returns right after the database is updated, and not really when the operation completed. fbl BTW, regarding to the OVS operation itself, ovs-vsctl task is to configure the database. It has no knowledge of the interfaces in the system.
It does echo ovs-vswitchd error reports when --no-wait is not used, but it is up to the caller to fix the database or do whatever is necessary.
See this example:
[root@plex ~]# ovs-vsctl add-port ovsbr0 foo
ovs-vsctl: Error detected while setting up 'foo': could not open network device foo (No such device). See ovs-vswitchd log for details.
ovs-vsctl: The default log directory is "/var/log/openvswitch".
[root@plex ~]# ovs-vsctl show
36a1b8c2-9512-4f38-a9b1-2328ea8832e2
Bridge "ovsbr0"
Port foo
Interface foo
error: "could not open network device foo (No such device)"
Port "ovsbr0"
Interface "ovsbr0"
type: internal
[root@plex ~]# ip link add foo type veth peer name veth10
[root@plex ~]# ovs-vsctl show
36a1b8c2-9512-4f38-a9b1-2328ea8832e2
Bridge "ovsbr0"
Port foo
Interface foo
Port "ovsbr0"
Interface "ovsbr0"
type: internal
fbl
(In reply to Flavio Leitner from comment #13) > It does echo ovs-vswitchd error reports when --no-wait is not used, but it > is up to the caller to fix the database or do whatever is necessary. Huh... OK, I can see where being able to add the interface to the bridge before it exists might be useful, but it's pretty confusing that it returns an error despite having "succeeded". So I guess we need to fix that part in OpenShift, but there still seems to be some problem with OVS getting stuck in some sort of loop constantly re-logging errors about them. (In reply to Dan Winship from comment #14) > So I guess we need to fix that part in OpenShift https://github.com/openshift/origin/pull/20670 OVS allows you to add ports that may become active in the future to the database. The vswitchd will complain that they exist. Perhaps you mean that this error should be ratelimited? Regardless there isn't anything wrong with OvS here - it allows those non-existent ports to exist in the database because:
1. They could disappear at any time and the software needs to handle that
2. The initialization order could be such that ovs is up before the networking is up (although, that shouldn't be the case on rhel).
The issue is an 'add' was not followed up with a 'delete' - and that means the port will live on in the configuration forever. There might be ways that we can change the software to mitigate this kind of behaviour (for instance, maybe we have a task that can delete non-existent ports) and/or make the log a bit less spew-y.
As an example, here's a quick and dirty python script that will clean up non-existent interfaces. I haven't really tested it well.
---- 8< ----
import netifaces
import sys
import time
from ovs.db import idl
from ovs import jsonrpc
from ovs.poller import Poller
from ovs.stream import Stream
class OVSDBException(Exception):
pass
class OVSDB(object):
@staticmethod
def wait_for_db_change(idl):
seq = idl.change_seqno
stop = time.time() + 10
while idl.change_seqno == seq and not idl.run():
poller = Poller()
idl.wait(poller)
poller.block()
if time.time() >= stop:
raise Exception('Retry Timeout')
def __init__(self, db_sock):
self._db_sock = db_sock
self._txn = None
schema = self._get_schema()
schema.register_all()
self._idl_conn = idl.Idl(db_sock, schema)
OVSDB.wait_for_db_change(self._idl_conn) # Initial Sync with DB
def _get_schema(self):
error, strm = Stream.open_block(Stream.open(self._db_sock))
if error:
raise Exception("Unable to connect to %s" % self._db_sock)
rpc = jsonrpc.Connection(strm)
req = jsonrpc.Message.create_request('get_schema', ['Open_vSwitch'])
error, resp = rpc.transact_block(req)
rpc.close()
if error or resp.error:
raise Exception('Unable to retrieve schema.')
return idl.SchemaHelper(None, resp.result)
def get_table(self, table_name):
return self._idl_conn.tables[table_name]
def _start_txn(self):
if self._txn is not None:
raise OVSDBException("ERROR: A transaction was started already")
self._idl_conn.change_seqno += 1
self._txn = idl.Transaction(self._idl_conn)
return self._txn
def _complete_txn(self, try_again_fn):
if self._txn is None:
raise OVSDBException("ERROR: Not in a transaction")
status = self._txn.commit_block()
if status is idl.Transaction.TRY_AGAIN:
if self._idl_conn._session.rpc.status != 0:
self._idl_conn.force_reconnect()
OVSDB.wait_for_db_change(self._idl_conn)
return try_again_fn(self)
elif status is idl.Transaction.ERROR:
return False
def _find_row(self, table_name, find):
return next(
(row for row in self.get_table(table_name).rows.values()
if find(row)), None)
def _find_row_by_name(self, table_name, value):
return self._find_row(table_name, lambda row: row.name == value)
def Ports(self):
return self.get_table('Port').rows.values()
def Bridges(self):
return self.get_table('Bridge').rows.values()
def destroy_port(self, port_name):
txn = self._start_txn()
br = None
for br_row in self.Bridges():
for port in br_row.ports:
if port.name == port_name:
br = br_row
if br is None:
print("unable to find bridge for port '%s" % port_name)
return
ports = [port for port in br.ports if port.name != port_name]
br.ports = ports
def try_again(db_entity):
db_entity.destroy_port(port_name)
txn.add_comment("ovs-purge: destroy_port=%s" % (port_name))
status = self._complete_txn(try_again)
if status is False:
raise OVSDBException('unable to delete Port %s: %s' %
(port_name, txn.get_error()))
self._txn = None
def argv_tuples(lst):
cur, nxt = iter(lst), iter(lst)
next(nxt, None)
try:
while True:
yield next(cur), next(nxt, None)
except StopIteration:
pass
def main():
db_sock = 'unix:/var/run/openvswitch/db.sock'
for cur, nxt in argv_tuples(sys.argv[1:]):
if cur in ['--db-sock']:
db_sock = nxt
ovsdb = OVSDB(db_sock)
ifs = netifaces.interfaces()
print(ifs)
for port in ovsdb.Ports():
if port.name not in ifs:
print("Found bad port '%s'" % port.name)
ovsdb.destroy_port(port.name)
if __name__ == '__main__':
main()
---- >8 ----
Does it make sense?
(In reply to Aaron Conole from comment #16) > OVS allows you to add ports that may become active in the future to the > database. The vswitchd will complain that they exist. Perhaps you mean > that this error should be ratelimited? Regardless there isn't anything > wrong with OvS here To clarify: it is apparently spewing errors about every non-existent port, multiple times a second, apparently forever (see the ovs-vswitchd.log greps in the initial comment) (In reply to Dan Winship from comment #17) > To clarify: it is apparently spewing errors about every non-existent port, > multiple times a second, apparently forever (see the ovs-vswitchd.log greps > in the initial comment) Because each time the vswitchd has to reconfigure itself, it will try to rebuild according with what is in the database. If the problematic interface is there, the error will be noted. fbl I tested this bug on v3.11.0-0.32.0, this issue still can be reproduced. step: 1. Create rc with this json oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/networking/list_for_pods.json oc scale rc test-rc --replicas=100 2. ssh into the pod scheduled node and delete the docker process by manual while true ; do docker ps | grep test-rc | awk '{ print $1 }' | xargs docker rm -f ; done 3. after running 10 mins and check the ovs port and pod container did not match sh-4.2# ovs-vsctl list-ports br0 | grep veth | wc -l 28 # ip a | grep veth | awk '{ print $2}' | awk -F@ '{ print $1}' | sort | wc -l 23 4. Check the log on node cat /var/log/messages | grep "veth57be0e6f" Sep 12 05:56:57 ip-172-18-8-246 NetworkManager[5100]: <info> [1536746217.8748] device (veth57be0e6f): carrier: link connected Sep 12 05:56:57 ip-172-18-8-246 NetworkManager[5100]: <info> [1536746217.8760] manager: (veth57be0e6f): new Veth device (/org/freedesktop/NetworkManager/Devices/188) Sep 12 05:57:03 ip-172-18-8-246 atomic-openshift-node: E0912 05:57:03.425374 8149 cni.go:260] Error adding network: CNI request failed with status 400: 'error on port veth57be0e6f: "could not open network device veth57be0e6f (No such device)" Sep 12 05:57:03 ip-172-18-8-246 atomic-openshift-node: E0912 05:57:03.425415 8149 cni.go:228] Error while adding to cni network: CNI request failed with status 400: 'error on port veth57be0e6f: "could not open network device veth57be0e6f (No such device)" Sep 12 05:57:05 ip-172-18-8-246 atomic-openshift-node: E0912 05:57:05.579413 8149 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = [failed to set up sandbox container "69de5646d393d0715a425e54dd50f939fc002e8fecf92944a33ae6803caf052f" network for pod "test-rc-v6c69": NetworkPlugin cni failed to set up pod "test-rc-v6c69_z1" network: CNI request failed with status 400: 'error on port veth57be0e6f: "could not open network device veth57be0e6f (No such device)" Sep 12 05:57:05 ip-172-18-8-246 atomic-openshift-node: E0912 05:57:05.579470 8149 kuberuntime_sandbox.go:56] CreatePodSandbox for pod "test-rc-v6c69_z1(061b9cc1-b671-11e8-b93b-0ea178bb84b2)" failed: rpc error: code = Unknown desc = [failed to set up sandbox container "69de5646d393d0715a425e54dd50f939fc002e8fecf92944a33ae6803caf052f" network for pod "test-rc-v6c69": NetworkPlugin cni failed to set up pod "test-rc-v6c69_z1" network: CNI request failed with status 400: 'error on port veth57be0e6f: "could not open network device veth57be0e6f (No such device)" Sep 12 05:57:05 ip-172-18-8-246 atomic-openshift-node: E0912 05:57:05.579498 8149 kuberuntime_manager.go:646] createPodSandbox for pod "test-rc-v6c69_z1(061b9cc1-b671-11e8-b93b-0ea178bb84b2)" failed: rpc error: code = Unknown desc = [failed to set up sandbox container "69de5646d393d0715a425e54dd50f939fc002e8fecf92944a33ae6803caf052f" network for pod "test-rc-v6c69": NetworkPlugin cni failed to set up pod "test-rc-v6c69_z1" network: CNI request failed with status 400: 'error on port veth57be0e6f: "could not open network device veth57be0e6f (No such device)" Sep 12 05:57:05 ip-172-18-8-246 atomic-openshift-node: E0912 05:57:05.579589 8149 pod_workers.go:186] Error syncing pod 061b9cc1-b671-11e8-b93b-0ea178bb84b2 ("test-rc-v6c69_z1(061b9cc1-b671-11e8-b93b-0ea178bb84b2)"), skipping: failed to "CreatePodSandbox" for "test-rc-v6c69_z1(061b9cc1-b671-11e8-b93b-0ea178bb84b2)" with CreatePodSandboxError: "CreatePodSandbox for pod \"test-rc-v6c69_z1(061b9cc1-b671-11e8-b93b-0ea178bb84b2)\" failed: rpc error: code = Unknown desc = [failed to set up sandbox container \"69de5646d393d0715a425e54dd50f939fc002e8fecf92944a33ae6803caf052f\" network for pod \"test-rc-v6c69\": NetworkPlugin cni failed to set up pod \"test-rc-v6c69_z1\" network: CNI request failed with status 400: 'error on port veth57be0e6f: \"could not open network device veth57be0e6f (No such device)\"\n', failed to stop sandbox container \"69de5646d393d0715a425e54dd50f939fc002e8fecf92944a33ae6803caf052f\" for pod \"test-rc-v6c69\": Error response from daemon: No such container: 69de5646d393d0715a425e54dd50f939fc002e8fecf92944a33ae6803caf052f]" Sep 12 05:57:05 ip-172-18-8-246 atomic-openshift-node: I0912 05:57:05.579936 8149 server.go:470] Event(v1.ObjectReference{Kind:"Pod", Namespace:"z1", Name:"test-rc-v6c69", UID:"061b9cc1-b671-11e8-b93b-0ea178bb84b2", APIVersion:"v1", ResourceVersion:"91472", FieldPath:""}): type: 'Warning' reason: 'FailedCreatePodSandBox' Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "69de5646d393d0715a425e54dd50f939fc002e8fecf92944a33ae6803caf052f" network for pod "test-rc-v6c69": NetworkPlugin cni failed to set up pod "test-rc-v6c69_z1" network: CNI request failed with status 400: 'error on port veth57be0e6f: "could not open network device veth57be0e6f (No such device)" (In reply to zhaozhanqi from comment #20) > Sep 12 05:57:03 ip-172-18-8-246 atomic-openshift-node: E0912 05:57:03.425374 > 8149 cni.go:260] Error adding network: CNI request failed with status 400: > 'error on port veth57be0e6f: "could not open network device veth57be0e6f (No > such device)" The patch does not get rid of these errors, it just tries to ensure that if this error happens, then the veth device mentioned gets removed from the OVS bridge properly afterward. So it should be getting rid of the errors about veths in "ovs-vsctl show", but this suggests it may not be doing that entirely: > 3. after running 10 mins and check the ovs port and pod container did not > match > sh-4.2# ovs-vsctl list-ports br0 | grep veth | wc -l > 28 > > # ip a | grep veth | awk '{ print $2}' | awk -F@ '{ print $1}' | sort | wc -l > 23 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |
Description of problem: Creating and deleting a large number of pods causes ovs errors,unable to create and delete ovs virtual port. Version-Release number of selected component (if applicable): ocp 3.9.14 openvswitch-2.7.3-3.git20180112.el7fdp.x86_64 How reproducible: always Steps to Reproduce: 1.create a project 2.create a tomcat app 3.scale up pod number to 500,and delete them immediately,then repeat creating and deleting,the error will appear after a while Actual results: can't create and delete pod Expected results: working normally Additional info: Creating and deleting a large number of pods causes ovs errors,unable to create and delete ovs virtual port. Customer using ocp 3.9.14 to do Stress Test, test steps are: using dc to scale up to 500 pods,then delete them, after a while,the error occurred. At first,the problem seems ocp cannot delete ovs port,after that creating will failed too.The problem seems in the rapid creation and deleting ovs version:openvswitch-2.7.3-3.git20180112.el7fdp.x86_64 1, In ovs-vswitchd.log there are many failure to create message. 2018-05-29T12:11:42.593Z|7139362|bridge|WARN|could not add network device veth50e17307 to ofproto (No such device) 2018-05-29T12:11:42.595Z|7139363|bridge|WARN|could not open network device veth497797fd (No such device) 2. OVS continues to perform open or add operations on the same port, unable to complete. # cat ovs-vswitchd.log |grep veth2d159069 2018-05-29T12:11:42.490Z|7139314|bridge|WARN|could not open network device veth2d159069 (No such device) 2018-05-29T12:11:47.415Z|7141594|bridge|WARN|could not open network device veth2d159069 (No such device) 2018-05-29T12:11:52.296Z|7143876|bridge|WARN|could not open network device veth2d159069 (No such device) 2018-05-29T12:11:57.534Z|7146163|bridge|WARN|could not open network device veth2d159069 (No such device) 2018-05-29T12:12:02.446Z|7148445|bridge|WARN|could not open network device veth2d159069 (No such device) 2018-05-29T12:12:07.371Z|7150725|bridge|WARN|could not open network device veth2d159069 (No such device) 2018-05-29T12:12:12.627Z|7153012|bridge|WARN|could not open network device veth2d159069 (No such device) 2018-05-29T12:12:17.534Z|7155294|bridge|WARN|could not open network device veth2d159069 (No such device) 2018-05-29T12:12:22.435Z|7157574|bridge|WARN|could not open network device veth2d159069 (No such device) 2018-05-29T12:12:27.771Z|7159865|bridge|WARN|could not open network device veth2d159069 (No such device) # grep veth61060184 ovs-vswitchd.log 2018-05-29T12:11:45.489Z|7140717|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:11:50.426Z|7142999|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:11:55.293Z|7145279|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:12:00.536Z|7147566|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:12:05.473Z|7149848|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:12:10.378Z|7152128|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:12:15.643Z|7154419|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:12:20.561Z|7156697|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:12:25.493Z|7158979|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:12:30.784Z|7161268|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:12:35.682Z|7163550|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:12:40.581Z|7165830|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:12:45.833Z|7168119|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:12:50.735Z|7170401|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:12:55.640Z|7172681|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:13:00.903Z|7174970|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:13:05.783Z|7177250|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:13:10.669Z|7179530|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:13:15.916Z|7181819|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:13:20.849Z|7184099|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:13:25.733Z|7186381|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:13:30.955Z|7188668|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:13:35.849Z|7190950|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:13:40.737Z|7193232|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 2018-05-29T12:13:45.994Z|7195519|bridge|WARN|could not add network device veth61060184 to ofproto (No such device) 3. The above OVS problem leads to failure of openshift CNI call, resulting in pod creation problem. #cat messages |grep -i veth2d319e8e May 29 19:31:07 node1 NetworkManager[1049]: <info> [1527593467.0108] device (veth2d319e8e): carrier: link connected May 29 19:31:07 node1 NetworkManager[1049]: <info> [1527593467.0116] manager: (veth2d319e8e): new Veth device (/org/freedesktop/NetworkManager/Devices/10678) May 29 19:34:03 node1 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br0 veth2d319e8e -- set Interface veth2d319e8e "external-ids=sandbox=\"c404ff9d3536bb791246e601f84dce09aa809f0c340acd61a6d7f2a6394b4125\",ip=\"10.130.0.18\"" May 29 19:34:08 node1 atomic-openshift-node: E0529 19:34:08.928553 17092 cni.go:259] Error adding network: CNI request failed with status 400: 'error on port veth2d319e8e: "could not open network device veth2d319e8e (No such device)" May 29 19:34:08 node1 atomic-openshift-node: E0529 19:34:08.928579 17092 cni.go:227] Error while adding to cni network: CNI request failed with status 400: 'error on port veth2d319e8e: "could not open network device veth2d319e8e (No such device)" May 29 19:34:08 node1 atomic-openshift-node: E0529 19:34:08.930176 17092 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "qperf-server-13-swlwr_yjfns" network: CNI request failed with status 400: 'error on port veth2d319e8 : "could not open network device veth2d319e8e (No such device)" May 29 19:34:08 node1 atomic-openshift-node: E0529 19:34:08.930210 17092 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "qperf-server-13-swlwr_yjfns(6d67c53e-627b-11e8-b361-005056947c2e)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "qperf-server-13-swlwr_yjfns" network: CNI request failed with status 400: 'error on port veth2d319e8e: "could not open network device veth2d319e8e (No such device)" May 29 19:34:08 node1 atomic-openshift-node: E0529 19:34:08.930224 17092 kuberuntime_manager.go:647] createPodSandbox for pod "qperf-server-13-swlwr_yjfns(6d67c53e-627b-11e8-b361-005056947c2e)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "qperf-server-13-swlwr_yjfns" network: CNI request failed with status 400: 'error on port veth2d319e8e: "could not open network device veth2d319e8e (No such device)" May 29 19:34:08 node1 atomic-openshift-node: E0529 19:34:08.930274 17092 pod_workers.go:186] Error syncing pod 6d67c53e-627b-11e8-b361-005056947c2e ("qperf-server-13-swlwr_yjfns(6d67c53e-627b-11e8-b361-005056947c2e)"), skipping: failed to "CreatePodSandbox" for "qperf-server-13-swlwr_yjfns(6d67c53e-627b-11e8-b361-005056947c2e)" with CreatePodSandboxError: "CreatePodSandbox for pod \"qperf-server-13-swlwr_yjfns(6d67c53e-627b-11e8-b361-005056947c2e)\" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod \"qperf-server-13-swlwr_yjfns\" network: CNI request failed with status 400: 'error on port veth2d319e8e: \"could not open network device veth2d319e8e (No such device)\"\n'" 4. All the above issue disappear after rebooting node.