Description of problem: Upgrade failing in CI with following error: clusterversion/version changed Failing to True: ClusterOperatorDegraded: Cluster operator dns is reporting a failure: Not all desired DNS DaemonSets available. https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6925 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
CoreDNS on one node can't talk to the apiserver. https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6925/artifacts/e2e-aws-upgrade/pods/openshift-dns_dns-default-lsf6p_dns.log The dns-operator is correctly reporting a degraded status. The apiserver and SDN operators are all reporting healthy, pods all reporting ready, etc. Reassigning to networking for further triage. Let me know if I made a mistake!
It looks like the problems started when the OVS pod was upgraded/restarted. OVS pod logs from that node are https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6925/artifacts/e2e-aws-upgrade/pods/openshift-sdn_ovs-hf2nr_openvswitch.log, and it shows a lot of 2019-09-11T09:33:15.285Z|00103|jsonrpc|WARN|unix#228: receive error: Connection reset by peer 2019-09-11T09:33:15.285Z|00104|reconnect|WARN|unix#228: connection dropped (Connection reset by peer) The other OVS pods show *some* of those errors, but ovs-hf2nr logs that every minute or so (and doesn't log anything else) from 08:47:42 until the cluster is shut down. I don't know what that error means... Aaron, any ideas?
The connection being dropped usually indicates the database was shutdown or that some kind of negotiation between the two processes failed (possibly the db got restarted a few times - which would explain some of the reconnects?).
So the kubelet executes "ovs-vsctl status" every 30 seconds. That should detect a down ovsdb, right? The node's iptables rules look fine. But the two pods scheduled on the node couldn't reach the apiserver. That indicates pretty clearly that something is wrong with the ovs bridge. So, I guess ovsdb somehow got wedged? That doesn't seem super likely, but I suppose is plausible. Aaron, is there a better status command we can use? Perhaps one that really checks for a functioning process?
*** Bug 1723914 has been marked as a duplicate of this bug. ***
(In reply to Aaron Conole from comment #3) > The connection being dropped usually indicates the database was shutdown or > that some kind of negotiation between the two processes failed (possibly the > db got restarted a few times - which would explain some of the reconnects?). ovs-vswitchd and ovsdb-server are always started/stopped/restarted as a team. The db shouldn't be restarting while vswitchd is running. And note that that log file is a mix of both the vswitchd and ovsdb logs; it comes from a startup script that ends with: tail -F --pid=$(cat /var/run/openvswitch/ovs-vswitchd.pid) /var/log/openvswitch/ovs-vswitchd.log & tail -F --pid=$(cat /var/run/openvswitch/ovsdb-server.pid) /var/log/openvswitch/ovsdb-server.log & Anyway, it appears to be ovsdb, not vswitchd that is logging the "reconnect" errors. eg, reindented for clarity: 2019-09-11T08:44:48.919Z|00038|bridge|INFO|bridge br0: using datapath ID 0000aa530131d545 2019-09-11T08:44:48.919Z|00039|connmgr|INFO|br0: added service controller "punix:/var/run/openvswitch/br0.mgmt" 2019-09-11T08:44:48.924Z|00040|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.11.0 2019-09-11T08:44:48.815Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log 2019-09-11T08:44:48.824Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.11.0 2019-09-11T08:44:48.829Z|00003|jsonrpc|WARN|unix#0: receive error: Connection reset by peer 2019-09-11T08:44:48.829Z|00004|reconnect|WARN|unix#0: connection dropped (Connection reset by peer) 2019-09-11T08:44:48.841Z|00005|jsonrpc|WARN|unix#1: receive error: Connection reset by peer 2019-09-11T08:44:48.841Z|00006|reconnect|WARN|unix#1: connection dropped (Connection reset by peer) 2019-09-11T08:44:48.936Z|00007|jsonrpc|WARN|unix#6: receive error: Connection reset by peer 2019-09-11T08:44:48.936Z|00008|reconnect|WARN|unix#6: connection dropped (Connection reset by peer) 2019-09-11T08:44:50.057Z|00041|bridge|INFO|bridge br0: deleted interface vethcd1e33a0 on port 33 2019-09-11T08:44:50.057Z|00042|bridge|INFO|bridge br0: deleted interface veth4018383c on port 37 So the errors don't seem to have anything to do with ovsdb or ovs-vswitchd restarting.
Gotcha - so which process is timing out / restarting? Is it ovsdb, ovs-vswitchd, or openshift-sdn? Openshift-sdn, which is normally quite chatty, doesn't seem to log anything. This makes me think that it's connection to ovs-vswitchd is fine.
Pushing to 4.3 because we have only seen this once. We can fix it later in a Z-stream if necessary.
No idea - maybe the ovs-vsctl is dying. Is it on a timeout? Is the timeout expiring? I don't know all of the processes that connect to the ovsdb.
So here's an example: 3376 accept(15, {sa_family=AF_UNIX}, [128->2]) = 18 3376 fcntl(18, F_GETFL) = 0x2 (flags O_RDWR) 3376 fcntl(18, F_SETFL, O_RDWR|O_NONBLOCK) = 0 3376 recvfrom(18, "{\"id\":0,\"method\":\"get_schema\",\"params\":[\"_Server\"]}", 512, 0, NULL, NULL) = 51 3376 sendto(18, "{\"id\":0,\"result\":{\"cksum\":\"3236486585 698\",\"name\":\"_Server\",\"version\":\"1.1.0\",\"tables\":{\"Database\":{\"columns\":{\"name\":{\"type\":\"string\"},\"model\":{\"type\":{\"key\":{\"type\":\"string\",\"enum\":[\"set\",[\"clustered\",\"standalone\"]]}}},\"connected\":{\"type\":\"boolean\"},\"leader\":{\"type\":\"boolean\"},\"schema\":{\"type\":{\"min\":0,\"key\":\"string\"}},\"sid\":{\"type\":{\"min\":0,\"key\":\"uuid\"}},\"cid\":{\"type\":{\"min\":0,\"key\":\"uuid\"}},\"index\":{\"type\":{\"min\":0,\"key\":\"integer\"}}}}}},\"error\":null}", 459, 0, NULL, 0) = 459 3376 recvfrom(18, "{\"id\":1,\"method\":\"monitor_cond\",\"params\":[\"_Server\",[\"monid\",\"_Server\"],{\"Database\":[{\"where\":[[\"name\",\"==\",\"Open_vSwitch\"]],\"columns\":[\"cid\",\"connected\",\"index\",\"leader\",\"model\",\"name\",\"schema\",\"sid\"]}]}]}", 461, 0, NULL, NULL) = 206 3376 sendto(18, "{\"id\":1,\"result\":{\"Database\":{\"41cd9e6f-8c9b-406e-bf23-231ee3431260\":{\"initial\":{\"connected\":true,\"name\":\"Open_vSwitch\",\"leader\":true,\"schema\":\"{\\\"cksum\\\":\\\"1452282319 23860\\\",\\\"name\\\":\\\"Open_vSwitch\\\",\\\"tables\\\":{\\\"AutoAttach\\\":{\\\"columns\\\":{\\\"mappings\\\":{\\\"type\\\":{\\\"key\\\":{\\\"maxInteger\\\":16777215,\\\"minInteger\\\":0,\\\"type\\\":\\\"integer\\\"},\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":{\\\"maxInteger\\\":4095,\\\"minInteger\\\":0,\\\"type\\\":\\\"integer\\\"}}},\\\"system_description\\\":{\\\"type\\\":\\\"string\\\"},\\\"system_name\\\":{\\\"type\\\":\\\"string\\\"}}},\\\"Bridge\\\":{\\\"columns\\\":{\\\"auto_attach\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"AutoAttach\\\",\\\"type\\\":\\\"uuid\\\"},\\\"min\\\":0}},\\\"controller\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"Controller\\\",\\\"type\\\":\\\"uuid\\\"},\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0}},\\\"datapath_id\\\":{\\\"ephemeral\\\":true,\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"min\\\":0}},\\\"datapath_type\\\":{\\\"type\\\":\\\"string\\\"},\\\"datapath_version\\\":{\\\"type\\\":\\\"string\\\"},\\\"external_ids\\\":{\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":\\\"string\\\"}},\\\"fail_mode\\\":{\\\"type\\\":{\\\"key\\\":{\\\"enum\\\":[\\\"set\\\",[\\\"secure\\\",\\\"standalone\\\"]],\\\"type\\\":\\\"string\\\"},\\\"min\\\":0}},\\\"flood_vlans\\\":{\\\"type\\\":{\\\"key\\\":{\\\"maxInteger\\\":4095,\\\"minInteger\\\":0,\\\"type\\\":\\\"integer\\\"},\\\"max\\\":4096,\\\"min\\\":0}},\\\"flow_tables\\\":{\\\"type\\\":{\\\"key\\\":{\\\"maxInteger\\\":254,\\\"minInteger\\\":0,\\\"type\\\":\\\"integer\\\"},\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":{\\\"refTable\\\":\\\"Flow_Table\\\",\\\"type\\\":\\\"uuid\\\"}}},\\\"ipfix\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"IPFIX\\\",\\\"type\\\":\\\"uuid\\\"},\\\"min\\\":0}},\\\"mcast_snooping_enable\\\":{\\\"type\\\":\\\"boolean\\\"},\\\"mirrors\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"Mirror\\\",\\\"type\\\":\\\"uuid\\\"},\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0}},\\\"name\\\":{\\\"mutable\\\":false,\\\"type\\\":\\\"string\\\"},\\\"netflow\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"NetFlow\\\",\\\"type\\\":\\\"uuid\\\"},\\\"min\\\":0}},\\\"other_config\\\":{\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":\\\"string\\\"}},\\\"ports\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"Port\\\",\\\"type\\\":\\\"uuid\\\"},\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0}},\\\"protocols\\\":{\\\"type\\\":{\\\"key\\\":{\\\"enum\\\":[\\\"set\\\",[\\\"OpenFlow10\\\",\\\"OpenFlow11\\\",\\\"OpenFlow12\\\",\\\"OpenFlow13\\\",\\\"OpenFlow14\\\",\\\"OpenFlow15\\\",\\\"OpenFlow16\\\"]],\\\"type\\\":\\\"string\\\"},\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0}},\\\"rstp_enable\\\":{\\\"type\\\":\\\"boolean\\\"},\\\"rstp_status\\\":{\\\"ephemeral\\\":true,\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":\\\"string\\\"}},\\\"sflow\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"sFlow\\\",\\\"type\\\":\\\"uuid\\\"},\\\"min\\\":0}},\\\"status\\\":{\\\"ephemeral\\\":true,\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":\\\"string\\\"}},\\\"stp_enable\\\":{\\\"type\\\":\\\"boolean\\\"}},\\\"indexes\\\":[[\\\"name\\\"]]},\\\"Controller\\\":{\\\"columns\\\":{\\\"connection_mode\\\":{\\\"type\\\":{\\\"key\\\":{\\\"enum\\\":[\\\"set\\\",[\\\"in-band\\\",\\\"out-of-band\\\"]],\\\"type\\\":\\\"string\\\"},\\\"min\\\":0}},\\\"controller_burst_limit\\\":{\\\"type\\\":{\\\"key\\\":{\\\"minInteger\\\":25,\\\"type\\\":\\\"integer\\\"},\\\"min\\\":0}},\\\"controller_rate_limit\\\":{\\\"type\\\":{\\\"key\\\":{\\\"minInteger\\\":100,\\\"type\\\":\\\"integer\\\"},\\\"min\\\":0}},\\\"enable_async_messages\\\":{\\\"type\\\":{\\\"key\\\":\\\"boolean\\\",\\\"min\\\":0}},\\\"external_ids\\\":{\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":\\\"string\\\"}},\\\"inactivity_probe\\\":{\\\"type\\\":{\\\"key\\\":\\\"integer\\\",\\\"min\\\":0}},\\\"is_connected\\\":{\\\"ephemeral\\\":true,\\\"type\\\":\\\"boolean\\\"},\\\"local_gateway\\\":{\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"min\\\":0}},\\\"local_ip\\\":{\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"min\\\":0}},\\\"local_netmask\\\":{\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"min\\\":0}},\\\"max_backoff\\\":{\\\"type\\\":{\\\"key\\\":{\\\"minInteger\\\":1000,\\\"type\\\":\\\"integer\\\"},\\\"min\\\":0}},\\\"other_config\\\":{\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":\\\"string\\\"}},\\\"role\\\":{\\\"ephemeral\\\":true,\\\"type\\\":{\\\"key\\\":{\\\"enum\\\":[\\\"set\\\",[\\\"master\\\",\\\"other\\\",\\\"slave\\\"]],\\\"type\\\":\\\"string\\\"},\\\"min\\\":0}},\\\"status\\\":{\\\"ephemeral\\\":true,\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":\\\"string\\\"}},\\\"target\\\":{\\\"type\\\":\\\"string\\\"}}},\\\"Flow_Sample_Collector_Set\\\":{\\\"columns\\\":{\\\"bridge\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"Bridge\\\",\\\"type\\\":\\\"uuid\\\"}}},\\\"external_id"..., 16015, 0, NULL, 0) = 16015 3376 recvfrom(18, "{\"id\":2,\"method\":\"monitor_cond\",\"params\":[\"Open_vSwitch\",[\"monid\",\"Open_vSwitch\"],{\"Interface\":[{\"columns\":[\"name\",\"options\"]}],\"Open_vSwitch\":[{\"columns\":[\"cur_cfg\"]}]}]}{\"id\":3,\"method\":\"set_db_change_aware\",\"params\":[true]}", 255, 0, NULL, NULL) = 226 3376 sendto(18, "{\"id\":2,\"result\":{\"Interface\":{\"c3c8f8d9-fcd7-4a01-a26f-ec9e985bffef\":{\"initial\":{\"name\":\"veth57fcc4fb\"}},\"2c8fee01-ec4e-467e-ae25-c3f6c494fbe7\":{\"initial\":{\"name\":\"vethc3e93576\"}},\"0b86132f-10a3-45e7-913b-8564a48add51\":{\"initial\":{\"name\":\"vethd5e50465\"}},\"137e5eca-b967-404c-882a-ffa985bfd4c8\":{\"initial\":{\"name\":\"vethb2c50177\"}},\"fcb05554-0102-4cfa-8346-332bdc64e866\":{\"initial\":{\"name\":\"vethc9fd9f4a\"}},\"5866f70a-fe4d-4d73-a379-b5fe2ee86008\":{\"initial\":{\"name\":\"veth7cdf84e3\"}},\"444f2b35-8e62-4966-a463-8c62b6b92f5a\":{\"initial\":{\"name\":\"vxlan0\",\"options\":[\"map\",[[\"dst_port\",\"4789\"],[\"key\",\"flow\"],[\"remote_ip\",\"flow\"]]]}},\"fd11de49-fde4-467c-b8b6-567962375074\":{\"initial\":{\"name\":\"veth8a1683d4\"}},\"fb0203b5-12ad-4eea-8e75-d7849dcba8de\":{\"initial\":{\"name\":\"veth4c6db7a2\"}},\"d83a5766-6fad-44c7-88fa-3439fb7b653e\":{\"initial\":{\"name\":\"veth80afd067\"}},\"0bd45664-58ea-407a-81b0-9554f3406cf7\":{\"initial\":{\"name\":\"br0\"}},\"f61e422d-12a5-4467-b2aa-0ae26575c815\":{\"initial\":{\"name\":\"vethd48cdce8\"}},\"ec991686-5c70-44f6-a670-cb68416e3a30\":{\"initial\":{\"name\":\"vethef45f97d\"}},\"42450e23-b9fa-423e-a508-f2f8fc3c0ca7\":{\"initial\":{\"name\":\"vethd1adefa5\"}},\"992aac17-7284-46bb-a3cb-720f8f6a2d1c\":{\"initial\":{\"name\":\"veth81f1a00d\"}},\"0e0d2ae5-3c6d-4c72-bc17-a195621ad8db\":{\"initial\":{\"name\":\"vethdf90af52\"}},\"97a2c47b-6a5b-44d0-893e-df6fd031dcfe\":{\"initial\":{\"name\":\"vethfa2985a0\"}},\"816e70a0-efef-4613-b088-3dc2da1e3e0c\":{\"initial\":{\"name\":\"tun0\"}},\"c9120f66-e362-441f-8a46-d424c85aec1d\":{\"initial\":{\"name\":\"vetha5266535\"}},\"0385e547-7fd4-4f3c-a77f-620aa1cd05c1\":{\"initial\":{\"name\":\"veth9c0f9133\"}}},\"Open_vSwitch\":{\"660748cb-9c68-4f75-8f43-f4b0dd297df3\":{\"initial\":{\"cur_cfg\":2415}}}},\"error\":null}", 1684, 0, NULL, 0) = 1684 3376 sendto(18, "{\"id\":3,\"result\":{},\"error\":null}", 33, 0, NULL, 0) = -1 EPIPE (Broken pipe) 3376 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3376, si_uid=998} --- 3376 write(5, "2019-09-18T21:09:54.047Z|00761|jsonrpc|WARN|unix#5251: send error: Broken pipe\n", 79) = 79 3376 close(18) = 0 "method":"monitor_cond" So that looks like this is an "ovsdb-client monitor-cond" call, and I have no idea where that would be coming from. Openshift-sdn doesn't run ovsdb-client at all...
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062