Bug 1751246

Summary: Cluster upgrade failing with `Cluster operator dns is reporting a failure`
Product: OpenShift Container Platform Reporter: ravig <rgudimet>
Component: NetworkingAssignee: Casey Callendrello <cdc>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aconole, aos-bugs, bbennett, danw, wking
Version: 4.2.0   
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-23 11:05:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description ravig 2019-09-11 13:55:42 UTC
Description of problem:

Upgrade failing in CI with following error:

clusterversion/version changed Failing to True: ClusterOperatorDegraded: Cluster operator dns is reporting a failure: Not all desired DNS DaemonSets available.


https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6925


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Dan Mace 2019-09-11 14:28:02 UTC
CoreDNS on one node can't talk to the apiserver.

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6925/artifacts/e2e-aws-upgrade/pods/openshift-dns_dns-default-lsf6p_dns.log

The dns-operator is correctly reporting a degraded status.

The apiserver and SDN operators are all reporting healthy, pods all reporting ready, etc.

Reassigning to networking for further triage. Let me know if I made a mistake!

Comment 2 Dan Winship 2019-09-11 16:24:49 UTC
It looks like the problems started when the OVS pod was upgraded/restarted. OVS pod logs from that node are https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6925/artifacts/e2e-aws-upgrade/pods/openshift-sdn_ovs-hf2nr_openvswitch.log, and it shows a lot of

    2019-09-11T09:33:15.285Z|00103|jsonrpc|WARN|unix#228: receive error: Connection reset by peer
    2019-09-11T09:33:15.285Z|00104|reconnect|WARN|unix#228: connection dropped (Connection reset by peer)

The other OVS pods show *some* of those errors, but ovs-hf2nr logs that every minute or so (and doesn't log anything else) from 08:47:42 until the cluster is shut down.

I don't know what that error means... Aaron, any ideas?

Comment 3 Aaron Conole 2019-09-11 19:57:47 UTC
The connection being dropped usually indicates the database was shutdown or that some kind of negotiation between the two processes failed (possibly the db got restarted a few times - which would explain some of the reconnects?).

Comment 4 Casey Callendrello 2019-09-12 16:37:12 UTC
So the kubelet executes "ovs-vsctl status" every 30 seconds. That should detect a down ovsdb, right?

The node's iptables rules look fine. But the two pods scheduled on the node couldn't reach the apiserver. That indicates pretty clearly that something is wrong with the ovs bridge. So, I guess ovsdb somehow got wedged? That doesn't seem super likely, but I suppose is plausible.

Aaron, is there a better status command we can use? Perhaps one that really checks for a functioning process?

Comment 5 W. Trevor King 2019-09-13 04:01:36 UTC
*** Bug 1723914 has been marked as a duplicate of this bug. ***

Comment 6 Dan Winship 2019-09-13 11:54:47 UTC
(In reply to Aaron Conole from comment #3)
> The connection being dropped usually indicates the database was shutdown or
> that some kind of negotiation between the two processes failed (possibly the
> db got restarted a few times - which would explain some of the reconnects?).

ovs-vswitchd and ovsdb-server are always started/stopped/restarted as a team. The db shouldn't be restarting while vswitchd is running.

And note that that log file is a mix of both the vswitchd and ovsdb logs; it comes from a startup script that ends with:

          tail -F --pid=$(cat /var/run/openvswitch/ovs-vswitchd.pid) /var/log/openvswitch/ovs-vswitchd.log &
          tail -F --pid=$(cat /var/run/openvswitch/ovsdb-server.pid) /var/log/openvswitch/ovsdb-server.log &

Anyway, it appears to be ovsdb, not vswitchd that is logging the "reconnect" errors. eg, reindented for clarity:

        2019-09-11T08:44:48.919Z|00038|bridge|INFO|bridge br0: using datapath ID 0000aa530131d545
        2019-09-11T08:44:48.919Z|00039|connmgr|INFO|br0: added service controller "punix:/var/run/openvswitch/br0.mgmt"
        2019-09-11T08:44:48.924Z|00040|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.11.0
    2019-09-11T08:44:48.815Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log
    2019-09-11T08:44:48.824Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.11.0
    2019-09-11T08:44:48.829Z|00003|jsonrpc|WARN|unix#0: receive error: Connection reset by peer
    2019-09-11T08:44:48.829Z|00004|reconnect|WARN|unix#0: connection dropped (Connection reset by peer)
    2019-09-11T08:44:48.841Z|00005|jsonrpc|WARN|unix#1: receive error: Connection reset by peer
    2019-09-11T08:44:48.841Z|00006|reconnect|WARN|unix#1: connection dropped (Connection reset by peer)
    2019-09-11T08:44:48.936Z|00007|jsonrpc|WARN|unix#6: receive error: Connection reset by peer
    2019-09-11T08:44:48.936Z|00008|reconnect|WARN|unix#6: connection dropped (Connection reset by peer)
        2019-09-11T08:44:50.057Z|00041|bridge|INFO|bridge br0: deleted interface vethcd1e33a0 on port 33
        2019-09-11T08:44:50.057Z|00042|bridge|INFO|bridge br0: deleted interface veth4018383c on port 37

So the errors don't seem to have anything to do with ovsdb or ovs-vswitchd restarting.

Comment 7 Casey Callendrello 2019-09-13 12:01:59 UTC
Gotcha - so which process is timing out / restarting? Is it ovsdb, ovs-vswitchd, or openshift-sdn? Openshift-sdn, which is normally quite chatty, doesn't seem to log anything. This makes me think that it's connection to ovs-vswitchd is fine.

Comment 8 Ben Bennett 2019-09-17 13:15:33 UTC
Pushing to 4.3 because we have only seen this once.  We can fix it later in a Z-stream if necessary.

Comment 9 Aaron Conole 2019-09-17 13:48:48 UTC
No idea - maybe the ovs-vsctl is dying.  Is it on a timeout?  Is the timeout expiring?  I don't know all of the processes that connect to the ovsdb.

Comment 10 Dan Winship 2019-09-18 21:36:54 UTC
So here's an example:

3376  accept(15, {sa_family=AF_UNIX}, [128->2]) = 18
3376  fcntl(18, F_GETFL)                = 0x2 (flags O_RDWR)
3376  fcntl(18, F_SETFL, O_RDWR|O_NONBLOCK) = 0
3376  recvfrom(18, "{\"id\":0,\"method\":\"get_schema\",\"params\":[\"_Server\"]}", 512, 0, NULL, NULL) = 51
3376  sendto(18, "{\"id\":0,\"result\":{\"cksum\":\"3236486585 698\",\"name\":\"_Server\",\"version\":\"1.1.0\",\"tables\":{\"Database\":{\"columns\":{\"name\":{\"type\":\"string\"},\"model\":{\"type\":{\"key\":{\"type\":\"string\",\"enum\":[\"set\",[\"clustered\",\"standalone\"]]}}},\"connected\":{\"type\":\"boolean\"},\"leader\":{\"type\":\"boolean\"},\"schema\":{\"type\":{\"min\":0,\"key\":\"string\"}},\"sid\":{\"type\":{\"min\":0,\"key\":\"uuid\"}},\"cid\":{\"type\":{\"min\":0,\"key\":\"uuid\"}},\"index\":{\"type\":{\"min\":0,\"key\":\"integer\"}}}}}},\"error\":null}", 459, 0, NULL, 0) = 459
3376  recvfrom(18, "{\"id\":1,\"method\":\"monitor_cond\",\"params\":[\"_Server\",[\"monid\",\"_Server\"],{\"Database\":[{\"where\":[[\"name\",\"==\",\"Open_vSwitch\"]],\"columns\":[\"cid\",\"connected\",\"index\",\"leader\",\"model\",\"name\",\"schema\",\"sid\"]}]}]}", 461, 0, NULL, NULL) = 206
3376  sendto(18, "{\"id\":1,\"result\":{\"Database\":{\"41cd9e6f-8c9b-406e-bf23-231ee3431260\":{\"initial\":{\"connected\":true,\"name\":\"Open_vSwitch\",\"leader\":true,\"schema\":\"{\\\"cksum\\\":\\\"1452282319 23860\\\",\\\"name\\\":\\\"Open_vSwitch\\\",\\\"tables\\\":{\\\"AutoAttach\\\":{\\\"columns\\\":{\\\"mappings\\\":{\\\"type\\\":{\\\"key\\\":{\\\"maxInteger\\\":16777215,\\\"minInteger\\\":0,\\\"type\\\":\\\"integer\\\"},\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":{\\\"maxInteger\\\":4095,\\\"minInteger\\\":0,\\\"type\\\":\\\"integer\\\"}}},\\\"system_description\\\":{\\\"type\\\":\\\"string\\\"},\\\"system_name\\\":{\\\"type\\\":\\\"string\\\"}}},\\\"Bridge\\\":{\\\"columns\\\":{\\\"auto_attach\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"AutoAttach\\\",\\\"type\\\":\\\"uuid\\\"},\\\"min\\\":0}},\\\"controller\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"Controller\\\",\\\"type\\\":\\\"uuid\\\"},\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0}},\\\"datapath_id\\\":{\\\"ephemeral\\\":true,\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"min\\\":0}},\\\"datapath_type\\\":{\\\"type\\\":\\\"string\\\"},\\\"datapath_version\\\":{\\\"type\\\":\\\"string\\\"},\\\"external_ids\\\":{\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":\\\"string\\\"}},\\\"fail_mode\\\":{\\\"type\\\":{\\\"key\\\":{\\\"enum\\\":[\\\"set\\\",[\\\"secure\\\",\\\"standalone\\\"]],\\\"type\\\":\\\"string\\\"},\\\"min\\\":0}},\\\"flood_vlans\\\":{\\\"type\\\":{\\\"key\\\":{\\\"maxInteger\\\":4095,\\\"minInteger\\\":0,\\\"type\\\":\\\"integer\\\"},\\\"max\\\":4096,\\\"min\\\":0}},\\\"flow_tables\\\":{\\\"type\\\":{\\\"key\\\":{\\\"maxInteger\\\":254,\\\"minInteger\\\":0,\\\"type\\\":\\\"integer\\\"},\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":{\\\"refTable\\\":\\\"Flow_Table\\\",\\\"type\\\":\\\"uuid\\\"}}},\\\"ipfix\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"IPFIX\\\",\\\"type\\\":\\\"uuid\\\"},\\\"min\\\":0}},\\\"mcast_snooping_enable\\\":{\\\"type\\\":\\\"boolean\\\"},\\\"mirrors\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"Mirror\\\",\\\"type\\\":\\\"uuid\\\"},\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0}},\\\"name\\\":{\\\"mutable\\\":false,\\\"type\\\":\\\"string\\\"},\\\"netflow\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"NetFlow\\\",\\\"type\\\":\\\"uuid\\\"},\\\"min\\\":0}},\\\"other_config\\\":{\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":\\\"string\\\"}},\\\"ports\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"Port\\\",\\\"type\\\":\\\"uuid\\\"},\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0}},\\\"protocols\\\":{\\\"type\\\":{\\\"key\\\":{\\\"enum\\\":[\\\"set\\\",[\\\"OpenFlow10\\\",\\\"OpenFlow11\\\",\\\"OpenFlow12\\\",\\\"OpenFlow13\\\",\\\"OpenFlow14\\\",\\\"OpenFlow15\\\",\\\"OpenFlow16\\\"]],\\\"type\\\":\\\"string\\\"},\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0}},\\\"rstp_enable\\\":{\\\"type\\\":\\\"boolean\\\"},\\\"rstp_status\\\":{\\\"ephemeral\\\":true,\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":\\\"string\\\"}},\\\"sflow\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"sFlow\\\",\\\"type\\\":\\\"uuid\\\"},\\\"min\\\":0}},\\\"status\\\":{\\\"ephemeral\\\":true,\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":\\\"string\\\"}},\\\"stp_enable\\\":{\\\"type\\\":\\\"boolean\\\"}},\\\"indexes\\\":[[\\\"name\\\"]]},\\\"Controller\\\":{\\\"columns\\\":{\\\"connection_mode\\\":{\\\"type\\\":{\\\"key\\\":{\\\"enum\\\":[\\\"set\\\",[\\\"in-band\\\",\\\"out-of-band\\\"]],\\\"type\\\":\\\"string\\\"},\\\"min\\\":0}},\\\"controller_burst_limit\\\":{\\\"type\\\":{\\\"key\\\":{\\\"minInteger\\\":25,\\\"type\\\":\\\"integer\\\"},\\\"min\\\":0}},\\\"controller_rate_limit\\\":{\\\"type\\\":{\\\"key\\\":{\\\"minInteger\\\":100,\\\"type\\\":\\\"integer\\\"},\\\"min\\\":0}},\\\"enable_async_messages\\\":{\\\"type\\\":{\\\"key\\\":\\\"boolean\\\",\\\"min\\\":0}},\\\"external_ids\\\":{\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":\\\"string\\\"}},\\\"inactivity_probe\\\":{\\\"type\\\":{\\\"key\\\":\\\"integer\\\",\\\"min\\\":0}},\\\"is_connected\\\":{\\\"ephemeral\\\":true,\\\"type\\\":\\\"boolean\\\"},\\\"local_gateway\\\":{\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"min\\\":0}},\\\"local_ip\\\":{\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"min\\\":0}},\\\"local_netmask\\\":{\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"min\\\":0}},\\\"max_backoff\\\":{\\\"type\\\":{\\\"key\\\":{\\\"minInteger\\\":1000,\\\"type\\\":\\\"integer\\\"},\\\"min\\\":0}},\\\"other_config\\\":{\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":\\\"string\\\"}},\\\"role\\\":{\\\"ephemeral\\\":true,\\\"type\\\":{\\\"key\\\":{\\\"enum\\\":[\\\"set\\\",[\\\"master\\\",\\\"other\\\",\\\"slave\\\"]],\\\"type\\\":\\\"string\\\"},\\\"min\\\":0}},\\\"status\\\":{\\\"ephemeral\\\":true,\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":\\\"string\\\"}},\\\"target\\\":{\\\"type\\\":\\\"string\\\"}}},\\\"Flow_Sample_Collector_Set\\\":{\\\"columns\\\":{\\\"bridge\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"Bridge\\\",\\\"type\\\":\\\"uuid\\\"}}},\\\"external_id"..., 16015, 0, NULL, 0) = 16015
3376  recvfrom(18, "{\"id\":2,\"method\":\"monitor_cond\",\"params\":[\"Open_vSwitch\",[\"monid\",\"Open_vSwitch\"],{\"Interface\":[{\"columns\":[\"name\",\"options\"]}],\"Open_vSwitch\":[{\"columns\":[\"cur_cfg\"]}]}]}{\"id\":3,\"method\":\"set_db_change_aware\",\"params\":[true]}", 255, 0, NULL, NULL) = 226
3376  sendto(18, "{\"id\":2,\"result\":{\"Interface\":{\"c3c8f8d9-fcd7-4a01-a26f-ec9e985bffef\":{\"initial\":{\"name\":\"veth57fcc4fb\"}},\"2c8fee01-ec4e-467e-ae25-c3f6c494fbe7\":{\"initial\":{\"name\":\"vethc3e93576\"}},\"0b86132f-10a3-45e7-913b-8564a48add51\":{\"initial\":{\"name\":\"vethd5e50465\"}},\"137e5eca-b967-404c-882a-ffa985bfd4c8\":{\"initial\":{\"name\":\"vethb2c50177\"}},\"fcb05554-0102-4cfa-8346-332bdc64e866\":{\"initial\":{\"name\":\"vethc9fd9f4a\"}},\"5866f70a-fe4d-4d73-a379-b5fe2ee86008\":{\"initial\":{\"name\":\"veth7cdf84e3\"}},\"444f2b35-8e62-4966-a463-8c62b6b92f5a\":{\"initial\":{\"name\":\"vxlan0\",\"options\":[\"map\",[[\"dst_port\",\"4789\"],[\"key\",\"flow\"],[\"remote_ip\",\"flow\"]]]}},\"fd11de49-fde4-467c-b8b6-567962375074\":{\"initial\":{\"name\":\"veth8a1683d4\"}},\"fb0203b5-12ad-4eea-8e75-d7849dcba8de\":{\"initial\":{\"name\":\"veth4c6db7a2\"}},\"d83a5766-6fad-44c7-88fa-3439fb7b653e\":{\"initial\":{\"name\":\"veth80afd067\"}},\"0bd45664-58ea-407a-81b0-9554f3406cf7\":{\"initial\":{\"name\":\"br0\"}},\"f61e422d-12a5-4467-b2aa-0ae26575c815\":{\"initial\":{\"name\":\"vethd48cdce8\"}},\"ec991686-5c70-44f6-a670-cb68416e3a30\":{\"initial\":{\"name\":\"vethef45f97d\"}},\"42450e23-b9fa-423e-a508-f2f8fc3c0ca7\":{\"initial\":{\"name\":\"vethd1adefa5\"}},\"992aac17-7284-46bb-a3cb-720f8f6a2d1c\":{\"initial\":{\"name\":\"veth81f1a00d\"}},\"0e0d2ae5-3c6d-4c72-bc17-a195621ad8db\":{\"initial\":{\"name\":\"vethdf90af52\"}},\"97a2c47b-6a5b-44d0-893e-df6fd031dcfe\":{\"initial\":{\"name\":\"vethfa2985a0\"}},\"816e70a0-efef-4613-b088-3dc2da1e3e0c\":{\"initial\":{\"name\":\"tun0\"}},\"c9120f66-e362-441f-8a46-d424c85aec1d\":{\"initial\":{\"name\":\"vetha5266535\"}},\"0385e547-7fd4-4f3c-a77f-620aa1cd05c1\":{\"initial\":{\"name\":\"veth9c0f9133\"}}},\"Open_vSwitch\":{\"660748cb-9c68-4f75-8f43-f4b0dd297df3\":{\"initial\":{\"cur_cfg\":2415}}}},\"error\":null}", 1684, 0, NULL, 0) = 1684
3376  sendto(18, "{\"id\":3,\"result\":{},\"error\":null}", 33, 0, NULL, 0) = -1 EPIPE (Broken pipe)
3376  --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3376, si_uid=998} ---
3376  write(5, "2019-09-18T21:09:54.047Z|00761|jsonrpc|WARN|unix#5251: send error: Broken pipe\n", 79) = 79
3376  close(18)                         = 0


"method":"monitor_cond"

So that looks like this is an "ovsdb-client monitor-cond" call, and I have no idea where that would be coming from. Openshift-sdn doesn't run ovsdb-client at all...

Comment 15 errata-xmlrpc 2020-01-23 11:05:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062