Bug 1751246 - Cluster upgrade failing with `Cluster operator dns is reporting a failure`
Summary: Cluster upgrade failing with `Cluster operator dns is reporting a failure`
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.3.0
Assignee: Casey Callendrello
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-09-11 13:55 UTC by ravig
Modified: 2020-01-23 11:06 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-23 11:05:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 321 0 None closed Bug 1751246: openshift-sdn: add better ovs liveness/readiness probe 2021-01-22 14:21:53 UTC
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-01-23 11:06:11 UTC

Description ravig 2019-09-11 13:55:42 UTC
Description of problem:

Upgrade failing in CI with following error:

clusterversion/version changed Failing to True: ClusterOperatorDegraded: Cluster operator dns is reporting a failure: Not all desired DNS DaemonSets available.


https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6925


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Dan Mace 2019-09-11 14:28:02 UTC
CoreDNS on one node can't talk to the apiserver.

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6925/artifacts/e2e-aws-upgrade/pods/openshift-dns_dns-default-lsf6p_dns.log

The dns-operator is correctly reporting a degraded status.

The apiserver and SDN operators are all reporting healthy, pods all reporting ready, etc.

Reassigning to networking for further triage. Let me know if I made a mistake!

Comment 2 Dan Winship 2019-09-11 16:24:49 UTC
It looks like the problems started when the OVS pod was upgraded/restarted. OVS pod logs from that node are https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6925/artifacts/e2e-aws-upgrade/pods/openshift-sdn_ovs-hf2nr_openvswitch.log, and it shows a lot of

    2019-09-11T09:33:15.285Z|00103|jsonrpc|WARN|unix#228: receive error: Connection reset by peer
    2019-09-11T09:33:15.285Z|00104|reconnect|WARN|unix#228: connection dropped (Connection reset by peer)

The other OVS pods show *some* of those errors, but ovs-hf2nr logs that every minute or so (and doesn't log anything else) from 08:47:42 until the cluster is shut down.

I don't know what that error means... Aaron, any ideas?

Comment 3 Aaron Conole 2019-09-11 19:57:47 UTC
The connection being dropped usually indicates the database was shutdown or that some kind of negotiation between the two processes failed (possibly the db got restarted a few times - which would explain some of the reconnects?).

Comment 4 Casey Callendrello 2019-09-12 16:37:12 UTC
So the kubelet executes "ovs-vsctl status" every 30 seconds. That should detect a down ovsdb, right?

The node's iptables rules look fine. But the two pods scheduled on the node couldn't reach the apiserver. That indicates pretty clearly that something is wrong with the ovs bridge. So, I guess ovsdb somehow got wedged? That doesn't seem super likely, but I suppose is plausible.

Aaron, is there a better status command we can use? Perhaps one that really checks for a functioning process?

Comment 5 W. Trevor King 2019-09-13 04:01:36 UTC
*** Bug 1723914 has been marked as a duplicate of this bug. ***

Comment 6 Dan Winship 2019-09-13 11:54:47 UTC
(In reply to Aaron Conole from comment #3)
> The connection being dropped usually indicates the database was shutdown or
> that some kind of negotiation between the two processes failed (possibly the
> db got restarted a few times - which would explain some of the reconnects?).

ovs-vswitchd and ovsdb-server are always started/stopped/restarted as a team. The db shouldn't be restarting while vswitchd is running.

And note that that log file is a mix of both the vswitchd and ovsdb logs; it comes from a startup script that ends with:

          tail -F --pid=$(cat /var/run/openvswitch/ovs-vswitchd.pid) /var/log/openvswitch/ovs-vswitchd.log &
          tail -F --pid=$(cat /var/run/openvswitch/ovsdb-server.pid) /var/log/openvswitch/ovsdb-server.log &

Anyway, it appears to be ovsdb, not vswitchd that is logging the "reconnect" errors. eg, reindented for clarity:

        2019-09-11T08:44:48.919Z|00038|bridge|INFO|bridge br0: using datapath ID 0000aa530131d545
        2019-09-11T08:44:48.919Z|00039|connmgr|INFO|br0: added service controller "punix:/var/run/openvswitch/br0.mgmt"
        2019-09-11T08:44:48.924Z|00040|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.11.0
    2019-09-11T08:44:48.815Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log
    2019-09-11T08:44:48.824Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.11.0
    2019-09-11T08:44:48.829Z|00003|jsonrpc|WARN|unix#0: receive error: Connection reset by peer
    2019-09-11T08:44:48.829Z|00004|reconnect|WARN|unix#0: connection dropped (Connection reset by peer)
    2019-09-11T08:44:48.841Z|00005|jsonrpc|WARN|unix#1: receive error: Connection reset by peer
    2019-09-11T08:44:48.841Z|00006|reconnect|WARN|unix#1: connection dropped (Connection reset by peer)
    2019-09-11T08:44:48.936Z|00007|jsonrpc|WARN|unix#6: receive error: Connection reset by peer
    2019-09-11T08:44:48.936Z|00008|reconnect|WARN|unix#6: connection dropped (Connection reset by peer)
        2019-09-11T08:44:50.057Z|00041|bridge|INFO|bridge br0: deleted interface vethcd1e33a0 on port 33
        2019-09-11T08:44:50.057Z|00042|bridge|INFO|bridge br0: deleted interface veth4018383c on port 37

So the errors don't seem to have anything to do with ovsdb or ovs-vswitchd restarting.

Comment 7 Casey Callendrello 2019-09-13 12:01:59 UTC
Gotcha - so which process is timing out / restarting? Is it ovsdb, ovs-vswitchd, or openshift-sdn? Openshift-sdn, which is normally quite chatty, doesn't seem to log anything. This makes me think that it's connection to ovs-vswitchd is fine.

Comment 8 Ben Bennett 2019-09-17 13:15:33 UTC
Pushing to 4.3 because we have only seen this once.  We can fix it later in a Z-stream if necessary.

Comment 9 Aaron Conole 2019-09-17 13:48:48 UTC
No idea - maybe the ovs-vsctl is dying.  Is it on a timeout?  Is the timeout expiring?  I don't know all of the processes that connect to the ovsdb.

Comment 10 Dan Winship 2019-09-18 21:36:54 UTC
So here's an example:

3376  accept(15, {sa_family=AF_UNIX}, [128->2]) = 18
3376  fcntl(18, F_GETFL)                = 0x2 (flags O_RDWR)
3376  fcntl(18, F_SETFL, O_RDWR|O_NONBLOCK) = 0
3376  recvfrom(18, "{\"id\":0,\"method\":\"get_schema\",\"params\":[\"_Server\"]}", 512, 0, NULL, NULL) = 51
3376  sendto(18, "{\"id\":0,\"result\":{\"cksum\":\"3236486585 698\",\"name\":\"_Server\",\"version\":\"1.1.0\",\"tables\":{\"Database\":{\"columns\":{\"name\":{\"type\":\"string\"},\"model\":{\"type\":{\"key\":{\"type\":\"string\",\"enum\":[\"set\",[\"clustered\",\"standalone\"]]}}},\"connected\":{\"type\":\"boolean\"},\"leader\":{\"type\":\"boolean\"},\"schema\":{\"type\":{\"min\":0,\"key\":\"string\"}},\"sid\":{\"type\":{\"min\":0,\"key\":\"uuid\"}},\"cid\":{\"type\":{\"min\":0,\"key\":\"uuid\"}},\"index\":{\"type\":{\"min\":0,\"key\":\"integer\"}}}}}},\"error\":null}", 459, 0, NULL, 0) = 459
3376  recvfrom(18, "{\"id\":1,\"method\":\"monitor_cond\",\"params\":[\"_Server\",[\"monid\",\"_Server\"],{\"Database\":[{\"where\":[[\"name\",\"==\",\"Open_vSwitch\"]],\"columns\":[\"cid\",\"connected\",\"index\",\"leader\",\"model\",\"name\",\"schema\",\"sid\"]}]}]}", 461, 0, NULL, NULL) = 206
3376  sendto(18, "{\"id\":1,\"result\":{\"Database\":{\"41cd9e6f-8c9b-406e-bf23-231ee3431260\":{\"initial\":{\"connected\":true,\"name\":\"Open_vSwitch\",\"leader\":true,\"schema\":\"{\\\"cksum\\\":\\\"1452282319 23860\\\",\\\"name\\\":\\\"Open_vSwitch\\\",\\\"tables\\\":{\\\"AutoAttach\\\":{\\\"columns\\\":{\\\"mappings\\\":{\\\"type\\\":{\\\"key\\\":{\\\"maxInteger\\\":16777215,\\\"minInteger\\\":0,\\\"type\\\":\\\"integer\\\"},\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":{\\\"maxInteger\\\":4095,\\\"minInteger\\\":0,\\\"type\\\":\\\"integer\\\"}}},\\\"system_description\\\":{\\\"type\\\":\\\"string\\\"},\\\"system_name\\\":{\\\"type\\\":\\\"string\\\"}}},\\\"Bridge\\\":{\\\"columns\\\":{\\\"auto_attach\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"AutoAttach\\\",\\\"type\\\":\\\"uuid\\\"},\\\"min\\\":0}},\\\"controller\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"Controller\\\",\\\"type\\\":\\\"uuid\\\"},\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0}},\\\"datapath_id\\\":{\\\"ephemeral\\\":true,\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"min\\\":0}},\\\"datapath_type\\\":{\\\"type\\\":\\\"string\\\"},\\\"datapath_version\\\":{\\\"type\\\":\\\"string\\\"},\\\"external_ids\\\":{\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":\\\"string\\\"}},\\\"fail_mode\\\":{\\\"type\\\":{\\\"key\\\":{\\\"enum\\\":[\\\"set\\\",[\\\"secure\\\",\\\"standalone\\\"]],\\\"type\\\":\\\"string\\\"},\\\"min\\\":0}},\\\"flood_vlans\\\":{\\\"type\\\":{\\\"key\\\":{\\\"maxInteger\\\":4095,\\\"minInteger\\\":0,\\\"type\\\":\\\"integer\\\"},\\\"max\\\":4096,\\\"min\\\":0}},\\\"flow_tables\\\":{\\\"type\\\":{\\\"key\\\":{\\\"maxInteger\\\":254,\\\"minInteger\\\":0,\\\"type\\\":\\\"integer\\\"},\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":{\\\"refTable\\\":\\\"Flow_Table\\\",\\\"type\\\":\\\"uuid\\\"}}},\\\"ipfix\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"IPFIX\\\",\\\"type\\\":\\\"uuid\\\"},\\\"min\\\":0}},\\\"mcast_snooping_enable\\\":{\\\"type\\\":\\\"boolean\\\"},\\\"mirrors\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"Mirror\\\",\\\"type\\\":\\\"uuid\\\"},\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0}},\\\"name\\\":{\\\"mutable\\\":false,\\\"type\\\":\\\"string\\\"},\\\"netflow\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"NetFlow\\\",\\\"type\\\":\\\"uuid\\\"},\\\"min\\\":0}},\\\"other_config\\\":{\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":\\\"string\\\"}},\\\"ports\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"Port\\\",\\\"type\\\":\\\"uuid\\\"},\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0}},\\\"protocols\\\":{\\\"type\\\":{\\\"key\\\":{\\\"enum\\\":[\\\"set\\\",[\\\"OpenFlow10\\\",\\\"OpenFlow11\\\",\\\"OpenFlow12\\\",\\\"OpenFlow13\\\",\\\"OpenFlow14\\\",\\\"OpenFlow15\\\",\\\"OpenFlow16\\\"]],\\\"type\\\":\\\"string\\\"},\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0}},\\\"rstp_enable\\\":{\\\"type\\\":\\\"boolean\\\"},\\\"rstp_status\\\":{\\\"ephemeral\\\":true,\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":\\\"string\\\"}},\\\"sflow\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"sFlow\\\",\\\"type\\\":\\\"uuid\\\"},\\\"min\\\":0}},\\\"status\\\":{\\\"ephemeral\\\":true,\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":\\\"string\\\"}},\\\"stp_enable\\\":{\\\"type\\\":\\\"boolean\\\"}},\\\"indexes\\\":[[\\\"name\\\"]]},\\\"Controller\\\":{\\\"columns\\\":{\\\"connection_mode\\\":{\\\"type\\\":{\\\"key\\\":{\\\"enum\\\":[\\\"set\\\",[\\\"in-band\\\",\\\"out-of-band\\\"]],\\\"type\\\":\\\"string\\\"},\\\"min\\\":0}},\\\"controller_burst_limit\\\":{\\\"type\\\":{\\\"key\\\":{\\\"minInteger\\\":25,\\\"type\\\":\\\"integer\\\"},\\\"min\\\":0}},\\\"controller_rate_limit\\\":{\\\"type\\\":{\\\"key\\\":{\\\"minInteger\\\":100,\\\"type\\\":\\\"integer\\\"},\\\"min\\\":0}},\\\"enable_async_messages\\\":{\\\"type\\\":{\\\"key\\\":\\\"boolean\\\",\\\"min\\\":0}},\\\"external_ids\\\":{\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":\\\"string\\\"}},\\\"inactivity_probe\\\":{\\\"type\\\":{\\\"key\\\":\\\"integer\\\",\\\"min\\\":0}},\\\"is_connected\\\":{\\\"ephemeral\\\":true,\\\"type\\\":\\\"boolean\\\"},\\\"local_gateway\\\":{\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"min\\\":0}},\\\"local_ip\\\":{\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"min\\\":0}},\\\"local_netmask\\\":{\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"min\\\":0}},\\\"max_backoff\\\":{\\\"type\\\":{\\\"key\\\":{\\\"minInteger\\\":1000,\\\"type\\\":\\\"integer\\\"},\\\"min\\\":0}},\\\"other_config\\\":{\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":\\\"string\\\"}},\\\"role\\\":{\\\"ephemeral\\\":true,\\\"type\\\":{\\\"key\\\":{\\\"enum\\\":[\\\"set\\\",[\\\"master\\\",\\\"other\\\",\\\"slave\\\"]],\\\"type\\\":\\\"string\\\"},\\\"min\\\":0}},\\\"status\\\":{\\\"ephemeral\\\":true,\\\"type\\\":{\\\"key\\\":\\\"string\\\",\\\"max\\\":\\\"unlimited\\\",\\\"min\\\":0,\\\"value\\\":\\\"string\\\"}},\\\"target\\\":{\\\"type\\\":\\\"string\\\"}}},\\\"Flow_Sample_Collector_Set\\\":{\\\"columns\\\":{\\\"bridge\\\":{\\\"type\\\":{\\\"key\\\":{\\\"refTable\\\":\\\"Bridge\\\",\\\"type\\\":\\\"uuid\\\"}}},\\\"external_id"..., 16015, 0, NULL, 0) = 16015
3376  recvfrom(18, "{\"id\":2,\"method\":\"monitor_cond\",\"params\":[\"Open_vSwitch\",[\"monid\",\"Open_vSwitch\"],{\"Interface\":[{\"columns\":[\"name\",\"options\"]}],\"Open_vSwitch\":[{\"columns\":[\"cur_cfg\"]}]}]}{\"id\":3,\"method\":\"set_db_change_aware\",\"params\":[true]}", 255, 0, NULL, NULL) = 226
3376  sendto(18, "{\"id\":2,\"result\":{\"Interface\":{\"c3c8f8d9-fcd7-4a01-a26f-ec9e985bffef\":{\"initial\":{\"name\":\"veth57fcc4fb\"}},\"2c8fee01-ec4e-467e-ae25-c3f6c494fbe7\":{\"initial\":{\"name\":\"vethc3e93576\"}},\"0b86132f-10a3-45e7-913b-8564a48add51\":{\"initial\":{\"name\":\"vethd5e50465\"}},\"137e5eca-b967-404c-882a-ffa985bfd4c8\":{\"initial\":{\"name\":\"vethb2c50177\"}},\"fcb05554-0102-4cfa-8346-332bdc64e866\":{\"initial\":{\"name\":\"vethc9fd9f4a\"}},\"5866f70a-fe4d-4d73-a379-b5fe2ee86008\":{\"initial\":{\"name\":\"veth7cdf84e3\"}},\"444f2b35-8e62-4966-a463-8c62b6b92f5a\":{\"initial\":{\"name\":\"vxlan0\",\"options\":[\"map\",[[\"dst_port\",\"4789\"],[\"key\",\"flow\"],[\"remote_ip\",\"flow\"]]]}},\"fd11de49-fde4-467c-b8b6-567962375074\":{\"initial\":{\"name\":\"veth8a1683d4\"}},\"fb0203b5-12ad-4eea-8e75-d7849dcba8de\":{\"initial\":{\"name\":\"veth4c6db7a2\"}},\"d83a5766-6fad-44c7-88fa-3439fb7b653e\":{\"initial\":{\"name\":\"veth80afd067\"}},\"0bd45664-58ea-407a-81b0-9554f3406cf7\":{\"initial\":{\"name\":\"br0\"}},\"f61e422d-12a5-4467-b2aa-0ae26575c815\":{\"initial\":{\"name\":\"vethd48cdce8\"}},\"ec991686-5c70-44f6-a670-cb68416e3a30\":{\"initial\":{\"name\":\"vethef45f97d\"}},\"42450e23-b9fa-423e-a508-f2f8fc3c0ca7\":{\"initial\":{\"name\":\"vethd1adefa5\"}},\"992aac17-7284-46bb-a3cb-720f8f6a2d1c\":{\"initial\":{\"name\":\"veth81f1a00d\"}},\"0e0d2ae5-3c6d-4c72-bc17-a195621ad8db\":{\"initial\":{\"name\":\"vethdf90af52\"}},\"97a2c47b-6a5b-44d0-893e-df6fd031dcfe\":{\"initial\":{\"name\":\"vethfa2985a0\"}},\"816e70a0-efef-4613-b088-3dc2da1e3e0c\":{\"initial\":{\"name\":\"tun0\"}},\"c9120f66-e362-441f-8a46-d424c85aec1d\":{\"initial\":{\"name\":\"vetha5266535\"}},\"0385e547-7fd4-4f3c-a77f-620aa1cd05c1\":{\"initial\":{\"name\":\"veth9c0f9133\"}}},\"Open_vSwitch\":{\"660748cb-9c68-4f75-8f43-f4b0dd297df3\":{\"initial\":{\"cur_cfg\":2415}}}},\"error\":null}", 1684, 0, NULL, 0) = 1684
3376  sendto(18, "{\"id\":3,\"result\":{},\"error\":null}", 33, 0, NULL, 0) = -1 EPIPE (Broken pipe)
3376  --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=3376, si_uid=998} ---
3376  write(5, "2019-09-18T21:09:54.047Z|00761|jsonrpc|WARN|unix#5251: send error: Broken pipe\n", 79) = 79
3376  close(18)                         = 0


"method":"monitor_cond"

So that looks like this is an "ovsdb-client monitor-cond" call, and I have no idea where that would be coming from. Openshift-sdn doesn't run ovsdb-client at all...

Comment 15 errata-xmlrpc 2020-01-23 11:05:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.