Bug 1852618 - [4.4]Cluster operator ingress has been degraded
Summary: [4.4]Cluster operator ingress has been degraded
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.6.0
Assignee: Alexander Constantinescu
QA Contact: huirwang
URL:
Whiteboard:
: 1831283 1848374 (view as bug list)
Depends On:
Blocks: 1855118 1857738
TreeView+ depends on / blocked
 
Reported: 2020-06-30 21:33 UTC by Hongkai Liu
Modified: 2020-10-13 09:23 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1857738 1857742 (view as bug list)
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)
flow issues output (507.79 KB, text/plain)
2020-07-09 13:36 UTC, Alexander Constantinescu
no flags Details
second flow restore with db dump (2.10 MB, text/plain)
2020-07-09 17:04 UTC, Alexander Constantinescu
no flags Details


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 701 None closed Set timeoutSeconds to reflect expected timeout for OVS commands 2020-10-23 09:07:41 UTC

Description Hongkai Liu 2020-06-30 21:33:01 UTC
Description of problem:
One router pods is NOT ready.

oc --context build02 get pod -n openshift-ingress
NAME                              READY   STATUS    RESTARTS   AGE
router-default-574c789c6b-5zvlw   1/1     Running   0          4d22h
router-default-574c789c6b-sdjhs   0/1     Running   1          4d22h

In the pod logs:

Trace[1278989275]: [30.000584618s] [30.000584618s] END
E0630 21:30:21.755674       1 reflector.go:123] github.com/openshift/router/pkg/router/template/service_lookup.go:32: Failed to list *v1.Service: Get https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout

oc --context build02 get clusterversions.config.openshift.io
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.10    True        False         4d22h   Error while reconciling 4.4.10: an unknown error has occurred

The cluster build02 is a production cluster for CI.

Comment 2 Hongkai Liu 2020-06-30 21:50:07 UTC
https://coreos.slack.com/archives/CEKNRGF25/p1593549189041400?thread_ts=1593543661.033500&cid=CEKNRGF25

Miciah Dashiel Butler Masters  1 hour ago
I found this in namespaces/openshift-sdn/pods/sdn-prp75/sdn/sdn/logs/previous.log:
2020-06-30T18:09:16.127346496Z I0630 18:09:16.127059 3076489 node.go:487] Reattaching pod 'openshift-ingress/router-default-574c789c6b-sdjhs' to SDN
2020-06-30T18:09:16.146756418Z I0630 18:09:16.145634 3076489 ovs.go:169] Error executing ovs-vsctl: ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Connection refused)
2020-06-30T18:09:16.186187339Z I0630 18:09:16.161740 3076489 ovs.go:169] Error executing ovs-vsctl: ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Connection refused)
2020-06-30T18:09:16.333740014Z W0630 18:09:16.333253 3076489 pod.go:274] CNI_ADD openshift-ingress/router-default-574c789c6b-sdjhs failed: exit status 1
2020-06-30T18:09:16.333740014Z W0630 18:09:16.333317 3076489 node.go:493] Could not reattach pod 'openshift-ingress/router-default-574c789c6b-sdjhs' to SDN: exit status 1

Comment 3 W. Trevor King 2020-06-30 22:17:29 UTC
Random bits from the must-gather for context, and to give folks some higher-level symptoms to help with Bugzilla searches:

$ yaml2json <cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring)'
2020-06-25T19:46:53Z 2020-06-25T20:40:09Z 4.4.10 Completed true
2020-05-21T19:30:15Z 2020-05-21T19:53:42Z 4.4.4 Completed false
$ yaml2json <cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
2020-05-21T19:53:42Z Available=True -: Done applying 4.4.10
2020-06-30T19:21:54Z Failing=False -: -
2020-06-25T20:40:09Z Progressing=False -: Cluster version is 4.4.10
2020-05-21T19:30:15Z RetrievedUpdates=True -: -

Hmm, I guess the gather was post-resolution?  Getting an Insights tarball for this cluster shows the ingress condition:

$ tar -xOz config/version < "$(ls | tail -n1)" | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + (.reason // "-") + " " + (.message // "-")'
2020-05-21T19:53:42Z Available True - Done applying 4.4.10
2020-06-30T20:38:31Z Failing True MultipleErrors Multiple errors are preventing progress:
* Cluster operator ingress is reporting a failure: Some ingresscontrollers are degraded: default
* Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Grafana failed: reconciling Grafana Deployment failed: updating deployment object failed: waiting for DeploymentRollout of grafana: deployment grafana is not ready. status: (replicas: 1, updated: 1, ready: 0, unavailable: 1)
2020-06-25T20:40:09Z Progressing False MultipleErrors Error while reconciling 4.4.10: an unknown error has occurred
2020-05-21T19:30:15Z RetrievedUpdates True - -
$ tar -xOz config/clusteroperator/ingress < "$(ls | tail -n1)" | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + (.reason // "-") + " " + (.message // "-")'
2020-06-30T18:09:43Z Degraded True IngressControllersDegraded Some ingresscontrollers are degraded: default
2020-06-30T18:09:13Z Progressing True Reconciling Not all ingress controllers are available.
2020-06-30T18:09:13Z Available False IngressUnavailable Not all ingress controllers are available.

Those ingress-operator conditions are still in the must-gather:

$ yaml2json <cluster-scoped-resources/config.openshift.io/clusteroperators/ingress.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
2020-06-30T18:09:43Z Degraded=True IngressControllersDegraded: Some ingresscontrollers are degraded: default
2020-06-30T18:09:13Z Progressing=True Reconciling: Not all ingress controllers are available.
2020-06-30T18:09:13Z Available=False IngressUnavailable: Not all ingress controllers are available.
$ yaml2json <namespaces/openshift-ingress-operator/operator.openshift.io/ingresscontrollers/default.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + (.reason // "-") + " " + (.message // "-")'
2020-05-21T19:46:15Z Admitted True Valid -
2020-06-30T18:09:13Z Available False MinimumReplicasUnavailable The deployment is unavailable: Deployment does not have minimum availability.
2020-06-30T18:09:13Z DeploymentDegraded True DeploymentUnavailable The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.
2020-05-21T19:46:19Z LoadBalancerManaged True WantedByEndpointPublishingStrategy The endpoint publishing strategy supports a managed load balancer
2020-05-21T19:47:03Z LoadBalancerReady True LoadBalancerProvisioned The LoadBalancer service is provisioned
2020-05-21T19:46:19Z DNSManaged True Normal DNS management is supported and zones are specified in the cluster DNS config.
2020-05-21T19:47:04Z DNSReady True NoFailedZones The record is provisioned in all reported zones.
2020-06-30T18:09:43Z Degraded True DegradedConditions One or more other status conditions indicate a degraded state: DeploymentDegraded=True
$ yaml2json <namespaces/openshift-ingress/apps/deployments.yaml | jq -r '.items[] | select(.metadata.name == "router-default").status'
{
  "conditions": [
    {
      "status": "True",
      "lastUpdateTime": "2020-06-25T19:57:25Z",
      "reason": "NewReplicaSetAvailable",
      "lastTransitionTime": "2020-05-21T19:46:15Z",
      "type": "Progressing",
      "message": "ReplicaSet \"router-default-574c789c6b\" has successfully progressed."
    },
    {
      "status": "False",
      "lastUpdateTime": "2020-06-30T18:09:13Z",
      "reason": "MinimumReplicasUnavailable",
      "lastTransitionTime": "2020-06-30T18:09:13Z",
      "type": "Available",
      "message": "Deployment does not have minimum availability."
    }
  ],
  "availableReplicas": "1",
  "updatedReplicas": "2",
  "readyReplicas": "1",
  "replicas": "2",
  "unavailableReplicas": "1",
  "observedGeneration": "4"
}
$ yaml2json <namespaces/openshift-ingress/apps/deployments.yaml | jq -r '.items[] | select(.metadata.name == "router-default").spec.strategy'
{
  "rollingUpdate": {
    "maxUnavailable": "0",
    "maxSurge": "1"
  },
  "type": "RollingUpdate"
}

Hmm.  maxUnavailable 0 is pretty aggressive.  Is ingress really broken with only a single ready router pod?  Back to the broken router pod:

$ yaml2json <namespaces/openshift-ingress/pods/router-default-574c789c6b-sdjhs/router-default-574c789c6b-sdjhs.yaml | jq -r '.status.containerStatuses[].lastState.terminated'
{
  "message": "08096s] [30.000408096s] END\nE0630 19:10:20.942115       1 reflector.go:123] github.com/openshift/router/pkg/router/controller/factory/factory.go:115: Failed to list *v1.Route: Get https://172.30.0.1:443/apis/route.openshift.io/v1/routes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout\nI0630 19:10:51.941091       1 trace.go:116] Trace[1208412245]: \"Reflector ListAndWatch\" name:github.com/openshift/router/pkg/router/controller/factory/factory.go:115 (started: 2020-06-30 19:10:21.940404787 +0000 UTC m=+3581.314237596) (total time: 30.000641626s):\nTrace[1208412245]: [30.000641626s] [30.000641626s] END\nE0630 19:10:51.941133       1 reflector.go:123] github.com/openshift/router/pkg/router/controller/factory/factory.go:115: Failed to list *v1.Endpoints: Get https://172.30.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout\nI0630 19:10:51.942009       1 trace.go:116] Trace[1959082013]: \"Reflector ListAndWatch\" name:github.com/openshift/router/pkg/router/template/service_lookup.go:32 (started: 2020-06-30 19:10:21.941540332 +0000 UTC m=+3581.315373190) (total time: 30.000436572s):\nTrace[1959082013]: [30.000436572s] [30.000436572s] END\nE0630 19:10:51.942034       1 reflector.go:123] github.com/openshift/router/pkg/router/template/service_lookup.go:32: Failed to list *v1.Service: Get https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout\nI0630 19:10:51.943090       1 trace.go:116] Trace[1299276755]: \"Reflector ListAndWatch\" name:github.com/openshift/router/pkg/router/controller/factory/factory.go:115 (started: 2020-06-30 19:10:21.942654557 +0000 UTC m=+3581.316487446) (total time: 30.000412715s):\nTrace[1299276755]: [30.000412715s] [30.000412715s] END\nE0630 19:10:51.943109       1 reflector.go:123] github.com/openshift/router/pkg/router/controller/factory/factory.go:115: Failed to list *v1.Route: Get https://172.30.0.1:443/apis/route.openshift.io/v1/routes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout\n",
  "containerID": "cri-o://ee50c0e1a7e7ee76623d0719dcf05b970635e0342bcdfbafc1b0d96032449481",
  "exitCode": "137",
  "reason": "Error",
  "finishedAt": "2020-06-30T19:11:14Z",
  "startedAt": "2020-06-30T18:10:40Z"
}

So that's what the router is crashlooping on, and kicking the SDN pod seems to have unstuck the update.

Comment 5 Clayton Coleman 2020-07-03 19:31:40 UTC
This is a release blocker for 4.5, happened in CI cluster after upgrade on rc.6

Comment 6 zhaozhanqi 2020-07-06 10:12:24 UTC
this issue is very hardly reproduced in QE's cluster during upgrade. 
@Hongkai, could you help show which kind of platform, and which kind of installation method?  Maybe it's not related with this issue. but we need to have a try.  thanks.

Comment 7 Hongkai Liu 2020-07-06 13:31:44 UTC
@zhaozhanqi 

We got the 4.4.4 installer on the release page.
https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/

The upgrade path is
4.4.4 to 4.4.10, and then 4.5.0-rc.6.

The cluster is hosted on the GCP.
However, the problem showed up only under some workload.
The only way of reproduction that I know is to run CI tests with the original repo and the installer repo.
It is hard for you to get the whole Prow running in your testing env.

I would suggest you to do many openshift builds to see if you can hit the bug.

Comment 8 Alexander Constantinescu 2020-07-06 16:53:00 UTC
Hi

I've had a look at the must-gather in #comment 1. 

I have seen that the networking pods did not manage to bring up the router-default-574c789c6b-sdjhs as mentioned in #comment 2. Looking closer at the those networking pods, to find out why that was I saw this:

From: namespaces/openshift-sdn/pods/sdn-prp75/sdn/sdn/logs/previous.log


2020-06-25T20:12:58.212104807Z I0625 20:12:58.212006 2453586 proxier.go:368] userspace proxy: processing 0 service events
2020-06-25T20:12:58.212104807Z I0625 20:12:58.212032 2453586 proxier.go:347] userspace syncProxyRules took 48.100771ms
2020-06-25T20:12:58.428219987Z I0625 20:12:58.428100 2453586 proxier.go:368] userspace proxy: processing 0 service events
2020-06-25T20:12:58.428219987Z I0625 20:12:58.428136 2453586 proxier.go:347] userspace syncProxyRules took 43.537689ms
2020-06-25T20:12:58.636985192Z I0625 20:12:58.636904 2453586 proxier.go:368] userspace proxy: processing 0 service events
2020-06-25T20:12:58.636985192Z I0625 20:12:58.636935 2453586 proxier.go:347] userspace syncProxyRules took 45.291026ms
2020-06-25T20:13:21.375947577Z I0625 20:13:21.375893 2453586 pod.go:539] CNI_DEL openshift-must-gather-t2l25/must-gather-hnf22
2020-06-25T20:13:26.258670957Z interrupt: Gracefully shutting down ...
2020-06-30T18:08:48.080756489Z I0630 18:08:48.080221 3076489 cmd.go:123] Reading proxy configuration from /config/kube-proxy-config.yaml
2020-06-30T18:08:48.105782541Z I0630 18:08:48.093063 3076489 feature_gate.go:243] feature gates: &{map[]}
2020-06-30T18:08:48.105782541Z I0630 18:08:48.093133 3076489 cmd.go:227] Watching config file /config/kube-proxy-config.yaml for changes
2020-06-30T18:08:48.105782541Z I0630 18:08:48.093183 3076489 cmd.go:227] Watching config file /config/..2020_06_25_19_59_50.375303914/kube-proxy-config.yaml for changes
2020-06-30T18:08:48.625784451Z I0630 18:08:48.622857 3076489 node.go:147] Initializing SDN node of type "redhat/openshift-ovs-networkpolicy" with configured hostname "build0-gstfj-w-b-9x7xl.c.openshift-ci-build-farm.internal" (IP "10.0.32.5")
2020-06-30T18:08:48.852416356Z I0630 18:08:48.852108 3076489 cmd.go:160] Starting node networking (v0.0.0-alpha.0-143-g26523681)
2020-06-30T18:08:48.852416356Z I0630 18:08:48.852152 3076489 node.go:385] Starting openshift-sdn network plugin
2020-06-30T18:08:51.515353736Z I0630 18:08:51.515003 3076489 sdn_controller.go:139] [SDN setup] full SDN setup required (plugin is not setup)
2020-06-30T18:08:58.307766761Z I0630 18:08:58.307242 3076489 node.go:434] Starting openshift-sdn pod manager
2020-06-30T18:08:58.311807815Z I0630 18:08:58.308377 3076489 node.go:487] Reattaching pod 'ocp/base-4-5-rhel8-7c99d64

It seems something killed the SDN pod on the 25th of June and it didn't restart until the 30th. 

For the OVS pods (running on the same node) I have even less to go on, `namespaces/openshift-sdn $ less pods/ovs-2t98n/openvswitch/openvswitch/logs/previous.log`

2020-06-30T18:09:30.04182099Z Starting ovsdb-server.
2020-06-30T18:09:30.805895582Z PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
2020-06-30T18:09:30.805895582Z PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4)
2020-06-30T18:09:30.811775694Z net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
2020-06-30T18:09:30.811775694Z net_mlx5: cannot 

I.e there are no logs from before the 30th of June. 

That also confirms why #comment 2 mentioned the following logs:

2020-06-30T18:09:16.146756418Z I0630 18:09:16.145634 3076489 ovs.go:169] Error executing ovs-vsctl: ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Connection refused)

It's because the OVS pod was not even up and running. 

I am in discussions with Hongkai trying to get the kubelet logs from the node in question: build0-gstfj-w-b-9x7xl.c.openshift-ci-build-farm.internal

Once we have the kubelet logs we might understand why the networking pods were not running. 

-Alexander

Comment 9 zhaozhanqi 2020-07-07 02:06:37 UTC
thanks @Hongkai Liu and @Alexander for helping this. I will also have a try on GCP cluster to try if we can hit this issue.

Comment 10 zhaozhanqi 2020-07-07 11:43:38 UTC
I trigger 7 times upgrade from 4.4.10 to 4.5.0-rc.6 on gcp today.  but no luck to reproduce this issue.

Comment 11 Alexander Constantinescu 2020-07-07 12:01:26 UTC
Hi 

An update on this issue:

I have not been able to get to the bottom of it. With the help of Hongkai we are able to reproduce the issue, using this CI job: https://github.com/openshift/release/pull/10001

What I have observed is that new nodes are being added to the cluster once that job is executed, which in turn deploys new networking pods on those nodes. At the same time, I am however also seeing the OVS liveness probes on some already existing nodes fail. I believe the root cause to be related to this. 

The OVS liveness probes start failing and starts triggering a restart of OVS on these nodes, the liveness probe failing might have to do with this (which is logged right after the OVS pods receives the SIGTERM):

2020-07-07T09:24:36.163Z|00148|connmgr|INFO|br0<->unix#55: 5 flow_mods in the last 0 s (5 adds)
2020-07-07T09:24:37Z|00001|vconn|WARN|unix:/var/run/openvswitch/br0.mgmt: version negotiation failed (we support version 0x01, peer supports version 0x04)
ovs-ofctl: br0: failed to connect to socket (Broken pipe)
2020-07-07T09:24:37.184Z|00149|bridge|INFO|bridge br0: added interface vethc7afb3e4 on port 17
2020-07-07T09:24:37.408Z|00150|vconn|WARN|unix#57: version negotiation failed (we support version 0x04, peer supports version 0x01)
2020-07-07T09:24:37.408Z|00151|rconn|WARN|br0<->unix#57: connection dropped (Protocol error)
2020-07-07T09:24:38.019Z|00152|connmgr|INFO|br0<->unix#60: 5 flow_mods in the last 0 s (5 adds)
2020-07-07 09:24:38 info: Saved flows

When OVS restarts it tries to restore the flows it had when exiting, this can be seen it the logs of the newly created OVS pod:

2020-07-07 09:25:08 info: Created br0, now adding flows ...
+ ovs-ofctl add-tlv-map br0 ''
2020-07-07T09:25:09Z|00001|vconn|WARN|unix:/var/run/openvswitch/br0.mgmt: version negotiation failed (we support version 0x01, peer supports version 0x04)
ovs-ofctl: br0: failed to connect to socket (Protocol error)
+ ovs-ofctl -O OpenFlow13 add-groups br0 /var/run/openvswitch/ovs-save.1FbOMfuiWG/br0.groups.dump
+ ovs-ofctl -O OpenFlow13 replace-flows br0 /var/run/openvswitch/ovs-save.1FbOMfuiWG/br0.flows.dump
+ rm -rf /var/run/openvswitch/ovs-save.1FbOMfuiWG
2020-07-07 09:25:10 info: Done restoring the existing flows ...

It is thus unable to restore the flows. 

After talking to more OVS experienced people, the "we support version 0x01, peer supports version 0x04" seems to highlight that we are not using the correct protocol (OpenFlow13) when communicating with OVS. This seems very strange to me, and I am not fully sure I understand it. Anyhow, manual testing of this script on a working OVS pod does not seem to indicate that it is critical. The command can be executed despite these logs being logged. 

OVS restarting in turn restarts SDN. SDN upon restart does not retrieve the networking information related to the pods already running on the node when it starts, logs are filled with;

I0707 09:25:45.532835 2276853 sdn_controller.go:139] [SDN setup] full SDN setup required (Link not found)
W0707 09:25:53.680616 2276853 pod.go:181] No network information for pod ci/d16f07b6-c031-11ea-9b97-0a58ac1051c3
W0707 09:25:53.680661 2276853 pod.go:181] No network information for pod ci/ce8f7fc9-c031-11ea-9b97-0a58ac1051c3
W0707 09:25:53.680670 2276853 pod.go:181] No network information for pod ci/ce6f94aa-c031-11ea-9b97-0a58ac1051c3
W0707 09:25:53.680679 2276853 pod.go:181] No network information for pod openshift-dns/dns-default-q6shx
W0707 09:25:53.680687 2276853 pod.go:181] No network information for pod ci-op-p5hncqt3/src-build
W0707 09:25:53.680695 2276853 pod.go:181] No network information for pod ci-op-f1c8w395/openstack-installer-build
W0707 09:25:53.680704 2276853 pod.go:181] No network information for pod ci-op-msqpw03z/libvirt-installer-build
W0707 09:25:53.680711 2276853 pod.go:181] No network information for pod ci-op-834xvgl1/src-build
W0707 09:25:53.680718 2276853 pod.go:181] No network information for pod ci/cec88a62-c031-11ea-9b97-0a58ac1051c3
W0707 09:25:53.680725 2276853 pod.go:181] No network information for pod ci/namespace-ttl-controller-784-znmmr
W0707 09:25:53.680733 2276853 pod.go:181] No network information for pod ci-op-24thkvqb/baremetal-installer-build
W0707 09:25:53.680741 2276853 pod.go:181] No network information for pod ci/d1124342-c031-11ea-9b97-0a58ac1051c3
W0707 09:25:53.680748 2276853 pod.go:181] No network information for pod ci/cfc278fe-c031-11ea-9b97-0a58ac1051c3
W0707 09:25:53.680756 2276853 pod.go:181] No network information for pod openshift-image-registry/node-ca-g68ld
W0707 09:25:53.680763 2276853 pod.go:181] No network information for pod ci/d1aab262-c031-11ea-9b97-0a58ac1051c3
W0707 09:25:53.680770 2276853 pod.go:181] No network information for pod ci/ce850b4a-c031-11ea-9b97-0a58ac1051c3
W0707 09:25:53.680776 2276853 pod.go:181] No network information for pod ci/secret-mirroring-controller-692-cvhrg
W0707 09:25:53.680783 2276853 pod.go:181] No network information for pod ci/cec01950-c031-11ea-9b97-0a58ac1051c3
W0707 09:25:53.680791 2276853 pod.go:181] No network information for pod ci/d079db6c-c031-11ea-9b97-0a58ac1051c3
W0707 09:25:53.680797 2276853 pod.go:181] No network information for pod ci/cf67b9bc-c031-11ea-9b97-0a58ac1051c3
W0707 09:25:53.680805 2276853 pod.go:181] No network information for pod ci/ce807aae-c031-11ea-9b97-0a58ac1051c3
W0707 09:25:53.680813 2276853 pod.go:181] No network information for pod ci-op-p9hct7ip/src-build
W0707 09:25:53.680821 2276853 pod.go:181] No network information for pod openshift-monitoring/prometheus-adapter-b9b49579d-9wjms
W0707 09:25:53.680827 2276853 pod.go:181] No network information for pod ci/cf103584-c031-11ea-9b97-0a58ac1051c3
W0707 09:25:53.680834 2276853 pod.go:181] No network information for pod ci/ce9ec660-c031-11ea-9b97-0a58ac1051c3
W0707 09:25:53.680841 2276853 pod.go:181] No network information for pod openshift-monitoring/grafana-f457c8645-wr7bg
W0707 09:25:53.680848 2276853 pod.go:181] No network information for pod ci/d05b1829-c031-11ea-9b97-0a58ac1051c3
W0707 09:25:53.680855 2276853 pod.go:181] No network information for pod ci-op-0zw7qv58/upi-installer-build

This is because there's nothing in OVS:

[root@build0-gstfj-w-c-8l7mw /]#  ovs-vsctl  --columns=name,external_ids,ofport find interface
name                : "vxlan0"
external_ids        : {}
ofport              : 1

name                : "br0"
external_ids        : {}
ofport              : 65534

name                : "tun0"
external_ids        : {}
ofport              : 2


As all those pods were already running on the node, we get no CNI_ADD and subsequently do not do anything to rectify the situation. 

The root cause to everything mentioned in this comment is the liveness probe which starts failing for no reason on a node which was running fine until the workload was triggered. 

FYI: the DNS operator is hung because it tries communicating with the api-server on 172.30.01, which is obviously unsuccessful as that pods networking is non-existent.

I suspect that whatever causes the liveness probe to fail (i.e: openvswitch) means it can't dump the flows prior to exiting when it gets its SIGTERM. Which means that he restores files which are empty or similar.

I need to investigate what causes the liveness probe to fail, which means temporarily stopping the CNO on the CI cluster.

Comment 12 Scott Dodson 2020-07-07 13:47:11 UTC
Given this was present in 4.4, this does not meet the requirements for a 4.5 blocker. Moving to 4.6, please clone back to 4.5.z and 4.4.z as appropriate.

Comment 13 Alexander Constantinescu 2020-07-08 08:46:42 UTC
OK, there's definitely a problem with the flow restore for OVS. 

I've modified the OVS daemonset to not delete the files containing the flows it restores upon booting the OVS pod, and the outcome is strange. On some nodes, the flow restore does not happen (though logs do not indicate any issues)

This is the file used to restore the flows (taken from a node which experienced the issue):

cat /var/run/openvswitch/flows.sh
ovs-ofctl add-tlv-map br0 ''
ovs-ofctl -O OpenFlow13 add-groups br0               "/var/run/openvswitch/ovs-save.ATtBUXQktE/br0.groups.dump" 
ovs-ofctl -O OpenFlow13 replace-flows br0               "/var/run/openvswitch/ovs-save.ATtBUXQktE/br0.flows.dump" 

It will use the temporary file ovs-save.ATtBUXQktE to restore everything. I grep-ed for the DNS pod IP in /var/run/openvswitch/ovs-save.ATtBUXQktE/br0.flows.dump:

[root@build0-gstfj-w-d-ckk6s /]# grep -ri 10.128.70.2 /var/run/openvswitch/ovs-save.ATtBUXQktE/br0.flows.dump
/var/run/openvswitch/ovs-save.ATtBUXQktE/br0.flows.dump: table=20, priority=100,arp,in_port=3,arp_spa=10.128.70.2,arp_sha=00:00:0a:80:46:02/00:00:ff:ff:ff:ff actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21
/var/run/openvswitch/ovs-save.ATtBUXQktE/br0.flows.dump: table=20, priority=100,arp,in_port=8,arp_spa=10.128.70.2,arp_sha=00:00:0a:80:46:02/00:00:ff:ff:ff:ff actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21
/var/run/openvswitch/ovs-save.ATtBUXQktE/br0.flows.dump: table=20, priority=100,ip,in_port=3,nw_src=10.128.70.2 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21
/var/run/openvswitch/ovs-save.ATtBUXQktE/br0.flows.dump: table=20, priority=100,ip,in_port=8,nw_src=10.128.70.2 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21
/var/run/openvswitch/ovs-save.ATtBUXQktE/br0.flows.dump: table=25, priority=100,ip,nw_src=10.128.70.2 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:30
/var/run/openvswitch/ovs-save.ATtBUXQktE/br0.flows.dump: table=40, priority=100,arp,arp_tpa=10.128.70.2 actions=output:8
/var/run/openvswitch/ovs-save.ATtBUXQktE/br0.flows.dump: table=70, priority=100,ip,nw_dst=10.128.70.2 actions=load:0x5a2682->NXM_NX_REG1[],load:0x8->NXM_NX_REG2[],goto_table:80

However performing a dump of the flows on that node:

[root@build0-gstfj-w-d-ckk6s /]#  ovs-ofctl -O OpenFlow13 dump-flows br0 | grep -i 10.128.70.2

Empty...

Restoring the flows manually later on succeeded however:

[root@build0-gstfj-w-d-ckk6s /]# sh +x /var/run/openvswitch/flows.sh 
2020-07-08T08:31:48Z|00001|vconn|WARN|unix:/var/run/openvswitch/br0.mgmt: version negotiation failed (we support version 0x01, peer supports version 0x04)
ovs-ofctl: br0: failed to connect to socket (Broken pipe)
[root@build0-gstfj-w-d-ckk6s /]#  ovs-ofctl -O OpenFlow13 dump-flows br0 | grep -i 10.128.70.2
 cookie=0x0, duration=4.569s, table=20, n_packets=0, n_bytes=0, priority=100,arp,in_port=3,arp_spa=10.128.70.2,arp_sha=00:00:0a:80:46:02/00:00:ff:ff:ff:ff actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21
 cookie=0x0, duration=4.568s, table=20, n_packets=0, n_bytes=0, priority=100,arp,in_port=8,arp_spa=10.128.70.2,arp_sha=00:00:0a:80:46:02/00:00:ff:ff:ff:ff actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21
 cookie=0x0, duration=4.568s, table=20, n_packets=0, n_bytes=0, priority=100,ip,in_port=3,nw_src=10.128.70.2 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21
 cookie=0x0, duration=4.567s, table=20, n_packets=0, n_bytes=0, priority=100,ip,in_port=8,nw_src=10.128.70.2 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21
 cookie=0x0, duration=4.566s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.128.70.2 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:30
 cookie=0x0, duration=4.565s, table=40, n_packets=2, n_bytes=84, priority=100,arp,arp_tpa=10.128.70.2 actions=output:8
 cookie=0x0, duration=4.563s, table=70, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.128.70.2 actions=load:0x5a2682->NXM_NX_REG1[],load:0x8->NXM_NX_REG2[],goto_table:80


I can only imagine that the restore commands nested in that flows.sh script failed at the moment the pod booted. I will add a retry for that in the entrypoint to that container. 

FYI: the reason why our livenessProbe fails so often is because we never set a "timeoutSeconds" to them. Both liveness and readineess execute 3 OVS commands with timeouts attached:

https://github.com/openshift/cluster-network-operator/blob/release-4.5/bindata/network/openshift-sdn/sdn-ovs.yaml#L139

The total timeout expected seems to be a maximum of 20 seconds.However the livenessProbe will timeout after 1 second (its default value)

I will modify the livenessProbe with this so that we don't reboot OVS so often.

Comment 14 zhaozhanqi 2020-07-08 09:17:22 UTC
Thanks the details information @Alexander
from above, seems same issue with bug https://bugzilla.redhat.com/show_bug.cgi?id=1848374

Comment 15 Alexander Constantinescu 2020-07-08 10:23:32 UTC
I am in talks with Zhao and Huiran from QE to see if they can provide a QE reproducer on 4.5 so that I can step away from using the prow CI prod. 

FYI: I've modified the OVS daemonset to check the exit status of the restore-script thinking it might be erroring when trying to restore the flows....it does not error, it just doesn't restore the flows.

Comment 16 Flavio Leitner 2020-07-08 18:30:42 UTC
(In reply to Alexander Constantinescu from comment #13)

2020-07-08T08:31:48Z|00001|vconn|WARN|unix:/var/run/openvswitch/br0.mgmt: version negotiation failed (we support version 0x01, peer supports version 0x04)

That message means the client is using default OpenFlow10 while the bridge is using OpenFlow13.
Perhaps something missed to configured the bridge?
It may have prevent the flow loading, and then the flow dump is empty. See bz#1278136


> OK, there's definitely a problem with the flow restore for OVS. 
> 
> I've modified the OVS daemonset to not delete the files containing the flows
> it restores upon booting the OVS pod, and the outcome is strange. On some
> nodes, the flow restore does not happen (though logs do not indicate any
> issues)
> 
> This is the file used to restore the flows (taken from a node which
> experienced the issue):
> 
> cat /var/run/openvswitch/flows.sh
> ovs-ofctl add-tlv-map br0 ''
> ovs-ofctl -O OpenFlow13 add-groups br0              
> "/var/run/openvswitch/ovs-save.ATtBUXQktE/br0.groups.dump" 
> ovs-ofctl -O OpenFlow13 replace-flows br0              
> "/var/run/openvswitch/ovs-save.ATtBUXQktE/br0.flows.dump" 
> 

Can you run the following command before restore the flow and leave it running
until you see the empty flow table?

   ''ovs-ofctl monitor <bridge> watch:''

This will show if the flows are accepted and what happened after that.
If you don't see anything, add a generic flow test if that shows up.

Another useful command to run after is:

   ''ovsdb-tool -m show-log -v''

It will tell if there was any modification in the database.
Is the bridge in secure mode? If so, is the controller up? Can it have managed the bridge in parallel?

Thanks
fbl

Comment 17 Flavio Leitner 2020-07-08 18:31:53 UTC
Correcting comment#16:
Perhaps something missed to configured the client and/or bridge?
Because the version used are not the same.
fbl

Comment 19 Alexander Constantinescu 2020-07-09 13:35:28 UTC
Hi Flavio

Thanks for looking into it all. I have the following which might be interesting for you and could help pin-point the problem with the flow restore:

Essentially what happens for us in Openshift is that, 

1. we have OVS running 
2. OVS receives a SIGTERM
3. Prior to exiting it creates a dump of the existing flows on the node, and generates this script:

$ cat /var/run/openvswitch/flows.sh 
ovs-ofctl -v add-tlv-map br0 ''
ovs-ofctl -v -O OpenFlow13 add-groups br0               "/var/run/openvswitch/ovs-save.mqGw09z9In/br0.groups.dump" 
ovs-ofctl -v -O OpenFlow13 replace-flows br0               "/var/run/openvswitch/ovs-save.mqGw09z9In/br0.flows.dump" 

"/var/run/openvswitch/ovs-save.mqGw09z9In/br0.flows.dump" = contains the dump of flows

4. OVS then restarts
5. When OVS restarts it executes this lines of code:

echo Restoring these flows:
cat /var/run/openvswitch/ovs-save.*/*
sh -x /var/run/openvswitch/flows.sh
echo "$(date -u "+%Y-%m-%d %H:%M:%S") info: Done restoring the existing flows ..." 2>&1
echo You now have these flows:
ovs-ofctl -O OpenFlow13 dump-flows br0 

The output I am attaching concerns the output of one run with these commands. 

Have a look out for the IP "10.130.2.102", it is present right after "Restoring these flows:" meaning it is existent in "/var/run/openvswitch/ovs-save.mqGw09z9In/br0.flows.dump". However it is no longer there after " You now have these flows:"

I am unable to interpret why, even with the verbose replace-flows output. 

Thanks in advance!

Comment 20 Alexander Constantinescu 2020-07-09 13:36:52 UTC
Created attachment 1700453 [details]
flow issues output

Comment 21 Flavio Leitner 2020-07-09 14:37:20 UTC
Ok, that's interesting because the debug output shows a few flows being added.
I marked one as an example:

[...]
  8463 2020-07-08T15:14:31Z|00050|vconn|DBG|unix:/var/run/openvswitch/br0.mgmt: sent (Success): OFPST_FLOW request (OF1.3) (xid=0x6):
  8464 2020-07-08T15:14:31Z|00051|vconn|DBG|unix:/var/run/openvswitch/br0.mgmt: received: OFPST_FLOW reply (OF1.3) (xid=0x6):
  8465 2020-07-08T15:14:31Z|00052|vconn|DBG|unix:/var/run/openvswitch/br0.mgmt: sent (Success): OFPT_FLOW_MOD (OF1.3) (xid=0x7): ADD priority=400,ip,in_port=2,nw_src=10.130.2.1 actions=goto_table:30
  8466 2020-07-08T15:14:31Z|00053|vconn|DBG|unix:/var/run/openvswitch/br0.mgmt: sent (Success): OFPT_BARRIER_REQUEST (OF1.3) (xid=0x6e):
  8467 2020-07-08T15:14:31Z|00054|vconn|DBG|unix:/var/run/openvswitch/br0.mgmt: received: OFPT_BARRIER_REPLY (OF1.3) (xid=0x6e):
> 8468 2020-07-08T15:14:31Z|00055|vconn|DBG|unix:/var/run/openvswitch/br0.mgmt: sent (Success): OFPT_FLOW_MOD (OF1.3) (xid=0x8): ADD priority=300,ct_state=-trk,ip actions=ct(table=0)
  8469 2020-07-08T15:14:31Z|00056|vconn|DBG|unix:/var/run/openvswitch/br0.mgmt: sent (Success): OFPT_BARRIER_REQUEST (OF1.3) (xid=0x6f):
  8470 2020-07-08T15:14:31Z|00057|vconn|DBG|unix:/var/run/openvswitch/br0.mgmt: received: OFPT_BARRIER_REPLY (OF1.3) (xid=0x6f):

that is actually added:

  8791 2020-07-08 15:14:32 info: Done restoring the existing flows ...
  8792 You now have these flows:
  8793 OFPST_FLOW reply (OF1.3) (xid=0x2):
  8794  cookie=0x0, duration=0.931s, table=0, n_packets=0, n_bytes=0, priority=400,ip,in_port=2,nw_src=10.130.2.1 actions=goto_table:30
> 8795  cookie=0x0, duration=0.931s, table=0, n_packets=0, n_bytes=0, priority=300,ct_state=-trk,ip actions=ct(table=0)

Well the replace-flows will dump the tables, find the differences and push only what is difference, so either there was something in the flow table, which doesn't make sense after a restart, or restore

Can you change the script to use --readd? That will force to re-add all flows regardless and it will help to see if the tool is getting lost processing the differences or something else is going on.

Thanks
fbl

Comment 22 Aniket Bhat 2020-07-09 16:17:28 UTC
@flavio, what we may be seeing is likely a symptom and not a true problem with save/restore.

Latest bug update from slack:

> OK, so there are a couple of issues with openshift-sdn in 4.3/4.4/4.5. Here's what I and @Aniket Bhat have found:
> Our livenessProbes are not correctly tuned, i.e: on heavily loaded clusters like CI prod the livenessProbes were timing out leading to unnecessary OVS restarts. Fix, will be provided by:  > https://github.com/openshift/cluster-network-operator/pull/701 and subsequent back-ports
> We had race conditions when restarting OVS and SDN leading to cases where we delete the OVS database and thus are unable to properly sync state when we start, fix will be provided by: >https://github.com/openshift/sdn/pull/158 and subsequent back-ports.

Comment 23 Flavio Leitner 2020-07-09 16:22:25 UTC
(In reply to Aniket Bhat from comment #22)
> @flavio, what we may be seeing is likely a symptom and not a true problem
> with save/restore.
> 
> Latest bug update from slack:
> 
> > OK, so there are a couple of issues with openshift-sdn in 4.3/4.4/4.5. Here's what I and @Aniket Bhat have found:
> > Our livenessProbes are not correctly tuned, i.e: on heavily loaded clusters like CI prod the livenessProbes were timing out leading to unnecessary OVS restarts. Fix, will be provided by:  > https://github.com/openshift/cluster-network-operator/pull/701 and subsequent back-ports
> > We had race conditions when restarting OVS and SDN leading to cases where we delete the OVS database and thus are unable to properly sync state when we start, fix will be provided by: >https://github.com/openshift/sdn/pull/158 and subsequent back-ports.

That makes sense and using ovsdb-tool, as recommended in comment#16, can confirm the changes done to the DB.

Comment 24 Alexander Constantinescu 2020-07-09 16:33:09 UTC
I am running a new try with your recommendation in #comment 16 for verification 

Will update the BZ with new data should I hit the issue again.

Comment 25 Alexander Constantinescu 2020-07-09 17:03:07 UTC
HI Flavio

Just hit a different form of the restore issue, now not all flows were added for IP "10.131.1.46"

Please find attached all the output needed hopefully to understand the issue. 

grep -i -e 10.131.1.46 -e "You now" ovs-hdwhq-10.131.1.46 

Flows that should be restored:

 table=20, priority=100,arp,in_port=3,arp_spa=10.131.1.46,arp_sha=00:00:0a:83:01:2e/00:00:ff:ff:ff:ff actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21
 table=20, priority=100,ip,in_port=3,nw_src=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21
 table=25, priority=100,ip,nw_src=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:30
 table=40, priority=100,arp,arp_tpa=10.131.1.46 actions=output:3
 table=70, priority=100,ip,nw_dst=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG1[],load:0x3->NXM_NX_REG2[],goto_table:80
 table=20, priority=100,arp,in_port=3,arp_spa=10.131.1.46,arp_sha=00:00:0a:83:01:2e/00:00:ff:ff:ff:ff actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21
 table=20, priority=100,ip,in_port=3,nw_src=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21
 table=25, priority=100,ip,nw_src=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:30
 table=40, priority=100,arp,arp_tpa=10.131.1.46 actions=output:3
 table=70, priority=100,ip,nw_dst=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG1[],load:0x3->NXM_NX_REG2[],goto_table:80
 table=20, priority=100,arp,in_port=3,arp_spa=10.131.1.46,arp_sha=00:00:0a:83:01:2e/00:00:ff:ff:ff:ff actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21
 table=20, priority=100,ip,in_port=3,nw_src=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21
 table=25, priority=100,ip,nw_src=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:30
 table=40, priority=100,arp,arp_tpa=10.131.1.46 actions=output:3
 table=70, priority=100,ip,nw_dst=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG1[],load:0x3->NXM_NX_REG2[],goto_table:80

You now have these flows:

 cookie=0x0, duration=0.147s, table=20, n_packets=0, n_bytes=0, priority=100,arp,in_port=3,arp_spa=10.131.1.46,arp_sha=00:00:0a:83:01:2e/00:00:ff:ff:ff:ff actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21
 cookie=0x0, duration=0.137s, table=20, n_packets=0, n_bytes=0, priority=100,ip,in_port=3,nw_src=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21
 cookie=0x0, duration=0.130s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:30
 cookie=0x0, duration=0.121s, table=40, n_packets=0, n_bytes=0, priority=100,arp,arp_tpa=10.131.1.46 actions=output:3
 cookie=0x0, duration=0.113s, table=70, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG1[],load:0x3->NXM_NX_REG2[],goto_table:80

Comment 26 Alexander Constantinescu 2020-07-09 17:04:04 UTC
Created attachment 1700468 [details]
second flow restore with db dump

Comment 27 Alexander Constantinescu 2020-07-09 17:13:22 UTC
Actually, my bad. I realized the problem is more that the dump of flows is duplicated three times.

Comment 28 Flavio Leitner 2020-07-09 17:48:33 UTC
(In reply to Alexander Constantinescu from comment #27)
> Actually, my bad. I realized the problem is more that the dump of flows is
> duplicated three times.

No worries, so nothing to see on comment#25 and comment#26?

Comment 29 Michael Gugino 2020-07-09 18:56:22 UTC
This looks related to this: https://bugzilla.redhat.com/show_bug.cgi?id=1843998

One of these is probably a dupe.

Comment 30 Alexander Constantinescu 2020-07-10 09:07:50 UTC
In response to #comment 28: I am not sure, it is strange that we have duplicate flows in the dump which is created, but reproducing the issue is very difficult, and collecting the data is almost even more difficult. So I think we can put the restore issue on hold for the following reasons:

- The original problem should be fixed by the resolutions mentioned in #comment 22
- Those resolutions might fix the restore issues I have mentioned here.
- We only really care about flow restore on upgrade, which should not happen when clusters are heavily loaded in the middle of their operations

Comment 31 Aniket Bhat 2020-07-15 13:19:07 UTC
*** Bug 1848374 has been marked as a duplicate of this bug. ***

Comment 32 Alexander Constantinescu 2020-07-16 13:21:37 UTC
Moving to ON_QE as the 4.6 PR has merged. 

Note to QE: this fix cannot be verified on 4.6 as the fix does not apply to that version. 4.6 did not have a livenessProbe, it had been removed with "system OVS". The fix can only start being verified on 4.5, but I need this bug to go in so that the process is respected and my 4.5 PR can merge.

Please move this to VERIFIED

Comment 33 W. Trevor King 2020-07-17 15:03:56 UTC
Moving back to MODIFIED, based on the PR landing.  ART's Elliot sweeper will move it into ON_QA when it builds a nightly and attaches it to an errata.

Comment 39 Juan Luis de Sousa-Valadas 2020-10-13 09:23:36 UTC
*** Bug 1831283 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.