Description of problem: One router pods is NOT ready. oc --context build02 get pod -n openshift-ingress NAME READY STATUS RESTARTS AGE router-default-574c789c6b-5zvlw 1/1 Running 0 4d22h router-default-574c789c6b-sdjhs 0/1 Running 1 4d22h In the pod logs: Trace[1278989275]: [30.000584618s] [30.000584618s] END E0630 21:30:21.755674 1 reflector.go:123] github.com/openshift/router/pkg/router/template/service_lookup.go:32: Failed to list *v1.Service: Get https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout oc --context build02 get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.10 True False 4d22h Error while reconciling 4.4.10: an unknown error has occurred The cluster build02 is a production cluster for CI.
https://coreos.slack.com/archives/CEKNRGF25/p1593549189041400?thread_ts=1593543661.033500&cid=CEKNRGF25 Miciah Dashiel Butler Masters 1 hour ago I found this in namespaces/openshift-sdn/pods/sdn-prp75/sdn/sdn/logs/previous.log: 2020-06-30T18:09:16.127346496Z I0630 18:09:16.127059 3076489 node.go:487] Reattaching pod 'openshift-ingress/router-default-574c789c6b-sdjhs' to SDN 2020-06-30T18:09:16.146756418Z I0630 18:09:16.145634 3076489 ovs.go:169] Error executing ovs-vsctl: ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Connection refused) 2020-06-30T18:09:16.186187339Z I0630 18:09:16.161740 3076489 ovs.go:169] Error executing ovs-vsctl: ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Connection refused) 2020-06-30T18:09:16.333740014Z W0630 18:09:16.333253 3076489 pod.go:274] CNI_ADD openshift-ingress/router-default-574c789c6b-sdjhs failed: exit status 1 2020-06-30T18:09:16.333740014Z W0630 18:09:16.333317 3076489 node.go:493] Could not reattach pod 'openshift-ingress/router-default-574c789c6b-sdjhs' to SDN: exit status 1
Random bits from the must-gather for context, and to give folks some higher-level symptoms to help with Bugzilla searches: $ yaml2json <cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring)' 2020-06-25T19:46:53Z 2020-06-25T20:40:09Z 4.4.10 Completed true 2020-05-21T19:30:15Z 2020-05-21T19:53:42Z 4.4.4 Completed false $ yaml2json <cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' 2020-05-21T19:53:42Z Available=True -: Done applying 4.4.10 2020-06-30T19:21:54Z Failing=False -: - 2020-06-25T20:40:09Z Progressing=False -: Cluster version is 4.4.10 2020-05-21T19:30:15Z RetrievedUpdates=True -: - Hmm, I guess the gather was post-resolution? Getting an Insights tarball for this cluster shows the ingress condition: $ tar -xOz config/version < "$(ls | tail -n1)" | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + (.reason // "-") + " " + (.message // "-")' 2020-05-21T19:53:42Z Available True - Done applying 4.4.10 2020-06-30T20:38:31Z Failing True MultipleErrors Multiple errors are preventing progress: * Cluster operator ingress is reporting a failure: Some ingresscontrollers are degraded: default * Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Grafana failed: reconciling Grafana Deployment failed: updating deployment object failed: waiting for DeploymentRollout of grafana: deployment grafana is not ready. status: (replicas: 1, updated: 1, ready: 0, unavailable: 1) 2020-06-25T20:40:09Z Progressing False MultipleErrors Error while reconciling 4.4.10: an unknown error has occurred 2020-05-21T19:30:15Z RetrievedUpdates True - - $ tar -xOz config/clusteroperator/ingress < "$(ls | tail -n1)" | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + (.reason // "-") + " " + (.message // "-")' 2020-06-30T18:09:43Z Degraded True IngressControllersDegraded Some ingresscontrollers are degraded: default 2020-06-30T18:09:13Z Progressing True Reconciling Not all ingress controllers are available. 2020-06-30T18:09:13Z Available False IngressUnavailable Not all ingress controllers are available. Those ingress-operator conditions are still in the must-gather: $ yaml2json <cluster-scoped-resources/config.openshift.io/clusteroperators/ingress.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' 2020-06-30T18:09:43Z Degraded=True IngressControllersDegraded: Some ingresscontrollers are degraded: default 2020-06-30T18:09:13Z Progressing=True Reconciling: Not all ingress controllers are available. 2020-06-30T18:09:13Z Available=False IngressUnavailable: Not all ingress controllers are available. $ yaml2json <namespaces/openshift-ingress-operator/operator.openshift.io/ingresscontrollers/default.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + (.reason // "-") + " " + (.message // "-")' 2020-05-21T19:46:15Z Admitted True Valid - 2020-06-30T18:09:13Z Available False MinimumReplicasUnavailable The deployment is unavailable: Deployment does not have minimum availability. 2020-06-30T18:09:13Z DeploymentDegraded True DeploymentUnavailable The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability. 2020-05-21T19:46:19Z LoadBalancerManaged True WantedByEndpointPublishingStrategy The endpoint publishing strategy supports a managed load balancer 2020-05-21T19:47:03Z LoadBalancerReady True LoadBalancerProvisioned The LoadBalancer service is provisioned 2020-05-21T19:46:19Z DNSManaged True Normal DNS management is supported and zones are specified in the cluster DNS config. 2020-05-21T19:47:04Z DNSReady True NoFailedZones The record is provisioned in all reported zones. 2020-06-30T18:09:43Z Degraded True DegradedConditions One or more other status conditions indicate a degraded state: DeploymentDegraded=True $ yaml2json <namespaces/openshift-ingress/apps/deployments.yaml | jq -r '.items[] | select(.metadata.name == "router-default").status' { "conditions": [ { "status": "True", "lastUpdateTime": "2020-06-25T19:57:25Z", "reason": "NewReplicaSetAvailable", "lastTransitionTime": "2020-05-21T19:46:15Z", "type": "Progressing", "message": "ReplicaSet \"router-default-574c789c6b\" has successfully progressed." }, { "status": "False", "lastUpdateTime": "2020-06-30T18:09:13Z", "reason": "MinimumReplicasUnavailable", "lastTransitionTime": "2020-06-30T18:09:13Z", "type": "Available", "message": "Deployment does not have minimum availability." } ], "availableReplicas": "1", "updatedReplicas": "2", "readyReplicas": "1", "replicas": "2", "unavailableReplicas": "1", "observedGeneration": "4" } $ yaml2json <namespaces/openshift-ingress/apps/deployments.yaml | jq -r '.items[] | select(.metadata.name == "router-default").spec.strategy' { "rollingUpdate": { "maxUnavailable": "0", "maxSurge": "1" }, "type": "RollingUpdate" } Hmm. maxUnavailable 0 is pretty aggressive. Is ingress really broken with only a single ready router pod? Back to the broken router pod: $ yaml2json <namespaces/openshift-ingress/pods/router-default-574c789c6b-sdjhs/router-default-574c789c6b-sdjhs.yaml | jq -r '.status.containerStatuses[].lastState.terminated' { "message": "08096s] [30.000408096s] END\nE0630 19:10:20.942115 1 reflector.go:123] github.com/openshift/router/pkg/router/controller/factory/factory.go:115: Failed to list *v1.Route: Get https://172.30.0.1:443/apis/route.openshift.io/v1/routes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout\nI0630 19:10:51.941091 1 trace.go:116] Trace[1208412245]: \"Reflector ListAndWatch\" name:github.com/openshift/router/pkg/router/controller/factory/factory.go:115 (started: 2020-06-30 19:10:21.940404787 +0000 UTC m=+3581.314237596) (total time: 30.000641626s):\nTrace[1208412245]: [30.000641626s] [30.000641626s] END\nE0630 19:10:51.941133 1 reflector.go:123] github.com/openshift/router/pkg/router/controller/factory/factory.go:115: Failed to list *v1.Endpoints: Get https://172.30.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout\nI0630 19:10:51.942009 1 trace.go:116] Trace[1959082013]: \"Reflector ListAndWatch\" name:github.com/openshift/router/pkg/router/template/service_lookup.go:32 (started: 2020-06-30 19:10:21.941540332 +0000 UTC m=+3581.315373190) (total time: 30.000436572s):\nTrace[1959082013]: [30.000436572s] [30.000436572s] END\nE0630 19:10:51.942034 1 reflector.go:123] github.com/openshift/router/pkg/router/template/service_lookup.go:32: Failed to list *v1.Service: Get https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout\nI0630 19:10:51.943090 1 trace.go:116] Trace[1299276755]: \"Reflector ListAndWatch\" name:github.com/openshift/router/pkg/router/controller/factory/factory.go:115 (started: 2020-06-30 19:10:21.942654557 +0000 UTC m=+3581.316487446) (total time: 30.000412715s):\nTrace[1299276755]: [30.000412715s] [30.000412715s] END\nE0630 19:10:51.943109 1 reflector.go:123] github.com/openshift/router/pkg/router/controller/factory/factory.go:115: Failed to list *v1.Route: Get https://172.30.0.1:443/apis/route.openshift.io/v1/routes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: i/o timeout\n", "containerID": "cri-o://ee50c0e1a7e7ee76623d0719dcf05b970635e0342bcdfbafc1b0d96032449481", "exitCode": "137", "reason": "Error", "finishedAt": "2020-06-30T19:11:14Z", "startedAt": "2020-06-30T18:10:40Z" } So that's what the router is crashlooping on, and kicking the SDN pod seems to have unstuck the update.
This is a release blocker for 4.5, happened in CI cluster after upgrade on rc.6
this issue is very hardly reproduced in QE's cluster during upgrade. @Hongkai, could you help show which kind of platform, and which kind of installation method? Maybe it's not related with this issue. but we need to have a try. thanks.
@zhaozhanqi We got the 4.4.4 installer on the release page. https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/ The upgrade path is 4.4.4 to 4.4.10, and then 4.5.0-rc.6. The cluster is hosted on the GCP. However, the problem showed up only under some workload. The only way of reproduction that I know is to run CI tests with the original repo and the installer repo. It is hard for you to get the whole Prow running in your testing env. I would suggest you to do many openshift builds to see if you can hit the bug.
Hi I've had a look at the must-gather in #comment 1. I have seen that the networking pods did not manage to bring up the router-default-574c789c6b-sdjhs as mentioned in #comment 2. Looking closer at the those networking pods, to find out why that was I saw this: From: namespaces/openshift-sdn/pods/sdn-prp75/sdn/sdn/logs/previous.log 2020-06-25T20:12:58.212104807Z I0625 20:12:58.212006 2453586 proxier.go:368] userspace proxy: processing 0 service events 2020-06-25T20:12:58.212104807Z I0625 20:12:58.212032 2453586 proxier.go:347] userspace syncProxyRules took 48.100771ms 2020-06-25T20:12:58.428219987Z I0625 20:12:58.428100 2453586 proxier.go:368] userspace proxy: processing 0 service events 2020-06-25T20:12:58.428219987Z I0625 20:12:58.428136 2453586 proxier.go:347] userspace syncProxyRules took 43.537689ms 2020-06-25T20:12:58.636985192Z I0625 20:12:58.636904 2453586 proxier.go:368] userspace proxy: processing 0 service events 2020-06-25T20:12:58.636985192Z I0625 20:12:58.636935 2453586 proxier.go:347] userspace syncProxyRules took 45.291026ms 2020-06-25T20:13:21.375947577Z I0625 20:13:21.375893 2453586 pod.go:539] CNI_DEL openshift-must-gather-t2l25/must-gather-hnf22 2020-06-25T20:13:26.258670957Z interrupt: Gracefully shutting down ... 2020-06-30T18:08:48.080756489Z I0630 18:08:48.080221 3076489 cmd.go:123] Reading proxy configuration from /config/kube-proxy-config.yaml 2020-06-30T18:08:48.105782541Z I0630 18:08:48.093063 3076489 feature_gate.go:243] feature gates: &{map[]} 2020-06-30T18:08:48.105782541Z I0630 18:08:48.093133 3076489 cmd.go:227] Watching config file /config/kube-proxy-config.yaml for changes 2020-06-30T18:08:48.105782541Z I0630 18:08:48.093183 3076489 cmd.go:227] Watching config file /config/..2020_06_25_19_59_50.375303914/kube-proxy-config.yaml for changes 2020-06-30T18:08:48.625784451Z I0630 18:08:48.622857 3076489 node.go:147] Initializing SDN node of type "redhat/openshift-ovs-networkpolicy" with configured hostname "build0-gstfj-w-b-9x7xl.c.openshift-ci-build-farm.internal" (IP "10.0.32.5") 2020-06-30T18:08:48.852416356Z I0630 18:08:48.852108 3076489 cmd.go:160] Starting node networking (v0.0.0-alpha.0-143-g26523681) 2020-06-30T18:08:48.852416356Z I0630 18:08:48.852152 3076489 node.go:385] Starting openshift-sdn network plugin 2020-06-30T18:08:51.515353736Z I0630 18:08:51.515003 3076489 sdn_controller.go:139] [SDN setup] full SDN setup required (plugin is not setup) 2020-06-30T18:08:58.307766761Z I0630 18:08:58.307242 3076489 node.go:434] Starting openshift-sdn pod manager 2020-06-30T18:08:58.311807815Z I0630 18:08:58.308377 3076489 node.go:487] Reattaching pod 'ocp/base-4-5-rhel8-7c99d64 It seems something killed the SDN pod on the 25th of June and it didn't restart until the 30th. For the OVS pods (running on the same node) I have even less to go on, `namespaces/openshift-sdn $ less pods/ovs-2t98n/openvswitch/openvswitch/logs/previous.log` 2020-06-30T18:09:30.04182099Z Starting ovsdb-server. 2020-06-30T18:09:30.805895582Z PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory 2020-06-30T18:09:30.805895582Z PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4) 2020-06-30T18:09:30.811775694Z net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory 2020-06-30T18:09:30.811775694Z net_mlx5: cannot I.e there are no logs from before the 30th of June. That also confirms why #comment 2 mentioned the following logs: 2020-06-30T18:09:16.146756418Z I0630 18:09:16.145634 3076489 ovs.go:169] Error executing ovs-vsctl: ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Connection refused) It's because the OVS pod was not even up and running. I am in discussions with Hongkai trying to get the kubelet logs from the node in question: build0-gstfj-w-b-9x7xl.c.openshift-ci-build-farm.internal Once we have the kubelet logs we might understand why the networking pods were not running. -Alexander
thanks @Hongkai Liu and @Alexander for helping this. I will also have a try on GCP cluster to try if we can hit this issue.
I trigger 7 times upgrade from 4.4.10 to 4.5.0-rc.6 on gcp today. but no luck to reproduce this issue.
Hi An update on this issue: I have not been able to get to the bottom of it. With the help of Hongkai we are able to reproduce the issue, using this CI job: https://github.com/openshift/release/pull/10001 What I have observed is that new nodes are being added to the cluster once that job is executed, which in turn deploys new networking pods on those nodes. At the same time, I am however also seeing the OVS liveness probes on some already existing nodes fail. I believe the root cause to be related to this. The OVS liveness probes start failing and starts triggering a restart of OVS on these nodes, the liveness probe failing might have to do with this (which is logged right after the OVS pods receives the SIGTERM): 2020-07-07T09:24:36.163Z|00148|connmgr|INFO|br0<->unix#55: 5 flow_mods in the last 0 s (5 adds) 2020-07-07T09:24:37Z|00001|vconn|WARN|unix:/var/run/openvswitch/br0.mgmt: version negotiation failed (we support version 0x01, peer supports version 0x04) ovs-ofctl: br0: failed to connect to socket (Broken pipe) 2020-07-07T09:24:37.184Z|00149|bridge|INFO|bridge br0: added interface vethc7afb3e4 on port 17 2020-07-07T09:24:37.408Z|00150|vconn|WARN|unix#57: version negotiation failed (we support version 0x04, peer supports version 0x01) 2020-07-07T09:24:37.408Z|00151|rconn|WARN|br0<->unix#57: connection dropped (Protocol error) 2020-07-07T09:24:38.019Z|00152|connmgr|INFO|br0<->unix#60: 5 flow_mods in the last 0 s (5 adds) 2020-07-07 09:24:38 info: Saved flows When OVS restarts it tries to restore the flows it had when exiting, this can be seen it the logs of the newly created OVS pod: 2020-07-07 09:25:08 info: Created br0, now adding flows ... + ovs-ofctl add-tlv-map br0 '' 2020-07-07T09:25:09Z|00001|vconn|WARN|unix:/var/run/openvswitch/br0.mgmt: version negotiation failed (we support version 0x01, peer supports version 0x04) ovs-ofctl: br0: failed to connect to socket (Protocol error) + ovs-ofctl -O OpenFlow13 add-groups br0 /var/run/openvswitch/ovs-save.1FbOMfuiWG/br0.groups.dump + ovs-ofctl -O OpenFlow13 replace-flows br0 /var/run/openvswitch/ovs-save.1FbOMfuiWG/br0.flows.dump + rm -rf /var/run/openvswitch/ovs-save.1FbOMfuiWG 2020-07-07 09:25:10 info: Done restoring the existing flows ... It is thus unable to restore the flows. After talking to more OVS experienced people, the "we support version 0x01, peer supports version 0x04" seems to highlight that we are not using the correct protocol (OpenFlow13) when communicating with OVS. This seems very strange to me, and I am not fully sure I understand it. Anyhow, manual testing of this script on a working OVS pod does not seem to indicate that it is critical. The command can be executed despite these logs being logged. OVS restarting in turn restarts SDN. SDN upon restart does not retrieve the networking information related to the pods already running on the node when it starts, logs are filled with; I0707 09:25:45.532835 2276853 sdn_controller.go:139] [SDN setup] full SDN setup required (Link not found) W0707 09:25:53.680616 2276853 pod.go:181] No network information for pod ci/d16f07b6-c031-11ea-9b97-0a58ac1051c3 W0707 09:25:53.680661 2276853 pod.go:181] No network information for pod ci/ce8f7fc9-c031-11ea-9b97-0a58ac1051c3 W0707 09:25:53.680670 2276853 pod.go:181] No network information for pod ci/ce6f94aa-c031-11ea-9b97-0a58ac1051c3 W0707 09:25:53.680679 2276853 pod.go:181] No network information for pod openshift-dns/dns-default-q6shx W0707 09:25:53.680687 2276853 pod.go:181] No network information for pod ci-op-p5hncqt3/src-build W0707 09:25:53.680695 2276853 pod.go:181] No network information for pod ci-op-f1c8w395/openstack-installer-build W0707 09:25:53.680704 2276853 pod.go:181] No network information for pod ci-op-msqpw03z/libvirt-installer-build W0707 09:25:53.680711 2276853 pod.go:181] No network information for pod ci-op-834xvgl1/src-build W0707 09:25:53.680718 2276853 pod.go:181] No network information for pod ci/cec88a62-c031-11ea-9b97-0a58ac1051c3 W0707 09:25:53.680725 2276853 pod.go:181] No network information for pod ci/namespace-ttl-controller-784-znmmr W0707 09:25:53.680733 2276853 pod.go:181] No network information for pod ci-op-24thkvqb/baremetal-installer-build W0707 09:25:53.680741 2276853 pod.go:181] No network information for pod ci/d1124342-c031-11ea-9b97-0a58ac1051c3 W0707 09:25:53.680748 2276853 pod.go:181] No network information for pod ci/cfc278fe-c031-11ea-9b97-0a58ac1051c3 W0707 09:25:53.680756 2276853 pod.go:181] No network information for pod openshift-image-registry/node-ca-g68ld W0707 09:25:53.680763 2276853 pod.go:181] No network information for pod ci/d1aab262-c031-11ea-9b97-0a58ac1051c3 W0707 09:25:53.680770 2276853 pod.go:181] No network information for pod ci/ce850b4a-c031-11ea-9b97-0a58ac1051c3 W0707 09:25:53.680776 2276853 pod.go:181] No network information for pod ci/secret-mirroring-controller-692-cvhrg W0707 09:25:53.680783 2276853 pod.go:181] No network information for pod ci/cec01950-c031-11ea-9b97-0a58ac1051c3 W0707 09:25:53.680791 2276853 pod.go:181] No network information for pod ci/d079db6c-c031-11ea-9b97-0a58ac1051c3 W0707 09:25:53.680797 2276853 pod.go:181] No network information for pod ci/cf67b9bc-c031-11ea-9b97-0a58ac1051c3 W0707 09:25:53.680805 2276853 pod.go:181] No network information for pod ci/ce807aae-c031-11ea-9b97-0a58ac1051c3 W0707 09:25:53.680813 2276853 pod.go:181] No network information for pod ci-op-p9hct7ip/src-build W0707 09:25:53.680821 2276853 pod.go:181] No network information for pod openshift-monitoring/prometheus-adapter-b9b49579d-9wjms W0707 09:25:53.680827 2276853 pod.go:181] No network information for pod ci/cf103584-c031-11ea-9b97-0a58ac1051c3 W0707 09:25:53.680834 2276853 pod.go:181] No network information for pod ci/ce9ec660-c031-11ea-9b97-0a58ac1051c3 W0707 09:25:53.680841 2276853 pod.go:181] No network information for pod openshift-monitoring/grafana-f457c8645-wr7bg W0707 09:25:53.680848 2276853 pod.go:181] No network information for pod ci/d05b1829-c031-11ea-9b97-0a58ac1051c3 W0707 09:25:53.680855 2276853 pod.go:181] No network information for pod ci-op-0zw7qv58/upi-installer-build This is because there's nothing in OVS: [root@build0-gstfj-w-c-8l7mw /]# ovs-vsctl --columns=name,external_ids,ofport find interface name : "vxlan0" external_ids : {} ofport : 1 name : "br0" external_ids : {} ofport : 65534 name : "tun0" external_ids : {} ofport : 2 As all those pods were already running on the node, we get no CNI_ADD and subsequently do not do anything to rectify the situation. The root cause to everything mentioned in this comment is the liveness probe which starts failing for no reason on a node which was running fine until the workload was triggered. FYI: the DNS operator is hung because it tries communicating with the api-server on 172.30.01, which is obviously unsuccessful as that pods networking is non-existent. I suspect that whatever causes the liveness probe to fail (i.e: openvswitch) means it can't dump the flows prior to exiting when it gets its SIGTERM. Which means that he restores files which are empty or similar. I need to investigate what causes the liveness probe to fail, which means temporarily stopping the CNO on the CI cluster.
Given this was present in 4.4, this does not meet the requirements for a 4.5 blocker. Moving to 4.6, please clone back to 4.5.z and 4.4.z as appropriate.
OK, there's definitely a problem with the flow restore for OVS. I've modified the OVS daemonset to not delete the files containing the flows it restores upon booting the OVS pod, and the outcome is strange. On some nodes, the flow restore does not happen (though logs do not indicate any issues) This is the file used to restore the flows (taken from a node which experienced the issue): cat /var/run/openvswitch/flows.sh ovs-ofctl add-tlv-map br0 '' ovs-ofctl -O OpenFlow13 add-groups br0 "/var/run/openvswitch/ovs-save.ATtBUXQktE/br0.groups.dump" ovs-ofctl -O OpenFlow13 replace-flows br0 "/var/run/openvswitch/ovs-save.ATtBUXQktE/br0.flows.dump" It will use the temporary file ovs-save.ATtBUXQktE to restore everything. I grep-ed for the DNS pod IP in /var/run/openvswitch/ovs-save.ATtBUXQktE/br0.flows.dump: [root@build0-gstfj-w-d-ckk6s /]# grep -ri 10.128.70.2 /var/run/openvswitch/ovs-save.ATtBUXQktE/br0.flows.dump /var/run/openvswitch/ovs-save.ATtBUXQktE/br0.flows.dump: table=20, priority=100,arp,in_port=3,arp_spa=10.128.70.2,arp_sha=00:00:0a:80:46:02/00:00:ff:ff:ff:ff actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21 /var/run/openvswitch/ovs-save.ATtBUXQktE/br0.flows.dump: table=20, priority=100,arp,in_port=8,arp_spa=10.128.70.2,arp_sha=00:00:0a:80:46:02/00:00:ff:ff:ff:ff actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21 /var/run/openvswitch/ovs-save.ATtBUXQktE/br0.flows.dump: table=20, priority=100,ip,in_port=3,nw_src=10.128.70.2 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21 /var/run/openvswitch/ovs-save.ATtBUXQktE/br0.flows.dump: table=20, priority=100,ip,in_port=8,nw_src=10.128.70.2 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21 /var/run/openvswitch/ovs-save.ATtBUXQktE/br0.flows.dump: table=25, priority=100,ip,nw_src=10.128.70.2 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:30 /var/run/openvswitch/ovs-save.ATtBUXQktE/br0.flows.dump: table=40, priority=100,arp,arp_tpa=10.128.70.2 actions=output:8 /var/run/openvswitch/ovs-save.ATtBUXQktE/br0.flows.dump: table=70, priority=100,ip,nw_dst=10.128.70.2 actions=load:0x5a2682->NXM_NX_REG1[],load:0x8->NXM_NX_REG2[],goto_table:80 However performing a dump of the flows on that node: [root@build0-gstfj-w-d-ckk6s /]# ovs-ofctl -O OpenFlow13 dump-flows br0 | grep -i 10.128.70.2 Empty... Restoring the flows manually later on succeeded however: [root@build0-gstfj-w-d-ckk6s /]# sh +x /var/run/openvswitch/flows.sh 2020-07-08T08:31:48Z|00001|vconn|WARN|unix:/var/run/openvswitch/br0.mgmt: version negotiation failed (we support version 0x01, peer supports version 0x04) ovs-ofctl: br0: failed to connect to socket (Broken pipe) [root@build0-gstfj-w-d-ckk6s /]# ovs-ofctl -O OpenFlow13 dump-flows br0 | grep -i 10.128.70.2 cookie=0x0, duration=4.569s, table=20, n_packets=0, n_bytes=0, priority=100,arp,in_port=3,arp_spa=10.128.70.2,arp_sha=00:00:0a:80:46:02/00:00:ff:ff:ff:ff actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21 cookie=0x0, duration=4.568s, table=20, n_packets=0, n_bytes=0, priority=100,arp,in_port=8,arp_spa=10.128.70.2,arp_sha=00:00:0a:80:46:02/00:00:ff:ff:ff:ff actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21 cookie=0x0, duration=4.568s, table=20, n_packets=0, n_bytes=0, priority=100,ip,in_port=3,nw_src=10.128.70.2 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21 cookie=0x0, duration=4.567s, table=20, n_packets=0, n_bytes=0, priority=100,ip,in_port=8,nw_src=10.128.70.2 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21 cookie=0x0, duration=4.566s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.128.70.2 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:30 cookie=0x0, duration=4.565s, table=40, n_packets=2, n_bytes=84, priority=100,arp,arp_tpa=10.128.70.2 actions=output:8 cookie=0x0, duration=4.563s, table=70, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.128.70.2 actions=load:0x5a2682->NXM_NX_REG1[],load:0x8->NXM_NX_REG2[],goto_table:80 I can only imagine that the restore commands nested in that flows.sh script failed at the moment the pod booted. I will add a retry for that in the entrypoint to that container. FYI: the reason why our livenessProbe fails so often is because we never set a "timeoutSeconds" to them. Both liveness and readineess execute 3 OVS commands with timeouts attached: https://github.com/openshift/cluster-network-operator/blob/release-4.5/bindata/network/openshift-sdn/sdn-ovs.yaml#L139 The total timeout expected seems to be a maximum of 20 seconds.However the livenessProbe will timeout after 1 second (its default value) I will modify the livenessProbe with this so that we don't reboot OVS so often.
Thanks the details information @Alexander from above, seems same issue with bug https://bugzilla.redhat.com/show_bug.cgi?id=1848374
I am in talks with Zhao and Huiran from QE to see if they can provide a QE reproducer on 4.5 so that I can step away from using the prow CI prod. FYI: I've modified the OVS daemonset to check the exit status of the restore-script thinking it might be erroring when trying to restore the flows....it does not error, it just doesn't restore the flows.
(In reply to Alexander Constantinescu from comment #13) 2020-07-08T08:31:48Z|00001|vconn|WARN|unix:/var/run/openvswitch/br0.mgmt: version negotiation failed (we support version 0x01, peer supports version 0x04) That message means the client is using default OpenFlow10 while the bridge is using OpenFlow13. Perhaps something missed to configured the bridge? It may have prevent the flow loading, and then the flow dump is empty. See bz#1278136 > OK, there's definitely a problem with the flow restore for OVS. > > I've modified the OVS daemonset to not delete the files containing the flows > it restores upon booting the OVS pod, and the outcome is strange. On some > nodes, the flow restore does not happen (though logs do not indicate any > issues) > > This is the file used to restore the flows (taken from a node which > experienced the issue): > > cat /var/run/openvswitch/flows.sh > ovs-ofctl add-tlv-map br0 '' > ovs-ofctl -O OpenFlow13 add-groups br0 > "/var/run/openvswitch/ovs-save.ATtBUXQktE/br0.groups.dump" > ovs-ofctl -O OpenFlow13 replace-flows br0 > "/var/run/openvswitch/ovs-save.ATtBUXQktE/br0.flows.dump" > Can you run the following command before restore the flow and leave it running until you see the empty flow table? ''ovs-ofctl monitor <bridge> watch:'' This will show if the flows are accepted and what happened after that. If you don't see anything, add a generic flow test if that shows up. Another useful command to run after is: ''ovsdb-tool -m show-log -v'' It will tell if there was any modification in the database. Is the bridge in secure mode? If so, is the controller up? Can it have managed the bridge in parallel? Thanks fbl
Correcting comment#16: Perhaps something missed to configured the client and/or bridge? Because the version used are not the same. fbl
Hi Flavio Thanks for looking into it all. I have the following which might be interesting for you and could help pin-point the problem with the flow restore: Essentially what happens for us in Openshift is that, 1. we have OVS running 2. OVS receives a SIGTERM 3. Prior to exiting it creates a dump of the existing flows on the node, and generates this script: $ cat /var/run/openvswitch/flows.sh ovs-ofctl -v add-tlv-map br0 '' ovs-ofctl -v -O OpenFlow13 add-groups br0 "/var/run/openvswitch/ovs-save.mqGw09z9In/br0.groups.dump" ovs-ofctl -v -O OpenFlow13 replace-flows br0 "/var/run/openvswitch/ovs-save.mqGw09z9In/br0.flows.dump" "/var/run/openvswitch/ovs-save.mqGw09z9In/br0.flows.dump" = contains the dump of flows 4. OVS then restarts 5. When OVS restarts it executes this lines of code: echo Restoring these flows: cat /var/run/openvswitch/ovs-save.*/* sh -x /var/run/openvswitch/flows.sh echo "$(date -u "+%Y-%m-%d %H:%M:%S") info: Done restoring the existing flows ..." 2>&1 echo You now have these flows: ovs-ofctl -O OpenFlow13 dump-flows br0 The output I am attaching concerns the output of one run with these commands. Have a look out for the IP "10.130.2.102", it is present right after "Restoring these flows:" meaning it is existent in "/var/run/openvswitch/ovs-save.mqGw09z9In/br0.flows.dump". However it is no longer there after " You now have these flows:" I am unable to interpret why, even with the verbose replace-flows output. Thanks in advance!
Created attachment 1700453 [details] flow issues output
Ok, that's interesting because the debug output shows a few flows being added. I marked one as an example: [...] 8463 2020-07-08T15:14:31Z|00050|vconn|DBG|unix:/var/run/openvswitch/br0.mgmt: sent (Success): OFPST_FLOW request (OF1.3) (xid=0x6): 8464 2020-07-08T15:14:31Z|00051|vconn|DBG|unix:/var/run/openvswitch/br0.mgmt: received: OFPST_FLOW reply (OF1.3) (xid=0x6): 8465 2020-07-08T15:14:31Z|00052|vconn|DBG|unix:/var/run/openvswitch/br0.mgmt: sent (Success): OFPT_FLOW_MOD (OF1.3) (xid=0x7): ADD priority=400,ip,in_port=2,nw_src=10.130.2.1 actions=goto_table:30 8466 2020-07-08T15:14:31Z|00053|vconn|DBG|unix:/var/run/openvswitch/br0.mgmt: sent (Success): OFPT_BARRIER_REQUEST (OF1.3) (xid=0x6e): 8467 2020-07-08T15:14:31Z|00054|vconn|DBG|unix:/var/run/openvswitch/br0.mgmt: received: OFPT_BARRIER_REPLY (OF1.3) (xid=0x6e): > 8468 2020-07-08T15:14:31Z|00055|vconn|DBG|unix:/var/run/openvswitch/br0.mgmt: sent (Success): OFPT_FLOW_MOD (OF1.3) (xid=0x8): ADD priority=300,ct_state=-trk,ip actions=ct(table=0) 8469 2020-07-08T15:14:31Z|00056|vconn|DBG|unix:/var/run/openvswitch/br0.mgmt: sent (Success): OFPT_BARRIER_REQUEST (OF1.3) (xid=0x6f): 8470 2020-07-08T15:14:31Z|00057|vconn|DBG|unix:/var/run/openvswitch/br0.mgmt: received: OFPT_BARRIER_REPLY (OF1.3) (xid=0x6f): that is actually added: 8791 2020-07-08 15:14:32 info: Done restoring the existing flows ... 8792 You now have these flows: 8793 OFPST_FLOW reply (OF1.3) (xid=0x2): 8794 cookie=0x0, duration=0.931s, table=0, n_packets=0, n_bytes=0, priority=400,ip,in_port=2,nw_src=10.130.2.1 actions=goto_table:30 > 8795 cookie=0x0, duration=0.931s, table=0, n_packets=0, n_bytes=0, priority=300,ct_state=-trk,ip actions=ct(table=0) Well the replace-flows will dump the tables, find the differences and push only what is difference, so either there was something in the flow table, which doesn't make sense after a restart, or restore Can you change the script to use --readd? That will force to re-add all flows regardless and it will help to see if the tool is getting lost processing the differences or something else is going on. Thanks fbl
@flavio, what we may be seeing is likely a symptom and not a true problem with save/restore. Latest bug update from slack: > OK, so there are a couple of issues with openshift-sdn in 4.3/4.4/4.5. Here's what I and @Aniket Bhat have found: > Our livenessProbes are not correctly tuned, i.e: on heavily loaded clusters like CI prod the livenessProbes were timing out leading to unnecessary OVS restarts. Fix, will be provided by: > https://github.com/openshift/cluster-network-operator/pull/701 and subsequent back-ports > We had race conditions when restarting OVS and SDN leading to cases where we delete the OVS database and thus are unable to properly sync state when we start, fix will be provided by: >https://github.com/openshift/sdn/pull/158 and subsequent back-ports.
(In reply to Aniket Bhat from comment #22) > @flavio, what we may be seeing is likely a symptom and not a true problem > with save/restore. > > Latest bug update from slack: > > > OK, so there are a couple of issues with openshift-sdn in 4.3/4.4/4.5. Here's what I and @Aniket Bhat have found: > > Our livenessProbes are not correctly tuned, i.e: on heavily loaded clusters like CI prod the livenessProbes were timing out leading to unnecessary OVS restarts. Fix, will be provided by: > https://github.com/openshift/cluster-network-operator/pull/701 and subsequent back-ports > > We had race conditions when restarting OVS and SDN leading to cases where we delete the OVS database and thus are unable to properly sync state when we start, fix will be provided by: >https://github.com/openshift/sdn/pull/158 and subsequent back-ports. That makes sense and using ovsdb-tool, as recommended in comment#16, can confirm the changes done to the DB.
I am running a new try with your recommendation in #comment 16 for verification Will update the BZ with new data should I hit the issue again.
HI Flavio Just hit a different form of the restore issue, now not all flows were added for IP "10.131.1.46" Please find attached all the output needed hopefully to understand the issue. grep -i -e 10.131.1.46 -e "You now" ovs-hdwhq-10.131.1.46 Flows that should be restored: table=20, priority=100,arp,in_port=3,arp_spa=10.131.1.46,arp_sha=00:00:0a:83:01:2e/00:00:ff:ff:ff:ff actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21 table=20, priority=100,ip,in_port=3,nw_src=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21 table=25, priority=100,ip,nw_src=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:30 table=40, priority=100,arp,arp_tpa=10.131.1.46 actions=output:3 table=70, priority=100,ip,nw_dst=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG1[],load:0x3->NXM_NX_REG2[],goto_table:80 table=20, priority=100,arp,in_port=3,arp_spa=10.131.1.46,arp_sha=00:00:0a:83:01:2e/00:00:ff:ff:ff:ff actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21 table=20, priority=100,ip,in_port=3,nw_src=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21 table=25, priority=100,ip,nw_src=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:30 table=40, priority=100,arp,arp_tpa=10.131.1.46 actions=output:3 table=70, priority=100,ip,nw_dst=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG1[],load:0x3->NXM_NX_REG2[],goto_table:80 table=20, priority=100,arp,in_port=3,arp_spa=10.131.1.46,arp_sha=00:00:0a:83:01:2e/00:00:ff:ff:ff:ff actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21 table=20, priority=100,ip,in_port=3,nw_src=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21 table=25, priority=100,ip,nw_src=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:30 table=40, priority=100,arp,arp_tpa=10.131.1.46 actions=output:3 table=70, priority=100,ip,nw_dst=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG1[],load:0x3->NXM_NX_REG2[],goto_table:80 You now have these flows: cookie=0x0, duration=0.147s, table=20, n_packets=0, n_bytes=0, priority=100,arp,in_port=3,arp_spa=10.131.1.46,arp_sha=00:00:0a:83:01:2e/00:00:ff:ff:ff:ff actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21 cookie=0x0, duration=0.137s, table=20, n_packets=0, n_bytes=0, priority=100,ip,in_port=3,nw_src=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:21 cookie=0x0, duration=0.130s, table=25, n_packets=0, n_bytes=0, priority=100,ip,nw_src=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG0[],goto_table:30 cookie=0x0, duration=0.121s, table=40, n_packets=0, n_bytes=0, priority=100,arp,arp_tpa=10.131.1.46 actions=output:3 cookie=0x0, duration=0.113s, table=70, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.131.1.46 actions=load:0x5a2682->NXM_NX_REG1[],load:0x3->NXM_NX_REG2[],goto_table:80
Created attachment 1700468 [details] second flow restore with db dump
Actually, my bad. I realized the problem is more that the dump of flows is duplicated three times.
(In reply to Alexander Constantinescu from comment #27) > Actually, my bad. I realized the problem is more that the dump of flows is > duplicated three times. No worries, so nothing to see on comment#25 and comment#26?
This looks related to this: https://bugzilla.redhat.com/show_bug.cgi?id=1843998 One of these is probably a dupe.
In response to #comment 28: I am not sure, it is strange that we have duplicate flows in the dump which is created, but reproducing the issue is very difficult, and collecting the data is almost even more difficult. So I think we can put the restore issue on hold for the following reasons: - The original problem should be fixed by the resolutions mentioned in #comment 22 - Those resolutions might fix the restore issues I have mentioned here. - We only really care about flow restore on upgrade, which should not happen when clusters are heavily loaded in the middle of their operations
*** Bug 1848374 has been marked as a duplicate of this bug. ***
Moving to ON_QE as the 4.6 PR has merged. Note to QE: this fix cannot be verified on 4.6 as the fix does not apply to that version. 4.6 did not have a livenessProbe, it had been removed with "system OVS". The fix can only start being verified on 4.5, but I need this bug to go in so that the process is respected and my 4.5 PR can merge. Please move this to VERIFIED
Moving back to MODIFIED, based on the PR landing. ART's Elliot sweeper will move it into ON_QA when it builds a nightly and attaches it to an errata.
*** Bug 1831283 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196