ARO team has seen the same issue on a customer-initiated cluster upgrade from 4.6.8->4.6.9. In this customer's case, the router pods are completely inaccessible off host, so all router workload and console is inaccessible. * We are missing an expected table 80 (network policy) entry for the openshift-ingress namespace in the ovs flows on one of the nodes hosting a router pod, this means all incoming traffic to the router pod is dropped on the floor: ``` $ oc get netnamespace | grep openshift-ingress openshift-ingress 11026750 ``` 11026750 == 0xa8413e ``` sh-4.4# ovs-ofctl -O OpenFlow13 dump-flows br0 | grep a8413e cookie=0x0, duration=268457.230s, table=20, n_packets=67252, n_bytes=2824584, priority=100,arp,in_port=11,arp_spa=10.128.4.10,arp_sha=00:00:0a:80:04:0a/00:00:ff:ff:ff:ff actions=load:0xa8413e->NXM_NX_REG0[],goto_table:21 cookie=0x0, duration=268457.231s, table=20, n_packets=3249309, n_bytes=483914783, priority=100,ip,in_port=11,nw_src=10.128.4.10 actions=load:0xa8413e->NXM_NX_REG0[],goto_table:21 cookie=0x0, duration=268457.231s, table=25, n_packets=3, n_bytes=222, priority=100,ip,nw_src=10.128.4.10 actions=load:0xa8413e->NXM_NX_REG0[],goto_table:30 cookie=0x0, duration=268457.231s, table=70, n_packets=2319179, n_bytes=444378891, priority=100,ip,nw_dst=10.128.4.10 actions=load:0xa8413e->NXM_NX_REG1[],load:0xb->NXM_NX_REG2[],goto_table:80 ``` ^ missing table=80 rule for this namespace, we expect to see something like `cookie=0x0, duration=XXXs, table=80, n_packets=XXX, n_bytes=XXX, priority=50,reg1=0xa8413e actions=output:NXM_NX_REG2[]`. * We are seeing bursts of "Error syncing OVS flows: timed out waiting for the condition" on the SDN pod in question, which we believe is likely to be associated. * No interesting openvswitch messages in `dmesg`. We tried deleting the SDN pod on the node in question and when it was recreated the rule did *not* appear. We restarted the openvswitch systemd "service" on the node, and the rule did appear. We had to restart openvswitch on all the cluster nodes to get back to health. Problem #1: we presume that `ovs-ofctl bundle` is repeatedly returning an error, but the SDN does not log the actual error so we don't know what it is (log line needed in `func (tx *ovsExecTx) Commit() error` ?) Problem #2: I wonder if the network policy code here is edge-triggered, not level-triggered (e.g. comment `// Push internal data to OVS (for namespaces that have changed)`), which means that there is no retry/resync capability? Problem #3: So far unknown root cause for this issue.
It is notable that https://github.com/openshift/sdn/pull/228 is new in 4.6.9, I fear a regression @danw .
Quite feasibly https://bugzilla.redhat.com/show_bug.cgi?id=1914393 could be a root cause of the "Error syncing OVS flows: timed out waiting for the condition" message. The ARO cluster had Strimzi/Kafka running on it.
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, itβs always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
After some testing I was still unable to reproduce this issue on a development cluster, however we did learn a few things. It seems that restarting systemd's openvswitch is a viable fix both in this instance and the associated one seen in https://bugzilla.redhat.com/show_bug.cgi?id=1914393 which leads me to believe it is in fact a regression with SDN going into 4.6.9. To combat this problem all upgrades to 4.6.9 have been until we can pinpoint exactly what's going on. Due to the problems association with only OCP4.6.9, the absence of any SDN changes in 4.6.7 & 4.6.8, and the issue's relation to the network policy OVS rule table, we believe the problem originated within the following PR which refactored much of the Network Policy code -> https://github.com/openshift/sdn/pull/228 Specifically I think it has to do with the syncNamespace() function in the SDN code, where updates to OVS rules are never getting applied... With errors like `I0108 17:31:21.735682 1819 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:3: 0/0: invalid IP address` Moving forward we will work with the rest of the team to find a fix and push a patch as soon as possible. - Andrew
*** Bug 1914393 has been marked as a duplicate of this bug. ***
> Specifically I think it has to do with the syncNamespace() function in the SDN code, where updates to OVS rules are never getting applied... With errors like > > `I0108 17:31:21.735682 1819 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:3: 0/0: invalid IP address` which is because: > flow add table=80, priority=150, reg1=8083932, ip, nw_dst=10.128.2.8, reg0=8083932, ip, nw_src=, tcp, tp_dst=2181, actions=output:NXM_NX_REG2[] Note missing `nw_src`. This _is_ a regression introduced by sdn#228.
(In reply to Scott Dodson from comment #4) > Who is impacted? If we have to block upgrade edges based on this issue, > which edges would need blocking? The bug is not upgrade-related, it's just a bug in 4.6.9 / 4.5.25, so all edges to those releases are affected > What is the impact? Is it serious enough to warrant blocking edges? Any customer using NetworkPolicies is potentially affected, and may sporadically lose network connectivity within namespaces that use NetworkPolicies. > How involved is remediation (even moderately serious impacts might be > acceptable if they are easy to mitigate)? Restarting openvswitch on every node appears to resolve it at least temporarily (by forcing sdn to restart and regenerate all OVS flows), but the bug will probably come back fairly quickly. There is no permanent fix. > Is this a regression (if all previous versions were also vulnerable, > updating to the new, vulnerable version does not increase exposure)? Yes, from 4.6.8 / 4.5.24
> may sporadically lose network connectivity within namespaces that use NetworkPolicies We saw complete loss of network connectivity to the openshift-ingress namespace, which I don't believe uses NetworkPolicies.
Reproduced this issue with following steps on 4.6.9, will use this as verified steps 1. deploy 4.6.8 cluster 2. Create two namespaces z1 and z2 oc new-project z1 oc new-project z2 3. oc label namespace z2 team=operations 4. Create test pods in both two namespace oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/rc/idle-rc-1.yaml -n z1 oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/networking/list_for_pods.json -n z1 oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/rc/idle-rc-1.yaml -n z2 oc create -f https://raw.githubusercontent.com/openshift/verification-tests/master/testdata/networking/list_for_pods.json -n z2 5. Create the following policy on namespace z1. the following policy mean only namespace z2 and with label 'test-pods' pods can access z1 namespace and label with 'test-pods ' kind: NetworkPolicy apiVersion: networking.k8s.io/v1 metadata: name: allow-ns-and-pod spec: podSelector: matchLabels: name: test-pods ingress: - from: - namespaceSelector: matchLabels: team: operations podSelector: matchLabels: name: test-pods podSelector: matchLabels: name: test-pods 6. Do upgrade to 4.6.9 build. (if your cluster is already 4.6.9. you can create resource eg. imagecontentsourcepolicies.operator.openshift.io) to make all masters and worker reboot.) 7. check the operator by `oc get co` will find the auth and console is not available oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.9 False True True 4h18m cloud-credential 4.6.9 True False False 5h49m cluster-autoscaler 4.6.9 True False False 5h47m config-operator 4.6.9 True False False 5h49m console 4.6.9 False False True 4h14m csi-snapshot-controller 4.6.9 True False False 5h18m dns 4.6.9 True False False 5h48m etcd 4.6.9 True False False 5h47m image-registry 4.6.9 True False False 5h39m ingress 4.6.9 True False False 5h39m insights 4.6.9 True False False 5h49m kube-apiserver 4.6.9 True False False 5h46m kube-controller-manager 4.6.9 True False False 5h46m kube-scheduler 4.6.9 True False False 5h46m kube-storage-version-migrator 4.6.9 True False False 4h15m machine-api 4.6.9 True False False 5h38m machine-approver 4.6.9 True False False 5h48m machine-config 4.6.9 True False False 5h47m marketplace 4.6.9 True False False 4h13m monitoring 4.6.9 True False False 5h38m network 4.6.9 True False False 5h49m node-tuning 4.6.9 True False False 5h49m openshift-apiserver 4.6.9 True False False 4h18m openshift-controller-manager 4.6.9 True False False 5h47m openshift-samples 4.6.9 True False False 5h41m operator-lifecycle-manager 4.6.9 True False False 5h48m operator-lifecycle-manager-catalog 4.6.9 True False False 5h48m operator-lifecycle-manager-packageserver 4.6.9 True False False 4h14m service-ca 4.6.9 True False False 5h49m storage 4.6.9 True False False 4h14m 8. Check the sdn pod on node (openshift-ingress pod located) $ oc logs sdn-zm57r -n openshift-sdn -c sdn | grep invalid I0112 06:32:53.217868 1945 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address I0112 06:32:53.741193 1945 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address I0112 06:32:54.384676 1945 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address I0112 06:32:55.183985 1945 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address I0112 06:32:56.178511 1945 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address I0112 06:32:57.417222 1945 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address I0112 06:32:58.960991 1945 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address I0112 06:33:00.886775 1945 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address I0112 06:33:03.290008 1945 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address I0112 06:33:06.288150 1945 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address I0112 06:39:21.424547 1945 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address I0112 06:39:21.958804 1945 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address I0112 06:39:22.607541 1945 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address I0112 06:39:23.407708 1945 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address I0112 06:39:24.402952 1945 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address 9. Get the openshift-ingress netnamespace id oc get netnamespace | grep openshift-ingress openshift-ingress 4736711 echo `printf %x 4736711` 4846c7 10. Check the openflow on above sdn pod, found the following flow is missing like ' cookie=0x0, duration=932.385s, table=80, n_packets=1797, n_bytes=255409, priority=50,reg1=0x4846c7 actions=output:NXM_NX_REG2[]' $ oc exec sdn-zm57r -n openshift-sdn -- ovs-ofctl dump-flows br0 -O openflow13 'table=80' Defaulting container name to sdn. Use 'oc describe pod/sdn-zm57r -n openshift-sdn' to see all of the containers in this pod. OFPST_FLOW reply (OF1.3) (xid=0x6): cookie=0x0, duration=1085.671s, table=80, n_packets=4779, n_bytes=489979, priority=300,ip,nw_src=10.129.2.1 actions=output:NXM_NX_REG2[] cookie=0x0, duration=1085.631s, table=80, n_packets=35387, n_bytes=109699949, priority=200,ct_state=+rpl,ip actions=output:NXM_NX_REG2[] cookie=0x0, duration=7.179s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0x43e3cd,reg1=0x6cf60e,nw_src=10.129.2.41,nw_dst=10.129.2.5 actions=output:NXM_NX_REG2[] cookie=0x0, duration=7.179s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0x43e3cd,reg1=0x6cf60e,nw_src=10.131.0.4,nw_dst=10.129.2.5 actions=output:NXM_NX_REG2[] cookie=0x0, duration=7.179s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0x43e3cd,reg1=0x6cf60e,nw_src=10.129.2.41,nw_dst=10.131.0.7 actions=output:NXM_NX_REG2[] cookie=0x0, duration=7.179s, table=80, n_packets=0, n_bytes=0, priority=150,ip,reg0=0x43e3cd,reg1=0x6cf60e,nw_src=10.131.0.4,nw_dst=10.131.0.7 actions=output:NXM_NX_REG2[] cookie=0x0, duration=7.179s, table=80, n_packets=0, n_bytes=0, priority=100,ip,reg1=0x6cf60e,nw_dst=10.131.0.7 actions=drop cookie=0x0, duration=7.179s, table=80, n_packets=0, n_bytes=0, priority=100,ip,reg1=0x6cf60e,nw_dst=10.129.2.5 actions=drop cookie=0x0, duration=1084.433s, table=80, n_packets=3825, n_bytes=416689, priority=50,reg1=0xdb4361 actions=output:NXM_NX_REG2[] cookie=0x0, duration=1059.798s, table=80, n_packets=3373, n_bytes=723105, priority=50,reg1=0x951de7 actions=output:NXM_NX_REG2[] cookie=0x0, duration=7.179s, table=80, n_packets=0, n_bytes=0, priority=50,reg1=0x6cf60e actions=output:NXM_NX_REG2[] cookie=0x0, duration=1085.671s, table=80, n_packets=1385, n_bytes=102490, priority=0 actions=drop
*** Bug 1905761 has been marked as a duplicate of this bug. ***
Hello- I have a case where the cu began an upgrade to 4.6.9 before it was blocked. Is there any workaround or patch to get their cluster moving? Currently DaemonSet "openshift-multus/multus" and "openshift-sdn/ovs" is not rolling out.
Hello- Case 02831577 which is attached to this bug has a degraded worker node, and the customer is looking to resolve this. They were upgrading from 4.6.8 to 4.6.9 before 4.6.9 was blocked, and the update hung with their MCO being degraded because their SDN and OVS daemonsets failed to roll out. We've tried deleting the daemon sets, and rebooting the worker node, but it is in the same state. Will upload latest must-gather from today. Insights shows: Operator: 'node-tuning' Issue : Progressing Reason : Reconciling Message : Working towards "4.6.9" LastTransition : 2021-01-07T13:38:20Z Operator: 'machine-config' Issue : Not available Reason : Message : Cluster not available for 4.6.8 LastTransition : 2020-12-29T09:35:18Z Issue : Degraded Reason : MachineConfigDaemonFailed Message : Failed to resync 4.6.8 because: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 8, updated: 8, ready: 7, unavailable: 1) LastTransition : 2020-12-29T09:35:18Z Operator: 'monitoring' Issue : Not available Reason : Message : LastTransition : 2020-12-29T09:30:59Z Issue : Degraded Reason : UpdatingnodeExporterFailed Message : Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: expected 8 desired scheduled nodes, got 0 updated scheduled nodes LastTransition : 2021-01-07T13:43:07Z Issue : Progressing Reason : RollOutInProgress Message : Rolling out the stack. LastTransition : 2021-01-14T17:13:13Z Operator: 'network' Issue : Degraded Reason : RolloutHung Message : DaemonSet "openshift-multus/multus" rollout is not making progress - last change 2020-12-29T09:25:14Z DaemonSet "openshift-sdn/ovs" rollout is not making progress - last change 2021-01-14T11:35:25Z DaemonSet "openshift-sdn/sdn" rollout is not making progress - last change 2020-12-29T09:25:14Z LastTransition : 2020-12-29T09:36:40Z Issue : Progressing Reason : Deploying Message : DaemonSet "openshift-multus/multus" is not available (awaiting 1 nodes) DaemonSet "openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes) DaemonSet "openshift-sdn/ovs" is not available (awaiting 1 nodes) DaemonSet "openshift-sdn/sdn" is not available (awaiting 1 nodes) LastTransition : 2020-12-29T09:25:13Z
*** Bug 1916428 has been marked as a duplicate of this bug. ***
I would expect us to add an e2e test that ensures this does not regress in the future (if there are no default network policies on cluster during our upgrade e2e, that is something to remedy)
If there are I0112 06:32:53.217868 1945 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: -:2: 0/0: invalid IP address errors in the sdn pod logs, then it is this bug, and there is no workaround other than downgrading to 4.6.8 or upgrading to 4.6.12 which will be out very soon). (Restarting OVS forces the SDN to recompute all the OVS flows, and will fix things at least momentarily, but there's no guarantee it won't just hit the bug again immediately after.)
4.6.12 at this time is available in the fast-4.6 channel. { "metadata": { "description": "", "io.openshift.upgrades.graph.release.channels": "candidate-4.6,fast-4.6,candidate-4.7", "io.openshift.upgrades.graph.release.manifestref": "sha256:5c3618ab914eb66267b7c552a9b51c3018c3a8f8acf08ce1ff7ae4bfdd3a82bd", "url": "https://access.redhat.com/errata/RHSA-2021:0037" }, "payload": "quay.io/openshift-release-dev/ocp-release@sha256:5c3618ab914eb66267b7c552a9b51c3018c3a8f8acf08ce1ff7ae4bfdd3a82bd", "version": "4.6.12" }
This bug targets 4.7 and, per the doc text, the moving-forward fix. This bug is not what to look at for 4.6. The issue discussed in this bug was fixed by reverting the break in 4.6.12 [1] and 4.5.27 [2]. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1915007#c7 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1915008#c6
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
Scott asked for an impact statement in comment 4, but I don't think we ever got one. And the fix went out with 4.7's GA, so I don't think we ever blocked any edges on this bug series either. Removing UpgradeBlocker to get this out of our suspect queue [1]. [1]: https://github.com/openshift/enhancements/pull/475
(In reply to W. Trevor King from comment #38) > Scott asked for an impact statement in comment 4, but I don't think we ever > got one. comment 9 > And the fix went out with 4.7's GA, so I don't think we ever > blocked any edges on this bug series either. No 4.7 edge was blocked, but 4.5 -> 4.6.9 was blocked until 4.6.12 went out with the bug reverted.
Ah, thanks. Those 4.5 -> 4.6.9 edges were pulled in [1], which I see was mentioned in comment 5. [1]: https://github.com/openshift/cincinnati-graph-data/pull/603