Bug 1886127 - 4.5 clusters should handle systemd openvswitch from a 4.6 downgrade
Summary: 4.5 clusters should handle systemd openvswitch from a 4.6 downgrade
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.5.z
Assignee: Casey Callendrello
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On: 1885848 1886148
Blocks: 1914958
TreeView+ depends on / blocked
 
Reported: 2020-10-07 17:15 UTC by Casey Callendrello
Modified: 2021-01-11 15:21 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-26 15:11:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 833 0 None closed Bug 1886127: Handle downgrades from 4.6 to 4.5 2021-02-11 19:52:35 UTC
Github openshift cluster-network-operator pull 837 0 None closed Bug 1886127: [4.5] sdn-ovs: fix liveness probe for downgrade case 2021-02-11 19:52:35 UTC
Red Hat Product Errata RHBA-2020:4268 0 None None None 2020-10-26 15:12:17 UTC

Description Casey Callendrello 2020-10-07 17:15:47 UTC
This bug was initially created as a copy of Bug #1885848

I am copying this bug because: 



Description of problem:

Downgrade(4.6.0-0.nightly-2020-10-05-234751 -> 4.5.0-0.nightly-2020-10-05-204452) stuck on the network operator.

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-10-05-234751
4.5.0-0.nightly-2020-10-05-204452

How reproducible:
Always

Steps to Reproduce:
1.Install 4.6 cluster 
2.Downgrade 4.6 to 4.5 cluster version
3.

Actual results:

Downgrade stuck on the network operator 

Expected results:
Downgrade should successful.

Additional info:

$ oc logs ovs-6xdx4
ovsdb-server: /var/run/openvswitch/ovsdb-server.pid: pidfile check failed (No such process), aborting
Starting ovsdb-server ... failed!

Comment 4 zhaozhanqi 2020-10-13 08:00:30 UTC
ovs pod crashed when downgrade from 4.6.0-0.nightly-2020-10-12-223649 to 4.5.0-0.nightly-2020-10-10-013307

oc get pod -n openshift-sdn 
NAME                   READY   STATUS             RESTARTS   AGE
ovs-4l7d8              1/1     Running            0          111m
ovs-9k6qp              1/1     Running            0          105m
ovs-9l9kc              1/1     Running            0          105m
ovs-dcv72              0/1     CrashLoopBackOff   11         29m
ovs-ms9d9              1/1     Running            0          111m
ovs-qg4k7              1/1     Running            0          111m
sdn-2fk22              1/1     Running            0          29m
sdn-4qq2v              1/1     Running            0          29m
sdn-5gvbh              1/1     Running            0          28m
sdn-5wd6b              1/1     Running            0          29m
sdn-controller-6c2j9   1/1     Running            0          29m
sdn-controller-g74ll   1/1     Running            0          29m
sdn-controller-z9dmk   1/1     Running            0          29m
sdn-l2kmw              1/1     Running            0          29m
sdn-n57x8              1/1     Running            0          28m


$ oc logs ovs-dcv72 -n openshift-sdn
openvswitch is running in systemd
rm: cannot remove '/var/run/openvswitch/flows.sh': No such file or directory
==> /host/var/log/openvswitch/ovs-vswitchd.log <==
2020-10-13T07:48:36.295Z|00300|connmgr|INFO|br0<->unix#1033: 2 flow_mods in the last 0 s (2 deletes)
2020-10-13T07:48:38.176Z|00301|bridge|INFO|bridge br0: added interface veth00c406f2 on port 36
2020-10-13T07:48:38.221Z|00302|connmgr|INFO|br0<->unix#1036: 5 flow_mods in the last 0 s (5 adds)
2020-10-13T07:48:38.256Z|00303|connmgr|INFO|br0<->unix#1039: 2 flow_mods in the last 0 s (2 deletes)
2020-10-13T07:49:37.697Z|00304|connmgr|INFO|br0<->unix#1048: 2 flow_mods in the last 0 s (2 deletes)
2020-10-13T07:49:37.723Z|00305|connmgr|INFO|br0<->unix#1051: 4 flow_mods in the last 0 s (4 deletes)
2020-10-13T07:49:37.742Z|00306|bridge|INFO|bridge br0: deleted interface veth32348243 on port 33
2020-10-13T07:49:47.016Z|00307|connmgr|INFO|br0<->unix#1054: 2 flow_mods in the last 0 s (2 deletes)
2020-10-13T07:49:47.043Z|00308|connmgr|INFO|br0<->unix#1057: 4 flow_mods in the last 0 s (4 deletes)
2020-10-13T07:49:47.063Z|00309|bridge|INFO|bridge br0: deleted interface vethb1890334 on port 34

==> /host/var/log/openvswitch/ovsdb-server.log <==
2020-10-13T06:11:45.485Z|00021|jsonrpc|WARN|unix#49: send error: Broken pipe
2020-10-13T06:11:45.485Z|00022|reconnect|WARN|unix#49: connection dropped (Broken pipe)
2020-10-13T06:11:45.568Z|00023|jsonrpc|WARN|unix#51: send error: Broken pipe
2020-10-13T06:11:45.568Z|00024|reconnect|WARN|unix#51: connection dropped (Broken pipe)
2020-10-13T06:27:10.031Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log
2020-10-13T06:27:10.047Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.2
2020-10-13T06:27:20.059Z|00003|memory|INFO|6292 kB peak resident set size after 10.0 seconds
2020-10-13T06:27:20.059Z|00004|memory|INFO|cells:358 monitors:3 sessions:2
2020-10-13T06:27:22.865Z|00005|jsonrpc|WARN|unix#22: send error: Broken pipe
2020-10-13T06:27:22.865Z|00006|reconnect|WARN|unix#22: connection dropped (Broken pipe)


 oc describe pod ovs-dcv72 -n openshift-sdn
Name:                 ovs-dcv72
Namespace:            openshift-sdn
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 ip-10-0-136-200.us-east-2.compute.internal/10.0.136.200
Start Time:           Tue, 13 Oct 2020 15:27:28 +0800
Labels:               app=ovs
                      component=network
                      controller-revision-hash=774dd84995
                      openshift.io/component=network
                      pod-template-generation=2
                      type=infra
Annotations:          <none>
Status:               Running
IP:                   10.0.136.200
IPs:
  IP:           10.0.136.200
Controlled By:  DaemonSet/ovs
Containers:
  openvswitch:
    Container ID:  cri-o://e6d53f5a2b387e3d0e9d0c5924477970c787e9d80f870b163c1df4f095db29f3
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:94012d3b73f7c59f93c8fb04eb85d25b85437b3eea72765166253d6ba79b8a34
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:94012d3b73f7c59f93c8fb04eb85d25b85437b3eea72765166253d6ba79b8a34
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      #!/bin/bash
      set -euo pipefail
      chown -R openvswitch:openvswitch /var/run/openvswitch
      chown -R openvswitch:openvswitch /etc/openvswitch
      
      if [ -f /host/var/run/ovs-config-executed ]; then
        echo "openvswitch is running in systemd"
        # Don't need to worry about restoring flows; this can only change if we've rebooted
        rm /var/run/openvswitch/flows.sh || true
        exec tail -F /host/var/log/openvswitch/ovs-vswitchd.log /host/var/log/openvswitch/ovsdb-server.log
        # executes forever
      fi
      
      # if another process is listening on the cni-server socket, wait until it exits
      retries=0
      while true; do
        if /usr/share/openvswitch/scripts/ovs-ctl status &>/dev/null; then
          echo "warning: Another process is currently managing OVS, waiting 15s ..." 2>&1
          sleep 15 & wait
          (( retries += 1 ))
        else
          break
        fi
        if [[ "${retries}" -gt 40 ]]; then
          echo "error: Another process is currently managing OVS, exiting" 2>&1
          exit 1
        fi
      done
      
      function quit {
          # Save the flows
          echo "$(date -u "+%Y-%m-%d %H:%M:%S") info: Saving flows ..." 2>&1
          bridges=$(ovs-vsctl -- --real list-br)
          TMPDIR=/var/run/openvswitch /usr/share/openvswitch/scripts/ovs-save save-flows $bridges > /var/run/openvswitch/flows.sh
          echo "$(date -u "+%Y-%m-%d %H:%M:%S") info: Saved flows" 2>&1
      
          # Don't allow ovs-vswitchd to clear datapath flows on exit
          kill -9 $(cat /var/run/openvswitch/ovs-vswitchd.pid 2>/dev/null) 2>/dev/null || true
          kill $(cat /var/run/openvswitch/ovsdb-server.pid 2>/dev/null) 2>/dev/null || true
          exit 0
      }
      trap quit SIGTERM
      
      # launch OVS
      # Start the ovsdb so that we can prep it before we start the ovs-vswitchd
      /usr/share/openvswitch/scripts/ovs-ctl start --ovs-user=openvswitch:openvswitch --no-ovs-vswitchd --system-id=random --no-monitor
      
      # Set the flow-restore-wait to true so ovs-vswitchd will wait till flows are restored
      ovs-vsctl --no-wait set Open_vSwitch . other_config:flow-restore-wait=true
      
      # Restrict the number of pthreads ovs-vswitchd creates to reduce the
      # amount of RSS it uses on hosts with many cores
      # https://bugzilla.redhat.com/show_bug.cgi?id=1571379
      # https://bugzilla.redhat.com/show_bug.cgi?id=1572797
      if [[ `nproc` -gt 12 ]]; then
          ovs-vsctl --no-wait set Open_vSwitch . other_config:n-revalidator-threads=4
          ovs-vsctl --no-wait set Open_vSwitch . other_config:n-handler-threads=10
      fi
      
      # And finally start the ovs-vswitchd now the DB is prepped
      /usr/share/openvswitch/scripts/ovs-ctl start --ovs-user=openvswitch:openvswitch --no-ovsdb-server --system-id=random --no-monitor
      
      # Load any flows that we saved
      echo "$(date -u "+%Y-%m-%d %H:%M:%S") info: Loading previous flows ..." 2>&1
      if [[ -f /var/run/openvswitch/flows.sh ]]; then
         echo "$(date -u "+%Y-%m-%d %H:%M:%S") info: Adding br0 if it doesn't exist ..." 2>&1
         /usr/bin/ovs-vsctl --may-exist add-br br0 -- set Bridge br0 fail_mode=secure protocols=OpenFlow13
         echo "$(date -u "+%Y-%m-%d %H:%M:%S") info: Created br0, now adding flows ..." 2>&1
         mv /var/run/openvswitch/flows.sh /var/run/openvswitch/flows-old.sh
         sh -x /var/run/openvswitch/flows-old.sh
         echo "$(date -u "+%Y-%m-%d %H:%M:%S") info: Done restoring the existing flows ..." 2>&1
         rm /var/run/openvswitch/flows-old.sh
      fi
      
      echo "$(date -u "+%Y-%m-%d %H:%M:%S") info: Remove other config ..." 2>&1
      ovs-vsctl --no-wait --if-exists remove Open_vSwitch . other_config flow-restore-wait=true
      echo "$(date -u "+%Y-%m-%d %H:%M:%S") info: Removed other config ..." 2>&1
      
      tail -F --pid=$(cat /var/run/openvswitch/ovs-vswitchd.pid) /var/log/openvswitch/ovs-vswitchd.log &
      tail -F --pid=$(cat /var/run/openvswitch/ovsdb-server.pid) /var/log/openvswitch/ovsdb-server.log &
      wait
      
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   penvswitch is running in systemd
rm: cannot remove '/var/run/openvswitch/flows.sh': No such file or directory
==> /host/var/log/openvswitch/ovs-vswitchd.log <==
2020-10-13T07:48:36.295Z|00300|connmgr|INFO|br0<->unix#1033: 2 flow_mods in the last 0 s (2 deletes)
2020-10-13T07:48:38.176Z|00301|bridge|INFO|bridge br0: added interface veth00c406f2 on port 36
2020-10-13T07:48:38.221Z|00302|connmgr|INFO|br0<->unix#1036: 5 flow_mods in the last 0 s (5 adds)
2020-10-13T07:48:38.256Z|00303|connmgr|INFO|br0<->unix#1039: 2 flow_mods in the last 0 s (2 deletes)
2020-10-13T07:49:37.697Z|00304|connmgr|INFO|br0<->unix#1048: 2 flow_mods in the last 0 s (2 deletes)
2020-10-13T07:49:37.723Z|00305|connmgr|INFO|br0<->unix#1051: 4 flow_mods in the last 0 s (4 deletes)
2020-10-13T07:49:37.742Z|00306|bridge|INFO|bridge br0: deleted interface veth32348243 on port 33
2020-10-13T07:49:47.016Z|00307|connmgr|INFO|br0<->unix#1054: 2 flow_mods in the last 0 s (2 deletes)
2020-10-13T07:49:47.043Z|00308|connmgr|INFO|br0<->unix#1057: 4 flow_mods in the last 0 s (4 deletes)
2020-10-13T07:49:47.063Z|00309|bridge|INFO|bridge br0: deleted interface vethb1890334 on port 34

==> /host/var/log/openvswitch/ovsdb-server.log <==
2020-10-13T06:11:45.485Z|00021|jsonrpc|WARN|unix#49: send error: Broken pipe
2020-10-13T06:11:45.485Z|00022|reconnect|WARN|unix#49: connection dropped (Broken pipe)
2020-10-13T06:11:45.568Z|00023|jsonrpc|WARN|unix#51: send error: Broken pipe
2020-10-13T06:11:45.568Z|00024|reconnect|WARN|unix#51: connection dropped (Broken pipe)
2020-10-13T06:27:10.031Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log
2020-10-13T06:27:10.047Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.2
2020-10-13T06:27:20.059Z|00003|memory|INFO|6292 kB peak resident set size after 10.0 seconds
2020-10-13T06:27:20.059Z|00004|memory|INFO|cells:358 monitors:3 sessions:2
2020-10-13T06:27:22.865Z|00005|jsonrpc|WARN|unix#22: send error: Broken pipe
2020-10-13T06:27:22.865Z|00006|reconnect|WARN|unix#22: connection dropped (Broken pipe)

      Exit Code:    137
      Started:      Tue, 13 Oct 2020 15:52:13 +0800
      Finished:     Tue, 13 Oct 2020 15:53:08 +0800
    Ready:          False
    Restart Count:  11
    Requests:
      cpu:     100m
      memory:  400Mi
    Liveness:  exec [/bin/bash -c #!/bin/bash
/usr/bin/ovs-appctl -T 5 ofproto/list > /dev/null &&
/usr/bin/ovs-vsctl -t 5 show > /dev/null &&
if /usr/bin/ovs-vsctl -t 5 br-exists br0; then /usr/bin/ovs-ofctl -t 5 -O OpenFlow13 probe br0; else true; fi
] delay=15s timeout=21s period=5s #success=1 #failure=3
    Readiness:  exec [/bin/bash -c #!/bin/bash
/usr/share/openvswitch/scripts/ovs-ctl status > /dev/null &&
/usr/bin/ovs-appctl -T 5 ofproto/list > /dev/null &&
/usr/bin/ovs-vsctl -t 5 show > /dev/null
] delay=15s timeout=11s period=5s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/openvswitch from host-config-openvswitch (rw)
      /host from host-slash (ro)
      /lib/modules from host-modules (ro)
      /run/openvswitch from host-run-ovs (rw)
      /sys from host-sys (ro)
      /var/run/openvswitch from host-run-ovs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from sdn-token-4v95v (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True

Comment 5 Wei Duan 2020-10-13 08:28:47 UTC
I met the same issue when downgrade from 4.6 to latest 4.5 release to verfy storage downgrade issue (storage passed).

Comment 6 Casey Callendrello 2020-10-13 10:03:37 UTC
Aha, I suspect I see the issue. Do you have a cluster in this state for me to test something? Thanks!

Comment 16 errata-xmlrpc 2020-10-26 15:11:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.16 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4268


Note You need to log in before you can comment on or make changes to this bug.