Bug 1913582 - [Migration]SDN to OVN migration stucks on MCO for rhel worker
Summary: [Migration]SDN to OVN migration stucks on MCO for rhel worker
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Kirsten Garrison
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-07 07:50 UTC by huirwang
Modified: 2021-02-24 15:51 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:50:56 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2332 0 None closed Bug 1913582: fix IsLikeTraditionalRhel7() to account for minor versions 2021-02-01 23:52:59 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:51:11 UTC

Description huirwang 2021-01-07 07:50:40 UTC
Description of problem:
SDN to OVN migration stucks on MCO for rhel worker

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2021-01-06-133533


Steps to Reproduce:
1. Setup IPI vsphere env with SDN network type, and scale up 2 rhel workers.
2. Follow below procedure to do migration.
1) oc annotate Network.operator.openshift.io cluster \
  'networkoperator.openshift.io/network-migration'=""
 
2) 
oc patch MachineConfigPool master --type='merge' --patch \
  '{ "spec": { "paused": true } }'
 oc patch MachineConfigPool worker --type='merge' --patch \
  '{ "spec":{ "paused" :true } }'

3)
oc patch Network.config.openshift.io cluster \
  --type='merge' --patch '{ "spec": { "networkType": "OVNKubernetes" } }'

4) Wait for all the multus pods recreated

5) Manually restart all the nodes.

6) Enable MCO


3.

Actual results:
MCO stucks on rhel worker.
 oc get nodes
NAME                                 STATUS                     ROLES    AGE     VERSION
huirwang-vs0107-6dp7q-master-0       Ready                      master   4h22m   v1.20.0+b1e9f0d
huirwang-vs0107-6dp7q-master-1       Ready                      master   4h22m   v1.20.0+b1e9f0d
huirwang-vs0107-6dp7q-master-2       Ready                      master   4h22m   v1.20.0+b1e9f0d
huirwang-vs0107-6dp7q-rhel-0         Ready,SchedulingDisabled   worker   3h9m    v1.20.0+bdb9571
huirwang-vs0107-6dp7q-rhel-1         Ready                      worker   3h9m    v1.20.0+bdb9571
huirwang-vs0107-6dp7q-worker-9zp8j   Ready                      worker   4h11m   v1.20.0+b1e9f0d
huirwang-vs0107-6dp7q-worker-mvkmx   Ready                      worker   4h12m   v1.20.0+b1e9f0d


 oc logs machine-config-daemon-fbfxf  -n openshift-machine-config-operator -c machine-config-daemon
..........
I0107 06:58:41.963900    3031 update.go:588] Checking Reconcilable for config rendered-worker-64a9749c22336a21aa5a446e4e8a7623 to rendered-worker-22482dd0ebf8110d0be0c26d7dd9d295
I0107 06:58:41.998392    3031 update.go:1844] Starting update from rendered-worker-64a9749c22336a21aa5a446e4e8a7623 to rendered-worker-22482dd0ebf8110d0be0c26d7dd9d295: &{osUpdate:false kargs:false fips:false passwd:false files:false units:true kernelType:false extensions:false}
I0107 06:58:42.015026    3031 update.go:1844] Update prepared; beginning drain
E0107 06:58:42.257578    3031 daemon.go:342] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-5lv8r, openshift-dns/dns-default-z2qsp, openshift-image-registry/node-ca-tq7xg, openshift-ingress-canary/ingress-canary-c9tdt, openshift-machine-config-operator/machine-config-daemon-fbfxf, openshift-monitoring/node-exporter-9m6zj, openshift-multus/multus-xv4wx, openshift-multus/network-metrics-daemon-zdnpc, openshift-network-diagnostics/network-check-target-b9rqr, openshift-ovn-kubernetes/ovnkube-node-72rvj, openshift-ovn-kubernetes/ovs-node-hb5nm
I0107 06:58:42.257603    3031 update.go:1844] drain complete
I0107 06:58:42.258614    3031 update.go:237] Successful drain took 0.242665656 seconds
I0107 06:58:42.258633    3031 update.go:1207] Updating files
I0107 06:58:42.272297    3031 update.go:1556] Writing file "/etc/NetworkManager/dispatcher.d/40-mdns-hostname"
I0107 06:58:42.274270    3031 update.go:1556] Writing file "/etc/NetworkManager/conf.d/99-vsphere.conf"
I0107 06:58:42.275623    3031 update.go:1556] Writing file "/etc/NetworkManager/dispatcher.d/30-resolv-prepender"
I0107 06:58:42.280679    3031 update.go:1556] Writing file "/etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt"
I0107 06:58:42.281739    3031 update.go:1556] Writing file "/etc/tmpfiles.d/cleanup-cni.conf"
I0107 06:58:42.286332    3031 update.go:1556] Writing file "/etc/kubernetes/static-pod-resources/configmaps/cloud-config/ca-bundle.pem"
I0107 06:58:42.291080    3031 update.go:1556] Writing file "/usr/local/bin/configure-ovs.sh"
I0107 06:58:42.295256    3031 update.go:1556] Writing file "/etc/containers/storage.conf"
I0107 06:58:42.298481    3031 update.go:1556] Writing file "/etc/kubernetes/static-pod-resources/coredns/Corefile.tmpl"
I0107 06:58:42.303043    3031 update.go:1556] Writing file "/etc/kubernetes/manifests/coredns.yaml"
I0107 06:58:42.307687    3031 update.go:1556] Writing file "/etc/systemd/system/crio.service.d/20-stream-address.conf"
I0107 06:58:42.312552    3031 update.go:1556] Writing file "/etc/systemd/system.conf.d/10-default-env-godebug.conf"
I0107 06:58:42.318389    3031 update.go:1556] Writing file "/etc/modules-load.d/iptables.conf"
I0107 06:58:42.324349    3031 update.go:1556] Writing file "/etc/kubernetes/static-pod-resources/keepalived/keepalived.conf.tmpl"
I0107 06:58:42.329728    3031 update.go:1556] Writing file "/etc/kubernetes/manifests/keepalived.yaml"
I0107 06:58:42.335052    3031 update.go:1556] Writing file "/etc/tmpfiles.d/kni.conf"
I0107 06:58:42.339619    3031 update.go:1556] Writing file "/etc/kubernetes/kubelet-ca.crt"
I0107 06:58:42.344284    3031 update.go:1556] Writing file "/etc/systemd/system.conf.d/kubelet-cgroups.conf"
I0107 06:58:42.348759    3031 update.go:1556] Writing file "/etc/systemd/system/kubelet.service.d/20-logging.conf"
I0107 06:58:42.353795    3031 update.go:1556] Writing file "/etc/kubernetes/static-pod-resources/mdns/config.hcl.tmpl"
I0107 06:58:42.359312    3031 update.go:1556] Writing file "/etc/kubernetes/manifests/mdns-publisher.yaml"
I0107 06:58:42.364061    3031 update.go:1556] Writing file "/etc/NetworkManager/conf.d/sdn.conf"
I0107 06:58:42.368670    3031 update.go:1556] Writing file "/var/lib/kubelet/config.json"
I0107 06:58:42.373570    3031 update.go:1556] Writing file "/etc/kubernetes/ca.crt"
I0107 06:58:42.378364    3031 update.go:1556] Writing file "/etc/ssh/sshd_config.d/10-disable-ssh-key-dir.conf"
I0107 06:58:42.383725    3031 update.go:1556] Writing file "/etc/sysctl.d/forward.conf"
I0107 06:58:42.388857    3031 update.go:1556] Writing file "/etc/sysctl.d/inotify.conf"
I0107 06:58:42.393315    3031 update.go:1556] Writing file "/usr/local/sbin/set-valid-hostname.sh"
I0107 06:58:42.395075    3031 update.go:1556] Writing file "/etc/kubernetes/kubelet-plugins/volume/exec/.dummy"
I0107 06:58:42.399352    3031 update.go:1556] Writing file "/usr/local/bin/vsphere-hostname.sh"
I0107 06:58:42.404033    3031 update.go:1556] Writing file "/etc/containers/registries.conf"
I0107 06:58:42.405581    3031 update.go:1556] Writing file "/etc/crio/crio.conf.d/00-default"
I0107 06:58:42.410622    3031 update.go:1556] Writing file "/etc/containers/policy.json"
I0107 06:58:42.412186    3031 update.go:1556] Writing file "/etc/kubernetes/cloud.conf"
I0107 06:58:42.418568    3031 update.go:1556] Writing file "/etc/kubernetes/kubelet.conf"
I0107 06:58:42.424214    3031 update.go:1497] Writing systemd unit "etc-NetworkManager-system\\x2dconnections\\x2dmerged.mount"
I0107 06:58:42.429481    3031 update.go:1462] Writing systemd unit dropin "10-mco-default-env.conf"
I0107 06:58:42.524281    3031 update.go:1451] Preset systemd unit crio.service
I0107 06:58:42.524302    3031 update.go:1462] Writing systemd unit dropin "mco-disabled.conf"
I0107 06:58:42.531538    3031 update.go:1534] Could not reset unit preset for docker.socket, skipping. (Error msg: error running preset on unit: Failed to execute operation: No such file or directory
)
I0107 06:58:42.531558    3031 update.go:1462] Writing systemd unit dropin "10-mco-default-env.conf"
I0107 06:58:42.536308    3031 update.go:1497] Writing systemd unit "kubelet.service"
I0107 06:58:42.540880    3031 update.go:1497] Writing systemd unit "machine-config-daemon-firstboot.service"
I0107 06:58:42.545540    3031 update.go:1497] Writing systemd unit "machine-config-daemon-pull.service"
I0107 06:58:42.550628    3031 update.go:1497] Writing systemd unit "node-valid-hostname.service"
I0107 06:58:42.556061    3031 update.go:1497] Writing systemd unit "nodeip-configuration.service"
I0107 06:58:42.561882    3031 update.go:1497] Writing systemd unit "ovs-configuration.service"
I0107 06:58:42.566399    3031 update.go:1462] Writing systemd unit dropin "10-ovs-vswitchd-restart.conf"
I0107 06:58:42.662668    3031 update.go:1451] Preset systemd unit ovs-vswitchd.service
I0107 06:58:42.662688    3031 update.go:1462] Writing systemd unit dropin "10-ovsdb-restart.conf"
I0107 06:58:42.667465    3031 update.go:1462] Writing systemd unit dropin "10-mco-default-env.conf"
I0107 06:58:42.675951    3031 update.go:1534] Could not reset unit preset for pivot.service, skipping. (Error msg: error running preset on unit: Failed to execute operation: No such file or directory
)
I0107 06:58:42.675969    3031 update.go:1497] Writing systemd unit "vsphere-hostname.service"
I0107 06:58:42.681272    3031 update.go:1462] Writing systemd unit dropin "mco-disabled.conf"
I0107 06:58:42.689236    3031 update.go:1534] Could not reset unit preset for zincati.service, skipping. (Error msg: error running preset on unit: Failed to execute operation: No such file or directory
)
I0107 06:58:42.772740    3031 update.go:1429] Enabled systemd units: [etc-NetworkManager-system\x2dconnections\x2dmerged.mount kubelet.service machine-config-daemon-firstboot.service machine-config-daemon-pull.service node-valid-hostname.service nodeip-configuration.service openvswitch.service ovs-configuration.service ovsdb-server.service vsphere-hostname.service]
I0107 06:58:42.772766    3031 update.go:1280] Deleting stale data
I0107 06:58:42.786883    3031 update.go:1675] Writing SSHKeys at "/home/core/.ssh/authorized_keys"
I0107 06:58:42.829983    3031 update.go:1675] Writing SSHKeys at "/home/core/.ssh/authorized_keys"
I0107 06:58:42.834252    3031 update.go:1207] Updating files
I0107 06:58:42.848137    3031 update.go:1556] Writing file "/etc/NetworkManager/dispatcher.d/40-mdns-hostname"
I0107 06:58:42.854039    3031 update.go:1556] Writing file "/etc/NetworkManager/conf.d/99-vsphere.conf"
I0107 06:58:42.858559    3031 update.go:1556] Writing file "/etc/NetworkManager/dispatcher.d/30-resolv-prepender"
I0107 06:58:42.863409    3031 update.go:1556] Writing file "/etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt"
I0107 06:58:42.867458    3031 update.go:1556] Writing file "/etc/tmpfiles.d/cleanup-cni.conf"
I0107 06:58:42.872077    3031 update.go:1556] Writing file "/etc/kubernetes/static-pod-resources/configmaps/cloud-config/ca-bundle.pem"
I0107 06:58:42.876675    3031 update.go:1556] Writing file "/usr/local/bin/configure-ovs.sh"
I0107 06:58:42.882405    3031 update.go:1556] Writing file "/etc/containers/storage.conf"
I0107 06:58:42.887765    3031 update.go:1556] Writing file "/etc/kubernetes/static-pod-resources/coredns/Corefile.tmpl"
I0107 06:58:42.892995    3031 update.go:1556] Writing file "/etc/kubernetes/manifests/coredns.yaml"
I0107 06:58:42.897549    3031 update.go:1556] Writing file "/etc/systemd/system/crio.service.d/20-stream-address.conf"
I0107 06:58:42.901967    3031 update.go:1556] Writing file "/etc/systemd/system.conf.d/10-default-env-godebug.conf"
I0107 06:58:42.906497    3031 update.go:1556] Writing file "/etc/modules-load.d/iptables.conf"
I0107 06:58:42.911101    3031 update.go:1556] Writing file "/etc/kubernetes/static-pod-resources/keepalived/keepalived.conf.tmpl"
I0107 06:58:42.912756    3031 update.go:1556] Writing file "/etc/kubernetes/manifests/keepalived.yaml"
I0107 06:58:42.917556    3031 update.go:1556] Writing file "/etc/tmpfiles.d/kni.conf"
I0107 06:58:42.919097    3031 update.go:1556] Writing file "/etc/kubernetes/kubelet-ca.crt"
I0107 06:58:42.924322    3031 update.go:1556] Writing file "/etc/systemd/system.conf.d/kubelet-cgroups.conf"
I0107 06:58:42.929051    3031 update.go:1556] Writing file "/etc/systemd/system/kubelet.service.d/20-logging.conf"
I0107 06:58:42.930678    3031 update.go:1556] Writing file "/etc/kubernetes/static-pod-resources/mdns/config.hcl.tmpl"
I0107 06:58:42.936360    3031 update.go:1556] Writing file "/etc/kubernetes/manifests/mdns-publisher.yaml"
I0107 06:58:42.942025    3031 update.go:1556] Writing file "/etc/NetworkManager/conf.d/sdn.conf"
I0107 06:58:42.947964    3031 update.go:1556] Writing file "/var/lib/kubelet/config.json"
I0107 06:58:42.953785    3031 update.go:1556] Writing file "/etc/kubernetes/ca.crt"
I0107 06:58:42.955570    3031 update.go:1556] Writing file "/etc/ssh/sshd_config.d/10-disable-ssh-key-dir.conf"
I0107 06:58:42.960360    3031 update.go:1556] Writing file "/etc/sysctl.d/forward.conf"
I0107 06:58:42.965069    3031 update.go:1556] Writing file "/etc/sysctl.d/inotify.conf"
I0107 06:58:42.969643    3031 update.go:1556] Writing file "/usr/local/sbin/set-valid-hostname.sh"
I0107 06:58:42.975443    3031 update.go:1556] Writing file "/etc/kubernetes/kubelet-plugins/volume/exec/.dummy"
I0107 06:58:42.980194    3031 update.go:1556] Writing file "/usr/local/bin/vsphere-hostname.sh"
I0107 06:58:42.985379    3031 update.go:1556] Writing file "/etc/containers/registries.conf"
I0107 06:58:42.990322    3031 update.go:1556] Writing file "/etc/crio/crio.conf.d/00-default"
I0107 06:58:42.994921    3031 update.go:1556] Writing file "/etc/containers/policy.json"
I0107 06:58:42.999591    3031 update.go:1556] Writing file "/etc/kubernetes/cloud.conf"
I0107 06:58:43.004299    3031 update.go:1556] Writing file "/etc/kubernetes/kubelet.conf"
I0107 06:58:43.009986    3031 update.go:1497] Writing systemd unit "etc-NetworkManager-system\\x2dconnections\\x2dmerged.mount"
I0107 06:58:43.015281    3031 update.go:1462] Writing systemd unit dropin "10-mco-default-env.conf"
I0107 06:58:43.095593    3031 update.go:1451] Preset systemd unit crio.service
I0107 06:58:43.095615    3031 update.go:1462] Writing systemd unit dropin "mco-disabled.conf"
I0107 06:58:43.100454    3031 update.go:1534] Could not reset unit preset for docker.socket, skipping. (Error msg: error running preset on unit: Failed to execute operation: No such file or directory
)
I0107 06:58:43.100471    3031 update.go:1462] Writing systemd unit dropin "10-mco-default-env.conf"
I0107 06:58:43.105018    3031 update.go:1497] Writing systemd unit "kubelet.service"
I0107 06:58:43.106440    3031 update.go:1497] Writing systemd unit "machine-config-daemon-firstboot.service"
I0107 06:58:43.108059    3031 update.go:1497] Writing systemd unit "machine-config-daemon-pull.service"
I0107 06:58:43.113291    3031 update.go:1497] Writing systemd unit "node-valid-hostname.service"
I0107 06:58:43.118779    3031 update.go:1497] Writing systemd unit "nodeip-configuration.service"
I0107 06:58:43.124258    3031 update.go:1497] Writing systemd unit "ovs-configuration.service"
I0107 06:58:43.129414    3031 update.go:1462] Writing systemd unit dropin "10-ovs-vswitchd-restart.conf"
I0107 06:58:43.221929    3031 update.go:1451] Preset systemd unit ovs-vswitchd.service
I0107 06:58:43.221957    3031 update.go:1462] Writing systemd unit dropin "10-ovsdb-restart.conf"
I0107 06:58:43.223851    3031 update.go:1462] Writing systemd unit dropin "10-mco-default-env.conf"
I0107 06:58:43.231717    3031 update.go:1534] Could not reset unit preset for pivot.service, skipping. (Error msg: error running preset on unit: Failed to execute operation: No such file or directory
)
I0107 06:58:43.231737    3031 update.go:1497] Writing systemd unit "vsphere-hostname.service"
I0107 06:58:43.236564    3031 update.go:1462] Writing systemd unit dropin "mco-disabled.conf"
I0107 06:58:43.243636    3031 update.go:1534] Could not reset unit preset for zincati.service, skipping. (Error msg: error running preset on unit: Failed to execute operation: No such file or directory
)
I0107 06:58:43.334973    3031 update.go:1429] Enabled systemd units: [etc-NetworkManager-system\x2dconnections\x2dmerged.mount kubelet.service machine-config-daemon-firstboot.service machine-config-daemon-pull.service node-valid-hostname.service nodeip-configuration.service openvswitch.service ovs-configuration.service ovsdb-server.service vsphere-hostname.service]
I0107 06:58:43.334998    3031 update.go:1280] Deleting stale data
E0107 06:58:43.335049    3031 writer.go:135] Marking Degraded due to: failed to log pending config: logger: unrecognized option '--journald'

Usage:
 logger [options] [message]

Options:
 -T, --tcp             use TCP only
 -d, --udp             use UDP only
 -i, --id              log the process ID too
 -f, --file <file>     log the contents of this file
 -h, --help            display this help text and exit
 -S, --size <num>      maximum size for a single message (default 1024)
 -n, --server <name>   write to this remote syslog server
 -P, --port <port>     use this port for UDP or TCP connection
 -p, --priority <prio> mark given message with this priority
 -s, --stderr          output message to standard error as well
 -t, --tag <tag>       mark every line with this tag
 -u, --socket <socket> write to this Unix socket
 -V, --version         output version information and exit

: exit status 1

Expected results:
Shoud migrate to OVN successfully

Additional info:

Comment 3 Peng Liu 2021-01-07 09:01:50 UTC
Flag `--journald` is not supported in RHEL 7, so it looks like an MCO use this flag without checking the running operating system. 
https://github.com/openshift/machine-config-operator/blob/0e40c961cdb200f886c06e88a7673c8dcd33b5be/pkg/daemon/update.go#L1829

Assign to MCO team.

Comment 5 Steve Milner 2021-01-11 15:16:23 UTC
Good catch Peng Liu! There is some sensing code in the package at https://github.com/openshift/machine-config-operator/blob/master/pkg/daemon/daemon.go#L243-L252 which might make sense to be turned into a function and used.

Comment 6 Kirsten Garrison 2021-01-11 19:24:36 UTC
This doesn't quite make sense as it _would_ be using  storePendingStateLegacyLogger() (see: https://github.com/openshift/machine-config-operator/blob/e70033b04d31f63661d59f92535ae460b6119bbe/pkg/daemon/update.go#L1816) which is for when journald not being present... which is 2 lines before what Peng Liu linked in the if statement: https://github.com/openshift/machine-config-operator/blob/0e40c961cdb200f886c06e88a7673c8dcd33b5be/pkg/daemon/update.go#L1826


To be clear:
```
if !dn.loggerSupportsJournal {
    return dn.storePendingStateLegacyLogger(pending, isPending) <--- this is what would be used if logger doesn't support journal
}
logger := exec.Command("logger", "--journald") <- this wouldn't be used if logger doesn't support journal
```

Comment 7 Kirsten Garrison 2021-01-11 19:40:50 UTC
Will look a little further..

Comment 8 Kirsten Garrison 2021-01-11 20:28:53 UTC
So the problem actually lies elsewhere in the VersionId check:
https://github.com/openshift/machine-config-operator/blob/4a005cb1e426a4d844fad89bd88fef95e0ce0246/pkg/daemon/osrelease.go#L39

we are checking for "7" but 

$ rpmdev-extract redhat-release-7.5-0.14.el7.x86_64.rpm 
redhat-release-7.5-0.14.el7.x86_64/etc/issue
redhat-release-7.5-0.14.el7.x86_64/etc/issue.net
redhat-release-7.5-0.14.el7.x86_64/etc/pki/product-default
<snip>
$ egrep "VERSION_ID=" redhat-release-7.5-0.14.el7.x86_64/usr/lib/os-release 
VERSION_ID="7.5"

will fix

Comment 9 Kirsten Garrison 2021-01-11 21:48:13 UTC
Just to summarize the BZ: 

the VERSION_ID was 7.5 but our check IsLikeTraditionalRhel7() did not account for minor versions (only for "7" not for 7.x) and so returned false. This resulted in dn.loggerSupportsJournal being set to true (https://github.com/openshift/machine-config-operator/blob/8c7dab8a8e613e6a3a0533c6607c459d7fa480dd/pkg/daemon/daemon.go#L245) which is the relevant value when selecting which logger to use. So instead of using storePendingStateLegacyLogger() (for cases where journald isn't avail) we used normal logger resulting in the --journald flag erroneously being used for a rhel7 node.

Comment 14 errata-xmlrpc 2021-02-24 15:50:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.