Description of problem: It seems that during the upgrade of the network operator, the operator is changing the ovnkube-config too early causing ovnkube-master nodes to crashloop due to older pods not allowing variable "host-network-namespace: During the upgrade, from what I can tell, the operator changes the ConfigMap before it starts updating the daemon sets. The first daemon set it updates is the ovnkube-node pods. This is leaving a large gap in time till the operator updates the daemonset for the ovnkube-master pods which will accept the new variable. $ omg get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version True True 5m12s Unable to apply 4.7.19: the update could not be applied $ omg get co | grep -v "True False False" NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.7.19 True True True -3s ingress 4.7.19 True False True 15m kube-apiserver 4.7.19 True True True 2h52m machine-config 4.6.26 False False True 2h49m network 4.6.26 True True True 3h2m openshift-apiserver 4.7.19 True False True -1s $ omg get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NAME READY STATUS RESTARTS AGE IP NODE ovnkube-master-9cwb2 4/6 Running 40 8d 10.230.0.9 ip-10-230-0-9.cluster.example.com <--- Seems to be restarting and not completely ready ovnkube-master-j9t5d 4/6 Running 41 8d 10.230.0.7 ip-10-230-0-7.cluster.example.com <--- Seems to be restarting and not completely ready ovnkube-master-z7nq8 5/6 Running 41 69d 10.230.0.11 ip-10-230-0-11.cluster.example.com <--- Seems to be restarting and not completely ready ovnkube-node-2f99b 3/3 Running 0 3h12m 10.230.0.7 ip-10-230-0-7.cluster.example.com ovnkube-node-77smm 3/3 Running 0 14d 10.230.0.5 ip-10-230-0-5.cluster.example.com ovnkube-node-bbflh 3/3 Running 0 3h6m 10.230.0.10 ip-10-230-0-10.cluster.example.com ovnkube-node-bkpzh 3/3 Running 2 14d 10.230.0.6 ip-10-230-0-6.cluster.example.com ovnkube-node-hggzp 3/3 Running 0 3h9m 10.230.0.4 ip-10-230-0-4.cluster.example.com ovnkube-node-hh5w6 3/3 Running 0 69d 10.230.0.2 ip-10-230-0-2.cluster.example.com ovnkube-node-j92sf 3/3 Running 1 69d 10.230.0.11 ip-10-230-0-11.cluster.example.com ovnkube-node-l5j26 3/3 Running 0 69d 10.230.0.1 ip-10-230-0-1.cluster.example.com ovnkube-node-mzqsd 2/3 Running 24 3h11m 10.230.0.3 ip-10-230-0-3.cluster.example.com <--- Also not having a good time ovnkube-node-qmlgn 2/3 Running 24 3h4m 10.230.0.9 ip-10-230-0-9.cluster.example.com <--- Also not having a good time ovnkube-node-w6cxv 3/3 Running 0 3h7m 10.230.0.8 ip-10-230-0-8.cluster.example.com $ omg logs ovnkube-master-z7nq8 -c ovnkube-master -p 2021-07-29T17:15:04.743180681Z + [[ -f /env/_master ]] 2021-07-29T17:15:04.743180681Z + gateway_mode_flags= 2021-07-29T17:15:04.743248215Z + grep -q OVNKubernetes /etc/systemd/system/ovs-configuration.service 2021-07-29T17:15:04.744238205Z + '[' -f /host/var/run/ovs-config-executed ']' 2021-07-29T17:15:04.744268618Z + gateway_mode_flags='--gateway-mode local --gateway-interface br-ex' 2021-07-29T17:15:04.744595205Z ++ date '+%m%d %H:%M:%S.%N' 2021-07-29T17:15:04.746512592Z + echo 'I0729 17:15:04.746071717 - ovnkube-master - start nbctl daemon for caching' 2021-07-29T17:15:04.746523207Z I0729 17:15:04.746071717 - ovnkube-master - start nbctl daemon for caching 2021-07-29T17:15:04.746866455Z ++ ovn-nbctl --pidfile=/var/run/ovn/ovn-nbctl.pid --detach -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt --db ssl:10.230.0.11:9641,ssl:10.230.0.7:9641,ssl:10.230.0.9:9641 2021-07-29T17:15:04.778434481Z + export OVN_NB_DAEMON=/var/run/ovn/ovn-nbctl.13.ctl 2021-07-29T17:15:04.778434481Z + OVN_NB_DAEMON=/var/run/ovn/ovn-nbctl.13.ctl 2021-07-29T17:15:04.778460700Z + ln -sf /var/run/ovn/ovn-nbctl.13.ctl /var/run/ovn/ 2021-07-29T17:15:04.779890164Z ln: 2021-07-29T17:15:04.779922602Z '/var/run/ovn/ovn-nbctl.13.ctl' and '/var/run/ovn/ovn-nbctl.13.ctl' are the same file2021-07-29T17:15:04.779937974Z 2021-07-29T17:15:04.780133557Z + true 2021-07-29T17:15:04.780459651Z ++ date '+%m%d %H:%M:%S.%N' 2021-07-29T17:15:04.782171747Z + echo 'I0729 17:15:04.781766566 - ovnkube-master - start ovnkube --init-master ip-10-230-0-11.cluster.example.com' 2021-07-29T17:15:04.782181628Z I0729 17:15:04.781766566 - ovnkube-master - start ovnkube --init-master ip-10-230-0-11.cluster.example.com 2021-07-29T17:15:04.782292929Z + exec /usr/bin/ovnkube --init-master ip-10-230-0-11.cluster.example.com --config-file=/run/ovnkube-config/ovnkube.conf --ovn-empty-lb-events --loglevel 4 --metrics-bind-address 127.0.0.1:29102 --gateway-mode local --gateway-interface br-ex --sb-address ssl:10.230.0.11:9642,ssl:10.230.0.7:9642,ssl:10.230.0.9:9642 --sb-client-privkey /ovn-cert/tls.key --sb-client-cert /ovn-cert/tls.crt --sb-client-cacert /ovn-ca/ca-bundle.crt --sb-cert-common-name ovn --nb-address ssl:10.230.0.11:9641,ssl:10.230.0.7:9641,ssl:10.230.0.9:9641 --nb-client-privkey /ovn-cert/tls.key --nb-client-cert /ovn-cert/tls.crt --nb-client-cacert /ovn-ca/ca-bundle.crt --nbctl-daemon-mode --nb-cert-common-name ovn --enable-multicast 2021-07-29T17:15:04.791275721Z F0729 17:15:04.791226 1 ovnkube.go:130] failed to parse config file /run/ovnkube-config/ovnkube.conf: warning: 2021-07-29T17:15:04.791275721Z can't store data at section "kubernetes", variable "host-network-namespace" Not sure what caused the ovnkube-master pods to pick up the new config map. I think during the upgrade, a MC change occurred and the masters were restarted. Once we worked around this exception and all the Pending pods were scheduled, we noticed the nodes continued to reboot from updates. I have documented our workaround to get the upgrades to progress. I believe we can prevent this issue by making sure we only change the config map right when we also are changing the daemon set for the ovnkube-master pods. Version-Release number of selected component (if applicable): 4.6 How reproducible: Only reproduces in customer environment Steps to Reproduce: N/A Actual results: network operator fails to upgrade Expected results: network operator upgrades successfully Additional info: must-gather (the first one) for the attached case has the logs and shows the state of the cluster.
(In reply to Aniket Bhat from comment #1) > @ngirard great analysis on the bug. I think we haven't seen this > in our upgrade jobs, but I do understand the problem. If the master pods > don't get restarted between the time the CNO updates the config map and when > the ovnkube-master daemonset rolls out with the new image, we should be > covered. > > I will try to figure out updating the config map closer to the daemonset > roll out of the masters to narrow down the window of failure. Yeah actually we can't move the configmap update close to the master rollouts. ovnkube-nodes also use the same configmap, so if the nodes roll out before CNO picks up configmap (plus since CNO is level triggered for reconciliation this is hard to achieve) then once CNO applies the configmap nodes will reboot a second time which we don't want.
(In reply to cstabler from comment #4) > What do you think about making ovnkube-(node&master) more resilient against > unknown fields in the configmap? So if this configmap cannot be manipulated by users as in its not user facing (which I think its not since CNO would reconcile any manual changes), we should be good. Same for the upstream scenario. If we don't expose the knobs to users and its an internal thing; I don't mind us silencing/ignoring unknown fields. But just wanna call out that its bad api ui (again since its not user facing we could get away with this). I just don't want folks to supply values and be surprised that changes aren't taking effect. Apart from the gateway-mode-config overrides that we allow users to do which doesn't get passed directly to ovn-k and is just parsed into the exec commands directly, I don't think we allow changing the configmap values. So we should be good from OCP perspective to do this change, let's make sure upstream is fine as well.
Added QE testcoverage: https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-46654
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056