Description of problem:
While preforming cri-o based installation upgrade, the upgrade failed due to nodes get not ready. Check the logs found that cri-o package not got updated at this point that caused the failure.
Jun 01 04:20:45 qe-ghuang-master-etcd-1 atomic-openshift-node: E0601 04:20:45.939342 99814 remote_runtime.go:69] Version from runtime service failed: rpc error: code = Unimplemented desc = unknown service runtime.v1alpha2.RuntimeService
Jun 01 04:20:45 qe-ghuang-master-etcd-1 atomic-openshift-node: E0601 04:20:45.939431 99814 kuberuntime_manager.go:172] Get runtime version failed: rpc error: code = Unimplemented desc = unknown service runtime.v1alpha2.RuntimeService
Jun 01 04:20:45 qe-ghuang-master-etcd-1 atomic-openshift-node: F0601 04:20:45.939454 99814 server.go:233] failed to run Kubelet: failed to create kubelet: rpc error: code = Unimplemented desc = unknown service runtime.v1alpha2.RuntimeService
Jun 01 04:20:45 qe-ghuang-master-etcd-1 systemd: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
Jun 01 04:20:45 qe-ghuang-master-etcd-1 systemd: Failed to start OpenShift Node.
Version-Release number of the following components:
Steps to Reproduce:
1. Spin up a 3.9 cluster with cri-o enabled
2. Upgrade to 3.10
Failed at task:
TASK [openshift_node : Wait for node to be ready] ******************************
FAILED - RETRYING: Wait for node to be ready (36 retries left).
cri-o version was still for 3.9 at this moment
# crio -version
crio version 1.9.12
Once updated the cri-o package and restart cri-o service, node can be back to ready.
I've reproduced this locally with a clean install. It looks to me like I had a systemd unit from a previous system container based install of crio lingering around. As soon as I removed that crio was started properly and the node started as well.
We'll look at any need to potentially clean up a system container based crio but since it was never officially supported I'm not sure we should consider this a blocker.
Lets verify that you don't have system container leftovers.
rm /etc/systemd/system/cri-o.service && systemctl daemon-reload
atomic containers delete cri-o
then re-run the installer
https://github.com/openshift/openshift-ansible/pull/8612 cleans up a few issues I ran into while testing crio and 3.10 installs. These problems were introduced very recently due to the oreg_url refactoring.
Scott, this is a rpm cri-o installation. We won't test cri-o system container installation as it was not officially supported.
The issue is that we should upgrade cri-o rpm package as long as node get updated as cri-o package isn't compatible across minor versions of kubelet.
Let me know if you still need something from me.
https://github.com/openshift/openshift-ansible/pull/8628 to ensure that cri-o package is updated when upgrading the node
Verified in openshift-ansible-3.10.0-0.60.0
rpm cri-o package is updated successfully