Description of problem: https://docs.openshift.com/container-platform/latest/install_config/upgrading/os_upgrades.html describes how to apply OS updates on OCP nodes. Especially if the number of nodes is large, manual approach is not feasible. This should be fully automated. There is already a playbook doing this automatically, perhaps it could be used for inspiration here: https://github.com/myllynen/openshift-automation-tools/blob/master/conf/install-os-updates.yml Thanks.
Since as you mention this problem is most exaggerated for large installations, we're focusing our improvements in this area on using "golden images" to allow for efficient blue/green updates whether it's for the operating system or for OpenShift. This is the direction we're going for OpenShift Online and we will encourage large OCP clusters to use them as well once they're available.
Golden images aren't a suitable solution unless OpenShift is deployed on an IaaS. For physical hosts blue/green isn't suitable.
Upstream PR here: https://github.com/openshift/openshift-ansible/pull/6558
Node upgrade hooks are implemented in openshift-ansible-3.7.45-1, openshift-ansible-3.9.26-1, and all versions of 3.10. Pending documentation changes https://github.com/openshift/openshift-docs/pull/8552/files
Verify this bug with openshift-ansible-3.10.9-1.git.240.1c86105.el7.noarch. With node pre-upgrade hook, we could do OS upgrade after node is unschedulable and drained. With upgrade hook which run after node is upgraded and before being schedulable again, we could also finished a server reboot. Add the hooks definition in ansible inventory while doing upgrade: openshift_node_upgrade_pre_hook=/root/workspace/pre_node.yml openshift_node_upgrade_hook=/root/workspace/node.yml [root@gpei-preserve-ansible-slave ~]# cat /root/workspace/pre_node.yml --- - name: Note the start of node OS upgrade debug: msg: "Node OS upgrade of {{ inventory_hostname }} is about to start" - name: Upgrade the OS yum: name=* state=latest - name: debug: msg: "OS upgrade of {{ inventory_hostname }} finished" [root@gpei-preserve-ansible-slave ~]# cat /root/workspace/node.yml --- - name: Note the reboot of node debug: msg: "Node {{ inventory_hostname }} is upgraded, going to be rebooted..." - name: Restart server shell: sleep 2 && shutdown -r now "Ansible updates triggered" async: 1 poll: 0 become: true ignore_errors: true - name: Waiting for the server to come back wait_for_connection: delay: 120 timeout: 300 Run 3.9 -> 3.10 upgrade, hooks were executed successfully. And the host OS upgrade could be finished. PLAY [Drain and upgrade nodes] ********************************************************************************************************************************************** TASK [Gathering Facts] ****************************************************************************************************************************************************** ok: [ec2-54-89-92-174.compute-1.amazonaws.com] TASK [Mark node unschedulable] ********************************************************************************************************************************************** changed: [ec2-54-89-92-174.compute-1.amazonaws.com -> ec2-54-242-151-46.compute-1.amazonaws.com] => {"attempts": 1, "changed": true, "failed": false, "results": {"cmd": "/usr/bin/oc adm manage-node ip-172-18-1-82.ec2.internal --schedulable=False", "nodes": [{"name": "ip-172-18-1-82.ec2.internal", "schedulable": false}], "results": "NAME STATUS ROLES AGE VERSION\nip-172-18-1-82.ec2.internal Ready,SchedulingDisabled compute 1h v1.9.1+a0ce1bc657\n", "returncode": 0}, "state": "present"} TASK [Drain Node for Kubelet upgrade] *************************************************************************************************************************************** changed: [ec2-54-89-92-174.compute-1.amazonaws.com -> ec2-54-242-151-46.compute-1.amazonaws.com] => {"attempts": 1, "changed": true, "cmd": ["oc", "adm", "drain", "ip-172-18-1-82.ec2.internal", "--config=/etc/origin/master/admin.kubeconfig", "--force", "--delete-local-data", "--ignore-daemonsets", "--timeout=0s"], "delta": "0:00:12.639060", "end": "2018-06-28 07:35:52.590467", "failed": false, ... "node \"ip-172-18-1-82.ec2.internal\" drained"]} TASK [debug] **************************************************************************************************************************************************************** ok: [ec2-54-89-92-174.compute-1.amazonaws.com] => { "msg": "Running node pre-upgrade hook /root/workspace/pre_node.yml" } TASK [include_tasks] ******************************************************************************************************************************************************** included: /root/workspace/pre_node.yml for ec2-54-89-92-174.compute-1.amazonaws.com TASK [Note the start of node OS upgrade] ************************************************************************************************************************************ ok: [ec2-54-89-92-174.compute-1.amazonaws.com] => { "msg": "Node OS upgrade of ec2-54-89-92-174.compute-1.amazonaws.com is about to start" } TASK [Upgrade the OS] ******************************************************************************************************************************************************* changed: [ec2-54-89-92-174.compute-1.amazonaws.com] => {"changed": true, "failed": false, "msg": "", "rc": 0, "results": ["Loaded plugins: amazon-id, search-disabled-repos\nResolving Dependencies\n--> Running transaction check\n---> Package NetworkManager.x86_64 1:1.10.2-13.el7 will be updated\n---> Package NetworkManager.x86_64 1:1.10.2-14.el7_5 will be an update\n---> Package NetworkManager-config-server.noarch 1:1.10.2-13.el7 will be updated\n---> Package NetworkManager-config-server.noarch 1:1.10.2-14.el7_5 will be an update\n---> Package NetworkManager-libnm.x86_64 1:1.10.2-13.el7 will be updated\n---> Package NetworkManager-libnm.x86_64 1:1.10.2-14.el7_5 will be an update\n---> Package NetworkManager-team.x86_64 1:1.10.2-13.el7 will be updated\n---> Package NetworkManager-team.x86_64 1:1.10.2-14.el7_5 will be an update\n---> Package NetworkManager-tui.x86_64 1:1.10.2-13.el7 will be updated\n---> Package NetworkManager-tui.x86_64 1:1.10.2-14.el7_5 will be an update\n---> Package atomic-openshift.x86_64 0:3.9.31-1.git.0.ef9737b.el7 will be updated\n ... python-urllib3.noarch 0:1.10.2-5.el7 \n\nComplete!\n"]} TASK [debug] **************************************************************************************************************************************************************** ok: [ec2-54-89-92-174.compute-1.amazonaws.com] => { "msg": "OS upgrade of ec2-54-89-92-174.compute-1.amazonaws.com finished" } ... TASK [openshift_node : Restart journald] ************************************************************************************************************************************ skipping: [ec2-54-89-92-174.compute-1.amazonaws.com] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true} TASK [debug] **************************************************************************************************************************************************************** ok: [ec2-54-89-92-174.compute-1.amazonaws.com] => { "msg": "Running node upgrade hook /root/workspace/node.yml" } TASK [include_tasks] ******************************************************************************************************************************************************** included: /root/workspace/node.yml for ec2-54-89-92-174.compute-1.amazonaws.com TASK [Note the reboot of node] ********************************************************************************************************************************************** ok: [ec2-54-89-92-174.compute-1.amazonaws.com] => { "msg": "Node ec2-54-89-92-174.compute-1.amazonaws.com is upgraded, going to be rebooted..." } TASK [Restart server] ******************************************************************************************************************************************************* changed: [ec2-54-89-92-174.compute-1.amazonaws.com] => {"ansible_job_id": "992188283596.45295", "changed": true, "failed": false, "finished": 0, "results_file": "/root/.ansible_async/992188283596.45295", "started": 1} TASK [Waiting for the server to come back] ********************************************************************************************************************************** ok: [ec2-54-89-92-174.compute-1.amazonaws.com] => {"changed": false, "elapsed": 123, "failed": false}