1462766 – [RFE] Provide automation for host OS updates

Bug 1462766 - [RFE] Provide automation for host OS updates

Summary: [RFE] Provide automation for host OS updates

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	RFE
Sub Component:
Version:	3.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.10.z
Assignee:	Brenton Leanhardt
QA Contact:	Gaoyun Pei
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-06-19 13:32 UTC by Marko Myllynen
Modified:	2021-12-10 15:05 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-10-08 13:28:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Marko Myllynen 2017-06-19 13:32:19 UTC

Description of problem:
https://docs.openshift.com/container-platform/latest/install_config/upgrading/os_upgrades.html describes how to apply OS updates on OCP nodes. Especially if the number of nodes is large, manual approach is not feasible.

This should be fully automated.

There is already a playbook doing this automatically, perhaps it could be used for inspiration here:

https://github.com/myllynen/openshift-automation-tools/blob/master/conf/install-os-updates.yml

Thanks.

Comment 1 Brenton Leanhardt 2017-07-13 13:12:18 UTC

Since as you mention this problem is most exaggerated for large installations, we're focusing our improvements in this area on using "golden images" to allow for efficient blue/green updates whether it's for the operating system or for OpenShift.  This is the direction we're going for OpenShift Online and we will encourage large OCP clusters to use them as well once they're available.

Comment 2 Juan Luis de Sousa-Valadas 2017-10-27 12:41:26 UTC

Golden images aren't a suitable solution unless OpenShift is deployed on an IaaS. For physical hosts blue/green isn't suitable.

Comment 9 jtudelag 2017-12-22 11:39:29 UTC

Upstream PR here: https://github.com/openshift/openshift-ansible/pull/6558

Comment 15 Scott Dodson 2018-06-20 13:06:17 UTC

Node upgrade hooks are implemented in openshift-ansible-3.7.45-1, openshift-ansible-3.9.26-1, and all versions of 3.10.

Pending documentation changes https://github.com/openshift/openshift-docs/pull/8552/files

Comment 17 Gaoyun Pei 2018-06-28 11:49:58 UTC

Verify this bug with openshift-ansible-3.10.9-1.git.240.1c86105.el7.noarch.

With node pre-upgrade hook, we could do OS upgrade after node is unschedulable and drained. With upgrade hook which run after node is upgraded and before being schedulable again, we could also finished a server reboot.


Add the hooks definition in ansible inventory  while doing upgrade:

openshift_node_upgrade_pre_hook=/root/workspace/pre_node.yml
openshift_node_upgrade_hook=/root/workspace/node.yml

[root@gpei-preserve-ansible-slave ~]# cat /root/workspace/pre_node.yml
---
- name: Note the start of node OS upgrade
  debug:
      msg: "Node OS upgrade of {{ inventory_hostname }} is about to start"

- name: Upgrade the OS
  yum: name=* state=latest

- name: 
  debug:
      msg: "OS upgrade of {{ inventory_hostname }} finished"

[root@gpei-preserve-ansible-slave ~]# cat /root/workspace/node.yml
---
- name: Note the reboot of node
  debug:
      msg: "Node {{ inventory_hostname }} is upgraded, going to be rebooted..."

- name: Restart server
  shell: sleep 2 && shutdown -r now "Ansible updates triggered"
  async: 1
  poll: 0
  become: true
  ignore_errors: true

- name: Waiting for the server to come back
  wait_for_connection:
    delay: 120
    timeout: 300


Run 3.9 -> 3.10 upgrade, hooks were executed successfully. And the host OS upgrade could be finished.

PLAY [Drain and upgrade nodes] **********************************************************************************************************************************************

TASK [Gathering Facts] ******************************************************************************************************************************************************
ok: [ec2-54-89-92-174.compute-1.amazonaws.com]

TASK [Mark node unschedulable] **********************************************************************************************************************************************
changed: [ec2-54-89-92-174.compute-1.amazonaws.com -> ec2-54-242-151-46.compute-1.amazonaws.com] => {"attempts": 1, "changed": true, "failed": false, "results": {"cmd": "/usr/bin/oc adm manage-node ip-172-18-1-82.ec2.internal --schedulable=False", "nodes": [{"name": "ip-172-18-1-82.ec2.internal", "schedulable": false}], "results": "NAME                          STATUS                     ROLES     AGE       VERSION\nip-172-18-1-82.ec2.internal   Ready,SchedulingDisabled   compute   1h        v1.9.1+a0ce1bc657\n", "returncode": 0}, "state": "present"}

TASK [Drain Node for Kubelet upgrade] ***************************************************************************************************************************************
changed: [ec2-54-89-92-174.compute-1.amazonaws.com -> ec2-54-242-151-46.compute-1.amazonaws.com] => {"attempts": 1, "changed": true, "cmd": ["oc", "adm", "drain", "ip-172-18-1-82.ec2.internal", "--config=/etc/origin/master/admin.kubeconfig", "--force", "--delete-local-data", "--ignore-daemonsets", "--timeout=0s"], "delta": "0:00:12.639060", "end": "2018-06-28 07:35:52.590467", "failed": false, ...
"node \"ip-172-18-1-82.ec2.internal\" drained"]}

TASK [debug] ****************************************************************************************************************************************************************
ok: [ec2-54-89-92-174.compute-1.amazonaws.com] => {
    "msg": "Running node pre-upgrade hook /root/workspace/pre_node.yml"
}

TASK [include_tasks] ********************************************************************************************************************************************************
included: /root/workspace/pre_node.yml for ec2-54-89-92-174.compute-1.amazonaws.com

TASK [Note the start of node OS upgrade] ************************************************************************************************************************************
ok: [ec2-54-89-92-174.compute-1.amazonaws.com] => {
    "msg": "Node OS upgrade of ec2-54-89-92-174.compute-1.amazonaws.com is about to start"
}

TASK [Upgrade the OS] *******************************************************************************************************************************************************
changed: [ec2-54-89-92-174.compute-1.amazonaws.com] => {"changed": true, "failed": false, "msg": "", "rc": 0, "results": ["Loaded plugins: amazon-id, search-disabled-repos\nResolving Dependencies\n--> Running transaction check\n---> Package NetworkManager.x86_64 1:1.10.2-13.el7 will be updated\n---> Package NetworkManager.x86_64 1:1.10.2-14.el7_5 will be an update\n---> Package NetworkManager-config-server.noarch 1:1.10.2-13.el7 will be updated\n---> Package NetworkManager-config-server.noarch 1:1.10.2-14.el7_5 will be an update\n---> Package NetworkManager-libnm.x86_64 1:1.10.2-13.el7 will be updated\n---> Package NetworkManager-libnm.x86_64 1:1.10.2-14.el7_5 will be an update\n---> Package NetworkManager-team.x86_64 1:1.10.2-13.el7 will be updated\n---> Package NetworkManager-team.x86_64 1:1.10.2-14.el7_5 will be an update\n---> Package NetworkManager-tui.x86_64 1:1.10.2-13.el7 will be updated\n---> Package NetworkManager-tui.x86_64 1:1.10.2-14.el7_5 will be an update\n---> Package atomic-openshift.x86_64 0:3.9.31-1.git.0.ef9737b.el7 will be updated\n
...
python-urllib3.noarch 0:1.10.2-5.el7      \n\nComplete!\n"]}

TASK [debug] ****************************************************************************************************************************************************************
ok: [ec2-54-89-92-174.compute-1.amazonaws.com] => {
    "msg": "OS upgrade of ec2-54-89-92-174.compute-1.amazonaws.com finished"
}


...


TASK [openshift_node : Restart journald] ************************************************************************************************************************************
skipping: [ec2-54-89-92-174.compute-1.amazonaws.com] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true}

TASK [debug] ****************************************************************************************************************************************************************
ok: [ec2-54-89-92-174.compute-1.amazonaws.com] => {
    "msg": "Running node upgrade hook /root/workspace/node.yml"
}

TASK [include_tasks] ********************************************************************************************************************************************************
included: /root/workspace/node.yml for ec2-54-89-92-174.compute-1.amazonaws.com

TASK [Note the reboot of node] **********************************************************************************************************************************************
ok: [ec2-54-89-92-174.compute-1.amazonaws.com] => {
    "msg": "Node ec2-54-89-92-174.compute-1.amazonaws.com is upgraded, going to be rebooted..."
}

TASK [Restart server] *******************************************************************************************************************************************************
changed: [ec2-54-89-92-174.compute-1.amazonaws.com] => {"ansible_job_id": "992188283596.45295", "changed": true, "failed": false, "finished": 0, "results_file": "/root/.ansible_async/992188283596.45295", "started": 1}

TASK [Waiting for the server to come back] **********************************************************************************************************************************
ok: [ec2-54-89-92-174.compute-1.amazonaws.com] => {"changed": false, "elapsed": 123, "failed": false}

Note You need to log in before you can comment on or make changes to this bug.