1645164 – Upgrade to 3.11 failed - node is removed and can't be attached back to the cluster

Bug 1645164 - Upgrade to 3.11 failed - node is removed and can't be attached back to the cluster

Summary: Upgrade to 3.11 failed - node is removed and can't be attached back to the cl...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Michael Gugino
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-11-01 14:08 UTC by Vladislav Walek
Modified:	2022-03-13 15:56 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-01-10 09:04:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0024	0	None	None	None	2019-01-10 09:05:50 UTC

Description Vladislav Walek 2018-11-01 14:08:22 UTC

Description of problem:

When doing upgrade from 3.10 to 3.11 atomic host,m the upgrade failed on task:

TASK [openshift_node : Install or Update node system container] *****************************************************************************************************************************************************************************
fatal: [master1]: FAILED! => {"changed": false, "msg": "Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?\n\n", "rc": 1}
to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade_control_plane.retry

The service docker was stopped and I restarted manually.
then I relaunched the playbook like this

# ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade_control_plane.yml --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade_control_plane.retry

but the service atomic-openshift-node on the first master is lost. Was rebooted.

After scale up from 3.10 (as the upgrade failed on first master) the playbook is failing.

The next try of the upgrade fails again.
Logs will be attached.

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 3 Scott Dodson 2018-11-01 14:50:16 UTC


*** This bug has been marked as a duplicate of bug 1641245 ***

Comment 4 Scott Dodson 2018-11-01 14:53:38 UTC

Oops not a dupe, just needs a 3.10 backport.

Comment 5 Scott Dodson 2018-11-01 14:54:34 UTC

Actually, bug description says upgrade to 3.11 which would mean the version should be 3.11 not 3.10? Please confirm what your customer is doing.

Comment 6 Michael Gugino 2018-11-01 17:02:45 UTC

My hypothesis here is that the atomic command cleans up the old systemd unit file for the node service, then attempts to create a new one and fails due to docker being down.

Latest 3.11 would prevent this problem from occurring in the first place due to task re-ordering, but we don't have anything in place to fix it for clusters that are already broken.

Curiously, I'm not sure why the atomic command needs to talk to the docker daemon in the first place.  Perhaps a regression there?

Investigating work-around to restore node service unit.

Comment 7 Michael Gugino 2018-11-01 17:18:52 UTC

https://github.com/openshift/openshift-ansible/blob/5155ab9bb8bc0bec530cf52b0fbd00c6f1684be2/roles/lib_openshift/library/oc_atomic_container.py#L103-L105

The logic of the module is to remove any existing systemd service unit if no existing containers are found.  This will be the case if docker is stopped, it will think it's a fresh install and remove any existing systemd units.

3.11 task ordering was fixed to prevent this condition: https://github.com/openshift/openshift-ansible/pull/10555

However, anyone that already hit this will be affected and the service unit will have been removed.

Unfortunately, all the config changes are in place for 3.11, so copying existing systemd unit file from another host may be problematic.  I'll get a patch out with a play that someone in this scenario can run adhoc in order to rectify this condition.

Comment 8 Michael Gugino 2018-11-01 17:33:00 UTC

PR Created in master: https://github.com/openshift/openshift-ansible/pull/10579

To utilize this fix, first ensure container runtime (either docker or crio) is running on the first master.  Next, run 'playbooks/common/openshift-cluster/upgrades/v3_11/fix_first_master_node.yml' the same way as an upgrade playbook.  This should only affect first master.

If that playbook completes successfully, re-run upgrade playbook as normal.

Comment 9 Vladislav Walek 2018-11-12 12:26:42 UTC

Hi,
the main issue is in "upgrade from 3.10 to 3.11".
However, the node service was lost during upgrade from 3.10 to 3.11. And scale up still on old version still doesn't work. 
The main issue is the upgrade from 3.10 to 3.11.
Thx

Comment 13 liujia 2018-12-06 05:22:05 UTC

The 3.10-3.11 upgrade issue has been fixed and verified in bz1641245. This bug is used to repair the broken cluster.

Version:
openshift-ansible-3.11.51-2.git.0.51c90a3.el7.noarch

Steps:
1. System container install ocp v3.10.83 on atomic hosts.
# oc version
oc v3.10.83
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-jliu-10-master-etcd-1:8443
openshift v3.10.83
kubernetes v1.10.0+b81c8f8

2. Upgrade above ocp without "system_images_registry" set in hosts file to get a broken cluster, with v3.11.16 installer, which still had the issue in bz1641245.

3. Upgrade failed at task [openshift_node : Install or Update node system container] as expected. Checked that atomic-openshift-node.service service(on master node) can not be restarted due to systemd service unit was not available.
# systemctl restart atomic-openshift-node.service
Failed to restart atomic-openshift-node.service: Unit not found.
# ls -la /etc/systemd/system/ |grep atomic

4. Run restore playbook to get node service back succeed.
ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/v3_11/fix_first_master_node.yml -v

# ls -la /etc/systemd/system|grep atomic
-rw-r--r--. 1 root root 589 Dec 6 03:19 atomic-openshift-node.service

# systemctl status atomic-openshift-node.service
● atomic-openshift-node.service
Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2018-12-06 03:22:47 UTC; 12s ago

# oc get node
NAME STATUS ROLES AGE VERSION
qe-jliu-10-master-etcd-1 Ready master 1h v1.11.0+d4cacc0
qe-jliu-10-node-1 Ready compute 1h v1.10.0+b81c8f8
qe-jliu-10-node-registry-router-1 Ready <none> 1h v1.10.0+b81c8f8

5. Re-run upgrade playbook to continue ocp upgrade with latest installer v3.11.51

# oc version
oc v3.11.51
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-jliu-10-master-etcd-1:8443
openshift v3.11.51
kubernetes v1.11.0+d4cacc0

Verify the bug, and change target to v3.11 since the pr merged in v3.11.

Comment 16 errata-xmlrpc 2019-01-10 09:04:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0024

Note You need to log in before you can comment on or make changes to this bug.