1495135 – Upgrade failed due to can not find atomic-openshift-master-api service in non-ha containerized env

Bug 1495135 - Upgrade failed due to can not find atomic-openshift-master-api service in non-ha containerized env

Summary: Upgrade failed due to can not find atomic-openshift-master-api service in non...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.7.0
Assignee:	Jan Chaloupka
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-09-25 09:24 UTC by liujia
Modified:	2017-12-18 06:01 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-11-28 22:12:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:3188	0	normal	SHIPPED_LIVE	Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update	2017-11-29 02:34:54 UTC

Description liujia 2017-09-25 09:24:39 UTC

Description of problem:
Run upgrade against non-ha containerized env, upgrade failed due to installer try to find atomic-openshift-master-api service.


TASK [Ensure HA Master is running] *********************************************
fatal: [qe-master]: FAILED! => {
    "changed": false, 
    "failed": true
}

MSG:

Could not find the requested service atomic-openshift-master-api: host


# ls -la /etc/sysconfig/ |grep atomic
-rw-r--r--.  1 root root  304 Sep 25 03:39 atomic-openshift-master
-rw-r--r--.  1 root root   95 Sep 25 03:46 atomic-openshift-node
-rw-r--r--.  1 root root  141 Sep 25 03:47 atomic-openshift-node-dep


PLAY RECAP *********************************************************************
localhost                  : ok=11   changed=0    unreachable=0    failed=0   
qe-etcd : ok=41   changed=4    unreachable=0    failed=0   
qe-master : ok=90   changed=11   unreachable=0    failed=1   
qe-node : ok=83   changed=10   unreachable=0    failed=0   



Failure summary:

  1. Hosts:    qe-master
     Play:     Verify master processes
     Task:     Ensure HA Master is running
     Message:  Could not find the requested service atomic-openshift-master-api: host

Version-Release number of the following components:
ansible-2.3.2.0-2.el7.noarch
openshift-ansible-3.7.0-0.127.0.git.0.b9941e4.el7.noarch

How reproducible:
always 

Steps to Reproduce:
1. Install non-ha container env
2. Run upgrade
3.

Actual results:
Upgrade failed.

Expected results:
Upgrade succeed.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Jan Chaloupka 2017-09-27 11:52:02 UTC

I am not able to reproduce it, my inventory:
```ini
[OSEv3:children]
masters
nodes
etcd

[OSEv3:vars]
ansible_ssh_user = root
deployment_type = openshift-enterprise
openshift_deployment_type = openshift-enterprise
osm_use_cockpit = false
openshift_release = v3.7
openshift_docker_insecure_registries=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888
openshift_docker_additional_registries="brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888,registry.ops.openshift.com"
containerized=True

openshift_pkg_version=-3.6.173.0.37
openshift_version=3.6.173.0.39

[masters]
10.8.174.18 ansible_ssh_host=10.8.174.18

[nodes]
10.8.174.183 ansible_ssh_host=10.8.174.183

[etcd]
10.8.174.18 ansible_ssh_host=10.8.174.18

```

can you share your inventory file?

Comment 2 Jan Chaloupka 2017-09-27 11:52:55 UTC

Both openshift_pkg_version and openshift_version are supposed to be commented out, they were used to deploy the v3.6 cluster.

Comment 3 Jan Chaloupka 2017-09-27 12:11:38 UTC

Please, see https://github.com/openshift/openshift-ansible/pull/4832.

Citing Clayton:
"Native clustering is the default configuration mode, even when only one master is configured" [1]
"We don't support upgrade from non-HA to HA" [2]

All the changes are for OCP 3.7+ so the error message is expected. The only item to complete is to document this case.

[1] https://github.com/openshift/openshift-ansible/pull/4832#issue-244862534
[2] https://github.com/openshift/openshift-ansible/pull/4832#discussion_r130642101

Comment 4 liujia 2017-09-28 05:12:28 UTC

Reproduced always with v3.7.0-0.127.0.

# rpm -qa|grep openshift
openshift-ansible-filter-plugins-3.7.0-0.127.0.git.0.b9941e4.el7.noarch
openshift-ansible-playbooks-3.7.0-0.127.0.git.0.b9941e4.el7.noarch
atomic-openshift-clients-3.7.0-0.127.0.git.0.459b70b.el7.x86_64
openshift-ansible-docs-3.7.0-0.127.0.git.0.b9941e4.el7.noarch
openshift-ansible-lookup-plugins-3.7.0-0.127.0.git.0.b9941e4.el7.noarch
openshift-ansible-roles-3.7.0-0.127.0.git.0.b9941e4.el7.noarch
atomic-openshift-utils-3.7.0-0.127.0.git.0.b9941e4.el7.noarch
openshift-ansible-3.7.0-0.127.0.git.0.b9941e4.el7.noarch
openshift-ansible-callback-plugins-3.7.0-0.127.0.git.0.b9941e4.el7.noarch

Inventory and upgrade.log in attachment.

Comment 7 liujia 2017-09-28 05:31:56 UTC

No any explicit claim saying that installer will not support non-ha containerized ocp upgrade from v3.6 to v3.7 till now. What QE received is just that single master service will be spitted to master-api.service and master-controller.service in 3.7, so for upgrade process, it may need not only a detect but also a transfer to complete this split just as point 2 in [1]. 

To document this case, it seems only a compromise for this issue but not the best solution, however, it indeed should be tracked as a bug before it come to the last conclusion.

[1] https://github.com/openshift/openshift-ansible/issues/4979

Comment 9 Jan Chaloupka 2017-09-29 08:50:16 UTC

Clayton, can you more elaborate on the issue and comment #7?

Comment 10 Clayton Coleman 2017-10-16 14:17:13 UTC

I would expect the upgrade to re-run openshift-master systemd_units.xml task on each master node, which would convert the monolithic master process into api and controller units.

Comment 11 Jan Chaloupka 2017-10-23 13:27:21 UTC

Once the control plane check passes, the non-ha master is upgraded to ha without any problems. So only the "Ensure HA Master is running" tasks need to be modified so they check the non-ha service if available.

Comment 12 Jan Chaloupka 2017-10-23 14:37:47 UTC

Upstream PR: https://github.com/openshift/openshift-ansible/pull/5845

Comment 14 liujia 2017-10-25 07:03:02 UTC

Version:
openshift-ansible-docs-3.7.0-0.178.0.git.0.27a1039.el7.noarch

Steps:
1. Container install ocp v3.6(one master_etcd+one node)
2. Upgrade ocp to latest v3.7

Upgrade succeed with atomic-openshift-master-api and atomic-openshift-master-controllers services running.
# docker ps|grep master
01116881e2bf        openshift3/ose:v3.7.0                   "/usr/bin/openshift s"   9 minutes ago       Up 9 minutes                            atomic-openshift-master-controllers
897c7f49f878        openshift3/ose:v3.7.0                   "/usr/bin/openshift s"   10 minutes ago      Up 10 minutes                           atomic-openshift-master-api
77d439489612        openshift3/ose:v3.6.173.0.59            "/usr/bin/openshift s"   12 minutes ago      Up 12 minutes                           atomic-openshift-master

It is strange to keep original atomic-openshift-master service together with api and controllers service after upgrade. Will track it in a new bug if it will cause new problem. As for this bug, upgrade works well against non-ha containerzied ocp.

Comment 18 errata-xmlrpc 2017-11-28 22:12:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188

Note You need to log in before you can comment on or make changes to this bug.