1544399 – [GSS] Migrating etcd Data: v2 to v3 playbook issues (openshift.master.cluster_method missing)

Bug 1544399 - [GSS] Migrating etcd Data: v2 to v3 playbook issues (openshift.master.cluster_method missing)

Summary: [GSS] Migrating etcd Data: v2 to v3 playbook issues (openshift.master.cluster...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3.6.z
Assignee:	Vadim Rutkovsky
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-02-12 11:30 UTC by Francesco Marchioni
Modified:	2021-06-10 14:34 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-04-12 06:03:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Error related to openshift.master.cluster_method (719.07 KB, text/plain) 2018-02-12 11:30 UTC, Francesco Marchioni	no flags	Details
Ansible inventory file (2.61 KB, text/plain) 2018-02-21 08:26 UTC, Francesco Marchioni	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:1106	0	None	None	None	2018-04-12 06:04:58 UTC

Description Francesco Marchioni 2018-02-12 11:30:06 UTC

Created attachment 1394847 [details]
Error related to openshift.master.cluster_method

Description of problem:
During the etcd  v2 to v3 migration, in the ansible inventory we had not defined "openshift.master.cluster_method". At the end, the playbook failed but it still left things unrealized (the modification of master-config.yaml, for example). 
It is not possible to rerun the playbook because it failed because the etcd data was already migrated. 
So the parameter "openshift.master.cluster_method" should be verified before to avoid leaving a migration incomplete.
 

Steps to Reproduce:
Migrated etcd without defining "openshift.master.cluster_method"

Actual results:
Attached output of the error

Expected results:
We expect an early failure of the migration to avoid having an inconsistent environment.

Additional info:

Comment 1 Vadim Rutkovsky 2018-02-20 13:27:59 UTC

Can't reproduce anymore in openshift-ansible-3.6.173.0.103-1. Probably the facts are being included correctly now.

Which openshift-ansible and ansible versions are you using? Could you attach an inventory file and an output of ansible-playbook run with '-vvv' parameter?

Comment 2 Vadim Rutkovsky 2018-02-20 21:14:53 UTC

Created https://github.com/openshift/openshift-ansible/pull/7226 to fix this

Comment 3 Francesco Marchioni 2018-02-21 08:26:58 UTC

Created attachment 1398566 [details]
Ansible inventory file

Comment 5 Vadim Rutkovsky 2018-02-27 14:54:03 UTC

Fix is available in openshift-ansible-3.6.173.0.104-1-4-g76aa5371e

Comment 7 liujia 2018-03-02 09:36:25 UTC

@Francesco

According to attached hosts file, it seems "openshift_master_cluster_method=native" was setting in hosts file. Why it happened "in the ansible inventory we had not defined "openshift.master.cluster_method""

BTW, I tried to annotate “openshift_master_cluster_method=native” in the inventory file, migrate will fail, but not at the same task as it in the log.

Could you help to confirm about the variable “openshift_master_cluster_method=native” was setting or not in the inventory file when hit the issue?

Comment 8 liujia 2018-03-05 09:18:48 UTC

@Francesco

I can not reproduce the issue on openshift-ansible-3.6.173.0.96-1.git.0.2954b4a.el7.noarch. There is no info about if what extra operations done and if the cluster is on semi-migrate status before run migrate. BTW, "openshift_master_cluster_method" should be consistent with current cluster's deployment.

So QE can not verify it from any steps except we can reproduce it. But if we indeed need this pr7226 merged into v3.6, then QE can check the pr merged and have no regression to verify the bug. Do you think that's ok?

Comment 9 liujia 2018-03-06 02:21:37 UTC

Version:
openshift-ansible-3.6.173.0.104-1.git.0.ee43cc5.el7.noarch

Steps:
1. HA install ocp v3.5
2. Upgrade v3.5 to v3.6
3. Edit inventory file to annotate "openshift_master_cluster_method=native"
(This step seems not reasonable, but according to this bug's description, just following the steps)
4. Run migrate

Migrate failed at TASK [Start master services].
TASK [Start master services] ************************************************************************************************************************************************
FAILED - RETRYING: Start master services (5 retries left).
...
FAILED - RETRYING: Start master services (1 retries left).
failed: [x.x.x.x] (item=atomic-openshift-master) => {
    "attempts": 5, 
    "changed": false, 
    "item": "atomic-openshift-master"
}
MSG:
Unable to start service atomic-openshift-master: Failed to start atomic-openshift-master.service: Unit is masked.

To summary, if we regard annotate "openshift_master_cluster_method=native" is a reasonable senario, then the bug need assign back. if we don't support "openshift_master_cluster_method" not defined in a ha deployed cluster, then this bug should be closed as notabug. So from QE side, this bug need to be assigned back first.

Comment 10 Vadim Rutkovsky 2018-03-07 12:47:55 UTC

(In reply to liujia from comment #9)
> Steps:
> 1. HA install ocp v3.5
> 2. Upgrade v3.5 to v3.6
> 3. Edit inventory file to annotate "openshift_master_cluster_method=native"
> (This step seems not reasonable, but according to this bug's description,
> just following the steps)

Scenario, where HA installation is used and then openshift_master_cluster_method is removed from the inventory is certainly invalid. There is no other way for openshift-ansible to detect if this is HA installation or not.

The valid scenario is:

* Prepare inventory where openshift_master_cluster_method is not specified
* Update roles/openshift_master/templates/master.yaml.v1.j2 and remove etcd3 part (to make sure etcd v2 data is created)
* Install 3.6
* Run migration

I've checked this on latest release-3.6 branch and no error occurred.


This has probably been a valid issue - the facts were set incorrectly, but I can't reproduce it on latest code

Comment 11 liujia 2018-03-08 01:52:15 UTC

(In reply to Vadim Rutkovsky from comment #10)
> The valid scenario is:
> 
> * Prepare inventory where openshift_master_cluster_method is not specified
> * Update roles/openshift_master/templates/master.yaml.v1.j2 and remove etcd3
> part (to make sure etcd v2 data is created)
> * Install 3.6
> * Run migration

This scenario is easy to install non-ha 3.6 without etcd3 and do migration. But the inventory file in attachment indicate that it is a ha cluster for the issue. So I think the key point is why openshift_master_cluster_method is not specified for a ha deployed cluster.

QE's Scenario:
Install 3.5(whatever ha or non-ha) with etcd v2
Upgrade 3.5 to 3.6 with above inventory file, keep etcd v2
Do etcd migrate v2 to v3 with above inventory file

This would be a valid qe scenario which works well because user will not install 3.6 with etcd2 as your ways.

> I've checked this on latest release-3.6 branch and no error occurred.
I think whether you scenario or qe'S scenario works well on both old branch and latest branch. Because we use correct inventory file which is corresponding with real deployment.

> This has probably been a valid issue - the facts were set incorrectly, but I
> can't reproduce it on latest code
You mean the issue in description is valid? Then could you reproduce it on openshift-ansible-3.6.173.0.96-1.git.0.2954b4a.el7.noarch which customer hit it? QE still can not reproduce it as current info.

Comment 12 liujia 2018-03-16 01:49:15 UTC

change back to MODIFIED to wait for the bug's decision.

Comment 14 Vadim Rutkovsky 2018-03-19 15:51:34 UTC

This is blocked by bug 1554707 now. 

I've tested the following setup:
1) Setup 3.5 cluster without `openshift_master_cluster_method` set
2) Update to 3.6
3) Add etcd keys
4) Run migrate

With the PR from bug 1554707 applied the error no longer appears.

Keeping this in MODIFIED until the blocking bug is resolved and a new package is released

Comment 15 liujia 2018-03-22 03:19:24 UTC

@Vadim Rutkovsky

Could u share your way about setup 3.5 cluster without "openshift_master_cluster_method" setting? It seems not supported to install a cluster without openshift_master_cluster_method from what I found in the code.

# vim roles/openshift_master/tasks/main.yml

# HA Variable Validation
- fail:
    msg: "openshift_master_cluster_method must be set to either 'native' or 'pacemaker' for multi-master installations"
  when: openshift_master_ha | bool and ((openshift_master_cluster_method is not defined) or (openshift_master_cluster_method is defined and openshift_master_cluster_method not in ["native", "pacemaker"]))

Comment 16 Vadim Rutkovsky 2018-03-26 12:21:35 UTC

(In reply to liujia from comment #15)
> @Vadim Rutkovsky
> 
> Could u share your way about setup 3.5 cluster without
> "openshift_master_cluster_method" setting? It seems not supported to install
> a cluster without openshift_master_cluster_method from what I found in the
> code.

Oh, sorry, I forgot to mention I've ran this with just one master. As you've noted in the previous comment there is no way to setup 3.5 cluster with multiple masters without setting 'openshift_master_cluster_method' (same in 3.6), so I verified that even one master would work after upgrading to 3.6 using latest code

Comment 17 liujia 2018-03-27 02:03:07 UTC

(In reply to Vadim Rutkovsky from comment #16)

> Oh, sorry, I forgot to mention I've ran this with just one master. As you've
> noted in the previous comment there is no way to setup 3.5 cluster with
> multiple masters without setting 'openshift_master_cluster_method' (same in
> 3.6), so I verified that even one master would work after upgrading to 3.6
> using latest code

But if it is not ha deploy, we need not specify 'openshift_master_cluster_method' in hosts file, and it is expected.

In conclusion, you and I, both can not re-produce it and have no idea about why there was not "openshift_master_cluster_method" setting in inventory file for a ha cluster in customer env. Am i right? 

So if you don't think it should be closed directly, then would u mind that QE just do regression test against the pr you linked. And if the issue comes out later, then re-open the bug or file a new one?

Comment 18 Vadim Rutkovsky 2018-03-27 08:09:55 UTC

(In reply to liujia from comment #17)
> In conclusion, you and I, both can not re-produce it and have no idea about
> why there was not "openshift_master_cluster_method" setting in inventory
> file for a ha cluster in customer env. Am i right? 

That's correct. If its missing there is no way I can reproduce the issue anymore.

> So if you don't think it should be closed directly, then would u mind that
> QE just do regression test against the pr you linked. And if the issue comes
> out later, then re-open the bug or file a new one?

Yes, regression test would be nice, thank you. Lets open a new issue with more specific details in case some error is found

Comment 19 liujia 2018-03-28 01:09:35 UTC

Can not reproduce, so QE just do regression test on openshift-ansible-3.6.173.0.110-1.git.0.ca81843.el7.noarch. Migration succeed.

Comment 22 errata-xmlrpc 2018-04-12 06:03:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1106

Note You need to log in before you can comment on or make changes to this bug.