Created attachment 1394847 [details] Error related to openshift.master.cluster_method Description of problem: During the etcd v2 to v3 migration, in the ansible inventory we had not defined "openshift.master.cluster_method". At the end, the playbook failed but it still left things unrealized (the modification of master-config.yaml, for example). It is not possible to rerun the playbook because it failed because the etcd data was already migrated. So the parameter "openshift.master.cluster_method" should be verified before to avoid leaving a migration incomplete. Steps to Reproduce: Migrated etcd without defining "openshift.master.cluster_method" Actual results: Attached output of the error Expected results: We expect an early failure of the migration to avoid having an inconsistent environment. Additional info:
Can't reproduce anymore in openshift-ansible-3.6.173.0.103-1. Probably the facts are being included correctly now. Which openshift-ansible and ansible versions are you using? Could you attach an inventory file and an output of ansible-playbook run with '-vvv' parameter?
Created https://github.com/openshift/openshift-ansible/pull/7226 to fix this
Created attachment 1398566 [details] Ansible inventory file
Fix is available in openshift-ansible-3.6.173.0.104-1-4-g76aa5371e
@Francesco According to attached hosts file, it seems "openshift_master_cluster_method=native" was setting in hosts file. Why it happened "in the ansible inventory we had not defined "openshift.master.cluster_method"" BTW, I tried to annotate “openshift_master_cluster_method=native” in the inventory file, migrate will fail, but not at the same task as it in the log. Could you help to confirm about the variable “openshift_master_cluster_method=native” was setting or not in the inventory file when hit the issue?
@Francesco I can not reproduce the issue on openshift-ansible-3.6.173.0.96-1.git.0.2954b4a.el7.noarch. There is no info about if what extra operations done and if the cluster is on semi-migrate status before run migrate. BTW, "openshift_master_cluster_method" should be consistent with current cluster's deployment. So QE can not verify it from any steps except we can reproduce it. But if we indeed need this pr7226 merged into v3.6, then QE can check the pr merged and have no regression to verify the bug. Do you think that's ok?
Version: openshift-ansible-3.6.173.0.104-1.git.0.ee43cc5.el7.noarch Steps: 1. HA install ocp v3.5 2. Upgrade v3.5 to v3.6 3. Edit inventory file to annotate "openshift_master_cluster_method=native" (This step seems not reasonable, but according to this bug's description, just following the steps) 4. Run migrate Migrate failed at TASK [Start master services]. TASK [Start master services] ************************************************************************************************************************************************ FAILED - RETRYING: Start master services (5 retries left). ... FAILED - RETRYING: Start master services (1 retries left). failed: [x.x.x.x] (item=atomic-openshift-master) => { "attempts": 5, "changed": false, "item": "atomic-openshift-master" } MSG: Unable to start service atomic-openshift-master: Failed to start atomic-openshift-master.service: Unit is masked. To summary, if we regard annotate "openshift_master_cluster_method=native" is a reasonable senario, then the bug need assign back. if we don't support "openshift_master_cluster_method" not defined in a ha deployed cluster, then this bug should be closed as notabug. So from QE side, this bug need to be assigned back first.
(In reply to liujia from comment #9) > Steps: > 1. HA install ocp v3.5 > 2. Upgrade v3.5 to v3.6 > 3. Edit inventory file to annotate "openshift_master_cluster_method=native" > (This step seems not reasonable, but according to this bug's description, > just following the steps) Scenario, where HA installation is used and then openshift_master_cluster_method is removed from the inventory is certainly invalid. There is no other way for openshift-ansible to detect if this is HA installation or not. The valid scenario is: * Prepare inventory where openshift_master_cluster_method is not specified * Update roles/openshift_master/templates/master.yaml.v1.j2 and remove etcd3 part (to make sure etcd v2 data is created) * Install 3.6 * Run migration I've checked this on latest release-3.6 branch and no error occurred. This has probably been a valid issue - the facts were set incorrectly, but I can't reproduce it on latest code
(In reply to Vadim Rutkovsky from comment #10) > The valid scenario is: > > * Prepare inventory where openshift_master_cluster_method is not specified > * Update roles/openshift_master/templates/master.yaml.v1.j2 and remove etcd3 > part (to make sure etcd v2 data is created) > * Install 3.6 > * Run migration This scenario is easy to install non-ha 3.6 without etcd3 and do migration. But the inventory file in attachment indicate that it is a ha cluster for the issue. So I think the key point is why openshift_master_cluster_method is not specified for a ha deployed cluster. QE's Scenario: Install 3.5(whatever ha or non-ha) with etcd v2 Upgrade 3.5 to 3.6 with above inventory file, keep etcd v2 Do etcd migrate v2 to v3 with above inventory file This would be a valid qe scenario which works well because user will not install 3.6 with etcd2 as your ways. > I've checked this on latest release-3.6 branch and no error occurred. I think whether you scenario or qe'S scenario works well on both old branch and latest branch. Because we use correct inventory file which is corresponding with real deployment. > This has probably been a valid issue - the facts were set incorrectly, but I > can't reproduce it on latest code You mean the issue in description is valid? Then could you reproduce it on openshift-ansible-3.6.173.0.96-1.git.0.2954b4a.el7.noarch which customer hit it? QE still can not reproduce it as current info.
change back to MODIFIED to wait for the bug's decision.
This is blocked by bug 1554707 now. I've tested the following setup: 1) Setup 3.5 cluster without `openshift_master_cluster_method` set 2) Update to 3.6 3) Add etcd keys 4) Run migrate With the PR from bug 1554707 applied the error no longer appears. Keeping this in MODIFIED until the blocking bug is resolved and a new package is released
@Vadim Rutkovsky Could u share your way about setup 3.5 cluster without "openshift_master_cluster_method" setting? It seems not supported to install a cluster without openshift_master_cluster_method from what I found in the code. # vim roles/openshift_master/tasks/main.yml # HA Variable Validation - fail: msg: "openshift_master_cluster_method must be set to either 'native' or 'pacemaker' for multi-master installations" when: openshift_master_ha | bool and ((openshift_master_cluster_method is not defined) or (openshift_master_cluster_method is defined and openshift_master_cluster_method not in ["native", "pacemaker"]))
(In reply to liujia from comment #15) > @Vadim Rutkovsky > > Could u share your way about setup 3.5 cluster without > "openshift_master_cluster_method" setting? It seems not supported to install > a cluster without openshift_master_cluster_method from what I found in the > code. Oh, sorry, I forgot to mention I've ran this with just one master. As you've noted in the previous comment there is no way to setup 3.5 cluster with multiple masters without setting 'openshift_master_cluster_method' (same in 3.6), so I verified that even one master would work after upgrading to 3.6 using latest code
(In reply to Vadim Rutkovsky from comment #16) > Oh, sorry, I forgot to mention I've ran this with just one master. As you've > noted in the previous comment there is no way to setup 3.5 cluster with > multiple masters without setting 'openshift_master_cluster_method' (same in > 3.6), so I verified that even one master would work after upgrading to 3.6 > using latest code But if it is not ha deploy, we need not specify 'openshift_master_cluster_method' in hosts file, and it is expected. In conclusion, you and I, both can not re-produce it and have no idea about why there was not "openshift_master_cluster_method" setting in inventory file for a ha cluster in customer env. Am i right? So if you don't think it should be closed directly, then would u mind that QE just do regression test against the pr you linked. And if the issue comes out later, then re-open the bug or file a new one?
(In reply to liujia from comment #17) > In conclusion, you and I, both can not re-produce it and have no idea about > why there was not "openshift_master_cluster_method" setting in inventory > file for a ha cluster in customer env. Am i right? That's correct. If its missing there is no way I can reproduce the issue anymore. > So if you don't think it should be closed directly, then would u mind that > QE just do regression test against the pr you linked. And if the issue comes > out later, then re-open the bug or file a new one? Yes, regression test would be nice, thank you. Lets open a new issue with more specific details in case some error is found
Can not reproduce, so QE just do regression test on openshift-ansible-3.6.173.0.110-1.git.0.ca81843.el7.noarch. Migration succeed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1106