1280230 – Failed to upgrade cluster HA environment

Bug 1280230 - Failed to upgrade cluster HA environment

Summary: Failed to upgrade cluster HA environment

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Andrew Butcher
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-11-11 09:50 UTC by Anping Li
Modified:	2015-11-20 15:41 UTC (History)
CC List:	5 users (show)
Fixed In Version:	openshift-ansible-3.0.12-1.git.25.269bbc5.el7aos
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-11-20 15:41:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Anping Li 2015-11-11 09:50:08 UTC

Description of problem:
Failed to upgrade the cluster environment to v3.1  (3 masters controlled by pace marker + 3 etcd + 2 nodes ) 


Version-Release number of selected component (if applicable):
atomic-openshift-utils-3.0.8-1.git.0.59ae79c.el7aos.noarch

Steps to Reproduce:
1. setup  cluster environemnt
2. upgrade the cluster environment to v3.1 


Actual results:
The following error messages was print

<--skip --->
TASK: [Upgrade master configuration] ****************************************** 
skipping: [master1.example.com]
fatal: [master2.example.com] => error while evaluating conditional: deployment_type in ['openshift-enterprise', 'atomic-enterprise'] and g_aos_versions.curr_version | version_compare('3.1', '>=')
fatal: [master3.example.com] => error while evaluating conditional: deployment_type in ['openshift-enterprise', 'atomic-enterprise'] and g_aos_versions.curr_version | version_compare('3.1', '>=')

<---skip--->
<---skip--->

fatal: [master1.example.com] => One or more undefined variables: 'dict object' has no attribute 'master_cert_subdir'

FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/root/upgrade.retry

localhost                  : ok=3    changed=0    unreachable=0    failed=0   
master1.example.com        : ok=43   changed=7    unreachable=1    failed=0   
master2.example.com        : ok=23   changed=4    unreachable=1    failed=0   
master3.example.com        : ok=23   changed=4    unreachable=1    failed=0   
node1.example.com          : ok=5    changed=0    unreachable=0    failed=0   
node2.example.com          : ok=5    changed=0    unreachable=0    failed=0   


Expected results:
Upgrade should successed

Additional info::

Comment 2 Andrew Butcher 2015-11-11 15:03:26 UTC

Proposed fix is here:

https://github.com/openshift/openshift-ansible/pull/870

Comment 3 Andrew Butcher 2015-11-11 15:53:50 UTC

PR #870 has been closed in favor of #839 which has been merged into master.

Comment 5 Anping Li 2015-11-12 06:38:19 UTC

The upgrade failed when check 'pcs status'

TASK: [openshift_master_cluster | Test if cluster is already configured] ****** 
fatal: [master1.example.com] => error while evaluating conditional: openshift.master.cluster_method == "pacemaker"

FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/root/upgrade.retry

localhost                  : ok=9    changed=0    unreachable=0    failed=0   
master1.example.com        : ok=65   changed=19   unreachable=1    failed=0   
master2.example.com        : ok=38   changed=12   unreachable=0    failed=0   
master3.example.com        : ok=38   changed=12   unreachable=0    failed=0   
node1.example.com          : ok=15   changed=3    unreachable=0    failed=0   
node2.example.com          : ok=15   changed=3    unreachable=0    failed=0   


After then,I check the PCS status and found the following.

[root@master1 ~]# pc status
-bash: pc: command not found
[root@master1 ~]# pcs status
Error: cluster is not currently running on this node
[root@master1 ~]# ps -ef|grep pcs
root       622     1  0 13:19 ?        00:00:00 /bin/sh /usr/lib/pcsd/pcsd start
root       681   622  0 13:19 ?        00:00:00 /bin/bash -c ulimit -S -c 0 >/dev/null 2>&1 ; /usr/bin/ruby -I/usr/lib/pcsd /usr/lib/pcsd/ssl.rb
root       684   681  0 13:19 ?        00:00:01 /usr/bin/ruby -I/usr/lib/pcsd /usr/lib/pcsd/ssl.rb
root     26430 26387  0 14:36 pts/0    00:00:00 grep --color=auto pcs

Comment 6 Andrew Butcher 2015-11-12 15:35:38 UTC

I'm unable to reproduce this issue with the master branch of openshift-ansible and my pacemaker HA upgrade completes successfully. Can you verify that your cluster is up and running prior to upgrade and that you have the latest ansible code?

I am launching the playbook like this:

ansible-playbook ~/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_0_to_v3_1/upgrade.yml


Prior to beginning the upgrade 'pcs status' should indicate that the cluster is started.

# pcs status
Cluster name: openshift_master
Last updated: Thu Nov 12 09:34:52 2015          Last change: Thu Nov 12 09:12:26 2015 by root via crm_resource on master4.example.com
Stack: corosync
Current DC: master5.example.com (version 1.1.13-10.el7-44eb2dd) - partition with quorum
3 nodes and 2 resources configured

Online: [ master4.example.com master5.example.com master6.example.com ]

Full list of resources:

 Resource Group: atomic-openshift-master
     virtual-ip (ocf::heartbeat:IPaddr2):       Started master4.example.com
     master     (systemd:atomic-openshift-master):      Started master4.example.com

PCSD Status:
  master4.example.com: Online
  master5.example.com: Online
  master6.example.com: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled


Additionally, ansible should not have an issue evaluating that conditional as the inventory variable is checked when verifying that the upgrade can proceed. If multiple masters are configured and openshift_master_cluster_method is not set to "pacemaker" the upgrade will fail with the following message:


PLAY [Verify upgrade can proceed] *********************************************

TASK: [fail ] *****************************************************************
failed: [master4.example.com] => {"failed": true}
msg: openshift_master_cluster_method must be set to 'pacemaker'

Comment 10 Andrew Butcher 2015-11-13 20:23:56 UTC

Proposed fix is here: https://github.com/openshift/openshift-ansible/pull/892

Comment 12 Anping Li 2015-11-16 01:22:46 UTC

Verified and pass with latest build

Note You need to log in before you can comment on or make changes to this bug.