Bug 1280230 - Failed to upgrade cluster HA environment
Failed to upgrade cluster HA environment
Product: OpenShift Container Platform
Classification: Red Hat
Component: Upgrade (Show other bugs)
Unspecified Unspecified
high Severity high
: ---
: ---
Assigned To: Andrew Butcher
Johnny Liu
: UpcomingRelease
Depends On:
  Show dependency treegraph
Reported: 2015-11-11 04:50 EST by Anping Li
Modified: 2015-11-20 10:41 EST (History)
5 users (show)

See Also:
Fixed In Version: openshift-ansible-3.0.12-1.git.25.269bbc5.el7aos
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2015-11-20 10:41:26 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Anping Li 2015-11-11 04:50:08 EST
Description of problem:
Failed to upgrade the cluster environment to v3.1  (3 masters controlled by pace marker + 3 etcd + 2 nodes ) 

Version-Release number of selected component (if applicable):

Steps to Reproduce:
1. setup  cluster environemnt
2. upgrade the cluster environment to v3.1 

Actual results:
The following error messages was print

<--skip --->
TASK: [Upgrade master configuration] ****************************************** 
skipping: [master1.example.com]
fatal: [master2.example.com] => error while evaluating conditional: deployment_type in ['openshift-enterprise', 'atomic-enterprise'] and g_aos_versions.curr_version | version_compare('3.1', '>=')
fatal: [master3.example.com] => error while evaluating conditional: deployment_type in ['openshift-enterprise', 'atomic-enterprise'] and g_aos_versions.curr_version | version_compare('3.1', '>=')


fatal: [master1.example.com] => One or more undefined variables: 'dict object' has no attribute 'master_cert_subdir'

FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/root/upgrade.retry

localhost                  : ok=3    changed=0    unreachable=0    failed=0   
master1.example.com        : ok=43   changed=7    unreachable=1    failed=0   
master2.example.com        : ok=23   changed=4    unreachable=1    failed=0   
master3.example.com        : ok=23   changed=4    unreachable=1    failed=0   
node1.example.com          : ok=5    changed=0    unreachable=0    failed=0   
node2.example.com          : ok=5    changed=0    unreachable=0    failed=0   

Expected results:
Upgrade should successed

Additional info::
Comment 2 Andrew Butcher 2015-11-11 10:03:26 EST
Proposed fix is here:

Comment 3 Andrew Butcher 2015-11-11 10:53:50 EST
PR #870 has been closed in favor of #839 which has been merged into master.
Comment 5 Anping Li 2015-11-12 01:38:19 EST
The upgrade failed when check 'pcs status'

TASK: [openshift_master_cluster | Test if cluster is already configured] ****** 
fatal: [master1.example.com] => error while evaluating conditional: openshift.master.cluster_method == "pacemaker"

FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/root/upgrade.retry

localhost                  : ok=9    changed=0    unreachable=0    failed=0   
master1.example.com        : ok=65   changed=19   unreachable=1    failed=0   
master2.example.com        : ok=38   changed=12   unreachable=0    failed=0   
master3.example.com        : ok=38   changed=12   unreachable=0    failed=0   
node1.example.com          : ok=15   changed=3    unreachable=0    failed=0   
node2.example.com          : ok=15   changed=3    unreachable=0    failed=0   

After then,I check the PCS status and found the following.

[root@master1 ~]# pc status
-bash: pc: command not found
[root@master1 ~]# pcs status
Error: cluster is not currently running on this node
[root@master1 ~]# ps -ef|grep pcs
root       622     1  0 13:19 ?        00:00:00 /bin/sh /usr/lib/pcsd/pcsd start
root       681   622  0 13:19 ?        00:00:00 /bin/bash -c ulimit -S -c 0 >/dev/null 2>&1 ; /usr/bin/ruby -I/usr/lib/pcsd /usr/lib/pcsd/ssl.rb
root       684   681  0 13:19 ?        00:00:01 /usr/bin/ruby -I/usr/lib/pcsd /usr/lib/pcsd/ssl.rb
root     26430 26387  0 14:36 pts/0    00:00:00 grep --color=auto pcs
Comment 6 Andrew Butcher 2015-11-12 10:35:38 EST
I'm unable to reproduce this issue with the master branch of openshift-ansible and my pacemaker HA upgrade completes successfully. Can you verify that your cluster is up and running prior to upgrade and that you have the latest ansible code?

I am launching the playbook like this:

ansible-playbook ~/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_0_to_v3_1/upgrade.yml

Prior to beginning the upgrade 'pcs status' should indicate that the cluster is started.

# pcs status
Cluster name: openshift_master
Last updated: Thu Nov 12 09:34:52 2015          Last change: Thu Nov 12 09:12:26 2015 by root via crm_resource on master4.example.com
Stack: corosync
Current DC: master5.example.com (version 1.1.13-10.el7-44eb2dd) - partition with quorum
3 nodes and 2 resources configured

Online: [ master4.example.com master5.example.com master6.example.com ]

Full list of resources:

 Resource Group: atomic-openshift-master
     virtual-ip (ocf::heartbeat:IPaddr2):       Started master4.example.com
     master     (systemd:atomic-openshift-master):      Started master4.example.com

PCSD Status:
  master4.example.com: Online
  master5.example.com: Online
  master6.example.com: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled

Additionally, ansible should not have an issue evaluating that conditional as the inventory variable is checked when verifying that the upgrade can proceed. If multiple masters are configured and openshift_master_cluster_method is not set to "pacemaker" the upgrade will fail with the following message:

PLAY [Verify upgrade can proceed] *********************************************

TASK: [fail ] *****************************************************************
failed: [master4.example.com] => {"failed": true}
msg: openshift_master_cluster_method must be set to 'pacemaker'
Comment 10 Andrew Butcher 2015-11-13 15:23:56 EST
Proposed fix is here: https://github.com/openshift/openshift-ansible/pull/892
Comment 12 Anping Li 2015-11-15 20:22:46 EST
Verified and pass with latest build

Note You need to log in before you can comment on or make changes to this bug.