Bug 1298803 - Retrieve pcs status failed during rolling restart when ansible is not running on master
Summary: Retrieve pcs status failed during rolling restart when ansible is not running...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.1.0
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: ---
Assignee: Andrew Butcher
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-01-15 06:29 UTC by Anping Li
Modified: 2016-09-07 21:11 UTC (History)
6 users (show)

Fixed In Version: openshift-ansible-3.0.38-1.git.0.66ba7e2.el7aos
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-09-07 21:11:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
rolling restart pcs masters output (98.67 KB, text/plain)
2016-01-22 03:07 UTC, Anping Li
no flags Details

Description Anping Li 2016-01-15 06:29:04 UTC
Description of problem:
if the ansible is running outside of masters,  Retrieve pcs status will fail. 


Version-Release number of selected component (if applicable):
atomic-openshift-utils-3.0.32

How reproducible:
always

Steps to Reproduce:
1. install atomic-openshift-utils on a outside of master machines
2. ansible-playbook -i /root/config/ose31nativeha /root/openshift-ansible/playbooks/byo/openshift-master/restart.yml

Actual results:
TASK: [Evaluate oo_current_masters] ******************************************* 
skipping: [localhost] => (item=master1.example.com)
skipping: [localhost] => (item=master2.example.com)
skipping: [localhost] => (item=master3.example.com)

PLAY [Validate pacemaker cluster] ********************************************* 

GATHERING FACTS *************************************************************** 
ok: [master1.example.com]

TASK: [Retrieve pcs status] *************************************************** 
ok: [master1.example.com]

TASK: [fail ] ***************************************************************** 
fatal: [master1.example.com] => Failed to template {% if not (pcs_status_output.stdout | validate_pcs_cluster(groups.oo_masters_to_config)) | bool %} True {% else %} False {% endif %}: |failed expects data is a string

FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/root/restart.retry

localhost                  : ok=12   changed=0    unreachable=0    failed=0   
master1.example.com        : ok=16   changed=0    unreachable=1    failed=0   
master2.example.com        : ok=14   changed=0    unreachable=0    failed=0   
master3.example.com        : ok=14   changed=0    unreachable=0    failed=0 


Expected results:
rolling restart can be perform outside of masters

Additional info:

Comment 4 Anping Li 2016-01-20 05:54:33 UTC
Still failed with same errors with the pull/1211

TASK: [Retrieve pcs status] *************************************************** 
ok: [master1.example.com]

TASK: [fail ] ***************************************************************** 
fatal: [master1.example.com] => Failed to template {% if not (pcs_status_output.stdout | validate_pcs_cluster(groups.oo_masters_to_config)) | bool %} True {% else %} False {% endif %}: |failed expects data is a string

FATAL: all hosts have already failed -- aborting

Comment 6 Anping Li 2016-01-21 10:26:37 UTC
TASK: [Retrieve pcs status] *************************************************** 
<master2.example.com> ESTABLISH CONNECTION FOR USER: root
<master2.example.com> REMOTE_MODULE command pcs status
<master2.example.com> EXEC ssh -C -tt -v -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 master2.example.com /bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1453371950.59-153437476202709 && echo $HOME/.ansible/tmp/ansible-tmp-1453371950.59-153437476202709'
<master2.example.com> PUT /tmp/tmpxYu489 TO /root/.ansible/tmp/ansible-tmp-1453371950.59-153437476202709/command
<master2.example.com> EXEC ssh -C -tt -v -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 master2.example.com /bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python /root/.ansible/tmp/ansible-tmp-1453371950.59-153437476202709/command; rm -rf /root/.ansible/tmp/ansible-tmp-1453371950.59-153437476202709/ >/dev/null 2>&1'
ok: [master2.example.com] => {"changed": false, "cmd": ["pcs", "status"], "delta": "0:00:00.887346", "end": "2016-01-21 18:25:51.554467", "rc": 0, "start": "2016-01-21 18:25:50.667121", "stderr": "", "stdout": "Cluster name: openshift_master\nLast updated: Thu Jan 21 18:25:50 2016\t\tLast change: Wed Jan 20 13:15:20 2016 by root via crm_resource on master1.example.com\nStack: corosync\nCurrent DC: master2.example.com (version 1.1.13-a14efad) - partition with quorum\n3 nodes and 2 resources configured\n\nOnline: [ master1.example.com master2.example.com master3.example.com ]\n\nFull list of resources:\n\n Resource Group: atomic-openshift-master\n     virtual-ip\t(ocf::heartbeat:IPaddr2):\tStarted master2.example.com\n     master\t(systemd:atomic-openshift-master):\tStarted master2.example.com\n\nPCSD Status:\n  master1.example.com: Online\n  master2.example.com: Online\n  master3.example.com: Online\n\nDaemon Status:\n  corosync: active/enabled\n  pacemaker: active/enabled\n  pcsd: active/enabled", "stdout_lines": ["Cluster name: openshift_master", "Last updated: Thu Jan 21 18:25:50 2016\t\tLast change: Wed Jan 20 13:15:20 2016 by root via crm_resource on master1.example.com", "Stack: corosync", "Current DC: master2.example.com (version 1.1.13-a14efad) - partition with quorum", "3 nodes and 2 resources configured", "", "Online: [ master1.example.com master2.example.com master3.example.com ]", "", "Full list of resources:", "", " Resource Group: atomic-openshift-master", "     virtual-ip\t(ocf::heartbeat:IPaddr2):\tStarted master2.example.com", "     master\t(systemd:atomic-openshift-master):\tStarted master2.example.com", "", "PCSD Status:", "  master1.example.com: Online", "  master2.example.com: Online", "  master3.example.com: Online", "", "Daemon Status:", "  corosync: active/enabled", "  pacemaker: active/enabled", "  pcsd: active/enabled"], "warnings": []}

TASK: [fail ] ***************************************************************** 
fatal: [master2.example.com] => Failed to template {% if not (pcs_status_output.stdout | validate_pcs_cluster(groups.oo_masters_to_config)) | bool %} True {% else %} False {% endif %}: |failed expects data is a string

FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/root/restart.retry

localhost                  : ok=12   changed=0    unreachable=0    failed=0   
master1.example.com        : ok=14   changed=0    unreachable=0    failed=0   
master2.example.com        : ok=16   changed=0    unreachable=1    failed=0   
master3.example.com        : ok=14   changed=0    unreachable=0    failed=0

Comment 7 Anping Li 2016-01-21 10:27:43 UTC
Andrew, you can still use those machines for debug.

Comment 8 Andrew Butcher 2016-01-21 16:09:28 UTC
Thanks Anping, I think the data type in your case is unicode so the filter input will now be validated against string or unicode.

Proposed fix: https://github.com/openshift/openshift-ansible/pull/1251

Comment 9 Anping Li 2016-01-22 03:07:33 UTC
Created attachment 1117086 [details]
rolling restart pcs masters output

I am sure i was using the latest code. But i still get error 'failed expects data is a string'. it is strange.
# grep -A 15 validate_pcs_cluster /root/openshift-ansible/filter_plugins/openshift_master.py 
    def validate_pcs_cluster(data, masters=None):
        ''' Validates output from "pcs status", ensuring that each master
            provided is online.
            Ex: data = ('...',
                        'PCSD Status:',
                        'master1.example.com: Online',
                        'master2.example.com: Online',
                        'master3.example.com: Online',
                        '...')
                masters = ['master1.example.com',
                           'master2.example.com',
                           'master3.example.com']
               returns True
        '''
        if not issubclass(type(data), basestring):
            raise errors.AnsibleFilterError("|failed expects data is a string or unicode")
--
                "validate_pcs_cluster": self.validate_pcs_cluster}

Comment 10 Anping Li 2016-01-22 03:10:25 UTC
The ansible version is as following.
The command I used is 'ansible-playbook -i /root/config/ose31pacemakerha2 /root/openshift-ansible/playbooks/byo/openshift-master/restart.yml -vvv|tee  pcserror.log'
[root@dhcp-129-213 ~]# rpm -qa|grep ansible
openshift-ansible-lookup-plugins-3.0.35-1.git.0.6a386dd.el7aos.noarch
openshift-ansible-3.0.35-1.git.0.6a386dd.el7aos.noarch
openshift-ansible-roles-3.0.35-1.git.0.6a386dd.el7aos.noarch
ansible-1.9.4-1.el7aos.noarch
openshift-ansible-filter-plugins-3.0.35-1.git.0.6a386dd.el7aos.noarch
openshift-ansible-playbooks-3.0.35-1.git.0.6a386dd.el7aos.noarch

Comment 11 Anping Li 2016-01-22 03:24:47 UTC
When I run restart.yaml on master1.example.com, I got the following messages.  You can login this machine and try the command below
ansible-playbook -i /root/config/ose31pacemakerha2 /root/openshift-ansible/playbooks/byo/openshift-master/restart.yml -vvv|tee  pcserror.log

TASK: [fail ] ***************************************************************** 
<master1.example.com> ESTABLISH CONNECTION FOR USER: root
failed: [master1.example.com] => {"failed": true}
msg: Pacemaker cluster validation failed. One or more nodes are not online.


FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/root/restart.retry

Comment 12 Andrew Butcher 2016-01-22 18:10:23 UTC
Thanks Anping, the output in comment 11 indicates that the cluster was unhealthy prior to master restart. Checking the output of "pcs status", master1 and master2 are offline so this message is expected.

[root@master1 ~]# pcs status
Cluster name: openshift_master
Last updated: Fri Jan 22 23:11:40 2016          Last change: Fri Jan 22 23:11:17 2016 by root via crm_resource on master1.example.com
Stack: corosync
Current DC: master1.example.com (version 1.1.13-a14efad) - partition with quorum
3 nodes and 2 resources configured

Online: [ master1.example.com master2.example.com master3.example.com ]

Full list of resources:

 Resource Group: atomic-openshift-master
     virtual-ip (ocf::heartbeat:IPaddr2):       Started master3.example.com
     master     (systemd:atomic-openshift-master):      Started master3.example.com

PCSD Status:
  master1.example.com: Offline <---- master is offline
  master2.example.com: Offline <---- master is offline
  master3.example.com: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: failed/enabled <---- pcsd is unhealthy


By restarting pcsd on master1 and master2 I can restore the cluster.


[root@master1 ~]# pcs status
Cluster name: openshift_master
Last updated: Fri Jan 22 23:12:09 2016          Last change: Fri Jan 22 23:11:17 2016 by root via crm_resource on master1.example.com
Stack: corosync
Current DC: master1.example.com (version 1.1.13-a14efad) - partition with quorum
3 nodes and 2 resources configured

Online: [ master1.example.com master2.example.com master3.example.com ]

Full list of resources:

 Resource Group: atomic-openshift-master
     virtual-ip (ocf::heartbeat:IPaddr2):       Started master3.example.com
     master     (systemd:atomic-openshift-master):      Started master3.example.com

PCSD Status:
  master1.example.com: Online
  master2.example.com: Online
  master3.example.com: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


I encountered an issue with the is-active service check which I've addressed in https://github.com/openshift/openshift-ansible/pull/1266.

Restarting the cluster is then successful. Logs are available here: http://file.rdu.redhat.com/abutcher/ansible.2.bz1298803.txt

Comment 13 Anping Li 2016-01-25 10:01:15 UTC
Only fix the issue in Comment 10. 
If don't run ansible on masters, the issue described in comment 9 still exist.

Comment 14 Andrew Butcher 2016-01-25 14:38:01 UTC
Anping, are you testing with the master branch of openshift-ansible?

I would expect the error message in Comment 9 to be "|failed expects data is a string or unicode" with the latest changes.

Comment 15 Anping Li 2016-01-26 02:31:00 UTC
Andrew, I am testing with the PR1266. I could find the expect messages. However the output is different.

grep -r 'failed expects data' /root/openshift-ansible/
/root/openshift-ansible/filter_plugins/openshift_master.py:            raise errors.AnsibleFilterError("|failed expects data is a string or unicode")

ansible-playbook -i /root/config/ose31pacemakerha /root/openshift-ansible/playbooks/byo/openshift-master/restart.yml
PLAY [Populate config host groups] ********************************************
<----snip---->
<----snip---->
<----snip---->

TASK: [fail ] ***************************************************************** 
fatal: [master1.example.com] => Failed to template {% if not (pcs_status_output.stdout | validate_pcs_cluster(groups.oo_masters_to_config)) | bool %} True {% else %} False {% endif %}: |failed expects data is a string

Comment 16 Andrew Butcher 2016-01-26 02:35:11 UTC
Thanks for confirming Anping I'll see what I can figure out.

Comment 18 Anping Li 2016-02-25 08:02:04 UTC
Using the package include the fix, rolling restart work well. so move bug to verified.


Note You need to log in before you can comment on or make changes to this bug.