Bug 1298803 - Retrieve pcs status failed during rolling restart when ansible is not running on master
Retrieve pcs status failed during rolling restart when ansible is not running...
Status: CLOSED CURRENTRELEASE
Product: OpenShift Container Platform
Classification: Red Hat
Component: Upgrade (Show other bugs)
3.1.0
Unspecified Unspecified
medium Severity low
: ---
: ---
Assigned To: Andrew Butcher
Anping Li
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-01-15 01:29 EST by Anping Li
Modified: 2016-09-07 17:11 EDT (History)
6 users (show)

See Also:
Fixed In Version: openshift-ansible-3.0.38-1.git.0.66ba7e2.el7aos
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-09-07 17:11:08 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
rolling restart pcs masters output (98.67 KB, text/plain)
2016-01-21 22:07 EST, Anping Li
no flags Details

  None (edit)
Description Anping Li 2016-01-15 01:29:04 EST
Description of problem:
if the ansible is running outside of masters,  Retrieve pcs status will fail. 


Version-Release number of selected component (if applicable):
atomic-openshift-utils-3.0.32

How reproducible:
always

Steps to Reproduce:
1. install atomic-openshift-utils on a outside of master machines
2. ansible-playbook -i /root/config/ose31nativeha /root/openshift-ansible/playbooks/byo/openshift-master/restart.yml

Actual results:
TASK: [Evaluate oo_current_masters] ******************************************* 
skipping: [localhost] => (item=master1.example.com)
skipping: [localhost] => (item=master2.example.com)
skipping: [localhost] => (item=master3.example.com)

PLAY [Validate pacemaker cluster] ********************************************* 

GATHERING FACTS *************************************************************** 
ok: [master1.example.com]

TASK: [Retrieve pcs status] *************************************************** 
ok: [master1.example.com]

TASK: [fail ] ***************************************************************** 
fatal: [master1.example.com] => Failed to template {% if not (pcs_status_output.stdout | validate_pcs_cluster(groups.oo_masters_to_config)) | bool %} True {% else %} False {% endif %}: |failed expects data is a string

FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/root/restart.retry

localhost                  : ok=12   changed=0    unreachable=0    failed=0   
master1.example.com        : ok=16   changed=0    unreachable=1    failed=0   
master2.example.com        : ok=14   changed=0    unreachable=0    failed=0   
master3.example.com        : ok=14   changed=0    unreachable=0    failed=0 


Expected results:
rolling restart can be perform outside of masters

Additional info:
Comment 4 Anping Li 2016-01-20 00:54:33 EST
Still failed with same errors with the pull/1211

TASK: [Retrieve pcs status] *************************************************** 
ok: [master1.example.com]

TASK: [fail ] ***************************************************************** 
fatal: [master1.example.com] => Failed to template {% if not (pcs_status_output.stdout | validate_pcs_cluster(groups.oo_masters_to_config)) | bool %} True {% else %} False {% endif %}: |failed expects data is a string

FATAL: all hosts have already failed -- aborting
Comment 6 Anping Li 2016-01-21 05:26:37 EST
TASK: [Retrieve pcs status] *************************************************** 
<master2.example.com> ESTABLISH CONNECTION FOR USER: root
<master2.example.com> REMOTE_MODULE command pcs status
<master2.example.com> EXEC ssh -C -tt -v -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 master2.example.com /bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1453371950.59-153437476202709 && echo $HOME/.ansible/tmp/ansible-tmp-1453371950.59-153437476202709'
<master2.example.com> PUT /tmp/tmpxYu489 TO /root/.ansible/tmp/ansible-tmp-1453371950.59-153437476202709/command
<master2.example.com> EXEC ssh -C -tt -v -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 master2.example.com /bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python /root/.ansible/tmp/ansible-tmp-1453371950.59-153437476202709/command; rm -rf /root/.ansible/tmp/ansible-tmp-1453371950.59-153437476202709/ >/dev/null 2>&1'
ok: [master2.example.com] => {"changed": false, "cmd": ["pcs", "status"], "delta": "0:00:00.887346", "end": "2016-01-21 18:25:51.554467", "rc": 0, "start": "2016-01-21 18:25:50.667121", "stderr": "", "stdout": "Cluster name: openshift_master\nLast updated: Thu Jan 21 18:25:50 2016\t\tLast change: Wed Jan 20 13:15:20 2016 by root via crm_resource on master1.example.com\nStack: corosync\nCurrent DC: master2.example.com (version 1.1.13-a14efad) - partition with quorum\n3 nodes and 2 resources configured\n\nOnline: [ master1.example.com master2.example.com master3.example.com ]\n\nFull list of resources:\n\n Resource Group: atomic-openshift-master\n     virtual-ip\t(ocf::heartbeat:IPaddr2):\tStarted master2.example.com\n     master\t(systemd:atomic-openshift-master):\tStarted master2.example.com\n\nPCSD Status:\n  master1.example.com: Online\n  master2.example.com: Online\n  master3.example.com: Online\n\nDaemon Status:\n  corosync: active/enabled\n  pacemaker: active/enabled\n  pcsd: active/enabled", "stdout_lines": ["Cluster name: openshift_master", "Last updated: Thu Jan 21 18:25:50 2016\t\tLast change: Wed Jan 20 13:15:20 2016 by root via crm_resource on master1.example.com", "Stack: corosync", "Current DC: master2.example.com (version 1.1.13-a14efad) - partition with quorum", "3 nodes and 2 resources configured", "", "Online: [ master1.example.com master2.example.com master3.example.com ]", "", "Full list of resources:", "", " Resource Group: atomic-openshift-master", "     virtual-ip\t(ocf::heartbeat:IPaddr2):\tStarted master2.example.com", "     master\t(systemd:atomic-openshift-master):\tStarted master2.example.com", "", "PCSD Status:", "  master1.example.com: Online", "  master2.example.com: Online", "  master3.example.com: Online", "", "Daemon Status:", "  corosync: active/enabled", "  pacemaker: active/enabled", "  pcsd: active/enabled"], "warnings": []}

TASK: [fail ] ***************************************************************** 
fatal: [master2.example.com] => Failed to template {% if not (pcs_status_output.stdout | validate_pcs_cluster(groups.oo_masters_to_config)) | bool %} True {% else %} False {% endif %}: |failed expects data is a string

FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/root/restart.retry

localhost                  : ok=12   changed=0    unreachable=0    failed=0   
master1.example.com        : ok=14   changed=0    unreachable=0    failed=0   
master2.example.com        : ok=16   changed=0    unreachable=1    failed=0   
master3.example.com        : ok=14   changed=0    unreachable=0    failed=0
Comment 7 Anping Li 2016-01-21 05:27:43 EST
Andrew, you can still use those machines for debug.
Comment 8 Andrew Butcher 2016-01-21 11:09:28 EST
Thanks Anping, I think the data type in your case is unicode so the filter input will now be validated against string or unicode.

Proposed fix: https://github.com/openshift/openshift-ansible/pull/1251
Comment 9 Anping Li 2016-01-21 22:07 EST
Created attachment 1117086 [details]
rolling restart pcs masters output

I am sure i was using the latest code. But i still get error 'failed expects data is a string'. it is strange.
# grep -A 15 validate_pcs_cluster /root/openshift-ansible/filter_plugins/openshift_master.py 
    def validate_pcs_cluster(data, masters=None):
        ''' Validates output from "pcs status", ensuring that each master
            provided is online.
            Ex: data = ('...',
                        'PCSD Status:',
                        'master1.example.com: Online',
                        'master2.example.com: Online',
                        'master3.example.com: Online',
                        '...')
                masters = ['master1.example.com',
                           'master2.example.com',
                           'master3.example.com']
               returns True
        '''
        if not issubclass(type(data), basestring):
            raise errors.AnsibleFilterError("|failed expects data is a string or unicode")
--
                "validate_pcs_cluster": self.validate_pcs_cluster}
Comment 10 Anping Li 2016-01-21 22:10:25 EST
The ansible version is as following.
The command I used is 'ansible-playbook -i /root/config/ose31pacemakerha2 /root/openshift-ansible/playbooks/byo/openshift-master/restart.yml -vvv|tee  pcserror.log'
[root@dhcp-129-213 ~]# rpm -qa|grep ansible
openshift-ansible-lookup-plugins-3.0.35-1.git.0.6a386dd.el7aos.noarch
openshift-ansible-3.0.35-1.git.0.6a386dd.el7aos.noarch
openshift-ansible-roles-3.0.35-1.git.0.6a386dd.el7aos.noarch
ansible-1.9.4-1.el7aos.noarch
openshift-ansible-filter-plugins-3.0.35-1.git.0.6a386dd.el7aos.noarch
openshift-ansible-playbooks-3.0.35-1.git.0.6a386dd.el7aos.noarch
Comment 11 Anping Li 2016-01-21 22:24:47 EST
When I run restart.yaml on master1.example.com, I got the following messages.  You can login this machine and try the command below
ansible-playbook -i /root/config/ose31pacemakerha2 /root/openshift-ansible/playbooks/byo/openshift-master/restart.yml -vvv|tee  pcserror.log

TASK: [fail ] ***************************************************************** 
<master1.example.com> ESTABLISH CONNECTION FOR USER: root
failed: [master1.example.com] => {"failed": true}
msg: Pacemaker cluster validation failed. One or more nodes are not online.


FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/root/restart.retry
Comment 12 Andrew Butcher 2016-01-22 13:10:23 EST
Thanks Anping, the output in comment 11 indicates that the cluster was unhealthy prior to master restart. Checking the output of "pcs status", master1 and master2 are offline so this message is expected.

[root@master1 ~]# pcs status
Cluster name: openshift_master
Last updated: Fri Jan 22 23:11:40 2016          Last change: Fri Jan 22 23:11:17 2016 by root via crm_resource on master1.example.com
Stack: corosync
Current DC: master1.example.com (version 1.1.13-a14efad) - partition with quorum
3 nodes and 2 resources configured

Online: [ master1.example.com master2.example.com master3.example.com ]

Full list of resources:

 Resource Group: atomic-openshift-master
     virtual-ip (ocf::heartbeat:IPaddr2):       Started master3.example.com
     master     (systemd:atomic-openshift-master):      Started master3.example.com

PCSD Status:
  master1.example.com: Offline <---- master is offline
  master2.example.com: Offline <---- master is offline
  master3.example.com: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: failed/enabled <---- pcsd is unhealthy


By restarting pcsd on master1 and master2 I can restore the cluster.


[root@master1 ~]# pcs status
Cluster name: openshift_master
Last updated: Fri Jan 22 23:12:09 2016          Last change: Fri Jan 22 23:11:17 2016 by root via crm_resource on master1.example.com
Stack: corosync
Current DC: master1.example.com (version 1.1.13-a14efad) - partition with quorum
3 nodes and 2 resources configured

Online: [ master1.example.com master2.example.com master3.example.com ]

Full list of resources:

 Resource Group: atomic-openshift-master
     virtual-ip (ocf::heartbeat:IPaddr2):       Started master3.example.com
     master     (systemd:atomic-openshift-master):      Started master3.example.com

PCSD Status:
  master1.example.com: Online
  master2.example.com: Online
  master3.example.com: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


I encountered an issue with the is-active service check which I've addressed in https://github.com/openshift/openshift-ansible/pull/1266.

Restarting the cluster is then successful. Logs are available here: http://file.rdu.redhat.com/abutcher/ansible.2.bz1298803.txt
Comment 13 Anping Li 2016-01-25 05:01:15 EST
Only fix the issue in Comment 10. 
If don't run ansible on masters, the issue described in comment 9 still exist.
Comment 14 Andrew Butcher 2016-01-25 09:38:01 EST
Anping, are you testing with the master branch of openshift-ansible?

I would expect the error message in Comment 9 to be "|failed expects data is a string or unicode" with the latest changes.
Comment 15 Anping Li 2016-01-25 21:31:00 EST
Andrew, I am testing with the PR1266. I could find the expect messages. However the output is different.

grep -r 'failed expects data' /root/openshift-ansible/
/root/openshift-ansible/filter_plugins/openshift_master.py:            raise errors.AnsibleFilterError("|failed expects data is a string or unicode")

ansible-playbook -i /root/config/ose31pacemakerha /root/openshift-ansible/playbooks/byo/openshift-master/restart.yml
PLAY [Populate config host groups] ********************************************
<----snip---->
<----snip---->
<----snip---->

TASK: [fail ] ***************************************************************** 
fatal: [master1.example.com] => Failed to template {% if not (pcs_status_output.stdout | validate_pcs_cluster(groups.oo_masters_to_config)) | bool %} True {% else %} False {% endif %}: |failed expects data is a string
Comment 16 Andrew Butcher 2016-01-25 21:35:11 EST
Thanks for confirming Anping I'll see what I can figure out.
Comment 18 Anping Li 2016-02-25 03:02:04 EST
Using the package include the fix, rolling restart work well. so move bug to verified.

Note You need to log in before you can comment on or make changes to this bug.