Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1298803

Summary:

Retrieve pcs status failed during rolling restart when ansible is not running on master

Product:

OpenShift Container Platform

Reporter:

Anping Li <anli>

Component:

Cluster Version Operator

Assignee:

Andrew Butcher <abutcher>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Anping Li <anli>

Severity:

low

Docs Contact:

Priority:

medium

Version:

3.1.0

CC:

anli, aos-bugs, bleanhar, jokerman, mmccomas, tdawson

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

openshift-ansible-3.0.38-1.git.0.66ba7e2.el7aos

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-09-07 21:11:08 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
rolling restart pcs masters output	none

Description Anping Li 2016-01-15 06:29:04 UTC

Description of problem:
if the ansible is running outside of masters,  Retrieve pcs status will fail. 


Version-Release number of selected component (if applicable):
atomic-openshift-utils-3.0.32

How reproducible:
always

Steps to Reproduce:
1. install atomic-openshift-utils on a outside of master machines
2. ansible-playbook -i /root/config/ose31nativeha /root/openshift-ansible/playbooks/byo/openshift-master/restart.yml

Actual results:
TASK: [Evaluate oo_current_masters] ******************************************* 
skipping: [localhost] => (item=master1.example.com)
skipping: [localhost] => (item=master2.example.com)
skipping: [localhost] => (item=master3.example.com)

PLAY [Validate pacemaker cluster] ********************************************* 

GATHERING FACTS *************************************************************** 
ok: [master1.example.com]

TASK: [Retrieve pcs status] *************************************************** 
ok: [master1.example.com]

TASK: [fail ] ***************************************************************** 
fatal: [master1.example.com] => Failed to template {% if not (pcs_status_output.stdout | validate_pcs_cluster(groups.oo_masters_to_config)) | bool %} True {% else %} False {% endif %}: |failed expects data is a string

FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/root/restart.retry

localhost                  : ok=12   changed=0    unreachable=0    failed=0   
master1.example.com        : ok=16   changed=0    unreachable=1    failed=0   
master2.example.com        : ok=14   changed=0    unreachable=0    failed=0   
master3.example.com        : ok=14   changed=0    unreachable=0    failed=0 


Expected results:
rolling restart can be perform outside of masters

Additional info:

Comment 4 Anping Li 2016-01-20 05:54:33 UTC

Still failed with same errors with the pull/1211

TASK: [Retrieve pcs status] *************************************************** 
ok: [master1.example.com]

TASK: [fail ] ***************************************************************** 
fatal: [master1.example.com] => Failed to template {% if not (pcs_status_output.stdout | validate_pcs_cluster(groups.oo_masters_to_config)) | bool %} True {% else %} False {% endif %}: |failed expects data is a string

FATAL: all hosts have already failed -- aborting

Comment 6 Anping Li 2016-01-21 10:26:37 UTC

TASK: [Retrieve pcs status] *************************************************** 
<master2.example.com> ESTABLISH CONNECTION FOR USER: root
<master2.example.com> REMOTE_MODULE command pcs status
<master2.example.com> EXEC ssh -C -tt -v -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 master2.example.com /bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1453371950.59-153437476202709 && echo $HOME/.ansible/tmp/ansible-tmp-1453371950.59-153437476202709'
<master2.example.com> PUT /tmp/tmpxYu489 TO /root/.ansible/tmp/ansible-tmp-1453371950.59-153437476202709/command
<master2.example.com> EXEC ssh -C -tt -v -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 master2.example.com /bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python /root/.ansible/tmp/ansible-tmp-1453371950.59-153437476202709/command; rm -rf /root/.ansible/tmp/ansible-tmp-1453371950.59-153437476202709/ >/dev/null 2>&1'
ok: [master2.example.com] => {"changed": false, "cmd": ["pcs", "status"], "delta": "0:00:00.887346", "end": "2016-01-21 18:25:51.554467", "rc": 0, "start": "2016-01-21 18:25:50.667121", "stderr": "", "stdout": "Cluster name: openshift_master\nLast updated: Thu Jan 21 18:25:50 2016\t\tLast change: Wed Jan 20 13:15:20 2016 by root via crm_resource on master1.example.com\nStack: corosync\nCurrent DC: master2.example.com (version 1.1.13-a14efad) - partition with quorum\n3 nodes and 2 resources configured\n\nOnline: [ master1.example.com master2.example.com master3.example.com ]\n\nFull list of resources:\n\n Resource Group: atomic-openshift-master\n     virtual-ip\t(ocf::heartbeat:IPaddr2):\tStarted master2.example.com\n     master\t(systemd:atomic-openshift-master):\tStarted master2.example.com\n\nPCSD Status:\n  master1.example.com: Online\n  master2.example.com: Online\n  master3.example.com: Online\n\nDaemon Status:\n  corosync: active/enabled\n  pacemaker: active/enabled\n  pcsd: active/enabled", "stdout_lines": ["Cluster name: openshift_master", "Last updated: Thu Jan 21 18:25:50 2016\t\tLast change: Wed Jan 20 13:15:20 2016 by root via crm_resource on master1.example.com", "Stack: corosync", "Current DC: master2.example.com (version 1.1.13-a14efad) - partition with quorum", "3 nodes and 2 resources configured", "", "Online: [ master1.example.com master2.example.com master3.example.com ]", "", "Full list of resources:", "", " Resource Group: atomic-openshift-master", "     virtual-ip\t(ocf::heartbeat:IPaddr2):\tStarted master2.example.com", "     master\t(systemd:atomic-openshift-master):\tStarted master2.example.com", "", "PCSD Status:", "  master1.example.com: Online", "  master2.example.com: Online", "  master3.example.com: Online", "", "Daemon Status:", "  corosync: active/enabled", "  pacemaker: active/enabled", "  pcsd: active/enabled"], "warnings": []}

TASK: [fail ] ***************************************************************** 
fatal: [master2.example.com] => Failed to template {% if not (pcs_status_output.stdout | validate_pcs_cluster(groups.oo_masters_to_config)) | bool %} True {% else %} False {% endif %}: |failed expects data is a string

FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/root/restart.retry

localhost                  : ok=12   changed=0    unreachable=0    failed=0   
master1.example.com        : ok=14   changed=0    unreachable=0    failed=0   
master2.example.com        : ok=16   changed=0    unreachable=1    failed=0   
master3.example.com        : ok=14   changed=0    unreachable=0    failed=0

Comment 7 Anping Li 2016-01-21 10:27:43 UTC

Andrew, you can still use those machines for debug.

Comment 8 Andrew Butcher 2016-01-21 16:09:28 UTC

Thanks Anping, I think the data type in your case is unicode so the filter input will now be validated against string or unicode.

Proposed fix: https://github.com/openshift/openshift-ansible/pull/1251

Comment 9 Anping Li 2016-01-22 03:07:33 UTC

Created attachment 1117086 [details]
rolling restart pcs masters output

I am sure i was using the latest code. But i still get error 'failed expects data is a string'. it is strange.
# grep -A 15 validate_pcs_cluster /root/openshift-ansible/filter_plugins/openshift_master.py 
    def validate_pcs_cluster(data, masters=None):
        ''' Validates output from "pcs status", ensuring that each master
            provided is online.
            Ex: data = ('...',
                        'PCSD Status:',
                        'master1.example.com: Online',
                        'master2.example.com: Online',
                        'master3.example.com: Online',
                        '...')
                masters = ['master1.example.com',
                           'master2.example.com',
                           'master3.example.com']
               returns True
        '''
        if not issubclass(type(data), basestring):
            raise errors.AnsibleFilterError("|failed expects data is a string or unicode")
--
                "validate_pcs_cluster": self.validate_pcs_cluster}

Comment 10 Anping Li 2016-01-22 03:10:25 UTC

The ansible version is as following.
The command I used is 'ansible-playbook -i /root/config/ose31pacemakerha2 /root/openshift-ansible/playbooks/byo/openshift-master/restart.yml -vvv|tee  pcserror.log'
[root@dhcp-129-213 ~]# rpm -qa|grep ansible
openshift-ansible-lookup-plugins-3.0.35-1.git.0.6a386dd.el7aos.noarch
openshift-ansible-3.0.35-1.git.0.6a386dd.el7aos.noarch
openshift-ansible-roles-3.0.35-1.git.0.6a386dd.el7aos.noarch
ansible-1.9.4-1.el7aos.noarch
openshift-ansible-filter-plugins-3.0.35-1.git.0.6a386dd.el7aos.noarch
openshift-ansible-playbooks-3.0.35-1.git.0.6a386dd.el7aos.noarch

Comment 11 Anping Li 2016-01-22 03:24:47 UTC

When I run restart.yaml on master1.example.com, I got the following messages.  You can login this machine and try the command below
ansible-playbook -i /root/config/ose31pacemakerha2 /root/openshift-ansible/playbooks/byo/openshift-master/restart.yml -vvv|tee  pcserror.log

TASK: [fail ] ***************************************************************** 
<master1.example.com> ESTABLISH CONNECTION FOR USER: root
failed: [master1.example.com] => {"failed": true}
msg: Pacemaker cluster validation failed. One or more nodes are not online.


FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/root/restart.retry

Comment 12 Andrew Butcher 2016-01-22 18:10:23 UTC

Thanks Anping, the output in comment 11 indicates that the cluster was unhealthy prior to master restart. Checking the output of "pcs status", master1 and master2 are offline so this message is expected.

[root@master1 ~]# pcs status
Cluster name: openshift_master
Last updated: Fri Jan 22 23:11:40 2016          Last change: Fri Jan 22 23:11:17 2016 by root via crm_resource on master1.example.com
Stack: corosync
Current DC: master1.example.com (version 1.1.13-a14efad) - partition with quorum
3 nodes and 2 resources configured

Online: [ master1.example.com master2.example.com master3.example.com ]

Full list of resources:

 Resource Group: atomic-openshift-master
     virtual-ip (ocf::heartbeat:IPaddr2):       Started master3.example.com
     master     (systemd:atomic-openshift-master):      Started master3.example.com

PCSD Status:
  master1.example.com: Offline <---- master is offline
  master2.example.com: Offline <---- master is offline
  master3.example.com: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: failed/enabled <---- pcsd is unhealthy


By restarting pcsd on master1 and master2 I can restore the cluster.


[root@master1 ~]# pcs status
Cluster name: openshift_master
Last updated: Fri Jan 22 23:12:09 2016          Last change: Fri Jan 22 23:11:17 2016 by root via crm_resource on master1.example.com
Stack: corosync
Current DC: master1.example.com (version 1.1.13-a14efad) - partition with quorum
3 nodes and 2 resources configured

Online: [ master1.example.com master2.example.com master3.example.com ]

Full list of resources:

 Resource Group: atomic-openshift-master
     virtual-ip (ocf::heartbeat:IPaddr2):       Started master3.example.com
     master     (systemd:atomic-openshift-master):      Started master3.example.com

PCSD Status:
  master1.example.com: Online
  master2.example.com: Online
  master3.example.com: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


I encountered an issue with the is-active service check which I've addressed in https://github.com/openshift/openshift-ansible/pull/1266.

Restarting the cluster is then successful. Logs are available here: http://file.rdu.redhat.com/abutcher/ansible.2.bz1298803.txt

Comment 13 Anping Li 2016-01-25 10:01:15 UTC

Only fix the issue in Comment 10. 
If don't run ansible on masters, the issue described in comment 9 still exist.

Comment 14 Andrew Butcher 2016-01-25 14:38:01 UTC

Anping, are you testing with the master branch of openshift-ansible?

I would expect the error message in Comment 9 to be "|failed expects data is a string or unicode" with the latest changes.

Comment 15 Anping Li 2016-01-26 02:31:00 UTC

Andrew, I am testing with the PR1266. I could find the expect messages. However the output is different.

grep -r 'failed expects data' /root/openshift-ansible/
/root/openshift-ansible/filter_plugins/openshift_master.py:            raise errors.AnsibleFilterError("|failed expects data is a string or unicode")

ansible-playbook -i /root/config/ose31pacemakerha /root/openshift-ansible/playbooks/byo/openshift-master/restart.yml
PLAY [Populate config host groups] ********************************************
<----snip---->
<----snip---->
<----snip---->

TASK: [fail ] ***************************************************************** 
fatal: [master1.example.com] => Failed to template {% if not (pcs_status_output.stdout | validate_pcs_cluster(groups.oo_masters_to_config)) | bool %} True {% else %} False {% endif %}: |failed expects data is a string

Comment 16 Andrew Butcher 2016-01-26 02:35:11 UTC

Thanks for confirming Anping I'll see what I can figure out.

Comment 18 Anping Li 2016-02-25 08:02:04 UTC

Using the package include the fix, rolling restart work well. so move bug to verified.