Description of problem: if the ansible is running outside of masters, Retrieve pcs status will fail. Version-Release number of selected component (if applicable): atomic-openshift-utils-3.0.32 How reproducible: always Steps to Reproduce: 1. install atomic-openshift-utils on a outside of master machines 2. ansible-playbook -i /root/config/ose31nativeha /root/openshift-ansible/playbooks/byo/openshift-master/restart.yml Actual results: TASK: [Evaluate oo_current_masters] ******************************************* skipping: [localhost] => (item=master1.example.com) skipping: [localhost] => (item=master2.example.com) skipping: [localhost] => (item=master3.example.com) PLAY [Validate pacemaker cluster] ********************************************* GATHERING FACTS *************************************************************** ok: [master1.example.com] TASK: [Retrieve pcs status] *************************************************** ok: [master1.example.com] TASK: [fail ] ***************************************************************** fatal: [master1.example.com] => Failed to template {% if not (pcs_status_output.stdout | validate_pcs_cluster(groups.oo_masters_to_config)) | bool %} True {% else %} False {% endif %}: |failed expects data is a string FATAL: all hosts have already failed -- aborting PLAY RECAP ******************************************************************** to retry, use: --limit @/root/restart.retry localhost : ok=12 changed=0 unreachable=0 failed=0 master1.example.com : ok=16 changed=0 unreachable=1 failed=0 master2.example.com : ok=14 changed=0 unreachable=0 failed=0 master3.example.com : ok=14 changed=0 unreachable=0 failed=0 Expected results: rolling restart can be perform outside of masters Additional info:
Still failed with same errors with the pull/1211 TASK: [Retrieve pcs status] *************************************************** ok: [master1.example.com] TASK: [fail ] ***************************************************************** fatal: [master1.example.com] => Failed to template {% if not (pcs_status_output.stdout | validate_pcs_cluster(groups.oo_masters_to_config)) | bool %} True {% else %} False {% endif %}: |failed expects data is a string FATAL: all hosts have already failed -- aborting
TASK: [Retrieve pcs status] *************************************************** <master2.example.com> ESTABLISH CONNECTION FOR USER: root <master2.example.com> REMOTE_MODULE command pcs status <master2.example.com> EXEC ssh -C -tt -v -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 master2.example.com /bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1453371950.59-153437476202709 && echo $HOME/.ansible/tmp/ansible-tmp-1453371950.59-153437476202709' <master2.example.com> PUT /tmp/tmpxYu489 TO /root/.ansible/tmp/ansible-tmp-1453371950.59-153437476202709/command <master2.example.com> EXEC ssh -C -tt -v -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 master2.example.com /bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python /root/.ansible/tmp/ansible-tmp-1453371950.59-153437476202709/command; rm -rf /root/.ansible/tmp/ansible-tmp-1453371950.59-153437476202709/ >/dev/null 2>&1' ok: [master2.example.com] => {"changed": false, "cmd": ["pcs", "status"], "delta": "0:00:00.887346", "end": "2016-01-21 18:25:51.554467", "rc": 0, "start": "2016-01-21 18:25:50.667121", "stderr": "", "stdout": "Cluster name: openshift_master\nLast updated: Thu Jan 21 18:25:50 2016\t\tLast change: Wed Jan 20 13:15:20 2016 by root via crm_resource on master1.example.com\nStack: corosync\nCurrent DC: master2.example.com (version 1.1.13-a14efad) - partition with quorum\n3 nodes and 2 resources configured\n\nOnline: [ master1.example.com master2.example.com master3.example.com ]\n\nFull list of resources:\n\n Resource Group: atomic-openshift-master\n virtual-ip\t(ocf::heartbeat:IPaddr2):\tStarted master2.example.com\n master\t(systemd:atomic-openshift-master):\tStarted master2.example.com\n\nPCSD Status:\n master1.example.com: Online\n master2.example.com: Online\n master3.example.com: Online\n\nDaemon Status:\n corosync: active/enabled\n pacemaker: active/enabled\n pcsd: active/enabled", "stdout_lines": ["Cluster name: openshift_master", "Last updated: Thu Jan 21 18:25:50 2016\t\tLast change: Wed Jan 20 13:15:20 2016 by root via crm_resource on master1.example.com", "Stack: corosync", "Current DC: master2.example.com (version 1.1.13-a14efad) - partition with quorum", "3 nodes and 2 resources configured", "", "Online: [ master1.example.com master2.example.com master3.example.com ]", "", "Full list of resources:", "", " Resource Group: atomic-openshift-master", " virtual-ip\t(ocf::heartbeat:IPaddr2):\tStarted master2.example.com", " master\t(systemd:atomic-openshift-master):\tStarted master2.example.com", "", "PCSD Status:", " master1.example.com: Online", " master2.example.com: Online", " master3.example.com: Online", "", "Daemon Status:", " corosync: active/enabled", " pacemaker: active/enabled", " pcsd: active/enabled"], "warnings": []} TASK: [fail ] ***************************************************************** fatal: [master2.example.com] => Failed to template {% if not (pcs_status_output.stdout | validate_pcs_cluster(groups.oo_masters_to_config)) | bool %} True {% else %} False {% endif %}: |failed expects data is a string FATAL: all hosts have already failed -- aborting PLAY RECAP ******************************************************************** to retry, use: --limit @/root/restart.retry localhost : ok=12 changed=0 unreachable=0 failed=0 master1.example.com : ok=14 changed=0 unreachable=0 failed=0 master2.example.com : ok=16 changed=0 unreachable=1 failed=0 master3.example.com : ok=14 changed=0 unreachable=0 failed=0
Andrew, you can still use those machines for debug.
Thanks Anping, I think the data type in your case is unicode so the filter input will now be validated against string or unicode. Proposed fix: https://github.com/openshift/openshift-ansible/pull/1251
Created attachment 1117086 [details] rolling restart pcs masters output I am sure i was using the latest code. But i still get error 'failed expects data is a string'. it is strange. # grep -A 15 validate_pcs_cluster /root/openshift-ansible/filter_plugins/openshift_master.py def validate_pcs_cluster(data, masters=None): ''' Validates output from "pcs status", ensuring that each master provided is online. Ex: data = ('...', 'PCSD Status:', 'master1.example.com: Online', 'master2.example.com: Online', 'master3.example.com: Online', '...') masters = ['master1.example.com', 'master2.example.com', 'master3.example.com'] returns True ''' if not issubclass(type(data), basestring): raise errors.AnsibleFilterError("|failed expects data is a string or unicode") -- "validate_pcs_cluster": self.validate_pcs_cluster}
The ansible version is as following. The command I used is 'ansible-playbook -i /root/config/ose31pacemakerha2 /root/openshift-ansible/playbooks/byo/openshift-master/restart.yml -vvv|tee pcserror.log' [root@dhcp-129-213 ~]# rpm -qa|grep ansible openshift-ansible-lookup-plugins-3.0.35-1.git.0.6a386dd.el7aos.noarch openshift-ansible-3.0.35-1.git.0.6a386dd.el7aos.noarch openshift-ansible-roles-3.0.35-1.git.0.6a386dd.el7aos.noarch ansible-1.9.4-1.el7aos.noarch openshift-ansible-filter-plugins-3.0.35-1.git.0.6a386dd.el7aos.noarch openshift-ansible-playbooks-3.0.35-1.git.0.6a386dd.el7aos.noarch
When I run restart.yaml on master1.example.com, I got the following messages. You can login this machine and try the command below ansible-playbook -i /root/config/ose31pacemakerha2 /root/openshift-ansible/playbooks/byo/openshift-master/restart.yml -vvv|tee pcserror.log TASK: [fail ] ***************************************************************** <master1.example.com> ESTABLISH CONNECTION FOR USER: root failed: [master1.example.com] => {"failed": true} msg: Pacemaker cluster validation failed. One or more nodes are not online. FATAL: all hosts have already failed -- aborting PLAY RECAP ******************************************************************** to retry, use: --limit @/root/restart.retry
Thanks Anping, the output in comment 11 indicates that the cluster was unhealthy prior to master restart. Checking the output of "pcs status", master1 and master2 are offline so this message is expected. [root@master1 ~]# pcs status Cluster name: openshift_master Last updated: Fri Jan 22 23:11:40 2016 Last change: Fri Jan 22 23:11:17 2016 by root via crm_resource on master1.example.com Stack: corosync Current DC: master1.example.com (version 1.1.13-a14efad) - partition with quorum 3 nodes and 2 resources configured Online: [ master1.example.com master2.example.com master3.example.com ] Full list of resources: Resource Group: atomic-openshift-master virtual-ip (ocf::heartbeat:IPaddr2): Started master3.example.com master (systemd:atomic-openshift-master): Started master3.example.com PCSD Status: master1.example.com: Offline <---- master is offline master2.example.com: Offline <---- master is offline master3.example.com: Online Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: failed/enabled <---- pcsd is unhealthy By restarting pcsd on master1 and master2 I can restore the cluster. [root@master1 ~]# pcs status Cluster name: openshift_master Last updated: Fri Jan 22 23:12:09 2016 Last change: Fri Jan 22 23:11:17 2016 by root via crm_resource on master1.example.com Stack: corosync Current DC: master1.example.com (version 1.1.13-a14efad) - partition with quorum 3 nodes and 2 resources configured Online: [ master1.example.com master2.example.com master3.example.com ] Full list of resources: Resource Group: atomic-openshift-master virtual-ip (ocf::heartbeat:IPaddr2): Started master3.example.com master (systemd:atomic-openshift-master): Started master3.example.com PCSD Status: master1.example.com: Online master2.example.com: Online master3.example.com: Online Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled I encountered an issue with the is-active service check which I've addressed in https://github.com/openshift/openshift-ansible/pull/1266. Restarting the cluster is then successful. Logs are available here: http://file.rdu.redhat.com/abutcher/ansible.2.bz1298803.txt
Only fix the issue in Comment 10. If don't run ansible on masters, the issue described in comment 9 still exist.
Anping, are you testing with the master branch of openshift-ansible? I would expect the error message in Comment 9 to be "|failed expects data is a string or unicode" with the latest changes.
Andrew, I am testing with the PR1266. I could find the expect messages. However the output is different. grep -r 'failed expects data' /root/openshift-ansible/ /root/openshift-ansible/filter_plugins/openshift_master.py: raise errors.AnsibleFilterError("|failed expects data is a string or unicode") ansible-playbook -i /root/config/ose31pacemakerha /root/openshift-ansible/playbooks/byo/openshift-master/restart.yml PLAY [Populate config host groups] ******************************************** <----snip----> <----snip----> <----snip----> TASK: [fail ] ***************************************************************** fatal: [master1.example.com] => Failed to template {% if not (pcs_status_output.stdout | validate_pcs_cluster(groups.oo_masters_to_config)) | bool %} True {% else %} False {% endif %}: |failed expects data is a string
Thanks for confirming Anping I'll see what I can figure out.
Using the package include the fix, rolling restart work well. so move bug to verified.