Bug 1631162
| Summary: | 3.11 Upgrade fails on "Check for GlusterFS cluster health" | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jaspreet Kaur <jkaur> | ||||
| Component: | Installer | Assignee: | Russell Teague <rteague> | ||||
| Installer sub component: | openshift-ansible | QA Contact: | Johnny Liu <jialiu> | ||||
| Status: | CLOSED DEFERRED | Docs Contact: | |||||
| Severity: | medium | ||||||
| Priority: | medium | CC: | aos-bugs, dmoessne, jarrpa, jkaur, jokerman, mmccomas, mnoguera, mtaru, roysjosh, sdodson, wmeng | ||||
| Version: | 3.11.0 | ||||||
| Target Milestone: | --- | ||||||
| Target Release: | 3.11.z | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2019-08-19 15:53:33 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
This is a safety check to ensure that gluster has fully healed before we remove an additional host from the cluster. Message: volume heketidbstorage is not ready Why is that volume not ready? I am running into the very same when trying to upgrade OCP 3.9 to 3.10 with OCS 3.10 already running on it:
(master and infra nodes already upgraded, below happens when dealing with storage nodes)
# rpm -q openshift-ansible
openshift-ansible-3.10.47-1.git.0.95bc2d2.el7_5.noarch
#
# rpm -q ansible
ansible-2.4.6.0-1.el7ae.noarch
#
# ansible --version
ansible 2.4.6.0
config file = /etc/ansible/ansible.cfg
configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python2.7/site-packages/ansible
executable location = /usr/bin/ansible
python version = 2.7.5 (default, May 31 2018, 09:41:32) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]
#
FAILED - RETRYING: Check for GlusterFS cluster health (6 retries left).
FAILED - RETRYING: Check for GlusterFS cluster health (5 retries left).
FAILED - RETRYING: Check for GlusterFS cluster health (4 retries left).
FAILED - RETRYING: Check for GlusterFS cluster health (3 retries left).
FAILED - RETRYING: Check for GlusterFS cluster health (2 retries left).
FAILED - RETRYING: Check for GlusterFS cluster health (1 retries left).
fatal: [inf152.example.com -> inf152.example.com]: FAILED! => {"attempts": 120, "changed": false, "failed": true, "msg": "volume heketidbstorage is not ready", "state": "unknown"}
to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade_nodes.retry
PLAY RECAP ***********************************************************************************************************************************************************************************************************************************
inf152.example.com : ok=141 changed=5 unreachable=0 failed=1
inf153.example.com : ok=53 changed=5 unreachable=0 failed=0
inf154.example.com : ok=53 changed=5 unreachable=0 failed=0
inf155.example.com : ok=68 changed=10 unreachable=0 failed=0
inf156.example.com : ok=69 changed=11 unreachable=0 failed=0
inf157.example.com : ok=68 changed=10 unreachable=0 failed=0
inf158.example.com : ok=68 changed=10 unreachable=0 failed=0
inf159.example.com : ok=68 changed=11 unreachable=0 failed=0
inf160.example.com : ok=68 changed=10 unreachable=0 failed=0
inf161.example.com : ok=68 changed=10 unreachable=0 failed=0
inf162.example.com : ok=68 changed=10 unreachable=0 failed=0
inf163.example.com : ok=68 changed=10 unreachable=0 failed=0
inf164.example.com : ok=68 changed=10 unreachable=0 failed=0
inf165.example.com : ok=68 changed=10 unreachable=0 failed=0
inf166.example.com : ok=68 changed=10 unreachable=0 failed=0
inf167.example.com : ok=68 changed=10 unreachable=0 failed=0
inf168.example.com : ok=68 changed=10 unreachable=0 failed=0
inf169.example.com : ok=68 changed=10 unreachable=0 failed=0
localhost : ok=13 changed=0 unreachable=0 failed=0
Failure summary:
1. Hosts: inf152.example.com
Play: Verify upgrade can proceed on first master
Task: Check for GlusterFS cluster health
Message: volume heketidbstorage is not ready
I followed:
https://docs.openshift.com/container-platform/3.10/upgrading/automated_upgrades.html#special-considerations-for-glusterfs
- which asks to remove demon label on one node to terminate, so ocp pod is gone
- add type=upgrade label
- # ansible-playbook -i /etc/ansible/hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade_nodes.yml -e openshift_upgrade_nodes_label="type=upgrade"
which ends up in above scenario.
however, if I put back the label in place get the OCS pods up again, verify everything is in sync, the node upgrade succeeds:
before the failed upgrade (mind those 2 nodes are part of different ocs glusters; registry and apps):
# oc get nodes -l type=upgrade
NAME STATUS ROLES AGE VERSION
inf157.example.com Ready compute 1d v1.9.1+a0ce1bc657
inf158.example.com Ready compute 1d v1.9.1+a0ce1bc657
#
--> upgrade
PLAY RECAP ***********************************************************************************************************************************************************************************************************************************
inf152.example.com : ok=164 changed=5 unreachable=0 failed=0
inf153.example.com : ok=70 changed=5 unreachable=0 failed=0
inf154.example.com : ok=70 changed=5 unreachable=0 failed=0
inf155.example.com : ok=78 changed=10 unreachable=0 failed=0
inf156.example.com : ok=79 changed=10 unreachable=0 failed=0
inf157.example.com : ok=149 changed=45 unreachable=0 failed=0
inf158.example.com : ok=149 changed=45 unreachable=0 failed=0
inf159.example.com : ok=78 changed=10 unreachable=0 failed=0
inf160.example.com : ok=78 changed=10 unreachable=0 failed=0
inf161.example.com : ok=78 changed=10 unreachable=0 failed=0
inf162.example.com : ok=78 changed=10 unreachable=0 failed=0
inf163.example.com : ok=78 changed=10 unreachable=0 failed=0
inf164.example.com : ok=78 changed=10 unreachable=0 failed=0
inf165.example.com : ok=78 changed=10 unreachable=0 failed=0
inf166.example.com : ok=78 changed=10 unreachable=0 failed=0
inf167.example.com : ok=78 changed=10 unreachable=0 failed=0
inf168.example.com : ok=78 changed=10 unreachable=0 failed=0
inf169.example.com : ok=78 changed=10 unreachable=0 failed=0
localhost : ok=13 changed=0 unreachable=0 failed=0
#
# oc get nodes -l type=upgrade
NAME STATUS ROLES AGE VERSION
inf157.example.com Ready compute,infra 1d v1.10.0+b81c8f8
inf158.example.com Ready compute,infra 1d v1.10.0+b81c8f8
#
--> then this succeeds, so I wonder if the steps given in the doc:
https://docs.openshift.com/container-platform/3.10/upgrading/automated_upgrades.html#special-considerations-for-glusterfs
Special Considerations When Using Containerized GlusterFS
are still valid for OCP 3.9 -> 3.10 upgrade and if we changed the playbooks and the upgrade procedure should be now different as well ?
Oh.... yes, those instructions are out of date. With the new GlusterFS health checks in the upgrade playbooks, it is no longer a requirement to remove the DaemonSet label from the GlusterFS nodes. This is a doc bug that should be relatively easy to fix. Jose, I am happy to file a docs bug. Just to be clear, when looking at the very doc, remove is no longer required, but the rest still is, right. Guess we want to serially update gluster nodes and ensure volumes are in sync, before going to the next node, right ? Thanks, daniel (In reply to daniel from comment #5) > I am happy to file a docs bug. > Just to be clear, when looking at the very doc, remove is no longer > required, but the rest still is, right. Guess we want to serially update > gluster nodes and ensure volumes are in sync, before going to the next node, > right ? Correct. same message just updating the control plane not the nodes,running the upgrade_control_plane.yaml playbook got the error:
Error message is:
fatal: [master-0.example.com -> master-0.labs.com]: FAILED! => {"attempts": 120, "changed": false, "msg": "volume heketidbstorage is not ready", "state": "unknown"}
And it reported in the summery:
1. Hosts: master-0.example.com
Play: Verify upgrade can proceed on first master
Task: Check for GlusterFS cluster health
Message: volume heketidbstorage is not ready
(In reply to mnoguera from comment #8) > same message just updating the control plane not the nodes,running the > upgrade_control_plane.yaml playbook got the error: > > Error message is: > fatal: [master-0.example.com -> master-0.labs.com]: FAILED! => {"attempts": > 120, "changed": false, "msg": "volume heketidbstorage is not ready", > "state": "unknown"} > > And it reported in the summery: > 1. Hosts: master-0.example.com > Play: Verify upgrade can proceed on first master > Task: Check for GlusterFS cluster health > Message: volume heketidbstorage is not ready Jose - any idea for the above? What exact playbook are you running? Please provide the directory path. Also, please provide your inventory file. Closing as there's no remaining open cases against this bug. |
Created attachment 1485018 [details] verbose ansible logs Description of problem: Upgrade fails on below check : 1. Hosts: 10.10.x.y Play: Verify upgrade can proceed on first master Task: Check for GlusterFS cluster health Message: volume heketidbstorage is not ready Version-Release number of the following components: $rpm -q openshift-ansible openshift-ansible-3.11.7-1.git.0.911481d.el7_5.noarch $ rpm -q ansible ansible-2.6.4-1.el7ae.noarch $ ansible --version ansible 2.6.4 config file = /home/quicklab/ansible.cfg configured module search path = [u'/home/quicklab/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.5 (default, Feb 20 2018, 09:19:12) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] How reproducible: Steps to Reproduce: 1. Upgrade openshift containing CNS env 2. CNS details : oc get pods -n glusterfs NAME READY STATUS RESTARTS AGE bonds-service-1-56zr2 0/1 CrashLoopBackOff 533 1d glusterblock-storage-provisioner-dc-1-cs2rf 1/1 Running 0 31d glusterfs-storage-fg59w 1/1 Running 3 48d glusterfs-storage-tmrfl 1/1 Running 2 48d glusterfs-storage-wh2sl 1/1 Running 0 2d heketi-storage-1-tj6l5 1/1 Running 0 8d 3. Actual results: Fails every time Expected results: should succeed. Additional info: Please attach logs from ansible-playbook with the -vvv flag