Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1631162

Summary: 3.11 Upgrade fails on "Check for GlusterFS cluster health"
Product: OpenShift Container Platform Reporter: Jaspreet Kaur <jkaur>
Component: InstallerAssignee: Russell Teague <rteague>
Installer sub component: openshift-ansible QA Contact: Johnny Liu <jialiu>
Status: CLOSED DEFERRED Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, dmoessne, jarrpa, jkaur, jokerman, mmccomas, mnoguera, mtaru, roysjosh, sdodson, wmeng
Version: 3.11.0   
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-19 15:53:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
verbose ansible logs none

Description Jaspreet Kaur 2018-09-20 06:43:04 UTC
Created attachment 1485018 [details]
verbose ansible logs

Description of problem: Upgrade fails on below check :

 1. Hosts:    10.10.x.y
     Play:     Verify upgrade can proceed on first master
     Task:     Check for GlusterFS cluster health
     Message:  volume heketidbstorage is not ready


Version-Release number of the following components:


$rpm -q openshift-ansible
openshift-ansible-3.11.7-1.git.0.911481d.el7_5.noarch

$ rpm -q ansible
ansible-2.6.4-1.el7ae.noarch

$ ansible --version
ansible 2.6.4
  config file = /home/quicklab/ansible.cfg
  configured module search path = [u'/home/quicklab/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Feb 20 2018, 09:19:12) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]




How reproducible:

Steps to Reproduce:
1. Upgrade openshift containing CNS env
2. CNS details :

oc get pods -n glusterfs
NAME                                          READY     STATUS             RESTARTS   AGE
bonds-service-1-56zr2                         0/1       CrashLoopBackOff   533        1d
glusterblock-storage-provisioner-dc-1-cs2rf   1/1       Running            0          31d
glusterfs-storage-fg59w                       1/1       Running            3          48d
glusterfs-storage-tmrfl                       1/1       Running            2          48d
glusterfs-storage-wh2sl                       1/1       Running            0          2d
heketi-storage-1-tj6l5                        1/1       Running            0          8d


3.

Actual results: Fails every time

Expected results: should succeed.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Scott Dodson 2018-09-20 12:30:41 UTC
This is a safety check to ensure that gluster has fully healed before we remove an additional host from the cluster.


Message:  volume heketidbstorage is not ready

Why is that volume not ready?

Comment 3 daniel 2018-09-25 17:28:37 UTC
I am running into the very same when trying to upgrade OCP 3.9 to 3.10 with OCS 3.10 already running on it:
(master and infra nodes already upgraded, below happens when dealing with storage nodes)


# rpm -q openshift-ansible
openshift-ansible-3.10.47-1.git.0.95bc2d2.el7_5.noarch
# 
# rpm -q ansible
ansible-2.4.6.0-1.el7ae.noarch
# 
# ansible --version
ansible 2.4.6.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, May 31 2018, 09:41:32) [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)]
# 



FAILED - RETRYING: Check for GlusterFS cluster health (6 retries left).
FAILED - RETRYING: Check for GlusterFS cluster health (5 retries left).
FAILED - RETRYING: Check for GlusterFS cluster health (4 retries left).
FAILED - RETRYING: Check for GlusterFS cluster health (3 retries left).
FAILED - RETRYING: Check for GlusterFS cluster health (2 retries left).
FAILED - RETRYING: Check for GlusterFS cluster health (1 retries left).
fatal: [inf152.example.com -> inf152.example.com]: FAILED! => {"attempts": 120, "changed": false, "failed": true, "msg": "volume heketidbstorage is not ready", "state": "unknown"}
        to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade_nodes.retry

PLAY RECAP ***********************************************************************************************************************************************************************************************************************************
inf152.example.com  : ok=141  changed=5    unreachable=0    failed=1   
inf153.example.com  : ok=53   changed=5    unreachable=0    failed=0   
inf154.example.com  : ok=53   changed=5    unreachable=0    failed=0   
inf155.example.com  : ok=68   changed=10   unreachable=0    failed=0   
inf156.example.com  : ok=69   changed=11   unreachable=0    failed=0   
inf157.example.com  : ok=68   changed=10   unreachable=0    failed=0   
inf158.example.com  : ok=68   changed=10   unreachable=0    failed=0   
inf159.example.com  : ok=68   changed=11   unreachable=0    failed=0   
inf160.example.com  : ok=68   changed=10   unreachable=0    failed=0   
inf161.example.com  : ok=68   changed=10   unreachable=0    failed=0   
inf162.example.com  : ok=68   changed=10   unreachable=0    failed=0   
inf163.example.com  : ok=68   changed=10   unreachable=0    failed=0   
inf164.example.com  : ok=68   changed=10   unreachable=0    failed=0   
inf165.example.com  : ok=68   changed=10   unreachable=0    failed=0   
inf166.example.com  : ok=68   changed=10   unreachable=0    failed=0   
inf167.example.com  : ok=68   changed=10   unreachable=0    failed=0   
inf168.example.com  : ok=68   changed=10   unreachable=0    failed=0   
inf169.example.com  : ok=68   changed=10   unreachable=0    failed=0   
localhost                  : ok=13   changed=0    unreachable=0    failed=0   



Failure summary:


  1. Hosts:    inf152.example.com
     Play:     Verify upgrade can proceed on first master
     Task:     Check for GlusterFS cluster health
     Message:  volume heketidbstorage is not ready


I followed:

https://docs.openshift.com/container-platform/3.10/upgrading/automated_upgrades.html#special-considerations-for-glusterfs
- which asks to remove demon label on one node to terminate, so ocp pod is gone
- add type=upgrade label
- # ansible-playbook -i /etc/ansible/hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade_nodes.yml -e  openshift_upgrade_nodes_label="type=upgrade"

which ends up in above scenario.


however, if I put back the label in place get the OCS pods up again, verify everything is in sync, the node upgrade succeeds:

before the failed upgrade (mind those 2 nodes are part of different ocs glusters; registry and apps):

# oc get nodes  -l type=upgrade
NAME                        STATUS    ROLES     AGE       VERSION
inf157.example.com   Ready     compute   1d        v1.9.1+a0ce1bc657   
inf158.example.com   Ready     compute   1d        v1.9.1+a0ce1bc657
# 

--> upgrade

PLAY RECAP ***********************************************************************************************************************************************************************************************************************************
inf152.example.com  : ok=164  changed=5    unreachable=0    failed=0
inf153.example.com  : ok=70   changed=5    unreachable=0    failed=0
inf154.example.com  : ok=70   changed=5    unreachable=0    failed=0
inf155.example.com  : ok=78   changed=10   unreachable=0    failed=0
inf156.example.com  : ok=79   changed=10   unreachable=0    failed=0
inf157.example.com  : ok=149  changed=45   unreachable=0    failed=0
inf158.example.com  : ok=149  changed=45   unreachable=0    failed=0
inf159.example.com  : ok=78   changed=10   unreachable=0    failed=0
inf160.example.com  : ok=78   changed=10   unreachable=0    failed=0
inf161.example.com  : ok=78   changed=10   unreachable=0    failed=0
inf162.example.com  : ok=78   changed=10   unreachable=0    failed=0
inf163.example.com  : ok=78   changed=10   unreachable=0    failed=0
inf164.example.com  : ok=78   changed=10   unreachable=0    failed=0
inf165.example.com  : ok=78   changed=10   unreachable=0    failed=0
inf166.example.com  : ok=78   changed=10   unreachable=0    failed=0
inf167.example.com  : ok=78   changed=10   unreachable=0    failed=0
inf168.example.com  : ok=78   changed=10   unreachable=0    failed=0
inf169.example.com  : ok=78   changed=10   unreachable=0    failed=0
localhost                  : ok=13   changed=0    unreachable=0    failed=0

# 
# oc get nodes  -l type=upgrade
NAME                        STATUS    ROLES           AGE       VERSION
inf157.example.com   Ready     compute,infra   1d        v1.10.0+b81c8f8
inf158.example.com   Ready     compute,infra   1d        v1.10.0+b81c8f8
# 

--> then this succeeds, so I wonder if the steps given in the doc:

https://docs.openshift.com/container-platform/3.10/upgrading/automated_upgrades.html#special-considerations-for-glusterfs
Special Considerations When Using Containerized GlusterFS

are still valid for OCP 3.9 -> 3.10 upgrade and if we changed the playbooks and the upgrade procedure should be now different as well ?

Comment 4 Jose A. Rivera 2018-09-25 18:33:28 UTC
Oh.... yes, those instructions are out of date. With the new GlusterFS health checks in the upgrade playbooks, it is no longer a requirement to remove the DaemonSet label from the GlusterFS nodes. This is a doc bug that should be relatively easy to fix.

Comment 5 daniel 2018-09-25 18:44:12 UTC
Jose,

I am happy to file a docs bug.
Just to be clear, when looking at the very doc, remove is no longer required, but the rest still is, right. Guess we want to serially update gluster nodes and ensure volumes are in sync, before going to the next node, right ?

Thanks,
daniel

Comment 6 Jaspreet Kaur 2018-10-02 07:14:09 UTC
docs bug https://bugzilla.redhat.com/show_bug.cgi?id=1633471

Comment 7 Jose A. Rivera 2018-10-02 14:36:17 UTC
(In reply to daniel from comment #5)
> I am happy to file a docs bug.
> Just to be clear, when looking at the very doc, remove is no longer
> required, but the rest still is, right. Guess we want to serially update
> gluster nodes and ensure volumes are in sync, before going to the next node,
> right ?

Correct.

Comment 8 Camino Noguera 2018-10-31 13:50:32 UTC
same message just updating the control plane not the nodes,running the upgrade_control_plane.yaml playbook got the error:

Error message is:
fatal: [master-0.example.com -> master-0.labs.com]: FAILED! => {"attempts": 120, "changed": false, "msg": "volume heketidbstorage is not ready", "state": "unknown"}

And it reported in the summery: 
  1. Hosts:    master-0.example.com
     Play:     Verify upgrade can proceed on first master
     Task:     Check for GlusterFS cluster health
     Message:  volume heketidbstorage is not ready

Comment 9 Yaniv Kaul 2018-11-20 08:01:10 UTC
(In reply to mnoguera from comment #8)
> same message just updating the control plane not the nodes,running the
> upgrade_control_plane.yaml playbook got the error:
> 
> Error message is:
> fatal: [master-0.example.com -> master-0.labs.com]: FAILED! => {"attempts":
> 120, "changed": false, "msg": "volume heketidbstorage is not ready",
> "state": "unknown"}
> 
> And it reported in the summery: 
>   1. Hosts:    master-0.example.com
>      Play:     Verify upgrade can proceed on first master
>      Task:     Check for GlusterFS cluster health
>      Message:  volume heketidbstorage is not ready

Jose - any idea for the above?

Comment 10 Jose A. Rivera 2018-11-20 16:57:13 UTC
What exact playbook are you running? Please provide the directory path. Also, please provide your inventory file.

Comment 12 Scott Dodson 2019-08-19 15:53:33 UTC
Closing as there's no remaining open cases against this bug.