Bug 1563690 - [3.7] GlusterFS pods fail to deploy without restarting OpenShift services
Summary: [3.7] GlusterFS pods fail to deploy without restarting OpenShift services
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: CNS-deployment
Version: rhgs-3.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: CNS 3.10
Assignee: Jose A. Rivera
QA Contact: Prasanth
Depends On:
TreeView+ depends on / blocked
Reported: 2018-04-04 13:10 UTC by mmariyan
Modified: 2018-11-08 08:46 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2018-05-16 11:12:48 UTC
Target Upstream Version:

Attachments (Terms of Use)

Description mmariyan 2018-04-04 13:10:32 UTC
Description of problem:

Fresh installation in ocp 3.7

1>The cu has noted when the glusterfs installation playbook it failed in the task reach " Wait for GlusterFS pods"  which is defined in the playbook "usr/share/ansible/openshift-ansible/roles/openshift_storage_glusterfs/tasks/glusterfs_deploy.yml" here he made the changes like below the glusterfs installation playbook runs successfully.


#  retries: "{{ (glusterfs_timeout | int / 10) | int }}"
   retries: 150

>In that time all the glusterfs pods not coming with "1/1 Running" state 
so he restarted the docker and node services then the pods state become 1/1 Running" state. Is this mandatory?

systemctl restart docker.service
systemctl restart atomic-openshift-node.service

Finally they then came up in the correct state:
1/1           Running

Actual results:

Playbook fails

Expected results:
It should get success glusterfs installation in openshift 3.7

Additional info:

Comment 1 Jose A. Rivera 2018-04-04 15:18:25 UTC
No, this should not be required. I'm not sure I fully understand the problem, however: Are you saying that the customer set the timeout to 150 and then the playbooks succeeded but the pods were not actually running?

Did you wipe the failed installation before re-running the playbooks? The GlusterFS playbooks are not idempotent, and it is not supported to run them more than once without uninstalling/wiping the failed deployment first. If you did do a second run, I suspect the playbooks detected the extant pods and skipped over the task that initially gave you problems. I don't know how they would have completed without heketi, however.

Please inspect the logs found in /var/log/glusterfs/glusterd.log as well as systemctl/journalctl logs for docker and atomic-openshift-node for additional information.

Updating the summary line to a more useful sentence.

Note You need to log in before you can comment on or make changes to this bug.