Bug 1452563
| Summary: | deployment stuck if gluster node went offline | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Alexander Koksharov <akokshar> |
| Component: | heketi | Assignee: | Humble Chirammal <hchiramm> |
| Status: | CLOSED ERRATA | QA Contact: | RamaKasturi <knarra> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | cns-3.5 | CC: | akhakhar, andcosta, annair, aos-bugs, bkunal, hchiramm, jarrpa, jmulligan, kramdoss, madam, pprakash, rgeorge, rhs-bugs, rreddy, rtalur, sankarshan, storage-qa-internal, vinug |
| Target Milestone: | --- | ||
| Target Release: | OCS 3.11 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | rhgs-volmanager-container-3.11.0-4 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-10-24 04:51:02 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1622458, 1629577 | ||
|
Description
Alexander Koksharov
2017-05-19 08:33:47 UTC
In problem description you mentioned, --snip-- I have successfully replicated this on v3.2 and it is also reproducable on v3.5. However, on 3.5 it is a bit better. I have installed two OCP clusters of different version and integrated them with one 2-nodes gluster cluster.. --/snip-- In CNS, we only support 'replica 3' configuration. Please refer this doc https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.2/html/container-native_storage_for_openshift_container_platform/chap-documentation-red_hat_gluster_storage_container_native_with_openshift_platform-setting_the_environment-deploy_cns Replica 2 could be prone to splitbrain, so the recommendation to use replica 3. Can you please look into this configuration and reproduce the issue again ? Hello Humble, Could you please clarify how having 3-nodes gluster cluster can yield any improvement to the subject problem? If there is kind of 'split brain' situation and gluster node which is in minority does not accept mount requests, why volume gets eventually mounted after a few monutes delay? I had only two nodes in my test and if they got split, they if the above is true (which is not) then nothing should work. But it does. I does not work well, but it does work! Issue is not in the Gluster cluster at all. Issue is on a connection level. Openshift configuration consist of: - endpoints representing each gluster node - service this loadshare endpoints (by means of iptables) - system uses service IP to connect to gluster. Service IP gets translated to one of the endpoints (gluster nodes) IPs. When system tries to establish connection it chooses one enpoint and send tcp-syn. If tcp-syn is not replied due to iptables DROP rule we have a delay of several minuted. If tcp-syn is rejected instead meaning that "ICMP unreachable" is sent back we have no delay in mounting a volume. If you still insist this should be tested with 3-nodes gluster cluster please explain to me call flows between openshift and gluster cluster in a situations when gluster cluster if healthy and in a situation when one node went down. Will wery appreciate if you can clarify what logic openshift applies when it tries to find node to connect? Thank you Lex. Hi Humble, Can you give me an update on this bugzilla? Thanks, Andre @humble, can you please update the FIV for this bug ? Below are the test steps i executed as per comment 13 to move this bug to verified state: ======================================================================== I have tried performing test case 1 in comment 13 and i see below results. 1) created an cirros app pod with gluster backed pvc and mounted it in /mnt. I see that volume is mounted using the server 10.70.46.170 inside the pod. 2) I powered off the node with which the volume was mounted. 3) Now i try to restart app pod by deleting the pod using the command 'oc delete pod <pod_name> 4) Now i see that another pod comes up and see that volume is mounted using the same node which is 10.70.46.170 inside the pod. Bug verification is done, but waiting for FIV to be put so that this can be moved to verified state. >sorry humble. My bad Nw :) >Looks like the hypervisor where the vm is hosted has high memory consumption and >due to this command was stuck. Now the memory consumption has reduced so i was >able to execute the command. Thanks karthick for helping me out with this. I am glad to hear this :) >Can you please update the Fixed In Version for this bug which would help me to >move the bug to verified state ? rhgs-volmanager-container-3.11.0-4 Verified the fix in rhgs-volmanager-rhel7:3.11.0-4 and it works fine as per comment 24. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2986 |