Bug 1618826
| Summary: | Gluster pods are going down during node drain while doing ocp upgrade | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Nitin Goyal <nigoyal> | ||||
| Component: | CNS-deployment | Assignee: | Michael Adam <madam> | ||||
| Status: | CLOSED WONTFIX | QA Contact: | Nitin Goyal <nigoyal> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | cns-3.9 | CC: | akhakhar, hchiramm, jarrpa, kramdoss, madam, pprakash, rhs-bugs, rtalur, sankarshan | ||||
| Target Milestone: | --- | Flags: | kramdoss:
needinfo+
|
||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2018-08-22 20:12:20 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1568862 | ||||||
| Attachments: |
|
||||||
|
Description
Nitin Goyal
2018-08-17 17:28:12 UTC
Some outputs: During upgrade ================ oc get pods NAME READY STATUS RESTARTS AGE glusterblock-storage-provisioner-dc-1-pbg6f 1/1 Running 5 1d glusterfs-storage-46tmw 1/1 Running 5 1d glusterfs-storage-8d25h 1/1 NodeLost 5 1d glusterfs-storage-99j95 1/1 Running 5 1d heketi-storage-1-g4fr4 1/1 Running 0 30m After upgrade ================= oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE glusterblock-storage-provisioner-dc-1-7bnx4 1/1 Running 0 6m 10.128.0.32 dhcp47-102.lab.eng.blr.redhat.com glusterfs-storage-46tmw 1/1 Running 5 1d 10.70.46.138 dhcp46-138.lab.eng.blr.redhat.com glusterfs-storage-8d25h 1/1 Running 6 1d 10.70.46.107 dhcp46-107.lab.eng.blr.redhat.com glusterfs-storage-99j95 1/1 Running 5 1d 10.70.46.208 dhcp46-208.lab.eng.blr.redhat.com heketi-storage-1-g4fr4 1/1 Running 0 39m 10.130.0.10 dhcp46-208.lab.eng.blr.redhat.com [root@dhcp47-102 ~]# oc rsh glusterfs-storage-46tmw sh-4.2# gluster vol heal vol_glusterfs_claim194_d71a5a3f-a17d-11e8-b8df-005056a5454b info Brick 10.70.46.208:/var/lib/heketi/mounts/vg_d459f800b1ab62c20a71b04e98d7606c/brick_fc46e9c423b8adbf91ed86e517bb9882/brick / Status: Connected Number of entries: 1 Brick 10.70.46.138:/var/lib/heketi/mounts/vg_153d8fd7a68e940be56532d032f7ded0/brick_6ca3abb9779fafde36dafbd5d4ee996e/brick Status: Transport endpoint is not connected Number of entries: - Brick 10.70.46.107:/var/lib/heketi/mounts/vg_74b0b7452b2bed44b07cb37428612c8d/brick_63815bbfacbd0cde57dad39474046a98/brick / Status: Connected Number of entries: 1 gluster vol status vol_glusterfs_claim194_d71a5a3f-a17d-11e8-b8df-005056a5454b Status of volume: vol_glusterfs_claim194_d71a5a3f-a17d-11e8-b8df-005056a5454b Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.46.208:/var/lib/heketi/mounts/v g_d459f800b1ab62c20a71b04e98d7606c/brick_fc 46e9c423b8adbf91ed86e517bb9882/brick 49152 0 Y 5516 Brick 10.70.46.138:/var/lib/heketi/mounts/v g_153d8fd7a68e940be56532d032f7ded0/brick_6c a3abb9779fafde36dafbd5d4ee996e/brick 49152 0 Y 5532 Brick 10.70.46.107:/var/lib/heketi/mounts/v g_74b0b7452b2bed44b07cb37428612c8d/brick_63 815bbfacbd0cde57dad39474046a98/brick 49152 0 Y 5598 Self-heal Daemon on localhost N/A N/A Y 5495 Self-heal Daemon on dhcp46-208.lab.eng.blr. redhat.com N/A N/A Y 5507 Self-heal Daemon on 10.70.46.107 N/A N/A Y 5583 Task Status of Volume vol_glusterfs_claim194_d71a5a3f-a17d-11e8-b8df-005056a5454b ------------------------------------------------------------------------------ There are no active volume tasks Created attachment 1476681 [details]
Ansible log
(In reply to Nitin Goyal from comment #0) > Description of problem: > As part of OCP upgrade, each node is drained and pods are moved out of the > node for the upgrade to happen on the node. With CNS configured, some of the > nodes have gluster pods and they are restarted as part of this upgrade > process. We have couple of issues here, It's slightly problematic that this references two issues. It might be better to have two BZs for that... > 1) This will trigger self-heal on all the volumes that has active IO and > there is a chance that 2nd gluster pod is taken down before the self-heal > gets completed. This will result in gluster volume going into read-only state The 3.10 upgrade instructions clearly treat gluster and instruct to wait for heals to complete: https://access.redhat.com/documentation/en-us/openshift_container_platform/3.10/html-single/upgrading_clusters/#special-considerations-for-glusterfs When coming from 3.9, you see the same text: https://access.redhat.com/documentation/en-us/openshift_container_platform/3.9/html-single/upgrading_clusters/#special-considerations-for-glusterfs If you follow those instructions, the problem should not happen. Furthermore, Jose points out that those instructions are even somewhat outdated, since the upgrade playbook of openshift-ansible 3.10 already checks for heals to complete. Have you been using the latest version of the upgrade playbook? > 2) Also with OCP being upgraded with CNS 3.9, we hit this issue - > https://bugzilla.redhat.com/show_bug.cgi?id=1615324 due to restart of > gluster pod. Well, that is no surprise. This is a bug in 3.9 that we are just about to fix in 3.10. Without fixing 3.9, we can't prevent this. Finally, this is a test of the OCP 3.9->3.10 upgrade, with CNS 3.9 installed. So there is no point marking this as a bug or even a blocker against CNS/OCS 3.10. :-) This is not a bug for OCS 3.10. ==> devel_ack - The blocker flag needs to be removed. Part 1 of the bug is a bug in the OCP upgrade process that has been fixed already. Part 2 is a bug in CNS 3.9 that can't be fixed. ==> What to do? I suggest close. Development Management has reviewed and declined this request. You may appeal this decision by reopening this request. |