Bug 1521027 - Volume mismatch between OpenShift PVs, Heketi, and gluster
Summary: Volume mismatch between OpenShift PVs, Heketi, and gluster
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: heketi
Version: cns-3.6
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: ---
Assignee: Michael Adam
QA Contact: Rahul Hinduja
URL:
Whiteboard:
Depends On:
Blocks: 1724792 1622458
TreeView+ depends on / blocked
 
Reported: 2017-12-05 16:25 UTC by Thom Carlin
Modified: 2021-06-10 13:50 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-19 17:45:12 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1401984 0 unspecified CLOSED [RFE] Provide a "force" option in heketi-cli to allow the user to forcefully delete/flush any entries from heketi DB 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1489082 0 unspecified CLOSED Deleting a project with a pod that uses a CNS PV results in pod stuck in terminating state 2021-06-10 12:57:48 UTC
Red Hat Bugzilla 1516288 0 urgent CLOSED [GSS] heketi doesn't remove old pods 2021-03-11 16:21:43 UTC
Red Hat Bugzilla 1516598 1 None None None 2024-09-18 00:46:51 UTC
Red Hat Bugzilla 1519549 0 unspecified CLOSED Heketi timeouts leading to inconsistent CNS state 2021-03-11 16:28:06 UTC
Red Hat Bugzilla 1519919 0 unspecified CLOSED [RFE] Provide mechanism to update/modify heketi database 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1524816 0 high CLOSED heketi was not removing the LVs associated with Bricks removed when Gluster Volumes were deleted 2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution) 3242321 0 None None None 2018-02-15 12:58:31 UTC


Description Thom Carlin 2017-12-05 16:25:21 UTC
Description of problem:

Volumes should be present in all 3 areas.  However, some are missing in the cluster

Version-Release number of selected component (if applicable):

OCP 3.6/CNS 3.6

How reproducible:

Uncertain but seems to be a known issue under various guises

Steps to Reproduce:
1. oc describe pv
2. heketi-cli topology info
3. gluster volume list
4. Compare the 3 lists 

Actual results:

Some volumes missing in some areas and absent in others

Expected results:

All volumes present in all areas

Additional info:

Initial workaround:
1) delete the PVs that don't have volumes in heketi
2) delete the heketi volumes that don't have PVs
3) delete the gluster volumes that don't have heketi volumes
4) Shutdown glusterd pods
5) on each GlusterFS node, for volumes that don't seem to have valid bricks: delete the volume directories 
6) Startup glusterd

Comment 2 Thom Carlin 2017-12-05 16:47:54 UTC
Additional information:

1) OCP Chronology:
   a) OCP cluster has been running for about 230 days.
   b) OCP cluster was initially 3.4/NFS
   c) OCP cluster was upgraded to 3.5/NFS
   d) OCP cluster was switched from 3.5/NFS to 3.5/CNS 3.5
   e) OCP cluster was upgraded to 3.6/CNS 3.5
   f) OCP cluster was upgraded to its current state: 3.6/CNS 3.6
   g) OCP cluster has all GA patch with the exception of python-requests (due to other bz)

2) OCP Persistent Volumes:
   a) There are currently 13 PVs:
   b) The number can vary from about 10 to 40, depending on the mix.
   c) The mix comes from running CI/CD Jenkins jobs and adhoc requests from other projects

3) Topology:
   a) 1 Heketi instance (pod)
   b) 3 glusterfs instance (pods) in a single pool using local storage
   c) 1 gluster-s3-dc instance (pod) [defined but not currently used]
   d) 1 glusterblock-provisioner-dc (pod) [defined but not currently used]

Comment 3 Thom Carlin 2017-12-05 16:53:43 UTC
Changing step 4 on Steps to Reproduce:
4) lvs
5) Compare the 4 lists

Comment 4 Thom Carlin 2017-12-05 17:02:46 UTC
Initial workaround:
7) Delete any gluster-related Logical Volumes (brick-* and tp_*)

For production environments, please carefully test the commands in another environment and contact GSS.  This workaround has *not* been fully vetted

Comment 12 John Call 2018-01-22 18:00:45 UTC
(In reply to Thom Carlin from comment #0)
> Initial workaround:
> 1) delete the PVs that don't have volumes in heketi
> 2) delete the heketi volumes that don't have PVs
> 3) delete the gluster volumes that don't have heketi volumes
> 4) Shutdown glusterd pods
> 5) on each GlusterFS node, for volumes that don't seem to have valid bricks:
> delete the volume directories 
> 6) Startup glusterd

I also found it necessary to cleanup (remove) logical volumes that had no associated gluster volumes.  I stole these steps from the heketi logs...

# umount /var/lib/heketi/mounts/vg_f067a6d1192e10332ef54923357f5d31/brick_9dff504c8f34d706c1a718f7b3f768da

# lvremove -f vg_f067a6d1192e10332ef54923357f5d31/tp_9dff504c8f34d706c1a718f7b3f768da

# sed -i.save "/brick_9dff504c8f34d706c1a718f7b3f768da/d" /var/lib/heketi/fstab

^^^ command were executed inside CNS pods (e.g. oc rsh glusterfs-storage-ABCXYZ)


Note You need to log in before you can comment on or make changes to this bug.