1521027 – Volume mismatch between OpenShift PVs, Heketi, and gluster

Bug 1521027 - Volume mismatch between OpenShift PVs, Heketi, and gluster

Summary: Volume mismatch between OpenShift PVs, Heketi, and gluster

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	heketi
Sub Component:
Version:	cns-3.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Michael Adam
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1724792 1622458
TreeView+	depends on / blocked

Reported:	2017-12-05 16:25 UTC by Thom Carlin
Modified:	2021-06-10 13:50 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-19 17:45:12 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1401984	0	unspecified	CLOSED	[RFE] Provide a "force" option in heketi-cli to allow the user to forcefully delete/flush any entries from heketi DB	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1489082	0	unspecified	CLOSED	Deleting a project with a pod that uses a CNS PV results in pod stuck in terminating state	2021-06-10 12:57:48 UTC
Red Hat Bugzilla	1516288	0	urgent	CLOSED	[GSS] heketi doesn't remove old pods	2021-03-11 16:21:43 UTC
Red Hat Bugzilla	1516598	1	None	None	None	2024-09-18 00:46:51 UTC
Red Hat Bugzilla	1519549	0	unspecified	CLOSED	Heketi timeouts leading to inconsistent CNS state	2021-03-11 16:28:06 UTC
Red Hat Bugzilla	1519919	0	unspecified	CLOSED	[RFE] Provide mechanism to update/modify heketi database	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1524816	0	high	CLOSED	heketi was not removing the LVs associated with Bricks removed when Gluster Volumes were deleted	2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution)	3242321	0	None	None	None	2018-02-15 12:58:31 UTC

Internal Links: 1401984 1489082 1516288 1516598 1519549 1519919 1524816

Description Thom Carlin 2017-12-05 16:25:21 UTC

Description of problem:

Volumes should be present in all 3 areas.  However, some are missing in the cluster

Version-Release number of selected component (if applicable):

OCP 3.6/CNS 3.6

How reproducible:

Uncertain but seems to be a known issue under various guises

Steps to Reproduce:
1. oc describe pv
2. heketi-cli topology info
3. gluster volume list
4. Compare the 3 lists 

Actual results:

Some volumes missing in some areas and absent in others

Expected results:

All volumes present in all areas

Additional info:

Initial workaround:
1) delete the PVs that don't have volumes in heketi
2) delete the heketi volumes that don't have PVs
3) delete the gluster volumes that don't have heketi volumes
4) Shutdown glusterd pods
5) on each GlusterFS node, for volumes that don't seem to have valid bricks: delete the volume directories 
6) Startup glusterd

Comment 2 Thom Carlin 2017-12-05 16:47:54 UTC

Additional information:

1) OCP Chronology:
   a) OCP cluster has been running for about 230 days.
   b) OCP cluster was initially 3.4/NFS
   c) OCP cluster was upgraded to 3.5/NFS
   d) OCP cluster was switched from 3.5/NFS to 3.5/CNS 3.5
   e) OCP cluster was upgraded to 3.6/CNS 3.5
   f) OCP cluster was upgraded to its current state: 3.6/CNS 3.6
   g) OCP cluster has all GA patch with the exception of python-requests (due to other bz)

2) OCP Persistent Volumes:
   a) There are currently 13 PVs:
   b) The number can vary from about 10 to 40, depending on the mix.
   c) The mix comes from running CI/CD Jenkins jobs and adhoc requests from other projects

3) Topology:
   a) 1 Heketi instance (pod)
   b) 3 glusterfs instance (pods) in a single pool using local storage
   c) 1 gluster-s3-dc instance (pod) [defined but not currently used]
   d) 1 glusterblock-provisioner-dc (pod) [defined but not currently used]

Comment 3 Thom Carlin 2017-12-05 16:53:43 UTC

Changing step 4 on Steps to Reproduce:
4) lvs
5) Compare the 4 lists

Comment 4 Thom Carlin 2017-12-05 17:02:46 UTC

Initial workaround:
7) Delete any gluster-related Logical Volumes (brick-* and tp_*)

For production environments, please carefully test the commands in another environment and contact GSS.  This workaround has *not* been fully vetted

Comment 12 John Call 2018-01-22 18:00:45 UTC

(In reply to Thom Carlin from comment #0)
> Initial workaround:
> 1) delete the PVs that don't have volumes in heketi
> 2) delete the heketi volumes that don't have PVs
> 3) delete the gluster volumes that don't have heketi volumes
> 4) Shutdown glusterd pods
> 5) on each GlusterFS node, for volumes that don't seem to have valid bricks:
> delete the volume directories 
> 6) Startup glusterd

I also found it necessary to cleanup (remove) logical volumes that had no associated gluster volumes.  I stole these steps from the heketi logs...

# umount /var/lib/heketi/mounts/vg_f067a6d1192e10332ef54923357f5d31/brick_9dff504c8f34d706c1a718f7b3f768da

# lvremove -f vg_f067a6d1192e10332ef54923357f5d31/tp_9dff504c8f34d706c1a718f7b3f768da

# sed -i.save "/brick_9dff504c8f34d706c1a718f7b3f768da/d" /var/lib/heketi/fstab

^^^ command were executed inside CNS pods (e.g. oc rsh glusterfs-storage-ABCXYZ)

Note You need to log in before you can comment on or make changes to this bug.