Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1303571

Summary:	Bricks used by glusterfs get unmounted from their respective nodes while attempting to stress.
Product:	Red Hat Enterprise Linux 7	Reporter:	Ambarish <asoman>
Component:	lvm2	Assignee:	LVM and device-mapper development team <lvm-team>
lvm2 sub component:	Other	QA Contact:	cluster-qe <cluster-qe>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	agk, heinzm, jbrassow, msnitzer, prajnoha, prockai, zkabelac
Version:	7.2
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-02-01 11:02:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ambarish 2016-02-01 10:44:07 UTC

Description of problem:

The volumes are created using RHGS(Red Hat Gluster Storage) which uses the underlying LVM PVs as bricks as the building block.
While running parallel I/O,the bricks get unmounted from the nodes (after ~20 mins of starting the workload).
This looks like an LVM issue(outside the scope of RHGS).
The VGs,PVs and LV are intact.

One of the problematic LVs is RHS_vg6/RHS_lv6 (from Server : 10.70.37.134)


*SNIPPET FROM LOGS* :


Jan 30 19:01:36 dhcp37-134 lvm[16942]: WARNING: Device for PV Kc8B3r-1Qg1-kfVy-VU6c-1WlR-rczA-0q0eRO not found or rejected by a filter.
Jan 30 19:01:36 dhcp37-134 lvm[16942]: Cannot change VG RHS_vg4 while PVs are missing.
Jan 30 19:01:36 dhcp37-134 lvm[16942]: Consider vgreduce --removemissing.
Jan 30 19:01:36 dhcp37-134 lvm[16942]: Failed to extend thin RHS_vg4-RHS_pool4-tpool.
Jan 30 19:01:36 dhcp37-134 lvm[16942]: Unmounting thin volume RHS_vg4-RHS_pool4-tpool from /rhs/brick4.

Jan 30 19:12:01 dhcp37-134 lvm[16942]: WARNING: Device for PV PuNir0-yPa4-qC9a-aO5a-GnwY-2iOl-7Y2vAg not found or rejected by a filter.
Jan 30 19:12:01 dhcp37-134 lvm[16942]: Cannot change VG RHS_vg5 while PVs are missing.
Jan 30 19:12:01 dhcp37-134 lvm[16942]: Consider vgreduce --removemissing.
Jan 30 19:12:01 dhcp37-134 lvm[16942]: Failed to extend thin RHS_vg5-RHS_pool5-tpool.
Jan 30 19:12:01 dhcp37-134 lvm[16942]: Unmounting thin volume RHS_vg5-RHS_pool5-tpool from /rhs/brick5.

Jan 30 19:37:20 dhcp37-134 kernel: XFS (dm-26): Unmounting Filesystem
Jan 30 19:37:20 dhcp37-134 kernel: XFS (dm-21): Unmounting Filesystem

Jan 31 12:20:48 dhcp37-134 lvm[16942]: WARNING: Device for PV g1FsKG-lxkE-cxe5-VKFx-YZnp-HCEQ-kJmQkK not found or rejected by a filter.
Jan 31 12:20:48 dhcp37-134 lvm[16942]: Cannot change VG RHS_vg6 while PVs are missing.
Jan 31 12:20:48 dhcp37-134 lvm[16942]: Consider vgreduce --removemissing.
Jan 31 12:20:48 dhcp37-134 lvm[16942]: Failed to extend thin RHS_vg6-RHS_pool6-tpool.
Jan 31 12:20:48 dhcp37-134 lvm[16942]: Unmounting thin volume RHS_vg6-RHS_pool6-tpool from /rhs/brick6.
Jan 31 12:20:49 dhcp37-134 lvm[16942]: WARNING: Device for PV ENWJcR-Q8Cj-2ld6-hTa3-lrC7-ugfw-vne6em not found or rejected by a filter.
Jan 31 12:20:49 dhcp37-134 lvm[16942]: Cannot change VG RHS_vg7 while PVs are missing.
Jan 31 12:20:49 dhcp37-134 lvm[16942]: Consider vgreduce --removemissing.
Jan 31 12:20:49 dhcp37-134 lvm[16942]: Failed to extend thin RHS_vg7-RHS_pool7-tpool.
Jan 31 12:20:49 dhcp37-134 lvm[16942]: Unmounting thin volume RHS_vg7-RHS_pool7-tpool from /rhs/brick7.



*SNIPPET FROM DMESG* :

[/dev/vdi was the partition used for brick6]

[root@dhcp37-103 core]# dmesg|grep vdi
[258262.255796]  vdi: unknown partition table
[259125.585924]  vdi: unknown partition table
[259125.633420]  vdi: unknown partition table
[259125.636292]  vdi: unknown partition table
[259125.668662]  vdi: unknown partition table
[259147.521809]  vdi: unknown partition table
[259178.239697]  vdi: unknown partition table


*VOLUME CONFIGURATION* :

[root@dhcp37-134 tmp]# gluster v info khal
 
Volume Name: khal
Type: Tier
Volume ID: b261b0d8-e9dc-4014-90f5-0e869755e146
Status: Started
Number of Bricks: 26
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 4 x 2 = 8
Brick1: 10.70.37.134:/rhs/brick7/A1
Brick2: 10.70.37.134:/rhs/brick6/A1
Brick3: 10.70.37.134:/rhs/brick5/A1
Brick4: 10.70.37.103:/rhs/brick7/A1
Brick5: 10.70.37.134:/rhs/brick4/A1
Brick6: 10.70.37.103:/rhs/brick6/A1
Brick7: 10.70.37.134:/rhs/brick3/A1
Brick8: 10.70.37.103:/rhs/brick5/A1
Cold Tier:
Cold Tier Type : Distributed-Replicate
Number of Bricks: 9 x 2 = 18
Brick9: 10.70.37.218:/rhs/brick1/A1
Brick10: 10.70.37.41:/rhs/brick1/A1
Brick11: 10.70.37.218:/rhs/brick2/A1
Brick12: 10.70.37.41:/rhs/brick2/A1
Brick13: 10.70.37.218:/rhs/brick3/A1
Brick14: 10.70.37.41:/rhs/brick3/A1
Brick15: 10.70.37.218:/rhs/brick4/A1
Brick16: 10.70.37.41:/rhs/brick4/A1
Brick17: 10.70.37.218:/rhs/brick5/A1
Brick18: 10.70.37.41:/rhs/brick5/A1
Brick19: 10.70.37.103:/rhs/brick1/A1
Brick20: 10.70.37.218:/rhs/brick6/A1
Brick21: 10.70.37.103:/rhs/brick2/A1
Brick22: 10.70.37.218:/rhs/brick7/A1
Brick23: 10.70.37.103:/rhs/brick3/A1
Brick24: 10.70.37.134:/rhs/brick2/A1
Brick25: 10.70.37.103:/rhs/brick4/A1
Brick26: 10.70.37.134:/rhs/brick1/A1
Options Reconfigured:
cluster.self-heal-daemon: on
features.quota-deem-statfs: off
features.inode-quota: on
features.quota: on
cluster.watermark-hi: 50
cluster.watermark-low: 20
cluster.read-freq-threshold: 1
cluster.write-freq-threshold: 1
performance.io-cache: off
performance.quick-read: off
features.record-counters: on
cluster.tier-mode: cache
features.ctr-enabled: on
performance.readdir-ahead: on
[root@dhcp37-134 tmp]# 


Version-Release number of selected component (if applicable):

[root@dhcp37-134 tmp]# cat /etc/redhat-storage-release 
Red Hat Gluster Storage Server 3.1 Update 2

[root@dhcp37-134 tmp]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.2 (Maipo)
[root@dhcp37-134 tmp]# 


How reproducible:
Tried Once

Steps to Reproduce:

Set a 9*2 volume.Add a 4*2 hot tier.Run parallel I/O from multiple clients.

Actual results:

4/8 bricks from hot tier get unmounted all of a sudden after 20 mins of the workload.

Expected results:

Bricks should not be unmounted,brick process should not have been killed and I/O must run successfully without hangs/crashes.

Additional info:


*NODES* [root/redhat]:

10.70.37.103
10.70.37.134


*CLIENTS* [root/redhat]:

10.70.37.199
10.70.37.87
10.70.37.96
10.70.37.61


*WORKLOAD DESCRIPTION*:

Following were run parallely from different threads.

Client 1 -> dd
Client 2 -> dd
Client 3 -> Linux untar
Client 4 -> Linux untar + Media copy

Comment 1 Ambarish 2016-02-01 10:53:29 UTC

Tier logs,brick logs and sosreports copied here:

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1303571/

Comment 3 Ambarish 2016-02-01 10:57:01 UTC

dmesg copied to same location

Comment 4 Ambarish 2016-02-01 10:58:57 UTC

The environment is preserved for further debugging by the LVM team.

Comment 5 Zdenek Kabelac 2016-02-01 11:02:04 UTC

(In reply to Ambarish from comment #0)
> Description of problem:
> 
> The volumes are created using RHGS(Red Hat Gluster Storage) which uses the
> underlying LVM PVs as bricks as the building block.
> While running parallel I/O,the bricks get unmounted from the nodes (after
> ~20 mins of starting the workload).
> This looks like an LVM issue(outside the scope of RHGS).
> The VGs,PVs and LV are intact.
> 
> One of the problematic LVs is RHS_vg6/RHS_lv6 (from Server : 10.70.37.134)
> 
> 
> *SNIPPET FROM LOGS* :
> 
> 
> Jan 30 19:01:36 dhcp37-134 lvm[16942]: WARNING: Device for PV
> Kc8B3r-1Qg1-kfVy-VU6c-1WlR-rczA-0q0eRO not found or rejected by a filter.
> Jan 30 19:01:36 dhcp37-134 lvm[16942]: Cannot change VG RHS_vg4 while PVs
> are missing.
> Jan 30 19:01:36 dhcp37-134 lvm[16942]: Consider vgreduce --removemissing.
> Jan 30 19:01:36 dhcp37-134 lvm[16942]: Failed to extend thin
> RHS_vg4-RHS_pool4-tpool.
> Jan 30 19:01:36 dhcp37-134 lvm[16942]: Unmounting thin volume
> RHS_vg4-RHS_pool4-tpool from /rhs/brick4.
> 


So you get clear log output what happens.

It's intended lvm2 behavior - the failure of thin-pool extension is currently associated with unmount of every related thin-volume - to avoid bigger disaster to happen (pool overfilling).

If you want higher 'occupancy' of thin-pool - raise-up threshold, lvm2 currently tries to avoid overfilling of thin-pool by dropping thin-volumes from being used (and thus potentially generating further load on thin-pool).

As a fix - provide more space in VG so the thin-pool resize does not fail.
Use higher percentage (up to 95%) for thin-pool resize.
Use smaller 'resize' step (down to 1%) (thought more resize operations will appear and my slow down thin-pool usage a bit more)

lvm2 currently does not provide configurable options for dmeventd behavior
if you want other lvm2 behavior then described - please open  RFE.

We do plain to provide some more 'fine-grained' policy modes....

Comment 6 Zdenek Kabelac 2016-02-01 11:02:51 UTC

No need for further debugging.

It works as designed, thus closing this BZ.