Hide Forgot
Description of problem: The volumes are created using RHGS(Red Hat Gluster Storage) which uses the underlying LVM PVs as bricks as the building block. While running parallel I/O,the bricks get unmounted from the nodes (after ~20 mins of starting the workload). This looks like an LVM issue(outside the scope of RHGS). The VGs,PVs and LV are intact. One of the problematic LVs is RHS_vg6/RHS_lv6 (from Server : 10.70.37.134) *SNIPPET FROM LOGS* : Jan 30 19:01:36 dhcp37-134 lvm[16942]: WARNING: Device for PV Kc8B3r-1Qg1-kfVy-VU6c-1WlR-rczA-0q0eRO not found or rejected by a filter. Jan 30 19:01:36 dhcp37-134 lvm[16942]: Cannot change VG RHS_vg4 while PVs are missing. Jan 30 19:01:36 dhcp37-134 lvm[16942]: Consider vgreduce --removemissing. Jan 30 19:01:36 dhcp37-134 lvm[16942]: Failed to extend thin RHS_vg4-RHS_pool4-tpool. Jan 30 19:01:36 dhcp37-134 lvm[16942]: Unmounting thin volume RHS_vg4-RHS_pool4-tpool from /rhs/brick4. Jan 30 19:12:01 dhcp37-134 lvm[16942]: WARNING: Device for PV PuNir0-yPa4-qC9a-aO5a-GnwY-2iOl-7Y2vAg not found or rejected by a filter. Jan 30 19:12:01 dhcp37-134 lvm[16942]: Cannot change VG RHS_vg5 while PVs are missing. Jan 30 19:12:01 dhcp37-134 lvm[16942]: Consider vgreduce --removemissing. Jan 30 19:12:01 dhcp37-134 lvm[16942]: Failed to extend thin RHS_vg5-RHS_pool5-tpool. Jan 30 19:12:01 dhcp37-134 lvm[16942]: Unmounting thin volume RHS_vg5-RHS_pool5-tpool from /rhs/brick5. Jan 30 19:37:20 dhcp37-134 kernel: XFS (dm-26): Unmounting Filesystem Jan 30 19:37:20 dhcp37-134 kernel: XFS (dm-21): Unmounting Filesystem Jan 31 12:20:48 dhcp37-134 lvm[16942]: WARNING: Device for PV g1FsKG-lxkE-cxe5-VKFx-YZnp-HCEQ-kJmQkK not found or rejected by a filter. Jan 31 12:20:48 dhcp37-134 lvm[16942]: Cannot change VG RHS_vg6 while PVs are missing. Jan 31 12:20:48 dhcp37-134 lvm[16942]: Consider vgreduce --removemissing. Jan 31 12:20:48 dhcp37-134 lvm[16942]: Failed to extend thin RHS_vg6-RHS_pool6-tpool. Jan 31 12:20:48 dhcp37-134 lvm[16942]: Unmounting thin volume RHS_vg6-RHS_pool6-tpool from /rhs/brick6. Jan 31 12:20:49 dhcp37-134 lvm[16942]: WARNING: Device for PV ENWJcR-Q8Cj-2ld6-hTa3-lrC7-ugfw-vne6em not found or rejected by a filter. Jan 31 12:20:49 dhcp37-134 lvm[16942]: Cannot change VG RHS_vg7 while PVs are missing. Jan 31 12:20:49 dhcp37-134 lvm[16942]: Consider vgreduce --removemissing. Jan 31 12:20:49 dhcp37-134 lvm[16942]: Failed to extend thin RHS_vg7-RHS_pool7-tpool. Jan 31 12:20:49 dhcp37-134 lvm[16942]: Unmounting thin volume RHS_vg7-RHS_pool7-tpool from /rhs/brick7. *SNIPPET FROM DMESG* : [/dev/vdi was the partition used for brick6] [root@dhcp37-103 core]# dmesg|grep vdi [258262.255796] vdi: unknown partition table [259125.585924] vdi: unknown partition table [259125.633420] vdi: unknown partition table [259125.636292] vdi: unknown partition table [259125.668662] vdi: unknown partition table [259147.521809] vdi: unknown partition table [259178.239697] vdi: unknown partition table *VOLUME CONFIGURATION* : [root@dhcp37-134 tmp]# gluster v info khal Volume Name: khal Type: Tier Volume ID: b261b0d8-e9dc-4014-90f5-0e869755e146 Status: Started Number of Bricks: 26 Transport-type: tcp Hot Tier : Hot Tier Type : Distributed-Replicate Number of Bricks: 4 x 2 = 8 Brick1: 10.70.37.134:/rhs/brick7/A1 Brick2: 10.70.37.134:/rhs/brick6/A1 Brick3: 10.70.37.134:/rhs/brick5/A1 Brick4: 10.70.37.103:/rhs/brick7/A1 Brick5: 10.70.37.134:/rhs/brick4/A1 Brick6: 10.70.37.103:/rhs/brick6/A1 Brick7: 10.70.37.134:/rhs/brick3/A1 Brick8: 10.70.37.103:/rhs/brick5/A1 Cold Tier: Cold Tier Type : Distributed-Replicate Number of Bricks: 9 x 2 = 18 Brick9: 10.70.37.218:/rhs/brick1/A1 Brick10: 10.70.37.41:/rhs/brick1/A1 Brick11: 10.70.37.218:/rhs/brick2/A1 Brick12: 10.70.37.41:/rhs/brick2/A1 Brick13: 10.70.37.218:/rhs/brick3/A1 Brick14: 10.70.37.41:/rhs/brick3/A1 Brick15: 10.70.37.218:/rhs/brick4/A1 Brick16: 10.70.37.41:/rhs/brick4/A1 Brick17: 10.70.37.218:/rhs/brick5/A1 Brick18: 10.70.37.41:/rhs/brick5/A1 Brick19: 10.70.37.103:/rhs/brick1/A1 Brick20: 10.70.37.218:/rhs/brick6/A1 Brick21: 10.70.37.103:/rhs/brick2/A1 Brick22: 10.70.37.218:/rhs/brick7/A1 Brick23: 10.70.37.103:/rhs/brick3/A1 Brick24: 10.70.37.134:/rhs/brick2/A1 Brick25: 10.70.37.103:/rhs/brick4/A1 Brick26: 10.70.37.134:/rhs/brick1/A1 Options Reconfigured: cluster.self-heal-daemon: on features.quota-deem-statfs: off features.inode-quota: on features.quota: on cluster.watermark-hi: 50 cluster.watermark-low: 20 cluster.read-freq-threshold: 1 cluster.write-freq-threshold: 1 performance.io-cache: off performance.quick-read: off features.record-counters: on cluster.tier-mode: cache features.ctr-enabled: on performance.readdir-ahead: on [root@dhcp37-134 tmp]# Version-Release number of selected component (if applicable): [root@dhcp37-134 tmp]# cat /etc/redhat-storage-release Red Hat Gluster Storage Server 3.1 Update 2 [root@dhcp37-134 tmp]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.2 (Maipo) [root@dhcp37-134 tmp]# How reproducible: Tried Once Steps to Reproduce: Set a 9*2 volume.Add a 4*2 hot tier.Run parallel I/O from multiple clients. Actual results: 4/8 bricks from hot tier get unmounted all of a sudden after 20 mins of the workload. Expected results: Bricks should not be unmounted,brick process should not have been killed and I/O must run successfully without hangs/crashes. Additional info: *NODES* [root/redhat]: 10.70.37.103 10.70.37.134 *CLIENTS* [root/redhat]: 10.70.37.199 10.70.37.87 10.70.37.96 10.70.37.61 *WORKLOAD DESCRIPTION*: Following were run parallely from different threads. Client 1 -> dd Client 2 -> dd Client 3 -> Linux untar Client 4 -> Linux untar + Media copy
Tier logs,brick logs and sosreports copied here: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1303571/
dmesg copied to same location
The environment is preserved for further debugging by the LVM team.
(In reply to Ambarish from comment #0) > Description of problem: > > The volumes are created using RHGS(Red Hat Gluster Storage) which uses the > underlying LVM PVs as bricks as the building block. > While running parallel I/O,the bricks get unmounted from the nodes (after > ~20 mins of starting the workload). > This looks like an LVM issue(outside the scope of RHGS). > The VGs,PVs and LV are intact. > > One of the problematic LVs is RHS_vg6/RHS_lv6 (from Server : 10.70.37.134) > > > *SNIPPET FROM LOGS* : > > > Jan 30 19:01:36 dhcp37-134 lvm[16942]: WARNING: Device for PV > Kc8B3r-1Qg1-kfVy-VU6c-1WlR-rczA-0q0eRO not found or rejected by a filter. > Jan 30 19:01:36 dhcp37-134 lvm[16942]: Cannot change VG RHS_vg4 while PVs > are missing. > Jan 30 19:01:36 dhcp37-134 lvm[16942]: Consider vgreduce --removemissing. > Jan 30 19:01:36 dhcp37-134 lvm[16942]: Failed to extend thin > RHS_vg4-RHS_pool4-tpool. > Jan 30 19:01:36 dhcp37-134 lvm[16942]: Unmounting thin volume > RHS_vg4-RHS_pool4-tpool from /rhs/brick4. > So you get clear log output what happens. It's intended lvm2 behavior - the failure of thin-pool extension is currently associated with unmount of every related thin-volume - to avoid bigger disaster to happen (pool overfilling). If you want higher 'occupancy' of thin-pool - raise-up threshold, lvm2 currently tries to avoid overfilling of thin-pool by dropping thin-volumes from being used (and thus potentially generating further load on thin-pool). As a fix - provide more space in VG so the thin-pool resize does not fail. Use higher percentage (up to 95%) for thin-pool resize. Use smaller 'resize' step (down to 1%) (thought more resize operations will appear and my slow down thin-pool usage a bit more) lvm2 currently does not provide configurable options for dmeventd behavior if you want other lvm2 behavior then described - please open RFE. We do plain to provide some more 'fine-grained' policy modes....
No need for further debugging. It works as designed, thus closing this BZ.