+++ This bug was initially created as a clone of Bug #1815192 +++ Description of problem: One host (of six) became non-operational in the RHV GUI this week. This seems to be the first instance of the error: [root@dell-r640-02 vdsm]# xzcat vdsm.log.69.xz | less ... 2020-03-16 12:40:46,953-0400 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmIoTunePolicies succeeded in 0.00 seconds (__init__:312) 2020-03-16 12:40:47,351-0400 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/glusterSD/dell-r640-01.gluster.tamlab.rdu2.redhat.com:_vmstore/2881e321-b95b-40c8-9874-a21a8eeb3167/dom _md/metadata (monitor:499) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 497, in _pathChecked delay = result.delay() File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) MiscFileReadException: Internal file read failure: (u'/rhev/data-center/mnt/glusterSD/dell-r640-01.gluster.tamlab.rdu2.redhat.com:_vmstore/2881e321-b95b-40c8-9874-a21a8eeb3167/dom_md/metadata', 1, 'Read timeout' ) 2020-03-16 12:40:47,352-0400 INFO (check/loop) [storage.Monitor] Domain 2881e321-b95b-40c8-9874-a21a8eeb3167 became INVALID (monitor:470) ... 2020-03-16 12:41:49,805-0400 INFO (periodic/2009) [vdsm.api] START repoStats(domains=()) from=internal, task_id=8b1ff4a9-ac21-4ec4-b777-55ddda3fe03e (api:48) 2020-03-16 12:41:49,806-0400 INFO (periodic/2009) [vdsm.api] FINISH repoStats return={u'a102ec6f-31b7-4503-87a8-d581f921099f': {'code': 0, 'actual': True, 'version': 5, 'acquired': True, 'delay': '0.000721486', 'lastCheck': '5.8', 'valid': True}, u'689445af-cc20-4f4d-80f1-cd6cc1ec6bca': {'code': 0, 'actual': True, 'version': 0, 'acquired': True, 'delay': '0.000332825', 'lastCheck': '2.4', 'valid': True}, u'2881e321-b9 5b-40c8-9874-a21a8eeb3167': {'code': 0, 'actual': True, 'version': 5, 'acquired': True, 'delay': '0.000685773', 'lastCheck': '11.9', 'valid': True}, u'05b3133d-04da-4f2b-b289-21e2da94b6e1': {'code': 0, 'actual': True, 'version': 0, 'acquired': True, 'delay': '0.000273442', 'lastCheck': '7.2', 'valid': True}} from=internal, task_id=8b1ff4a9-ac21-4ec4-b777-55ddda3fe03e (api:54) 2020-03-16 12:41:50,544-0400 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/glusterSD/dell-r640-01.gluster.tamlab.rdu2.redhat.com:_vmstore/2881e321-b95b-40c8-9874-a21a8eeb3167/dom _md/metadata (monitor:499) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 497, in _pathChecked delay = result.delay() File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) MiscFileReadException: Internal file read failure: (u'/rhev/data-center/mnt/glusterSD/dell-r640-01.gluster.tamlab.rdu2.redhat.com:_vmstore/2881e321-b95b-40c8-9874-a21a8eeb3167/dom_md/metadata', 1, bytearray(b"/u sr/bin/dd: failed to open \'/rhev/data-center/mnt/glusterSD/dell-r640-01.gluster.tamlab.rdu2.redhat.com:_vmstore/2881e321-b95b-40c8-9874-a21a8eeb3167/dom_md/metadata\': Software caused connection abort\n")) 2020-03-16 12:41:50,544-0400 INFO (check/loop) [storage.Monitor] Domain 2881e321-b95b-40c8-9874-a21a8eeb3167 became INVALID (monitor:470) 2020-03-16 12:41:50,545-0400 ERROR (monitor/2881e32) [storage.Monitor] Error checking domain 2881e321-b95b-40c8-9874-a21a8eeb3167 (monitor:425) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 406, in _checkDomainStatus self.domain.selftest() File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 717, in selftest self.oop.os.statvfs(self.domaindir) File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 245, in statvfs return self._iop.statvfs(path) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 479, in statvfs resdict = self._sendCommand("statvfs", {"path": path}, self.timeout) File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 448, in _sendCommand raise OSError(errcode, errstr) OSError: [Errno 103] Software caused connection abort 2020-03-16 12:41:50,546-0400 INFO (jsonrpc/3) [storage.StorageDomainManifest] Renaming dir /rhev/data-center/mnt/glusterSD/dell-r640-01.gluster.tamlab.rdu2.redhat.com:_vmstore/2881e321-b95b-40c8-9874-a21a8eeb3167/images/8b1efd6b-e636-4699-bbc9-f5c813d3dd89 to /rhev/data-center/mnt/glusterSD/dell-r640-01.gluster.tamlab.rdu2.redhat.com:_vmstore/2881e321-b95b-40c8-9874-a21a8eeb3167/images/_remove_me_8b1efd6b-e636-4699-bbc9-f5c813d3dd89 (fileSD:240) 2020-03-16 12:41:50,547-0400 ERROR (jsonrpc/3) [storage.StorageDomainManifest] image: /rhev/data-center/mnt/glusterSD/dell-r640-01.gluster.tamlab.rdu2.redhat.com:_vmstore/2881e321-b95b-40c8-9874-a21a8eeb3167/images/8b1efd6b-e636-4699-bbc9-f5c813d3dd89 can't be moved (fileSD:244) 2020-03-16 12:41:50,548-0400 INFO (jsonrpc/3) [vdsm.api] FINISH deleteImage error=Could not remove all image's volumes: (u'8b1efd6b-e636-4699-bbc9-f5c813d3dd89 [Errno 107] Transport endpoint is not connected',) from=::ffff:10.8.117.7,56200, flow_id=c087f480-fa93-4a76-ae9d-e4a6d58f3814, task_id=f49e527e-e8f8-4366-bfb7-beeb0238dba8 (api:52) 2020-03-16 12:41:50,548-0400 ERROR (jsonrpc/3) [storage.TaskManager.Task] (Task='f49e527e-e8f8-4366-bfb7-beeb0238dba8') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "<string>", line 2, in deleteImage File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 1527, in deleteImage pool.deleteImage(dom, imgUUID, volsByImg) File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 2003, in deleteImage domain.deleteImage(domain.sdUUID, imgUUID, volsByImg) File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 740, in deleteImage self._manifest.deleteImage(sdUUID, imgUUID, volsImgs) File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 245, in deleteImage raise se.ImageDeleteError("%s %s" % (imgUUID, str(e))) ImageDeleteError: Could not remove all image's volumes: (u'8b1efd6b-e636-4699-bbc9-f5c813d3dd89 [Errno 107] Transport endpoint is not connected',) 2020-03-16 12:41:50,548-0400 INFO (jsonrpc/3) [storage.TaskManager.Task] (Task='f49e527e-e8f8-4366-bfb7-beeb0238dba8') aborting: Task is aborted: "Could not remove all image's volumes: (u'8b1efd6b-e636-4699-bbc9-f5c813d3dd89 [Errno 107] Transport endpoint is not connected',)" - code 256 (task:1181) 2020-03-16 12:41:50,548-0400 ERROR (jsonrpc/3) [storage.Dispatcher] FINISH deleteImage error=Could not remove all image's volumes: (u'8b1efd6b-e636-4699-bbc9-f5c813d3dd89 [Errno 107] Transport endpoint is not connected',) (dispatcher:83) 2020-03-16 12:41:50,548-0400 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Image.delete failed (error 256) in 6.66 seconds (__init__:312) Since then, we are unable to get the host to mount the vmstore storage domain. [root@dell-r640-02 ~]# df -h df: ‘/rhev/data-center/mnt/glusterSD/dell-r640-01.gluster.tamlab.rdu2.redhat.com:_vmstore’: Transport endpoint is not connected Filesystem Size Used Avail Use% Mounted on /dev/mapper/rhvh-rhvh--4.3.0.7--0.20190512.0+1 99G 2.0G 92G 3% / devtmpfs 94G 0 94G 0% /dev tmpfs 94G 16K 94G 1% /dev/shm tmpfs 94G 99M 94G 1% /run tmpfs 94G 0 94G 0% /sys/fs/cgroup /dev/mapper/rhvh-home 976M 2.6M 907M 1% /home /dev/sdc1 976M 183M 726M 21% /boot /dev/mapper/rhvh-tmp 976M 2.7M 907M 1% /tmp /dev/mapper/rhvh-var 15G 333M 14G 3% /var /dev/mapper/rhvh-var_crash 9.8G 37M 9.2G 1% /var/crash /dev/mapper/rhvh-var_log 20G 797M 18G 5% /var/log /dev/mapper/rhvh-var_log_audit 2.0G 43M 1.8G 3% /var/log/audit /dev/mapper/gluster_vg-gluster_lv_engine 100G 36G 65G 36% /gluster_bricks/engine /dev/mapper/gluster_vg-gluster_lv_vmstore 3.5T 3.5T 42G 99% /gluster_bricks/vmstore dell-r640-01.gluster.tamlab.rdu2.redhat.com:/engine 100G 37G 64G 37% /rhev/data-center/mnt/glusterSD/dell-r640-01.gluster.tamlab.rdu2.redhat.com:_engine dell-per630-4.gsslab.rdu2.redhat.com:/home/isos 491G 130G 361G 27% /rhev/data-center/mnt/dell-per630-4.gsslab.rdu2.redhat.com:_home_isos dell-per630-4.gsslab.rdu2.redhat.com:/home/exports 491G 130G 361G 27% /rhev/data-center/mnt/dell-per630-4.gsslab.rdu2.redhat.com:_home_exports tmpfs 19G 0 19G 0% /run/user/0 We realize that the vmstore SD is full, but none of the other hosts have this issue. Version-Release number of selected component (if applicable): [root@dell-r640-02 ~]# rpm -q glusterfs glusterfs-3.12.2-47.el7rhgs.x86_64 --- Additional comment from RHEL Program Management on 2020-03-19 18:10:37 UTC --- This bug is automatically being proposed for RHHI-V 1.8 release at Red Hat Hyperconverged Infrastructure for Virtualization product, by setting the release flag 'rhiv‑1.8' to '?'. If this bug should be proposed for a different release, please manually change the proposed release flag. --- Additional comment from Krutika Dhananjay on 2020-03-20 12:06:07 UTC --- What was the size of the VM? -Krutika --- Additional comment from Krutika Dhananjay on 2020-03-20 12:50:51 UTC --- Cancelling needinfo, found the issue. Upstream tracker - https://github.com/gluster/glusterfs/issues/1127 --- Additional comment from Krutika Dhananjay on 2020-03-23 07:04:18 UTC --- Upstream patch - https://review.gluster.org/c/glusterfs/+/24244 --- Additional comment from Donald Berry on 2020-03-26 19:15:34 UTC --- Removed the ghost shards and both nodes became operational. Here is part of the email thread: Hi Krutika, 1. .remove_me/ The same gfids were in all three hosts (4-6) of that subvolume (brick /rhgs/vmstore), e.g. 4: [root@dell-r640-04 .remove_me]# ll total 0 ----------. 2 root root 0 Mar 18 20:56 066a8381-42ab-49fe-b18b-31cec5274092 ----------. 2 root root 0 Mar 18 18:47 0d3d148a-e0bb-4033-9cc1-a6f0abddf65a ----------. 2 root root 0 Mar 18 18:47 10242631-4de0-478a-bdb4-91bf6420dd2e ----------. 2 root root 0 Mar 18 18:09 1cd8dcd6-3236-4ec0-bcd1-f59d85426ecb ----------. 2 root root 0 Mar 18 18:10 4e62d28e-8910-45a6-9ad8-098b4cf3bb07 ----------. 2 root root 0 Mar 18 18:47 5d2b410e-cf1b-4aff-95e5-667450f8aea0 ----------. 2 root root 0 Mar 18 18:47 70738a38-5e21-4180-aeec-17fda76991ed ----------. 2 root root 0 Mar 18 18:10 7d7d3fb3-f0fc-472f-93e4-f1d290209ce9 ----------. 2 root root 0 Mar 18 18:47 7e9d6fe1-ce21-43a8-8ec6-fb586ba233b2 ----------. 2 root root 0 Mar 18 18:47 91ef3b6b-c06d-4b14-aae3-d004bf19a523 ----------. 2 root root 0 Mar 18 18:47 92ba4df0-828a-476a-b1d0-a4de40e1e1aa ----------. 2 root root 0 Mar 18 20:56 a78844f2-642e-4e15-a841-293e27edc663 ----------. 2 root root 0 Mar 18 18:10 abf696e8-c8e8-480a-bac0-9348241e1d01 ----------. 2 root root 0 Mar 18 18:09 adc71f9f-9408-4193-b5f2-f2a0ceb1423b ----------. 2 root root 0 Mar 18 18:09 c186910e-c8f8-4ce0-881d-1da99b7fc48a ----------. 2 root root 0 Mar 24 16:06 cb3ba6bd-b174-44b1-89ba-d8d507c320cb ----------. 2 root root 0 Mar 18 20:56 ed869b0e-470a-4f49-9c06-7e46ed1bf7a0 ----------. 2 root root 0 Mar 18 18:47 fbaf5657-f41b-41d5-af7d-bdbfbfa5c305 [root@dell-r640-04 .remove_me]# There were no gfids in .remove_me/ on hosts 1-3: [root@dell-r640-01 .remove_me]# ll total 0 [root@dell-r640-01 .remove_me]# 2. Stale file handle I checked all 6 hosts, but only two (3, 5) had those messages, and they were for the same gfids that were in .remove_me on 4-6: [root@dell-r640-03 glusterfs]# awk '/MSGID: 133021/ {print $15}' rhev-data-center-mnt-glusterSD-dell-r640-01.gluster.tamlab.rdu2.redhat.com\:_vmstore.log | sort -u 066a8381-42ab-49fe-b18b-31cec5274092 0d3d148a-e0bb-4033-9cc1-a6f0abddf65a 10242631-4de0-478a-bdb4-91bf6420dd2e 1cd8dcd6-3236-4ec0-bcd1-f59d85426ecb 4e62d28e-8910-45a6-9ad8-098b4cf3bb07 5d2b410e-cf1b-4aff-95e5-667450f8aea0 70738a38-5e21-4180-aeec-17fda76991ed 7d7d3fb3-f0fc-472f-93e4-f1d290209ce9 7e9d6fe1-ce21-43a8-8ec6-fb586ba233b2 91ef3b6b-c06d-4b14-aae3-d004bf19a523 92ba4df0-828a-476a-b1d0-a4de40e1e1aa a78844f2-642e-4e15-a841-293e27edc663 abf696e8-c8e8-480a-bac0-9348241e1d01 adc71f9f-9408-4193-b5f2-f2a0ceb1423b c186910e-c8f8-4ce0-881d-1da99b7fc48a ed869b0e-470a-4f49-9c06-7e46ed1bf7a0 fbaf5657-f41b-41d5-af7d-bdbfbfa5c305 [root@dell-r640-05 glusterfs]# awk '/MSGID: 133021/ {print $15}' rhev-data-center-mnt-glusterSD-dell-r640-01.gluster.tamlab.rdu2.redhat.com\:_vmstore.log | sort -u 066a8381-42ab-49fe-b18b-31cec5274092 0d3d148a-e0bb-4033-9cc1-a6f0abddf65a 10242631-4de0-478a-bdb4-91bf6420dd2e 1cd8dcd6-3236-4ec0-bcd1-f59d85426ecb 4e62d28e-8910-45a6-9ad8-098b4cf3bb07 5d2b410e-cf1b-4aff-95e5-667450f8aea0 70738a38-5e21-4180-aeec-17fda76991ed 7d7d3fb3-f0fc-472f-93e4-f1d290209ce9 7e9d6fe1-ce21-43a8-8ec6-fb586ba233b2 91ef3b6b-c06d-4b14-aae3-d004bf19a523 92ba4df0-828a-476a-b1d0-a4de40e1e1aa a78844f2-642e-4e15-a841-293e27edc663 abf696e8-c8e8-480a-bac0-9348241e1d01 adc71f9f-9408-4193-b5f2-f2a0ceb1423b c186910e-c8f8-4ce0-881d-1da99b7fc48a ed869b0e-470a-4f49-9c06-7e46ed1bf7a0 fbaf5657-f41b-41d5-af7d-bdbfbfa5c305 The other hosts either did not have those messages, or they were from Mar. 22: [root@dell-r640-01 glusterfs]# grep -cs 'MSGID: 133021' * | grep -v 0$ rhev-data-center-mnt-glusterSD-dell-r640-01.gluster.tamlab.rdu2.redhat.com:_vmstore.log-20200322:338 [root@dell-r640-02 glusterfs]# grep -s 'MSGID: 133021' * [root@dell-r640-02 glusterfs]# [root@dell-r640-04 glusterfs]# grep -s 'MSGID: 133021' * [root@dell-r640-04 glusterfs]# [root@dell-r640-06 glusterfs]# grep -cs 'MSGID: 133021' * | grep -v 0$ rhev-data-center-mnt-glusterSD-dell-r640-01.gluster.tamlab.rdu2.redhat.com:_vmstore.log-20200322:6147 3. I removed the ghost shards on 4-6: [root@dell-r640-04 ~]# for gfid in $(ls /rhgs/vmstore/vmstore/.shard/.remove_me); do rm -f /rhgs/vmstore/vmstore/.shard/$gfid*; done [root@dell-r640-05 ~]# for gfid in $(ls /rhgs/vmstore/vmstore/.shard/.remove_me); do rm -f /rhgs/vmstore/vmstore/.shard/$gfid*; done [root@dell-r640-06 ~]# for gfid in $(ls /rhgs/vmstore/vmstore/.shard/.remove_me); do rm -f /rhgs/vmstore/vmstore/.shard/$gfid*; done After a few minutes there were no more gfids in .remove_me/: [root@dell-r640-04 .remove_me]# ll total 0 [root@dell-r640-04 .remove_me]# 4. gfids on 1-3 Immediately after deleting the ghost shards on 4-6, nodes 1-3 showed the following shard counts: create a list: [root@dell-r640-03 glusterfs]# awk '/MSGID: 133021/ {print $15}' rhev-data-center-mnt-glusterSD-dell-r640-01.gluster.tamlab.rdu2.redhat.com\:_vmstore.log | sort -u > /tmp/gfid shards: [root@dell-r640-03 .shard]# for gfid in $(cat /tmp/gfid); do ls /gluster_bricks/vmstore/vmstore/.shard/$gfid* | wc -l; done 2344 22 30 692 949 31 26 934 32 12 29 790 ls: cannot access /gluster_bricks/vmstore/vmstore/.shard/abf696e8-c8e8-480a-bac0-9348241e1d01*: No such file or directory 0 571 355 985 16 After a few minutes, the shard counts were almost 0, but there are still some remaining: [root@dell-r640-01 .shard]# for gfid in $(cat /tmp/gfid); do echo -n "$gfid: "; ls /gluster_bricks/vmstore/vmstore/.shard/$gfid* 2>/dev/null | wc -l; done 066a8381-42ab-49fe-b18b-31cec5274092: 0 0d3d148a-e0bb-4033-9cc1-a6f0abddf65a: 0 10242631-4de0-478a-bdb4-91bf6420dd2e: 0 1cd8dcd6-3236-4ec0-bcd1-f59d85426ecb: 3 <-- 4e62d28e-8910-45a6-9ad8-098b4cf3bb07: 0 5d2b410e-cf1b-4aff-95e5-667450f8aea0: 0 70738a38-5e21-4180-aeec-17fda76991ed: 0 7d7d3fb3-f0fc-472f-93e4-f1d290209ce9: 0 7e9d6fe1-ce21-43a8-8ec6-fb586ba233b2: 0 91ef3b6b-c06d-4b14-aae3-d004bf19a523: 0 92ba4df0-828a-476a-b1d0-a4de40e1e1aa: 0 a78844f2-642e-4e15-a841-293e27edc663: 0 abf696e8-c8e8-480a-bac0-9348241e1d01: 0 adc71f9f-9408-4193-b5f2-f2a0ceb1423b: 7 <-- c186910e-c8f8-4ce0-881d-1da99b7fc48a: 1 <-- ed869b0e-470a-4f49-9c06-7e46ed1bf7a0: 0 fbaf5657-f41b-41d5-af7d-bdbfbfa5c305: 0 Q. Do I need to delete the remaining ghost gfids on nodes 1-3? Do I need to run 'gluster volume heal vmstore full'? Note, I activated both nodes (2 and 6), and both are still up. Storage/Volumes/vmstore shows 6/6 bricks active. [root@dell-r640-02 ~]# df /gluster_bricks/vmstore /rhev/data-center/mnt/glusterSD/dell-r640-01.gluster.tamlab.rdu2.redhat.com:_vmstore Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/gluster_vg-gluster_lv_vmstore 3757840384 1568428228 2189412156 42% /gluster_bricks/vmstore dell-r640-01.gluster.tamlab.rdu2.redhat.com:/vmstore 7515680768 3883147280 3632533488 52% /rhev/data-center/mnt/glusterSD/dell-r640-01.gluster.tamlab.rdu2.redhat.com:_vmstore Thanks. Don On 2020-03-26 5:31 a.m., Krutika Dhananjay wrote: > > > On Thu, Mar 26, 2020 at 12:37 AM Donald Berry <dberry <dberry>> wrote: > > Hi Krutika, > > If I check the gfids in .remove_me/, there are many matching entries, e.g. for the first one, 066a8381-42ab-49fe-b18b-31cec5274092 there are 2455 shards: > > [root@dell-r640-06 ~]# cd /rhgs/vmstore/vmstore/.shard/ > [root@dell-r640-06 .shard]# ll 066a8381-42ab-49fe-b18b-31cec5274092* > -rw-rw----. 2 root root 67108864 Sep 26 13:10 066a8381-42ab-49fe-b18b-31cec5274092.1 > -rw-rw----. 2 root root 67108864 Sep 26 13:10 066a8381-42ab-49fe-b18b-31cec5274092.10 > -rw-rw----. 2 root root 67108864 Sep 26 13:10 066a8381-42ab-49fe-b18b-31cec5274092.1000 > ... > rw-rw----. 2 root root 67108864 Sep 26 13:10 066a8381-42ab-49fe-b18b-31cec5274092.996 > -rw-rw----. 2 root root 67108864 Sep 26 13:10 066a8381-42ab-49fe-b18b-31cec5274092.998 > -rw-rw----. 2 root root 67108864 Sep 26 13:10 066a8381-42ab-49fe-b18b-31cec5274092.999 > [root@dell-r640-06 .shard]# ll 066a8381-42ab-49fe-b18b-31cec5274092* | wc -l > 2455 > > Yes this is expected, they are all sharded files. > > Are you saying we need to remove all of these? Here are the others: > > [root@dell-r640-06 ~]# for gfid in $(ls /rhgs/vmstore/vmstore/.shard/.remove_me); do ls /rhgs/vmstore/vmstore/.shard/$gfid* | wc -l ; done > 2455 > 21 > 23 > 573 > 970 > 28 > 29 > 985 > 23 > 23 > 24 > 809 > ls: cannot access /rhgs/vmstore/vmstore/.shard/abf696e8-c8e8-480a-bac0-9348241e1d01*: No such file or directory > 0 > 571 > 375 > 3 > 1919 > 15 > > Also, if I check for dht errors, there is only one error, yesterday at 2:30 AM, which was cleaned up yesterday (shard c186910e-c8f8-4ce0-881d-1da99b7fc48a.480): > > > So here's the thing. That's one part of the solution. But what I'm trying to get at is that there are all these gfids that are listed under .shard/.remove_me of 06 that have ghost shards but the base file (the file to which they all belong) themselves are non-existent. > I confirmed this by trying to for example find c186910e-c8f8-4ce0-881d-1da99b7fc48a in the file system by checking .glusterfs/c1/86 in the 6 bricks and it doesn't exist. Same thing is true for the rest of the gfids under .remove_me. So while c186910e-c8f8-4ce0-881d-1da99b7fc48a.480 may temporarily fix your problem, there are two issues - > 1. There could be more files with some shard that has this gfid mismatch. Finding them would be like finding a needle in a haystack. > 2. Moreover, these are all ghost shards. So their very existence will consume unnecessary space in your filesystem. So they may be safely deleted. In fact, there is even ample evidence in the mount logs that indicate failure to clean up shards of all these gfids. Grep for the particular gfid and filter strings with 'MSGID: 133021'. For example this - > > /[2020-03-19 00:56:52.165883] E [MSGID: 133021] [shard.c:3630:shard_delete_shards] 0-vms/ > /tore-shard: Failed to clean up shards of gfid c186910e-c8f8-4ce0-881d-1da99b7fc48a [Stale file handle]/ > > While under normal circumstances the clean up would have happened automatically in the background, due to this crash and the 'Stale file handle' issues, you'd have to clean them up manually. > > Let me know if that made sense. > > -Krutika > > [root@dell-r640-06 ~]# cd /var/log/glusterfs > [root@dell-r640-06 glusterfs]# grep -Iisr vmstore-dht.*gfid.different rhev-data-center-mnt-glusterSD-dell-r640-01.gluster.tamlab.rdu2.redhat.com:_vmstore.log > [2020-03-24 02:30:04.762603] W [MSGID: 109009] [dht-common.c:3071:dht_lookup_linkfile_cbk] 0-vmstore-dht: /.shard/c186910e-c8f8-4ce0-881d-1da99b7fc48a.480: gfid different on data file on vmstore-replicate-0, gfid local = 00000000-0000-0000-0000-000000000000, gfid node = ad489910-9461-4cb3-b06b-2e4e12de7cc1 > > Thanks. > > Don --- Additional comment from Krutika Dhananjay on 2020-03-28 03:08:30 UTC --- (In reply to Donald Berry from comment #5) > Removed the ghost shards and both nodes became operational. Here is part of > the email thread: > > Hi Krutika, > > 1. .remove_me/ > > The same gfids were in all three hosts (4-6) of that subvolume (brick > /rhgs/vmstore), e.g. 4: > > [root@dell-r640-04 .remove_me]# ll > total 0 > ----------. 2 root root 0 Mar 18 20:56 066a8381-42ab-49fe-b18b-31cec5274092 > ----------. 2 root root 0 Mar 18 18:47 0d3d148a-e0bb-4033-9cc1-a6f0abddf65a > ----------. 2 root root 0 Mar 18 18:47 10242631-4de0-478a-bdb4-91bf6420dd2e > ----------. 2 root root 0 Mar 18 18:09 1cd8dcd6-3236-4ec0-bcd1-f59d85426ecb > ----------. 2 root root 0 Mar 18 18:10 4e62d28e-8910-45a6-9ad8-098b4cf3bb07 > ----------. 2 root root 0 Mar 18 18:47 5d2b410e-cf1b-4aff-95e5-667450f8aea0 > ----------. 2 root root 0 Mar 18 18:47 70738a38-5e21-4180-aeec-17fda76991ed > ----------. 2 root root 0 Mar 18 18:10 7d7d3fb3-f0fc-472f-93e4-f1d290209ce9 > ----------. 2 root root 0 Mar 18 18:47 7e9d6fe1-ce21-43a8-8ec6-fb586ba233b2 > ----------. 2 root root 0 Mar 18 18:47 91ef3b6b-c06d-4b14-aae3-d004bf19a523 > ----------. 2 root root 0 Mar 18 18:47 92ba4df0-828a-476a-b1d0-a4de40e1e1aa > ----------. 2 root root 0 Mar 18 20:56 a78844f2-642e-4e15-a841-293e27edc663 > ----------. 2 root root 0 Mar 18 18:10 abf696e8-c8e8-480a-bac0-9348241e1d01 > ----------. 2 root root 0 Mar 18 18:09 adc71f9f-9408-4193-b5f2-f2a0ceb1423b > ----------. 2 root root 0 Mar 18 18:09 c186910e-c8f8-4ce0-881d-1da99b7fc48a > ----------. 2 root root 0 Mar 24 16:06 cb3ba6bd-b174-44b1-89ba-d8d507c320cb > ----------. 2 root root 0 Mar 18 20:56 ed869b0e-470a-4f49-9c06-7e46ed1bf7a0 > ----------. 2 root root 0 Mar 18 18:47 fbaf5657-f41b-41d5-af7d-bdbfbfa5c305 > [root@dell-r640-04 .remove_me]# > > There were no gfids in .remove_me/ on hosts 1-3: > > [root@dell-r640-01 .remove_me]# ll > total 0 > [root@dell-r640-01 .remove_me]# > > 2. Stale file handle > > I checked all 6 hosts, but only two (3, 5) had those messages, and they were > for the same gfids that were in .remove_me on 4-6: > > [root@dell-r640-03 glusterfs]# awk '/MSGID: 133021/ {print $15}' > rhev-data-center-mnt-glusterSD-dell-r640-01.gluster.tamlab.rdu2.redhat.com\: > _vmstore.log | sort -u > 066a8381-42ab-49fe-b18b-31cec5274092 > 0d3d148a-e0bb-4033-9cc1-a6f0abddf65a > 10242631-4de0-478a-bdb4-91bf6420dd2e > 1cd8dcd6-3236-4ec0-bcd1-f59d85426ecb > 4e62d28e-8910-45a6-9ad8-098b4cf3bb07 > 5d2b410e-cf1b-4aff-95e5-667450f8aea0 > 70738a38-5e21-4180-aeec-17fda76991ed > 7d7d3fb3-f0fc-472f-93e4-f1d290209ce9 > 7e9d6fe1-ce21-43a8-8ec6-fb586ba233b2 > 91ef3b6b-c06d-4b14-aae3-d004bf19a523 > 92ba4df0-828a-476a-b1d0-a4de40e1e1aa > a78844f2-642e-4e15-a841-293e27edc663 > abf696e8-c8e8-480a-bac0-9348241e1d01 > adc71f9f-9408-4193-b5f2-f2a0ceb1423b > c186910e-c8f8-4ce0-881d-1da99b7fc48a > ed869b0e-470a-4f49-9c06-7e46ed1bf7a0 > fbaf5657-f41b-41d5-af7d-bdbfbfa5c305 > > [root@dell-r640-05 glusterfs]# awk '/MSGID: 133021/ {print $15}' > rhev-data-center-mnt-glusterSD-dell-r640-01.gluster.tamlab.rdu2.redhat.com\: > _vmstore.log | sort -u > 066a8381-42ab-49fe-b18b-31cec5274092 > 0d3d148a-e0bb-4033-9cc1-a6f0abddf65a > 10242631-4de0-478a-bdb4-91bf6420dd2e > 1cd8dcd6-3236-4ec0-bcd1-f59d85426ecb > 4e62d28e-8910-45a6-9ad8-098b4cf3bb07 > 5d2b410e-cf1b-4aff-95e5-667450f8aea0 > 70738a38-5e21-4180-aeec-17fda76991ed > 7d7d3fb3-f0fc-472f-93e4-f1d290209ce9 > 7e9d6fe1-ce21-43a8-8ec6-fb586ba233b2 > 91ef3b6b-c06d-4b14-aae3-d004bf19a523 > 92ba4df0-828a-476a-b1d0-a4de40e1e1aa > a78844f2-642e-4e15-a841-293e27edc663 > abf696e8-c8e8-480a-bac0-9348241e1d01 > adc71f9f-9408-4193-b5f2-f2a0ceb1423b > c186910e-c8f8-4ce0-881d-1da99b7fc48a > ed869b0e-470a-4f49-9c06-7e46ed1bf7a0 > fbaf5657-f41b-41d5-af7d-bdbfbfa5c305 > > The other hosts either did not have those messages, or they were from Mar. > 22: > > [root@dell-r640-01 glusterfs]# grep -cs 'MSGID: 133021' * | grep -v 0$ > rhev-data-center-mnt-glusterSD-dell-r640-01.gluster.tamlab.rdu2.redhat.com: > _vmstore.log-20200322:338 > > [root@dell-r640-02 glusterfs]# grep -s 'MSGID: 133021' * > [root@dell-r640-02 glusterfs]# > > [root@dell-r640-04 glusterfs]# grep -s 'MSGID: 133021' * > [root@dell-r640-04 glusterfs]# > > [root@dell-r640-06 glusterfs]# grep -cs 'MSGID: 133021' * | grep -v 0$ > rhev-data-center-mnt-glusterSD-dell-r640-01.gluster.tamlab.rdu2.redhat.com: > _vmstore.log-20200322:6147 > > 3. I removed the ghost shards on 4-6: > > [root@dell-r640-04 ~]# for gfid in $(ls > /rhgs/vmstore/vmstore/.shard/.remove_me); do rm -f > /rhgs/vmstore/vmstore/.shard/$gfid*; done > [root@dell-r640-05 ~]# for gfid in $(ls > /rhgs/vmstore/vmstore/.shard/.remove_me); do rm -f > /rhgs/vmstore/vmstore/.shard/$gfid*; done > [root@dell-r640-06 ~]# for gfid in $(ls > /rhgs/vmstore/vmstore/.shard/.remove_me); do rm -f > /rhgs/vmstore/vmstore/.shard/$gfid*; done > > After a few minutes there were no more gfids in .remove_me/: > > [root@dell-r640-04 .remove_me]# ll > total 0 > [root@dell-r640-04 .remove_me]# > > 4. gfids on 1-3 > > Immediately after deleting the ghost shards on 4-6, nodes 1-3 showed the > following shard counts: > > create a list: > [root@dell-r640-03 glusterfs]# awk '/MSGID: 133021/ {print $15}' > rhev-data-center-mnt-glusterSD-dell-r640-01.gluster.tamlab.rdu2.redhat.com\: > _vmstore.log | sort -u > /tmp/gfid > > shards: > [root@dell-r640-03 .shard]# for gfid in $(cat /tmp/gfid); do ls > /gluster_bricks/vmstore/vmstore/.shard/$gfid* | wc -l; done > 2344 > 22 > 30 > 692 > 949 > 31 > 26 > 934 > 32 > 12 > 29 > 790 > ls: cannot access > /gluster_bricks/vmstore/vmstore/.shard/abf696e8-c8e8-480a-bac0-9348241e1d01*: > No such file or directory > 0 > 571 > 355 > 985 > 16 > > After a few minutes, the shard counts were almost 0, but there are still > some remaining: > > [root@dell-r640-01 .shard]# for gfid in $(cat /tmp/gfid); do echo -n "$gfid: > "; ls /gluster_bricks/vmstore/vmstore/.shard/$gfid* 2>/dev/null | wc -l; done > 066a8381-42ab-49fe-b18b-31cec5274092: 0 > 0d3d148a-e0bb-4033-9cc1-a6f0abddf65a: 0 > 10242631-4de0-478a-bdb4-91bf6420dd2e: 0 > 1cd8dcd6-3236-4ec0-bcd1-f59d85426ecb: 3 <-- > 4e62d28e-8910-45a6-9ad8-098b4cf3bb07: 0 > 5d2b410e-cf1b-4aff-95e5-667450f8aea0: 0 > 70738a38-5e21-4180-aeec-17fda76991ed: 0 > 7d7d3fb3-f0fc-472f-93e4-f1d290209ce9: 0 > 7e9d6fe1-ce21-43a8-8ec6-fb586ba233b2: 0 > 91ef3b6b-c06d-4b14-aae3-d004bf19a523: 0 > 92ba4df0-828a-476a-b1d0-a4de40e1e1aa: 0 > a78844f2-642e-4e15-a841-293e27edc663: 0 > abf696e8-c8e8-480a-bac0-9348241e1d01: 0 > adc71f9f-9408-4193-b5f2-f2a0ceb1423b: 7 <-- > c186910e-c8f8-4ce0-881d-1da99b7fc48a: 1 <-- > ed869b0e-470a-4f49-9c06-7e46ed1bf7a0: 0 > fbaf5657-f41b-41d5-af7d-bdbfbfa5c305: 0 > > Q. Do I need to delete the remaining ghost gfids on nodes 1-3? Do I need to > run 'gluster volume heal vmstore full'? > > Note, I activated both nodes (2 and 6), and both are still up. > Storage/Volumes/vmstore shows 6/6 bricks active. > > [root@dell-r640-02 ~]# df /gluster_bricks/vmstore > /rhev/data-center/mnt/glusterSD/dell-r640-01.gluster.tamlab.rdu2.redhat.com: > _vmstore > Filesystem 1K-blocks Used > Available Use% Mounted on > /dev/mapper/gluster_vg-gluster_lv_vmstore 3757840384 1568428228 > 2189412156 42% /gluster_bricks/vmstore > dell-r640-01.gluster.tamlab.rdu2.redhat.com:/vmstore 7515680768 3883147280 > 3632533488 52% > /rhev/data-center/mnt/glusterSD/dell-r640-01.gluster.tamlab.rdu2.redhat.com: > _vmstore > > Thanks. > > Don > > On 2020-03-26 5:31 a.m., Krutika Dhananjay wrote: > > > > > > On Thu, Mar 26, 2020 at 12:37 AM Donald Berry <dberry <dberry>> wrote: > > > > Hi Krutika, > > > > If I check the gfids in .remove_me/, there are many matching entries, e.g. for the first one, 066a8381-42ab-49fe-b18b-31cec5274092 there are 2455 shards: > > > > [root@dell-r640-06 ~]# cd /rhgs/vmstore/vmstore/.shard/ > > [root@dell-r640-06 .shard]# ll 066a8381-42ab-49fe-b18b-31cec5274092* > > -rw-rw----. 2 root root 67108864 Sep 26 13:10 066a8381-42ab-49fe-b18b-31cec5274092.1 > > -rw-rw----. 2 root root 67108864 Sep 26 13:10 066a8381-42ab-49fe-b18b-31cec5274092.10 > > -rw-rw----. 2 root root 67108864 Sep 26 13:10 066a8381-42ab-49fe-b18b-31cec5274092.1000 > > ... > > rw-rw----. 2 root root 67108864 Sep 26 13:10 066a8381-42ab-49fe-b18b-31cec5274092.996 > > -rw-rw----. 2 root root 67108864 Sep 26 13:10 066a8381-42ab-49fe-b18b-31cec5274092.998 > > -rw-rw----. 2 root root 67108864 Sep 26 13:10 066a8381-42ab-49fe-b18b-31cec5274092.999 > > [root@dell-r640-06 .shard]# ll 066a8381-42ab-49fe-b18b-31cec5274092* | wc -l > > 2455 > > > > Yes this is expected, they are all sharded files. > > > > Are you saying we need to remove all of these? Here are the others: > > > > [root@dell-r640-06 ~]# for gfid in $(ls /rhgs/vmstore/vmstore/.shard/.remove_me); do ls /rhgs/vmstore/vmstore/.shard/$gfid* | wc -l ; done > > 2455 > > 21 > > 23 > > 573 > > 970 > > 28 > > 29 > > 985 > > 23 > > 23 > > 24 > > 809 > > ls: cannot access /rhgs/vmstore/vmstore/.shard/abf696e8-c8e8-480a-bac0-9348241e1d01*: No such file or directory > > 0 > > 571 > > 375 > > 3 > > 1919 > > 15 > > > > Also, if I check for dht errors, there is only one error, yesterday at 2:30 AM, which was cleaned up yesterday (shard c186910e-c8f8-4ce0-881d-1da99b7fc48a.480): > > > > > > So here's the thing. That's one part of the solution. But what I'm trying to get at is that there are all these gfids that are listed under .shard/.remove_me of 06 that have ghost shards but the base file (the file to which they all belong) themselves are non-existent. > > I confirmed this by trying to for example find c186910e-c8f8-4ce0-881d-1da99b7fc48a in the file system by checking .glusterfs/c1/86 in the 6 bricks and it doesn't exist. Same thing is true for the rest of the gfids under .remove_me. So while c186910e-c8f8-4ce0-881d-1da99b7fc48a.480 may temporarily fix your problem, there are two issues - > > 1. There could be more files with some shard that has this gfid mismatch. Finding them would be like finding a needle in a haystack. > > 2. Moreover, these are all ghost shards. So their very existence will consume unnecessary space in your filesystem. So they may be safely deleted. In fact, there is even ample evidence in the mount logs that indicate failure to clean up shards of all these gfids. Grep for the particular gfid and filter strings with 'MSGID: 133021'. For example this - > > > > /[2020-03-19 00:56:52.165883] E [MSGID: 133021] [shard.c:3630:shard_delete_shards] 0-vms/ > > /tore-shard: Failed to clean up shards of gfid c186910e-c8f8-4ce0-881d-1da99b7fc48a [Stale file handle]/ > > > > While under normal circumstances the clean up would have happened automatically in the background, due to this crash and the 'Stale file handle' issues, you'd have to clean them up manually. > > > > Let me know if that made sense. > > > > -Krutika > > > > [root@dell-r640-06 ~]# cd /var/log/glusterfs > > [root@dell-r640-06 glusterfs]# grep -Iisr vmstore-dht.*gfid.different rhev-data-center-mnt-glusterSD-dell-r640-01.gluster.tamlab.rdu2.redhat.com:_vmstore.log > > [2020-03-24 02:30:04.762603] W [MSGID: 109009] [dht-common.c:3071:dht_lookup_linkfile_cbk] 0-vmstore-dht: /.shard/c186910e-c8f8-4ce0-881d-1da99b7fc48a.480: gfid different on data file on vmstore-replicate-0, gfid local = 00000000-0000-0000-0000-000000000000, gfid node = ad489910-9461-4cb3-b06b-2e4e12de7cc1 > > > > Thanks. > > > > Don Just reiterating what was discussed yesterday over mail (missed this notification as there was no needinfo) - Yes, you just need to remove the ghost shards from 1..3 as well. And no, there's no need to run heal-full, really. -Krutika
There are two parts to this issue - 1. The crash itself which is fixed by https://review.gluster.org/#/c/glusterfs/+/24244/ This is a simple null-dereference bug and the patch is also non-intrusive. 2. The second issue is the source of the ESTALE error itself which leads to error logging being triggered in shard where the crash happens. The ESTALE issue is seen only in distributed-replicate RHHI volumes. This is being tracked at https://bugzilla.redhat.com/show_bug.cgi?id=1835180 as a separate bug in DHT, the source of the ESTALE errors. The combined issue is such that as long as the ESTALE error persists, the mount process keeps crashing in the same place every time, even after restarts. So far two customers have hit this bug. Fix to 1) has been merged upstream. If we at least take fix to 1) in, customers wont be seeing a service disruption in terms of mounts crashing. The fuse mount point will continue to operate smoothly without interruption. Just that background deletion will keep failing with ESTALE but that's ok, it's a background job. -Krutika
Verified with RHGS 3.5.2 - glusterfs-6.0-37.el8rhgs and RHVH 4.4.1 1. Created RHHI-V deployment with glusterfs storage domain for hosted-engine storage 2. Created few more gluster storage domains with 2x3 gluster replica volumes 3. Ran all sorts of workload and VM life cycle operations 4. Added more bricks and expanded the volume in to 3X3 5. Triggered rebalance 6. Removed the set of bricks With all these operations, there are no errors seen.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2572
*** Bug 1851526 has been marked as a duplicate of this bug. ***
*** Bug 2128703 has been marked as a duplicate of this bug. ***