Bug 1329466
Summary: | Gluster brick got inode-locked and freeze the whole cluster | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Chen Chen <aflyhorse> | ||||||||
Component: | locks | Assignee: | bugs <bugs> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 3.7.10 | CC: | aflyhorse, amukherj, aspandey, bugs, jbyers, ndevos, pkarampu, skoduri | ||||||||
Target Milestone: | --- | Keywords: | Triaged | ||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2016-08-19 07:39:00 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1330997, 1344836, 1360576, 1361402 | ||||||||||
Attachments: |
|
Description
Chen Chen
2016-04-22 02:50:14 UTC
Created attachment 1149635 [details]
gluster volume statedump [nfs] when the volume is frozen
Created attachment 1149637 [details]
/var/log/glusterfs of sm11 (whose brick reported blocked) after "start force"
Could you provide the script or program that does the parallel I/O on the same file? Are you executing this from one client system, or from multiple clients? I was running GATK CombineVariants (multi-threaded mode, -nt 16) when I noticed this inode lock. I executed this from one client system. https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_CombineVariants.php Besides, in another lock (which failed to recover from "start force"), I was running GATK HaplotypeCaller (single-threaded) from multiple clients. The following is the statedump snapshot from this lock. I don't really know why there were *write* operation on GATK jar. Both the brick and the volume were mounted with noatime flag, and "ls -la" showed it has not been modified since I downloaded the jar. [xlator.features.locks.mainvol-locks.inode] path=/home/analyzer/softs/bin/GenomeAnalysisTK.jar mandatory=0 inodelk-count=4 lock-dump.domain.domain=mainvol-disperse-0:self-heal lock-dump.domain.domain=mainvol-disperse-0 inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = 1, owner=dc2d3dfcc57f0000, client=0x7ff03435d5f0, connection-id=sm12-8063-2016/04/01-07:51:46:892384-mainvol-client-0-0-0, blocked at 2016-04-01 16:52:58, granted at 2016-04-01 16:52:58 inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 1, owner=1414371e1a7f0000, client=0x7ff034204490, connection-id=hw10-17315-2016/04/01-07:51:44:421807-mainvol-client-0-0-0, blocked at 2016-04-01 16:58:51 inodelk.inodelk[2](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 1, owner=a8eb14cd9b7f0000, client=0x7ff01400dbd0, connection-id=sm14-879-2016/04/01-07:51:56:133106-mainvol-client-0-0-0, blocked at 2016-04-01 17:03:41 inodelk.inodelk[3](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 1, owner=b41a0482867f0000, client=0x7ff01800e670, connection-id=sm15-30906-2016/04/01-07:51:45:711474-mainvol-client-0-0-0, blocked at 2016-04-01 17:05:09 Any scheduled update? I don't know what is the GlusterFS equivalent of RHGS 3.1.3. Again, another tight lock here. Couldn't release by force start. So I cold-reset the infected node (which has a noticable huge 1min-load). [xlator.features.locks.mainvol-locks.inode] path=<gfid:62adaa3a-a1b8-458c-964f-5742f942cd0f>/.WGC037694D_combined_R1.fastq.gz.gzT9LR mandatory=0 inodelk-count=38 lock-dump.domain.domain=mainvol-disperse-1:self-heal inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551610, owner=c4dd93ee0c7f0000, client=0x7f3eb4887cc0, connection-id=hw10-18694-2016/05/02-05:57:47:620063-mainvol-client-6-0, blocked at 2016-05-11 08:22:56, granted at 2016-05-11 08:27:07 inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551610, owner=98acff20d77f0000, client=0x7f3ea869e250, connection-id=sm16-23349-2016/05/02-05:57:01:49902-mainvol-client-6-0, blocked at 2016-05-11 08:27:07 inodelk.inodelk[2](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551610, owner=8004a958097f0000, client=0x7f3ebc66e630, connection-id=sm15-28555-2016/05/02-05:57:53:27608-mainvol-client-6-0, blocked at 2016-05-11 08:27:07 ......(tailored) Got another block here. I'm a bit puzzled. I'm only initializing *ONE* rsync process on *ONE* client connecting to *ONE* Gluterfs via NFS. Why would *ALL* of the nodes want to lock it? [xlator.features.locks.mainvol-locks.inode] path=/home/support/bak2t/camel.alone/mapping/camel62/.camel62.clean.r1.fastq.SnlBzz mandatory=0 inodelk-count=8 lock-dump.domain.domain=dht.file.migrate lock-dump.domain.domain=mainvol-disperse-1:self-heal inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551610, owner=a85c3672497f0000, client=0x7f3910091b20, connection-id=sm14-20329-2016/06/28-05:35:35:487901-mainvol-client-6-0, blocked at 2016-07-14 11:44:42, granted at 2016-07-14 12:13:25 inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551610, owner=606f8a77c07f0000, client=0x7f39101d1c80, connection-id=sm13-14349-2016/06/28-05:35:35:486931-mainvol-client-6-0, blocked at 2016-07-14 12:13:25 inodelk.inodelk[2](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551610, owner=2cbb10b1b57f0000, client=0x7f3910004ca0, connection-id=hw10-63151-2016/06/28-05:35:33:427463-mainvol-client-6-0, blocked at 2016-07-14 12:13:25 inodelk.inodelk[3](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551610, owner=28b725b5337f0000, client=0x7f39100ce4c0, connection-id=sm11-5958-2016/06/28-05:35:35:510742-mainvol-client-6-0, blocked at 2016-07-14 12:13:25 inodelk.inodelk[4](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551610, owner=c8e18ca77c7f0000, client=0x7f391024c340, connection-id=sm16-16031-2016/06/28-05:35:35:487112-mainvol-client-6-0, blocked at 2016-07-14 12:13:25 inodelk.inodelk[5](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551610, owner=08026bee887f0000, client=0x7f391c701700, connection-id=sm15-29608-2016/06/28-05:35:35:523099-mainvol-client-6-0, blocked at 2016-07-14 12:13:25 inodelk.inodelk[6](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551610, owner=005ad35f5b7f0000, client=0x7f39102b24f0, connection-id=sm12-22762-2016/06/28-05:35:35:487941-mainvol-client-6-0, blocked at 2016-07-14 12:13:25 inodelk.inodelk[7](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 18446744073709551610, owner=507f26b5337f0000, client=0x7f39100ce4c0, connection-id=sm11-5958-2016/06/28-05:35:35:510742-mainvol-client-6-0, blocked at 2016-07-14 12:13:25 lock-dump.domain.domain=mainvol-disperse-1 (In reply to Chen Chen from comment #6) > Got another block here. I'm a bit puzzled. > > I'm only initializing *ONE* rsync process on *ONE* client connecting to > *ONE* Gluterfs via NFS. Why would *ALL* of the nodes want to lock it? > > [xlator.features.locks.mainvol-locks.inode] > path=/home/support/bak2t/camel.alone/mapping/camel62/.camel62.clean.r1.fastq. > SnlBzz > mandatory=0 > inodelk-count=8 > lock-dump.domain.domain=dht.file.migrate > lock-dump.domain.domain=mainvol-disperse-1:self-heal > inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = > 18446744073709551610, owner=a85c3672497f0000, client=0x7f3910091b20, > connection-id=sm14-20329-2016/06/28-05:35:35:487901-mainvol-client-6-0, > blocked at 2016-07-14 11:44:42, granted at 2016-07-14 12:13:25 > inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = > 18446744073709551610, owner=606f8a77c07f0000, client=0x7f39101d1c80, > connection-id=sm13-14349-2016/06/28-05:35:35:486931-mainvol-client-6-0, > blocked at 2016-07-14 12:13:25 > inodelk.inodelk[2](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = > 18446744073709551610, owner=2cbb10b1b57f0000, client=0x7f3910004ca0, > connection-id=hw10-63151-2016/06/28-05:35:33:427463-mainvol-client-6-0, > blocked at 2016-07-14 12:13:25 > inodelk.inodelk[3](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = > 18446744073709551610, owner=28b725b5337f0000, client=0x7f39100ce4c0, > connection-id=sm11-5958-2016/06/28-05:35:35:510742-mainvol-client-6-0, > blocked at 2016-07-14 12:13:25 > inodelk.inodelk[4](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = > 18446744073709551610, owner=c8e18ca77c7f0000, client=0x7f391024c340, > connection-id=sm16-16031-2016/06/28-05:35:35:487112-mainvol-client-6-0, > blocked at 2016-07-14 12:13:25 > inodelk.inodelk[5](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = > 18446744073709551610, owner=08026bee887f0000, client=0x7f391c701700, > connection-id=sm15-29608-2016/06/28-05:35:35:523099-mainvol-client-6-0, > blocked at 2016-07-14 12:13:25 > inodelk.inodelk[6](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = > 18446744073709551610, owner=005ad35f5b7f0000, client=0x7f39102b24f0, > connection-id=sm12-22762-2016/06/28-05:35:35:487941-mainvol-client-6-0, > blocked at 2016-07-14 12:13:25 > inodelk.inodelk[7](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = > 18446744073709551610, owner=507f26b5337f0000, client=0x7f39100ce4c0, > connection-id=sm11-5958-2016/06/28-05:35:35:510742-mainvol-client-6-0, > blocked at 2016-07-14 12:13:25 > lock-dump.domain.domain=mainvol-disperse-1 Are you still running 3.7.10? How about upgrading it to 3.7.13 and retest? (In reply to Atin Mukherjee from comment #7) > (In reply to Chen Chen from comment #6) > > Got another block here. I'm a bit puzzled. > > > > I'm only initializing *ONE* rsync process on *ONE* client connecting to > > *ONE* Gluterfs via NFS. Why would *ALL* of the nodes want to lock it? > > Are you still running 3.7.10? How about upgrading it to 3.7.13 and retest? Yes, I'm still on 3.7.10. Are you sure this one is touched in updates? The two #BZs blocked by this one are still in ON_QA/POST status. If so, I'll schedule a down time. I'm hesitant to upgrade, fearing it might introduce some new bugs. (In reply to Chen Chen from comment #8) > (In reply to Atin Mukherjee from comment #7) > > (In reply to Chen Chen from comment #6) > > > Got another block here. I'm a bit puzzled. > > > > > > I'm only initializing *ONE* rsync process on *ONE* client connecting to > > > *ONE* Gluterfs via NFS. Why would *ALL* of the nodes want to lock it? > > > > Are you still running 3.7.10? How about upgrading it to 3.7.13 and retest? > > Yes, I'm still on 3.7.10. > > Are you sure this one is touched in updates? The two #BZs blocked by this > one are still in ON_QA/POST status. If so, I'll schedule a down time. > > I'm hesitant to upgrade, fearing it might introduce some new bugs. Well, fix for 1344836 is definitely in mainline, but not in 3.7 branch. @Pranith - Do you mind to backport this to 3.7? (In reply to Atin Mukherjee from comment #9) > (In reply to Chen Chen from comment #8) > > (In reply to Atin Mukherjee from comment #7) > > > (In reply to Chen Chen from comment #6) > > > > Got another block here. I'm a bit puzzled. > > > > > > > > I'm only initializing *ONE* rsync process on *ONE* client connecting to > > > > *ONE* Gluterfs via NFS. Why would *ALL* of the nodes want to lock it? > > > > > > Are you still running 3.7.10? How about upgrading it to 3.7.13 and retest? > > > > Yes, I'm still on 3.7.10. > > > > Are you sure this one is touched in updates? The two #BZs blocked by this > > one are still in ON_QA/POST status. If so, I'll schedule a down time. > > > > I'm hesitant to upgrade, fearing it might introduce some new bugs. > > Well, fix for 1344836 is definitely in mainline, but not in 3.7 branch. > > @Pranith - Do you mind to backport this to 3.7? I'm willing to jump to 3.8, if the upgrade process won't cause much turbulence (such as extensive configuration modification, possible data loss, etc.). Error: Package: glusterfs-ganesha-3.8.1-1.el7.x86_64 (centos-gluster38) Requires: nfs-ganesha-gluster You could try using --skip-broken to work around the problem So I'll fallback to gluster-native-nfs3 first. After updated to 3.8.2, the cluster still sometimes hangs. However these is no longer "inodelk lock" in statedump, so I figure it is another bug. I'll close this bug report and open a new one. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |