--- Additional comment from Csaba Henk on 2018-05-31 10:10:03 EDT --- This is we find in the fusedump: The file dm_errori.sas7bdat is managed via two names, this one ("the base file") and dm_errori.sas7bdat.lck ("the lck file"). At start, the base file exists and some interaction occur with it. Then the lck file is looked up, failed to be found, and gets created: {"Unique":479868,"op":"LOOKUP","Nodeid":139842860710128,"Pid":14012,"matches":["dm_errori.sas7bdat.lck"]} {"Unique":479869,"op":"LOOKUP","Nodeid":139842860710128,"Pid":14012,"matches":["dm_errori.sas7bdat.lck"]} {"Unique":479870,"op":"CREATE","Nodeid":139842860710128,"Pid":14012,"matches":["dm_errori.sas7bdat.lck"],"nodemap":{"dm_errori.sas7bdat.lck":139842869444064},"size":[0]} (These are just summaries of the respective fuse messages. Occasionally, if it's needed to make my point, I might provide the complete message. Here we know that the LOOKUPs fail from the absence of the "nodemap" field (which indicates association of a file name with a nodeid). That the lookup is doubled does not have significance.) Data is written to the lck file and finally it replaces the previous base file via RENAME: {"Unique":480062,"op":"RENAME","Nodeid":139842860710128,"Pid":14012,"matches":["dm_errori.sas7bdat.lck","dm_errori.sas7bdat"]} Then for a long while the base file is used in various ways, when at some point it seems the above scenario would be replayed. {"Unique":822175,"op":"LOOKUP","Nodeid":139842860710128,"Pid":53795,"matches":["dm_errori.sas7bdat.lck",139843072187888],"nodemap":{"dm_errori.sas7bdat.lck":139843072187888},"size":[1481113600]} {"Unique":822176,"op":"LOOKUP","Nodeid":139842860710128,"Pid":53795,"matches":["dm_errori.sas7bdat.lck",139843072187888],"nodemap":{"dm_errori.sas7bdat.lck":139843072187888},"size":[1481113600]} {"Unique":822177,"op":"OPEN","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]} {"Unique":822178,"op":"SETATTR","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]} (While this data lacks temporal information, we say "long while" based on the Unique values which index fuse messages. So this second window of action comes about 350,000 messages later.) This time the lck file is found, but the process wants to start from scratch, as it's immediately truncated to 0 via the SETATTR request (Unique 822178), which can be seen by looking at the complete request (inferring from the values of the Valid and Size fields): {"Truncated":false,"Msg":[null,"SETATTR",{"Len":128,"Opcode":4,"Unique":822178,"Nodeid":139843072187888,"Uid":4447,"Gid":9003,"Pid":53795,"Padding":0},[{"Valid":584,"Padding":0,"Fh":139842872218944,"Size":0,"LockOwner":8496685147530397345,"Atime":0,"Mtime":0,"Unused2":0,"Atimensec":0,"Mtimensec":0,"Unused3":0,"Mode":0,"Unused4":0,"Uid":0,"Gid":0,"Unused5":0}]]} However, there is something irregular if we go back a bit further. {"Unique":822172,"op":"LOOKUP","Nodeid":139842860710128,"Pid":53795,"matches":["dm_errori.sas7bdat",139843072187888],"nodemap":{"dm_errori.sas7bdat":139843072187888},"size":[1481113600]} {"Unique":822173,"op":"OPEN","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]} {"Unique":822174,"op":"SETLK","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]} {"Unique":822175,"op":"LOOKUP","Nodeid":139842860710128,"Pid":53795,"matches":["dm_errori.sas7bdat.lck",139843072187888],"nodemap":{"dm_errori.sas7bdat.lck":139843072187888},"size":[1481113600]} {"Unique":822176,"op":"LOOKUP","Nodeid":139842860710128,"Pid":53795,"matches":["dm_errori.sas7bdat.lck",139843072187888],"nodemap":{"dm_errori.sas7bdat.lck":139843072187888},"size":[1481113600]} {"Unique":822177,"op":"OPEN","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]} {"Unique":822178,"op":"SETATTR","Nodeid":139843072187888,"Pid":53795,"matches":[139843072187888]} Just before the lck file was accessed, the base file was also looked up -- and it resolved to the same nodeid as the lck file! That is, they are the same file under different names (hardlinks of each other). That is almost certainly against the intent of the application, as the fusedump does not contain any LINK messages (via which hardlinking is facilitated). *** Please confirm if the applications accessing the volume do not do hard linking -- as here we talk about the fuse dump of a single client, theoretically it's possible that another client was creating the hardlink. *** So, most likely the application assumes that the base file and the lck file are independent entities; and it's the oddity of the glusterfs backend that these two files are the same. The application might want to reset only the link file -- but that being the same as the base file, it's doing away with all of the content of the base file as well. Hence the corruption. How can the base file and the lck file get accidentally hardlinked to each other on the gluster backend? Most likely it's an issue with dht. One possibility is that the RENAME above was not properly executed. In that case the situation of the base and the lck file being the same should linger on through a longer period, from the RENAME (Unique 480062) on to the eventual removal of the lck file (Unique 822200). Other possibility is that just at the time when the base and the lck files get resolved to the same id, some other client is performing a RENAME on them which is not sufficiently synchronized and the LOOKUPs on this side can hit into an inconsistent intermediate state; in this case the hardlinked situation is ephemeral. We'd be happy to get stat/getfattr information of *both files* (base and lck -- or a note on one's absence), on *all bricks* and *all clients*. It would be however much more useful than post festam file metadata to monitor what's going on during the workload. Is there a period when presence of the lck file is observable? Is it a hard link of the base file? --- Additional comment from Raghavendra G on 2018-06-01 08:56:20 EDT --- (In reply to Csaba Henk from comment #23) > Other possibility is that just at the time when the base and the lck files > get resolved to the same id, some other client is performing a RENAME on > them which is not sufficiently synchronized and the LOOKUPs on this side can > hit into an inconsistent intermediate state; in this case the hardlinked > situation is ephemeral. There is one scenario in dht_rename where for a brief period lookups on src and dst can be successful and identify them as hardlinks to each other. For the scenario to happen following conditions should hold good for a rename (src, dst). (dst-cached != src-cached) and (dst-hashed == src-hashed) and (src-cached != dst-hashed) In this scenario following is the control flow of dht-rename: 1. link (src, dst) on src-cached. 2. rename (src, dst) on dst-hashed/src-hashed (Note that dst-hashed == src-hashed). 3. rest of the rename which removes hardlink src on src-cached. Note that between 2 and 3 till the hardlink is removed, * lookup (src) would fail on src-hashed resulting in lookup-everywhere. Since hardlink src exists on src-cached, lookup will be successful mapping it to inode with src-gfid. * lookup (dst) would identify a linkto file on dst-hashed. The linkto file points to src-cached, following which we'll find the hardlink dst on src-cached. Lookup (dst) succeeds mapping dst to inode with src-gfid. Both src and dst would be identified as hardlinks to file with src-gfid in inode table of client. Same result is conveyed back to application. If we've hit this scenario, we would see the failure of lookup (src) on src-hashed and eventual finding it on src-cached through lookup-everywhere. We need to set diagnostics.client-log-level to DEBUG to see these logs. The other work-around (if the hypothesis is true) is to turn on cluster.lookup-optimize to true. When lookup-optimize is turned on dht-lookup doesn't resort to lookup-everywhere on not finding src on src-hashed. Instead it just conveys a failure to application. Since lookup won't reach src-cached, it won't find the hard-link.
(In reply to Raghavendra G from comment #24) > (In reply to Csaba Henk from comment #23) > > The other work-around (if the hypothesis is true) is to turn on > cluster.lookup-optimize to true. When lookup-optimize is turned on > dht-lookup doesn't resort to lookup-everywhere on not finding src on > src-hashed. Instead it just conveys a failure to application. Since lookup > won't reach src-cached, it won't find the hard-link. Since patch [1] We've made hashed-subvol as the subvol to take locks during rename. This means presence of entrylks on hashed-subvol on the basename of src/dst of rename indicates a rename in progress. On observing locks, dht_lookup can, * acquire entrylk on parent inode with basename * Do the lookup * unlock entrylk * unwind the lookup This will help lookup to synchronize with rename and hence to preserve the atomicity of rename. Note that this algorithm works even when first lookup fails with ENOENT. Also, the cost of synchronization is isolated to the lookups happening only during rename. Lookups happening outside rename window won't suffer the cost of synchronization. This will be a code change and I'll be submitting a patch to do this. [1] https://review.gluster.org/19547/
REVIEW: https://review.gluster.org/21123 (performance/write-behind: remove the request from wip queue in wb_fulfill_request) posted (#1) for review on master by Raghavendra G
REVISION POSTED: https://review.gluster.org/21123 (performance/write-behind: remove the request from wip queue in wb_fulfill_request) posted (#2) for review on master by Raghavendra G
Above patch is merged, but that seems to refer to different issue. Hence moving it back to NEW. Please update with proper status if worked on.
This bug is moved to https://github.com/gluster/glusterfs/issues/902, and will be tracked there from now on. Visit GitHub issues URL for further details