Bug 1136720 - DHT + Rebalance + rename :- files are getting truncated after rename and rebalance
Summary: DHT + Rebalance + rename :- files are getting truncated after rename and reba...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: distribute
Version: rhgs-3.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Nithya Balachandran
QA Contact: storage-qa-internal@redhat.com
URL:
Whiteboard: dht-data-loss
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-09-03 07:13 UTC by Rachana Patel
Modified: 2016-05-30 16:04 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-30 16:04:43 UTC
Embargoed:


Attachments (Terms of Use)
Log snippets as discussed in Comment #3 (83.23 KB, text/plain)
2014-09-13 12:53 UTC, Shyamsundar
no flags Details

Description Rachana Patel 2014-09-03 07:13:53 UTC
Description of problem:
=======================
Some of the files are getting truncated to 0 size after doing renames from multiple mounts and running rebalance simultaneously.

Version-Release number of selected component (if applicable):
=============================================================
3.6.0.27-6.el6rhs.x86_64

How reproducible:
=================
intermittent

Steps to Reproduce:
==================
1. created 100 files on the mount point of size  1MB
2. started renaming from multiple mount points
3. did add-brick and rebalance 
4. ran rebalance many times
5. created a directory test
6. moved all the files inside test/ directory
7. then created another directory test1/
8. mv test/* test1/
Result: some of the files got truncated to 0 size
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f83-18
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f84-18
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f85-18
-rw-r--r-- 1 root root       0 Aug 28 05:43 f8-57
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f86-18
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f87-18
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f88-18
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f89-18
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f90-18
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f91-18
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f9-19
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f92-18
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f93-18
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f94-18
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f95-18
-rw-r--r-- 1 root root       0 Aug 28 05:43 f9-57
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f96-18
files which are truncated 
[root@localhost test1]# find . -size 0
./f5-19
./f8-57
./f4-57
./f9-57
./f6-57
files
====
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f100-16
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f10-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f11-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f1-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f12-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f13-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f14-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f15-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f16-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f17-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f18-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f19-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f20-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f21-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f2-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f22-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f23-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f24-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f25-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f26-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f27-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f28-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f29-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f30-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f31-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f3-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f32-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f33-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f34-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f35-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f36-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f37-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f38-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f39-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f40-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f41-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f4-19
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f42-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f43-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f44-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f45-17
-rw-r--r-- 1 root root       0 Aug 28 05:43 f4-57
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f46-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f47-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f48-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f49-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f50-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f51-17
-rw-r--r-- 1 root root       0 Aug 28 05:43 f5-19
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f52-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f53-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f54-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f55-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f56-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f57-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f58-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f59-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f60-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f61-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f6-19
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f62-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f63-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f64-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f65-17
-rw-r--r-- 1 root root       0 Aug 28 05:43 f6-57
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f66-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f67-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f68-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f69-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f70-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f71-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f7-19
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f72-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f73-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f74-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f75-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f76-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f77-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f78-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f79-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f80-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f81-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f8-19
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f82-17
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f83-18
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f84-18
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f85-18
-rw-r--r-- 1 root root       0 Aug 28 05:43 f8-57
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f86-18
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f87-18
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f88-18
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f89-18
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f90-18
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f91-18
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f9-19
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f92-18
-rw-r--r-- 1 root root 1048576 Aug 28 05:43 f93-18

Actual results:
===============
few files are truncated to size 0

Comment 3 Shyamsundar 2014-09-13 12:51:58 UTC
Analysis for the issue that was logged as part of the #Description:
-------------------------------------------------------------------

This is not a data loss scenario, as a matter of fact everything is in working order. Here is why,

1) There was a 0 byte file named f[4,5,6,8,9]-57 already on the system
Evidence: NFS log has the CREATE for these files

2) When the rename happens, there is a rename for f5-57 -> f5-1 when the new f5-1 exists (from a rename of f5-340), so this is a case of renaming over an existing file, at this point f5-* is a 0 byte file and only 1 instance survives.
Evidence: NFS log
[2014-08-28 10:18:15.259705] D [MSGID: 0] [dht-rename.c:1281:dht_rename] 0-spacebar-dht: renaming /f5-340 (hash=spacebar-replicate-17/cache=spacebar-replicate-17) => /f5-1 (hash=spacebar-replicate-21/cache=<nul>)
[2014-08-28 10:19:18.896736] D [MSGID: 0] [dht-rename.c:1281:dht_rename] 0-spacebar-dht: renaming /f5-57 (hash=spacebar-replicate-2/cache=spacebar-replicate-2) => /f5-1 (hash=spacebar-replicate-21/cache=spacebar-replicate-17) (NOTE: dstcached location found as R17 and will be unlinked, this unlink will not appear in the logs as this is a trace level message)
[2014-08-28 10:24:05.684857] D [MSGID: 0] [dht-rename.c:1281:dht_rename] 0-spacebar-dht: renaming /f5-1 (hash=spacebar-replicate-21/cache=spacebar-replicate-2) => /f5-2 (hash=spacebar-replicate-4/cache=<nul>)

The rename of f5-57 to f5-1 also deletes f5-1 at replicate-17 as the dstcached existed.

3) From the rebalance logs, we can see that f5-* was rebalanced a few times, there were no clashes etc. BUT between 2 rebalance of f5-* the cached changed, and that led to the above investigation, to determine who changed the cached location. (which is the stale/old prior test case created f5-57).
Evidence: rebalance logs
./192.168.12.67/spacebar-rebalance.log:[2014-08-28 10:08:39.564778] I [dht-rebalance.c:865:dht_migrate_file] 0-spacebar-dht: /f5-340: attempting to move from spacebar-replicate-22 to spacebar-replicate-17
./192.168.12.67/spacebar-rebalance.log:[2014-08-28 10:08:40.459044] I [MSGID: 109022] [dht-rebalance.c:1143:dht_migrate_file] 0-spacebar-dht: completed migration of /f5-340 from subvolume spacebar-replicate-22 to spacebar-replicate-17

As per time stamp, this session of rebalance moved the file from R22 (R = replicate) to R17

The very next rebalance for the file f5-* is below and moves it from R2 to R15

./192.168.12.17/spacebar-rebalance.log:[2014-08-28 10:24:25.015226] I [dht-rebalance.c:865:dht_migrate_file] 0-spacebar-dht: /f5-3: attempting to move from spacebar-replicate-2 to spacebar-replicate-15
./192.168.12.17/spacebar-rebalance.log:[2014-08-28 10:24:25.262409] I [MSGID: 109022] [dht-rebalance.c:1143:dht_migrate_file] 0-spacebar-dht: completed migration of /f5-3 from subvolume spacebar-replicate-2 to spacebar-replicate-15

So basically nothing other than rebalance can change the cached location for the file, which led to the hunt explained in (1) and (2) above.

Overall, the test is working as intended, and we are good here. Attached are the log snippets from the various logs with some text on the analysis.

Comment 4 Shyamsundar 2014-09-13 12:53:25 UTC
Created attachment 937204 [details]
Log snippets as discussed in Comment #3

Log snippets to show the issue as described in comment #3

Comment 11 Pavithra 2014-11-19 06:44:50 UTC
Moving this bug back to Nithya as we decided not to document the warning in the admin guide.


Note You need to log in before you can comment on or make changes to this bug.