Description of problem: status detail says there are 0 number of files files and deletes pending. And there are 0 bytes of data pending to be synced. But after this the number of files present in slave volume is less than that of in the master. And the arequal-checksum doesn't match. Version-Release number of selected component (if applicable): glusterfs-3.4.0.18rhs-1.el6rhs.x86_64 How reproducible: Hit once. Haven't tried to reproduce again. Steps to Reproduce: 1. Create a session between 2*2 distributed-replicated master and slave volume. 2. Now use crefi to generate few files from *nfs* mount ./crefi.py -n 10 --random --multi -I 100 --min=10 --max=500 -b 10 -d 10 --type tar /mnt/master/ ./crefi.py -n 10 --random --multi -I 100 --min=10 --max=500 -b 10 -d 10 /mnt/master/ 3. While the data creation is in progress, stop the geo-rep session 4. Start creating data again ./crefi.py -n 100 --random --multi -I 100 --min=10 --max=500 -b 5 -d 5 /mnt/master/ 5. Now start the geo-rep again 6. Wait for some time. Actual results: Run status detail NODE HEALTH UPTIME FILES SYNCD FILES PENDING BYTES PENDING DELETES PENDING -------------------------------------------------------------------------------------------------------------------- spitfire.blr.redhat.com Stable 00:10:12 1962 0 0Bytes 0 typhoon.blr.redhat.com Stable 00:10:12 0 0 0Bytes 0 mustang.blr.redhat.com Stable 00:10:12 0 0 0Bytes 0 harrier.blr.redhat.com Stable 00:10:12 1953 0 0Bytes 0 The number of files in master and slave [root@spacex ~]# /opt/qa/tools/arequal-checksum /mnt/master/ Entry counts Regular files : 3675 Directories : 101 Symbolic links : 0 Other : 0 Total : 3776 Metadata checksums Regular files : 486e85 Directories : 24d74c Symbolic links : 3e9 Other : 3e9 Checksums Regular files : 36df2d443b71bd18c4c1d3cbf59ce429 Directories : 7c3e3c6210534312 Symbolic links : 0 Other : 0 Total : 8e20c2eddebe1a23 [root@spacex ~]# /opt/qa/tools/arequal-checksum /mnt/slave/ Entry counts Regular files : 3658 Directories : 101 Symbolic links : 0 Other : 0 Total : 3759 Metadata checksums Regular files : 3e9 Directories : 24d74c Symbolic links : 3e9 Other : 3e9 Checksums Regular files : 4ccbdb83d2be7e85472424d24dcc058d Directories : 3c3578756c1a5071 Symbolic links : 0 Other : 0 Total : 37da8724f3682b79 The regular files count in master and slave are different and so is the md5sum. Expected results: All the files should be synced to the slave volume. Additional info: I found this in the logs of node from where the geo-rep commands were executed. These might be harmless though since the worker got restarted again. I did not find any other unusual error in the logs. [2013-08-13 19:40:55.675932] I [master(/rhs/bricks/brick0):343:crawlwrap] _GMaster: crawl interval: 3 seconds [2013-08-13 19:40:55.696681] I [master(/rhs/bricks/brick0):366:crawlwrap] _GMaster: new master is ce2e7783-f759-43ae-a76c-0d68cc814e13 [2013-08-13 19:40:55.697092] I [master(/rhs/bricks/brick0):370:crawlwrap] _GMaster: primary master with volume id ce2e7783-f759-43ae-a76c-0d68cc814e13 ... [2013-08-13 19:41:48.262649] I [monitor(monitor):81:set_state] Monitor: new state: Stable [2013-08-13 19:41:58.173151] I [master(/rhs/bricks/brick0):356:crawlwrap] _GMaster: 19 crawls, 2 turns [2013-08-13 19:42:32.664465] E [syncdutils(/rhs/bricks/brick0):206:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 133, in main main_i() File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 513, in main_i local.service_loop(*[r for r in [remote] if r]) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1060, in service_loop g2.crawlwrap() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 396, in crawlwrap time.sleep(self.sleep_interval) File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 297, in <lambda> def set_term_handler(hook=lambda *a: finalize(*a, **{'exval': 1})): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 150, in finalize shutil.rmtree(gconf.ssh_ctl_dir) File "/usr/lib64/python2.6/shutil.py", line 217, in rmtree onerror(os.remove, fullname, sys.exc_info()) File "/usr/lib64/python2.6/shutil.py", line 215, in rmtree os.remove(fullname) OSError: [Errno 2] No such file or directory: '/tmp/gsyncd-aux-ssh-KW3ISD/gsycnd-ssh-root@falcon:22' [2013-08-13 19:42:32.665458] E [syncdutils(/rhs/bricks/brick0):189:log_raise_exception] <top>: connection to peer is broken [2013-08-13 19:42:32.673480] E [resource(/rhs/bricks/brick0):204:errlog] Popen: command "ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-ssh-KW3ISD/gsycnd-ssh-%r@%h:%p root@falcon /nonexistent/gsyncd --session-owner ce2e7783-f759-43ae-a76c-0d68cc814e13 -N --listen --timeout 120 gluster://localhost:slave" returned with 255, saying: [2013-08-13 19:42:32.673796] E [resource(/rhs/bricks/brick0):207:logerr] Popen: ssh> [2013-08-13 14:10:49.039772] I [socket.c:3487:socket_init] 0-glusterfs: SSL support is NOT enabled [2013-08-13 19:42:32.674027] E [resource(/rhs/bricks/brick0):207:logerr] Popen: ssh> [2013-08-13 14:10:49.039837] I [socket.c:3502:socket_init] 0-glusterfs: using system polling thread [2013-08-13 19:42:32.674318] E [resource(/rhs/bricks/brick0):207:logerr] Popen: ssh> [2013-08-13 14:10:49.291131] I [cli-rpc-ops.c:5461:gf_cli_getwd_cbk] 0-cli: Received resp to getwd [2013-08-13 19:42:32.674527] E [resource(/rhs/bricks/brick0):207:logerr] Popen: ssh> [2013-08-13 14:10:49.291323] I [input.c:36:cli_batch] 0-: Exiting with: 0 [2013-08-13 19:42:32.674738] E [resource(/rhs/bricks/brick0):207:logerr] Popen: ssh> Killed by signal 15. [2013-08-13 19:45:11.77749] I [monitor(monitor):237:distribute] <top>: slave bricks: [{'host': 'falcon', 'dir': '/rhs/bricks/brick0'}, {'host': 'hornet', 'dir': '/rhs/bricks/brick1'}, {'host': 'interceptor', 'dir': '/rhs/bricks/brick2'}, {'host': 'lightning', 'dir': '/rhs/bricks/brick3'}] [2013-08-13 19:45:11.78715] I [monitor(monitor):256:distribute] <top>: worker specs: [('/rhs/bricks/brick0', 'ssh://root@falcon:gluster://localhost:slave')] I have archived the logs.
I just hit this once more for a more straightforward case/steps 1. Create 10k text files and as many symlinks for each of the files from *nfs* mount point. 2. Start the geo-rep and wait for it to sync. MASTER: master SLAVE: euclid::slave NODE HEALTH UPTIME FILES SYNCD FILES PENDING BYTES PENDING DELETES PENDING ---------------------------------------------------------------------------------------------------------------------- pythagoras.blr.redhat.com Stable 01:18:31 20600 0 0Bytes 0 ramanujan.blr.redhat.com Stable 01:18:31 0 0 0Bytes 0 Number of files in master and slave are as below [root@archimedes ~]# find /mnt/master/ | wc -l 20391 [root@archimedes ~]# find /mnt/slave/ | wc -l 20187
(In reply to M S Vishwanath Bhat from comment #1) > I just hit this once more for a more straightforward case/steps > > 1. Create 10k text files and as many symlinks for each of the files from > *nfs* mount point. > 2. Start the geo-rep and wait for it to sync. > Were these empty files by any chance? I ask this because as I've mentioned in other bugs, the entry operations do fail at times without gsyncd getting a failure from GlusterFS. The way we identify this is when Rsync errors out due to a missing entry on the slave. For this to happen, there needs to be a data operations on the files (and that to should be captured in the same changelog where the entry operation is captured). Regarding the crawl instrumentation values: "Files Pending", "Bytes Pending" and "Deletes Pending" are *short lived* instrumentation values. They would be non-zero when a changelog is getting processed and gets reset after that ("Files Pending" is summed up into "Files Synced" and the information is persisted to disk).
I still see this in 24rhs [root@spacex ~]# /opt/qa/tools/arequal-checksum /mnt/master/ Entry counts Regular files : 43814 Directories : 2888 Symbolic links : 2 Other : 0 Total : 46704 Metadata checksums Regular files : 1676cd9 Directories : 3e9 Symbolic links : 3e9 Other : 3e9 Checksums Regular files : f788c9d2c89e76396e939f03aa8a9491 Directories : 517e296b1d301a57 Symbolic links : 41530f524b577a5c Other : 0 Total : 893670e8347382a3 [root@spacex ~]# /opt/qa/tools/arequal-checksum /mnt/slave/ Entry counts Regular files : 43814 Directories : 2888 Symbolic links : 2 Other : 0 Total : 46704 Metadata checksums Regular files : 1676cd9 Directories : 3e9 Symbolic links : 3e9 Other : 3e9 Checksums Regular files : f788c9d2c89e76396e939f03aa8a9491 Directories : 517e296b1d301a57 Symbolic links : 41530f524b577a5c Other : 0 Total : 893670e8347382a3 But in status detail MASTER: master SLAVE: euclid::slave NODE HEALTH UPTIME FILES SYNCD FILES PENDING BYTES PENDING DELETES PENDING ---------------------------------------------------------------------------------------------------------------------- pythagoras.blr.redhat.com Stable 01:06:02 24770 0 0Bytes 0 ramanujan.blr.redhat.com Stable 01:05:59 0 0 0Bytes 0 All of them were synced from changelog and no node went down. AFAIK there were no zero byte files. I just did untar of linux kernel and glusterfs.
MS, From your last comment the arequal checksums match. What's the issue? - Is it about status detail or about files not syncing?
OK, so I get it now. It's about status details, ie. "Files Synced" does not match the total number of files. By the status detail output you have given, were there only 2 node the master cluster (and that too pure replica)? Also, I did not understand the zero byte file thing you have mentioned. What's that got to do with "status detail"?
(In reply to Venky Shankar from comment #6) > OK, so I get it now. It's about status details, ie. "Files Synced" does not > match the total number of files. > Yes. The files synced is about the half the total files actually synced. > By the status detail output you have given, were there only 2 node the > master cluster (and that too pure replica)? No. It was a 2*2 distributed-replicated volume with 2 nodes. Both the active gsyncd of each pair was present in only one node. > > Also, I did not understand the zero byte file thing you have mentioned. > What's that got to do with "status detail"? I mentioned zero byte files, because you wanted to know about it in comment #3.
(In reply to M S Vishwanath Bhat from comment #7) > (In reply to Venky Shankar from comment #6) [snip] > No. It was a 2*2 distributed-replicated volume with 2 nodes. Both the active > gsyncd of each pair was present in only one node. Then the "status detail" state file getting overwritten by the two processes. > > > > > Also, I did not understand the zero byte file thing you have mentioned. > > What's that got to do with "status detail"? > > I mentioned zero byte files, because you wanted to know about it in comment > #3. Ah! replying to the comment would not have confused me :-) Yeh, so empty files or files for which data operations get captured in a different changelog than where the entry operation is captured cannot be retried in case the entry operation fail on the slave.
MASTER: master SLAVE: falcon::slave NODE HEALTH UPTIME FILES SYNCD FILES PENDING BYTES PENDING DELETES PENDING -------------------------------------------------------------------------------------------------------------------- spitfire.blr.redhat.com Stable 02:10:02 2674 29 15.3MB 24 typhoon.blr.redhat.com Stable 02:09:59 0 0 0Bytes 0 mustang.blr.redhat.com Stable 02:09:59 0 0 0Bytes 0 harrier.blr.redhat.com Stable 02:09:59 2674 21 11.6MB 25 Few files are listed as deletes pending, when I have not done any deletes or any renames.
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.