Bug 1002987
Summary: | Dist-geo-rep: status becomes faulty and process is restarted with error ' master is corrupt' in log | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Rachana Patel <racpatel> |
Component: | geo-replication | Assignee: | Aravinda VK <avishwan> |
Status: | CLOSED WORKSFORME | QA Contact: | amainkar |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 2.1 | CC: | aavati, bhubbard, csaba, grajaiya, hamiller, ndevos, nsathyan, rhs-bugs, rwheeler, vagarwal, vshankar |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Previously, the geo-replication logic used to totally depend on xtime extended attribute for its working, and hence whenever there was no xtime attribute, it used to error out. Now with changelog based geo-replication, this is not an error. Hence converted the error backtrace to the warning level log.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2015-01-12 08:26:51 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1005474, 1286162 | ||
Bug Blocks: |
Description
Rachana Patel
2013-08-30 12:27:09 UTC
I see the following logs in the auxillary mount log file: [2013-09-02 04:03:40.228262] I [dht-layout.c:633:dht_layout_normalize] 0-4_master2-dht: found anomalies in /flat/flat/100/etc10/dbus-1/session.d. holes=1 overlaps=1 missing=2 down=0 misc=0 [2013-09-02 04:03:40.493924] I [dht-layout.c:633:dht_layout_normalize] 0-4_master2-dht: found anomalies in /flat/flat/100/etc10/dbus-1/system.d. holes=1 overlaps=1 missing=2 down=0 misc=0 [2013-09-02 04:03:41.567090] I [dht-layout.c:633:dht_layout_normalize] 0-4_master2-dht: found anomalies in /flat/flat/100/etc10/libibverbs.d. holes=1 overlaps=1 missing=2 down=0 misc=0 [2013-09-02 04:03:41.675607] I [dht-layout.c:633:dht_layout_normalize] 0-4_master2-dht: found anomalies in /flat/flat/100/etc10/rpm. holes=1 overlaps=1 missing=2 down=0 misc=0 [2013-09-02 04:03:42.236039] I [dht-layout.c:633:dht_layout_normalize] 0-4_master2-dht: found anomalies in /flat/flat/100/etc10/mcelog. holes=1 overlaps=1 missing=2 down=0 misc=0 [2013-09-02 04:03:42.427043] I [dht-layout.c:633:dht_layout_normalize] 0-4_master2-dht: found anomalies in /flat/flat/100/etc10/makedev.d. holes=1 overlaps=1 missing=2 down=0 misc=0 Fetching xtime xattrs result in the log entries above, thereby gsyncd using the default xtime (URXTIME: (-1, 0)) which is less than the slave. This in an assertion for gsyncd and hence the complain of master getting corrupt. getting same defect in build 3.4.0.32rhs-1.el6_4.x86_64 also. - not sure those are related defects or not but like Bug 981708 - log is full with 0-master1-dht: found anomalies in (null). holes=1 overlaps=0 missing=0 down=1 misc=0 and as mentioned in Bug 981837 - log says disk layout missing.. after getting those errors it is getting restarted. more info:- while creating data(before starting geo rep session) , one or more brick went down for few hour https://code.engineering.redhat.com/gerrit/#/c/14999/ This bug can prevent geo-replication setup getting corrupted, and non-recoverable state. Need the fix to make it an warning log and continue. The latest geo-replication would allow to self-correct from the situation above. able to reproduce with - 3.4.0.39rhs-1.el6rhs.x86_64 Status becomes faulty and process is restarted with error:- ('master is corrupt' error is not there as that erro string is not present any more) log snippet:- [2013-11-07 13:34:59.792462] E [syncdutils(/rhs/brick2):207:log_raise_exception] <top>: FAIL: Traceback (most recent call last): File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 233, in twrap tf(*aa) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 909, in Xsyncer self.Xcrawl() File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1004, in Xcrawl self.Xcrawl(e, xtr) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1004, in Xcrawl self.Xcrawl(e, xtr) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 985, in Xcrawl xte = self.xtime(e) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 289, in xtime return self.xtime_low(rsc, path, **opts) File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 115, in xtime_low rsc.server.aggregated.set_xtime(path, self.uuid, xt) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 299, in ff return f(*a) File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 467, in set_xtime Xattr.lsetxattr(path, '.'.join([cls.GX_NSPACE, uuid, 'xtime']), struct.pack('!II', *mark)) File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 66, in lsetxattr cls.raise_oserr() File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 25, in raise_oserr raise OSError(errn, os.strerror(errn)) OSError: [Errno 2] No such file or directory [2013-11-07 13:34:59.974319] I [syncdutils(/rhs/brick2):159:finalize] <top>: exiting. [2013-11-07 13:35:00.17575] I [monitor(monitor):81:set_state] Monitor: new state: faulty [2013-11-07 13:35:11.87167] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------ [2013-11-07 13:35:11.87543] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker ---------------------------- As mentioned in Comment #6, log is full of 'founr anamalies..." log snippet:- [2013-11-06 04:32:11.734009] I [dht-layout.c:646:dht_layout_normalize] 0-master1-dht: found anomalies in /7/8/etc3/cron.hourly. holes=2 overlaps=1 missing=1 down=0 misc=0 [2013-11-06 04:32:24.484921] I [fuse-bridge.c:5714:fuse_thread_proc] 0-fuse: unmounting /tmp/gsyncd-aux-mount-ueeknN [2013-11-06 04:32:24.485769] W [glusterfsd.c:1097:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x357bee894d] (-->/lib64/libpthre ad.so.0() [0x357c607851] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x4053cd]))) 0-: received signum (15), shutting down [2013-11-06 04:32:24.485804] I [fuse-bridge.c:6393:fini] 0-fuse: Unmounting '/tmp/gsyncd-aux-mount-ueeknN'. [2013-11-06 04:32:36.265138] I [glusterfsd.c:2024:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.4.0.38rhs (/usr/sbin/glusterfs --aux-gfid-mount --log-file=/var/log/glusterfs/geo-replication/master1/ssh%3A%2F%2Froot%4010.70.36.134%3Agluster%3 A%2F%2F127.0.0.1%3Aslave1.%2Frhs%2Fbrick2.gluster.log --volfile-server=localhost --volfile-id=master1 --client-pid=-1 /tmp/gsyncd-aux-m ount-yXKk66) (able to reproduce without add-brick and rebalance steps) Steps to reproduce:- 1. create a distributed volume. 2. FUSE and NFS mount that volume and create a data on that volume. 3. while creating data bring bricks up and down by killing brick process. (done 3-4 time) 4. bring all bricks up and create a geo rep session for that volume. 5. start geo rep session and keep checking status. master volume info :- [root@7-VM1 rpm]# gluster v info master1 Volume Name: master1 Type: Distribute Volume ID: 66193974-4233-4735-9102-a15fb3e4633e Status: Started Number of Bricks: 6 Transport-type: tcp Bricks: Brick1: 10.70.36.130:/rhs/brick2 Brick2: 10.70.36.131:/rhs/brick2 Brick3: 10.70.36.132:/rhs/brick2 Brick4: 10.70.36.130:/rhs/brick3 Brick5: 10.70.36.131:/rhs/brick3 Brick6: 10.70.36.132:/rhs/brick3 Options Reconfigured: changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on slave volume info:- [root@7-VM5 rpm]# gluster v info slave1 Volume Name: slave1 Type: Distribute Volume ID: 0e99cc94-3dc7-46d7-83b6-1b5d1606a3a3 Status: Started Number of Bricks: 4 Transport-type: tcp Bricks: Brick1: 10.70.36.134:/rhs/brick1 Brick2: 10.70.36.134:/rhs/brick2 Brick3: 10.70.36.135:/rhs/brick1 Brick4: 10.70.36.135:/rhs/brick2 As per the comments 19 and 21, This bug is only applicable in RHS 2.0. Closing this bug since not applicable in 2.1. Please reopen the bug if this issue found again. |