| Summary: | Dist-geo-rep: status becomes faulty and process is restarted with error ' master is corrupt' in log | ||
|---|---|---|---|
| Product: | Red Hat Gluster Storage | Reporter: | Rachana Patel <racpatel> |
| Component: | geo-replication | Assignee: | Aravinda VK <avishwan> |
| Status: | CLOSED WORKSFORME | QA Contact: | amainkar |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 2.1 | CC: | aavati, bhubbard, csaba, grajaiya, hamiller, ndevos, nsathyan, rhs-bugs, rwheeler, vagarwal, vshankar |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Previously, the geo-replication logic used to totally depend on xtime extended attribute for its working, and hence whenever there was no xtime attribute, it used to error out. Now with changelog based geo-replication, this is not an error. Hence converted the error backtrace to the warning level log.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-01-12 08:26:51 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Bug Depends On: | 1005474, 1286162 | ||
| Bug Blocks: | |||
I see the following logs in the auxillary mount log file: [2013-09-02 04:03:40.228262] I [dht-layout.c:633:dht_layout_normalize] 0-4_master2-dht: found anomalies in /flat/flat/100/etc10/dbus-1/session.d. holes=1 overlaps=1 missing=2 down=0 misc=0 [2013-09-02 04:03:40.493924] I [dht-layout.c:633:dht_layout_normalize] 0-4_master2-dht: found anomalies in /flat/flat/100/etc10/dbus-1/system.d. holes=1 overlaps=1 missing=2 down=0 misc=0 [2013-09-02 04:03:41.567090] I [dht-layout.c:633:dht_layout_normalize] 0-4_master2-dht: found anomalies in /flat/flat/100/etc10/libibverbs.d. holes=1 overlaps=1 missing=2 down=0 misc=0 [2013-09-02 04:03:41.675607] I [dht-layout.c:633:dht_layout_normalize] 0-4_master2-dht: found anomalies in /flat/flat/100/etc10/rpm. holes=1 overlaps=1 missing=2 down=0 misc=0 [2013-09-02 04:03:42.236039] I [dht-layout.c:633:dht_layout_normalize] 0-4_master2-dht: found anomalies in /flat/flat/100/etc10/mcelog. holes=1 overlaps=1 missing=2 down=0 misc=0 [2013-09-02 04:03:42.427043] I [dht-layout.c:633:dht_layout_normalize] 0-4_master2-dht: found anomalies in /flat/flat/100/etc10/makedev.d. holes=1 overlaps=1 missing=2 down=0 misc=0 Fetching xtime xattrs result in the log entries above, thereby gsyncd using the default xtime (URXTIME: (-1, 0)) which is less than the slave. This in an assertion for gsyncd and hence the complain of master getting corrupt. getting same defect in build 3.4.0.32rhs-1.el6_4.x86_64 also. - not sure those are related defects or not but like Bug 981708 - log is full with 0-master1-dht: found anomalies in (null). holes=1 overlaps=0 missing=0 down=1 misc=0 and as mentioned in Bug 981837 - log says disk layout missing.. after getting those errors it is getting restarted. more info:- while creating data(before starting geo rep session) , one or more brick went down for few hour https://code.engineering.redhat.com/gerrit/#/c/14999/ This bug can prevent geo-replication setup getting corrupted, and non-recoverable state. Need the fix to make it an warning log and continue. The latest geo-replication would allow to self-correct from the situation above. able to reproduce with - 3.4.0.39rhs-1.el6rhs.x86_64
Status becomes faulty and process is restarted with error:-
('master is corrupt' error is not there as that erro string is not present any more)
log snippet:-
[2013-11-07 13:34:59.792462] E [syncdutils(/rhs/brick2):207:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 233, in twrap
tf(*aa)
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 909, in Xsyncer
self.Xcrawl()
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1004, in Xcrawl
self.Xcrawl(e, xtr)
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1004, in Xcrawl
self.Xcrawl(e, xtr)
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 985, in Xcrawl
xte = self.xtime(e)
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 289, in xtime
return self.xtime_low(rsc, path, **opts)
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 115, in xtime_low
rsc.server.aggregated.set_xtime(path, self.uuid, xt)
File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 299, in ff
return f(*a)
File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 467, in set_xtime
Xattr.lsetxattr(path, '.'.join([cls.GX_NSPACE, uuid, 'xtime']), struct.pack('!II', *mark))
File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 66, in lsetxattr
cls.raise_oserr()
File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 25, in raise_oserr
raise OSError(errn, os.strerror(errn))
OSError: [Errno 2] No such file or directory
[2013-11-07 13:34:59.974319] I [syncdutils(/rhs/brick2):159:finalize] <top>: exiting.
[2013-11-07 13:35:00.17575] I [monitor(monitor):81:set_state] Monitor: new state: faulty
[2013-11-07 13:35:11.87167] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------
[2013-11-07 13:35:11.87543] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker
----------------------------
As mentioned in Comment #6, log is full of 'founr anamalies..."
log snippet:-
[2013-11-06 04:32:11.734009] I [dht-layout.c:646:dht_layout_normalize] 0-master1-dht: found anomalies in /7/8/etc3/cron.hourly. holes=2
overlaps=1 missing=1 down=0 misc=0
[2013-11-06 04:32:24.484921] I [fuse-bridge.c:5714:fuse_thread_proc] 0-fuse: unmounting /tmp/gsyncd-aux-mount-ueeknN
[2013-11-06 04:32:24.485769] W [glusterfsd.c:1097:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x357bee894d] (-->/lib64/libpthre
ad.so.0() [0x357c607851] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x4053cd]))) 0-: received signum (15), shutting down
[2013-11-06 04:32:24.485804] I [fuse-bridge.c:6393:fini] 0-fuse: Unmounting '/tmp/gsyncd-aux-mount-ueeknN'.
[2013-11-06 04:32:36.265138] I [glusterfsd.c:2024:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.4.0.38rhs
(/usr/sbin/glusterfs --aux-gfid-mount --log-file=/var/log/glusterfs/geo-replication/master1/ssh%3A%2F%2Froot%4010.70.36.134%3Agluster%3
A%2F%2F127.0.0.1%3Aslave1.%2Frhs%2Fbrick2.gluster.log --volfile-server=localhost --volfile-id=master1 --client-pid=-1 /tmp/gsyncd-aux-m
ount-yXKk66)
(able to reproduce without add-brick and rebalance steps)
Steps to reproduce:-
1. create a distributed volume.
2. FUSE and NFS mount that volume and create a data on that volume.
3. while creating data bring bricks up and down by killing brick process. (done 3-4 time)
4. bring all bricks up and create a geo rep session for that volume.
5. start geo rep session and keep checking status.
master volume info :-
[root@7-VM1 rpm]# gluster v info master1
Volume Name: master1
Type: Distribute
Volume ID: 66193974-4233-4735-9102-a15fb3e4633e
Status: Started
Number of Bricks: 6
Transport-type: tcp
Bricks:
Brick1: 10.70.36.130:/rhs/brick2
Brick2: 10.70.36.131:/rhs/brick2
Brick3: 10.70.36.132:/rhs/brick2
Brick4: 10.70.36.130:/rhs/brick3
Brick5: 10.70.36.131:/rhs/brick3
Brick6: 10.70.36.132:/rhs/brick3
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
slave volume info:-
[root@7-VM5 rpm]# gluster v info slave1
Volume Name: slave1
Type: Distribute
Volume ID: 0e99cc94-3dc7-46d7-83b6-1b5d1606a3a3
Status: Started
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: 10.70.36.134:/rhs/brick1
Brick2: 10.70.36.134:/rhs/brick2
Brick3: 10.70.36.135:/rhs/brick1
Brick4: 10.70.36.135:/rhs/brick2
As per the comments 19 and 21, This bug is only applicable in RHS 2.0. Closing this bug since not applicable in 2.1. Please reopen the bug if this issue found again. |
Description of problem: Dist-geo-rep: status becomes faulty and process is restarted with error ' master is corrupt' in log Version-Release number of selected component (if applicable): How reproducible: not tried Steps to Reproduce: 1. had a distributed volume. created data on it. -created and started geo rep session. - added new RHSS(10.70.37.67) to master cluster and added brick to the volume(Brick6: 10.70.37.67:/rhs/brick1/2) - started rebalance process for that server. - run - gsec, create push-pem and start command(with force option) - keep checking status - process on one server is going into faulty status and log gives error ' master is corrupt' [root@4DVM3 ~]# gluster v info 4_master2 Volume Name: 4_master2 Type: Distribute Volume ID: 82f7b96b-ac89-4603-9e64-c3be5d6b2db3 Status: Started Number of Bricks: 6 Transport-type: tcp Bricks: Brick1: 10.70.37.110:/rhs/brick1/2 Brick2: 10.70.37.81:/rhs/brick1/2 Brick3: 10.70.37.110:/rhs/brick2/2 Brick4: 10.70.37.81:/rhs/brick2/2 Brick5: 10.70.37.110:/rhs/brick3/2 Brick6: 10.70.37.67:/rhs/brick1/2 Options Reconfigured: geo-replication.indexing: on geo-replication.ignore-pid-check: on changelog.changelog: on [root@4DVM3 ~]# gluster volume geo 4_master2 status NODE MASTER SLAVE HEALTH UPTIME ----------------------------------------------------------------------------------------------------------------------- 4DVM3.lab.eng.blr.redhat.com 4_master2 ssh://rhsauto026.lab.eng.blr.redhat.com::4_slave2 Stable 02:22:23 4DVM5.lab.eng.blr.redhat.com 4_master2 ssh://rhsauto026.lab.eng.blr.redhat.com::4_slave2 Stable 00:43:03 4DVM2.lab.eng.blr.redhat.com 4_master2 ssh://rhsauto026.lab.eng.blr.redhat.com::4_slave2 Stable 01:24:50 log snippet:- [root@4DVM5 ~]# ifconfig | grep inet inet addr:10.70.37.81 Bcast:10.70.37.255 Mask:255.255.254.0 inet addr:127.0.0.1 Mask:255.0.0.0 less /var/log/glusterfs/geo-replication/4_master2/ssh%3A%2F%2Froot%4010.70.37.1%3Agluster%3A%2F%2F127.0.0.1%3A4_slave2.log [2013-08-30 07:10:41.435367] E [syncdutils(/rhs/brick2/2):184:log_raise_exception] <top>: master is corrupt [2013-08-30 07:10:41.435650] I [syncdutils(/rhs/brick2/2):158:finalize] <top>: exiting. [2013-08-30 07:10:41.447088] I [monitor(monitor):81:set_state] Monitor: new state: faulty [2013-08-30 07:10:51.472136] I [monitor(monitor):129:monitor] Monitor: ------------------------------------------------------------ [2013-08-30 07:10:51.472404] I [monitor(monitor):130:monitor] Monitor: starting gsyncd worker [2013-08-30 07:10:52.6149] I [gsyncd(/rhs/brick2/2):503:main_i] <top>: syncing: gluster://localhost:4_master2 -> ssh://root.eng.blr.redhat.com:gluster://localhost:4_slave2 [2013-08-30 07:10:54.703702] I [master(/rhs/brick2/2):57:gmaster_builder] <top>: setting up xsync change detection mode [2013-08-30 07:10:54.704690] I [master(/rhs/brick2/2):57:gmaster_builder] <top>: setting up changelog change detection mode [2013-08-30 07:10:54.709770] I [master(/rhs/brick2/2):816:register] _GMaster: xsync temp directory: /var/run/gluster/4_master2/ssh%3A%2F%2Froot%4010.70.37.1%3Agluster%3A%2F%2F127.0.0.1%3A4_slave2/2445012e69193433989d6d552ad269a6/xsync [2013-08-30 07:10:55.370009] I [master(/rhs/brick2/2):335:crawlwrap] _GMaster: primary master with volume id 82f7b96b-ac89-4603-9e64-c3be5d6b2db3 ... [2013-08-30 07:10:55.447535] I [master(/rhs/brick2/2):345:crawlwrap] _GMaster: crawl interval: 60 seconds [2013-08-30 07:11:51.833565] I [monitor(monitor):81:set_state] Monitor: new state: Stable [2013-08-30 07:31:57.533741] E [syncdutils(/rhs/brick2/2):184:log_raise_exception] <top>: master is corrupt [2013-08-30 07:31:57.534126] I [syncdutils(/rhs/brick2/2):158:finalize] <top>: exiting. [2013-08-30 07:31:57.547292] I [monitor(monitor):81:set_state] Monitor: new state: faulty Actual results: more than 18 times this happen (in less than 3 hours) Expected results: Additional info: