1477087 – [geo-rep] master worker crash with interrupted system call

Bug 1477087 - [geo-rep] master worker crash with interrupted system call

Summary: [geo-rep] master worker crash with interrupted system call

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Aravinda VK
QA Contact:	Rochelle
Docs Contact:
URL:
Whiteboard:	rebase
Depends On:
Blocks:	1499393 1500845 1503134
TreeView+	depends on / blocked

Reported:	2017-08-01 08:25 UTC by Rochelle
Modified:	2018-09-12 07:25 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-3.12.2-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1499393 (view as bug list)
Environment:
Last Closed:	2018-09-04 06:34:23 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2018:2607	0	None	None	None	2018-09-04 06:35:54 UTC

Description Rochelle 2017-08-01 08:25:00 UTC

Description of problem:
=========================
Ran automated snapshot + geo-replication cases and observed master worker crash with interrupted system call 


[2017-07-31 17:32:22.560633] I [master(/bricks/brick2/master_brick10):1132:crawl] _GMaster: slave's time: (1501521813, 0)
[2017-07-31 17:32:22.668236] I [master(/bricks/brick1/master_brick6):1132:crawl] _GMaster: slave's time: (1501521812, 0)
[2017-07-31 17:32:23.242929] I [gsyncd(monitor):714:main_i] <top>: Monitor Status: Paused
[2017-07-31 17:33:24.706393] I [gsyncd(monitor):714:main_i] <top>: Monitor Status: Started
[2017-07-31 17:33:24.708093] E [syncdutils(/bricks/brick1/master_brick6):296:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 204, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 780, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1582, in service_loop
    g2.crawlwrap()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 570, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1143, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1118, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1001, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 894, in process_change
    rl = errno_wrap(os.readlink, [en], [ENOENT], [ESTALE])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 495, in errno_wrap
    return call(*arg)
OSError: [Errno 4] Interrupted system call: '.gfid/6858a52c-4d7d-4c06-889f-3c43e3a91e68/597f69a3%%SAZPTFV05C'
[2017-07-31 17:33:24.714128] I [syncdutils(/bricks/brick1/master_brick6):237:finalize] <top>: exiting.
[2017-07-31 17:33:24.719995] I [repce(/bricks/brick1/master_brick6):92:service_loop] RepceServer: terminating on reaching EOF.
[2017-07-31 17:33:24.720269] I [syncdutils(/bricks/brick1/master_brick6):237:finalize] <top>: exiting.
[2017-07-31 17:33:24.762686] I [gsyncdstatus(monitor):240:set_worker_status] GeorepStatus: Worker Status: Faulty
[2017-07-31 17:33:27.880553] I [master(/bricks/brick2/master_brick10):1132:crawl] _GMaster: slave's time: (1501522338, 0)




client log suggests the following at the same time: 
-----------------------------------------------------

[2017-07-31 16:53:11.296205] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-master-client-9: Server lk version = 1
[2017-07-31 16:53:11.319704] I [MSGID: 108031] [afr-common.c:2264:afr_local_discovery_cbk] 0-master-replicate-3: selecting local read_child master-client-6
[2017-07-31 16:53:11.319885] I [MSGID: 108031] [afr-common.c:2264:afr_local_discovery_cbk] 0-master-replicate-1: selecting local read_child master-client-2
[2017-07-31 16:53:11.320437] I [MSGID: 108031] [afr-common.c:2264:afr_local_discovery_cbk] 0-master-replicate-5: selecting local read_child master-client-10
[2017-07-31 17:33:24.751800] I [fuse-bridge.c:5092:fuse_thread_proc] 0-fuse: initating unmount of /tmp/gsyncd-aux-mount-dW_b8o
[2017-07-31 17:33:24.752358] W [glusterfsd.c:1290:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x3777a07aa1) [0x7fc5556f6aa1] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xd5) [0x7fc556b0a845] -->/usr/sbin/glusterfs(cleanup_and_exit+0x76) [0x7fc556b0a2b6] ) 0-: received signum (15), shutting down
[2017-07-31 17:33:24.752386] I [fuse-bridge.c:5827:fini] 0-fuse: Unmounting '/tmp/gsyncd-aux-mount-dW_b8o'.
[2017-07-31 17:33:24.752400] I [fuse-bridge.c:5832:fini] 0-fuse: Closing fuse connection to '/tmp/gsyncd-aux-mount-dW_b8o'.
[2017-07-31 17:33:39.461354] I [MSGID: 100030] [glusterfsd.c:2431:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.4 (args: /usr/sbin/glusterfs --aux-gfid-mount --acl --log-file=/var/log/glusterfs/geo-replication/master/ssh%3A%2F%2Froot%4010.70.41.228%3Agluster%3A%2F%2F127.0.0.1%3Aslave.%2Fbricks%2Fbrick1%2Fmaster_brick6.gluster.log --volfile-server=localhost --volfile-id=master --client-pid=-1 /tmp/gsyncd-aux-mount-dz_HYe)
[2017-07-31 17:33:39.484689] I [MSGID: 101190] [event-epoll.c:602:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2017-07-31 17:33:39.500307] I [MSGID: 101173] [graph.c:269:gf_add_cmdline_options] 0-master-md-cache: adding option 'cache-posix-acl' for volume 'master-md-cache' with value 'true'





Version-Release number of selected component (if applicable):
==================================================================
glusterfs-geo-replication-3.8.4-36.el6rhs.x86_64


How reproducible:
=================
Saw this only once so far. 

Steps to Reproduce:
===================
Ran automated snapshot cases with a geo-replication setup



Actual results:
================
The worker crashed



Expected results:
==================
The worker should not crash

Comment 4 Kotresh HR 2017-10-07 03:54:57 UTC

Upstream patch:

https://review.gluster.org/18447 (master)

Comment 9 errata-xmlrpc 2018-09-04 06:34:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Note You need to log in before you can comment on or make changes to this bug.