Bug 1033451 - geo-rep passive sessions went to faulty state while creating 10K files and doing metadata operations in loop.
Summary: geo-rep passive sessions went to faulty state while creating 10K files and do...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: geo-replication
Version: 2.1
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Bug Updates Notification Mailing List
QA Contact: storage-qa-internal@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-11-22 06:36 UTC by Vijaykumar Koppad
Modified: 2014-12-24 09:58 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-12-24 09:58:41 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Vijaykumar Koppad 2013-11-22 06:36:13 UTC
Description of problem: 

geo-rep passive sessions went to faulty state while creating 10K files and doing metadata operations. The reason being brick process got SIGTERM with Emergency messages in the brick logs.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-11-22 04:53:17.746951] I [server-helpers.c:590:server_log_conn_destroy] 0-master-server: destroyed connection of redcloak.blr.redhat.com-16047-2013/11/22-04:53:16:649410-master-client-1-0  
[2013-11-22 04:53:19.502941] E [posix.c:374:posix_setattr] 0-master-posix: setattr (lstat) on /bricks/master_brick2/ failed: No such file or directory
[2013-11-22 04:53:19.503000] I [server-rpc-fops.c:1778:server_setattr_cbk] 0-master-server: 901166: SETATTR / (00000000-0000-0000-0000-000000000001) ==> (No such file or directory)
[2013-11-22 04:53:19.503088] E [posix.c:3349:posix_getxattr] 0-master-posix: listxattr failed on /bricks/master_brick2/: No such file or directory
[2013-11-22 04:53:19.528109] E [posix.c:616:posix_opendir] 0-master-posix: opendir failed on /bricks/master_brick2/: No such file or directory
[2013-11-22 04:53:19.528131] I [server-rpc-fops.c:705:server_opendir_cbk] 0-master-server: 901171: OPENDIR / (00000000-0000-0000-0000-000000000001) ==> (No such file or directory)
[2013-11-22 04:53:22.139406] M [posix-helpers.c:1309:posix_health_check_thread_proc] 0-master-posix: still alive! -> SIGTERM
[2013-11-22 04:53:22.139655] W [glusterfsd.c:1097:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x33a66e894d] (-->/lib64/libpthread.so.0() [0x33a6e07851] (-->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xcd) [0x4053cd]))) 0-: received signum (15), shutting down
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Version-Release number of selected component (if applicable): glusterfs-3.4.0.44rhs-1


How reproducible: Didn't try to reproduce 


Steps to Reproduce:
1.create and start a geo-rep relationship between master and slave.
2.run following script,
for i in {1..10}
do
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K    /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K  --fop=chmod  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=chown  /mnt/master/
    sleep 100 
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=chgrp  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K  --fop=symlink  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=hardlink  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=rename /mnt/master/
    sleep 500
    rm -rvf /mnt/master/*
done


Actual results: geo-rep passive sessions went to faulty, reason being brick process got SIGTERM with emergency message.


Expected results: Brick shouldn't get SIGTERM in between. 


Additional info:

Comment 2 Niels de Vos 2013-11-22 08:14:46 UTC
This message caused the brick process to stop:

[2013-11-22 04:53:22.139406] M [posix-helpers.c:1309:posix_health_check_thread_proc] 0-master-posix: still alive! -> SIGTERM


The reason the health-check failed, is because of this (and similar):

[2013-11-22 04:53:19.528109] E [posix.c:616:posix_opendir] 0-master-posix: opendir failed on /bricks/master_brick2/: No such file or directory


It suggests that the directory for this brick does not exist. Could it be that the brick process has been started before the directory /bricks/master_brick2/ was available (is /bricks a mountpoint)?. Or, in case the directory was removed/renamed while the brick process was running, I think it is correct to stop the process for the brick.

Could you post the 'gluster volume info master' output to verify the bricks?

Comment 3 Vijaykumar Koppad 2013-11-22 09:21:01 UTC
# gluster volume info master
 
Volume Name: master
Type: Distributed-Replicate
Volume ID: 94f27837-1db2-457e-9a61-08a44d84c7ef
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.43.0:/bricks/master_brick1
Brick2: 10.70.43.29:/bricks/master_brick2
Brick3: 10.70.43.40:/bricks/master_brick3
Brick4: 10.70.43.53:/bricks/master_brick4
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on

Comment 5 Aravinda VK 2014-12-24 09:58:21 UTC
Looks like the brick process got a SIGTERM due to posix health check which indicates some problem with the underlying storage layer. geo-rep sessions going to faulty is expected behavior in this scenario.


Note You need to log in before you can comment on or make changes to this bug.