Bug 1033451

Summary: geo-rep passive sessions went to faulty state while creating 10K files and doing metadata operations in loop.
Product: Red Hat Gluster Storage Reporter: Vijaykumar Koppad <vkoppad>
Component: geo-replicationAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED NOTABUG QA Contact: storage-qa-internal <storage-qa-internal>
Severity: high Docs Contact:
Priority: unspecified    
Version: 2.1CC: aavati, avishwan, csaba, david.macdonald, ndevos
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-12-24 09:58:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Vijaykumar Koppad 2013-11-22 06:36:13 UTC
Description of problem: 

geo-rep passive sessions went to faulty state while creating 10K files and doing metadata operations. The reason being brick process got SIGTERM with Emergency messages in the brick logs.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2013-11-22 04:53:17.746951] I [server-helpers.c:590:server_log_conn_destroy] 0-master-server: destroyed connection of redcloak.blr.redhat.com-16047-2013/11/22-04:53:16:649410-master-client-1-0  
[2013-11-22 04:53:19.502941] E [posix.c:374:posix_setattr] 0-master-posix: setattr (lstat) on /bricks/master_brick2/ failed: No such file or directory
[2013-11-22 04:53:19.503000] I [server-rpc-fops.c:1778:server_setattr_cbk] 0-master-server: 901166: SETATTR / (00000000-0000-0000-0000-000000000001) ==> (No such file or directory)
[2013-11-22 04:53:19.503088] E [posix.c:3349:posix_getxattr] 0-master-posix: listxattr failed on /bricks/master_brick2/: No such file or directory
[2013-11-22 04:53:19.528109] E [posix.c:616:posix_opendir] 0-master-posix: opendir failed on /bricks/master_brick2/: No such file or directory
[2013-11-22 04:53:19.528131] I [server-rpc-fops.c:705:server_opendir_cbk] 0-master-server: 901171: OPENDIR / (00000000-0000-0000-0000-000000000001) ==> (No such file or directory)
[2013-11-22 04:53:22.139406] M [posix-helpers.c:1309:posix_health_check_thread_proc] 0-master-posix: still alive! -> SIGTERM
[2013-11-22 04:53:22.139655] W [glusterfsd.c:1097:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x33a66e894d] (-->/lib64/libpthread.so.0() [0x33a6e07851] (-->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xcd) [0x4053cd]))) 0-: received signum (15), shutting down
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Version-Release number of selected component (if applicable): glusterfs-3.4.0.44rhs-1


How reproducible: Didn't try to reproduce 


Steps to Reproduce:
1.create and start a geo-rep relationship between master and slave.
2.run following script,
for i in {1..10}
do
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K    /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K  --fop=chmod  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=chown  /mnt/master/
    sleep 100 
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=chgrp  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K  --fop=symlink  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=hardlink  /mnt/master/
    sleep 100
    ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K   --fop=rename /mnt/master/
    sleep 500
    rm -rvf /mnt/master/*
done


Actual results: geo-rep passive sessions went to faulty, reason being brick process got SIGTERM with emergency message.


Expected results: Brick shouldn't get SIGTERM in between. 


Additional info:

Comment 2 Niels de Vos 2013-11-22 08:14:46 UTC
This message caused the brick process to stop:

[2013-11-22 04:53:22.139406] M [posix-helpers.c:1309:posix_health_check_thread_proc] 0-master-posix: still alive! -> SIGTERM


The reason the health-check failed, is because of this (and similar):

[2013-11-22 04:53:19.528109] E [posix.c:616:posix_opendir] 0-master-posix: opendir failed on /bricks/master_brick2/: No such file or directory


It suggests that the directory for this brick does not exist. Could it be that the brick process has been started before the directory /bricks/master_brick2/ was available (is /bricks a mountpoint)?. Or, in case the directory was removed/renamed while the brick process was running, I think it is correct to stop the process for the brick.

Could you post the 'gluster volume info master' output to verify the bricks?

Comment 3 Vijaykumar Koppad 2013-11-22 09:21:01 UTC
# gluster volume info master
 
Volume Name: master
Type: Distributed-Replicate
Volume ID: 94f27837-1db2-457e-9a61-08a44d84c7ef
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.43.0:/bricks/master_brick1
Brick2: 10.70.43.29:/bricks/master_brick2
Brick3: 10.70.43.40:/bricks/master_brick3
Brick4: 10.70.43.53:/bricks/master_brick4
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on

Comment 5 Aravinda VK 2014-12-24 09:58:21 UTC
Looks like the brick process got a SIGTERM due to posix health check which indicates some problem with the underlying storage layer. geo-rep sessions going to faulty is expected behavior in this scenario.