Hide Forgot
Description of problem: geo-rep passive sessions went to faulty state while creating 10K files and doing metadata operations. The reason being brick process got SIGTERM with Emergency messages in the brick logs. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2013-11-22 04:53:17.746951] I [server-helpers.c:590:server_log_conn_destroy] 0-master-server: destroyed connection of redcloak.blr.redhat.com-16047-2013/11/22-04:53:16:649410-master-client-1-0 [2013-11-22 04:53:19.502941] E [posix.c:374:posix_setattr] 0-master-posix: setattr (lstat) on /bricks/master_brick2/ failed: No such file or directory [2013-11-22 04:53:19.503000] I [server-rpc-fops.c:1778:server_setattr_cbk] 0-master-server: 901166: SETATTR / (00000000-0000-0000-0000-000000000001) ==> (No such file or directory) [2013-11-22 04:53:19.503088] E [posix.c:3349:posix_getxattr] 0-master-posix: listxattr failed on /bricks/master_brick2/: No such file or directory [2013-11-22 04:53:19.528109] E [posix.c:616:posix_opendir] 0-master-posix: opendir failed on /bricks/master_brick2/: No such file or directory [2013-11-22 04:53:19.528131] I [server-rpc-fops.c:705:server_opendir_cbk] 0-master-server: 901171: OPENDIR / (00000000-0000-0000-0000-000000000001) ==> (No such file or directory) [2013-11-22 04:53:22.139406] M [posix-helpers.c:1309:posix_health_check_thread_proc] 0-master-posix: still alive! -> SIGTERM [2013-11-22 04:53:22.139655] W [glusterfsd.c:1097:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x33a66e894d] (-->/lib64/libpthread.so.0() [0x33a6e07851] (-->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xcd) [0x4053cd]))) 0-: received signum (15), shutting down >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Version-Release number of selected component (if applicable): glusterfs-3.4.0.44rhs-1 How reproducible: Didn't try to reproduce Steps to Reproduce: 1.create and start a geo-rep relationship between master and slave. 2.run following script, for i in {1..10} do ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K /mnt/master/ sleep 100 ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K --fop=chmod /mnt/master/ sleep 100 ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K --fop=chown /mnt/master/ sleep 100 ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K --fop=chgrp /mnt/master/ sleep 100 ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K --fop=symlink /mnt/master/ sleep 100 ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K --fop=hardlink /mnt/master/ sleep 100 ./crefi.py -n 100 --multi -b 10 -d 10 --random --max=2K --min=1K --fop=rename /mnt/master/ sleep 500 rm -rvf /mnt/master/* done Actual results: geo-rep passive sessions went to faulty, reason being brick process got SIGTERM with emergency message. Expected results: Brick shouldn't get SIGTERM in between. Additional info:
This message caused the brick process to stop: [2013-11-22 04:53:22.139406] M [posix-helpers.c:1309:posix_health_check_thread_proc] 0-master-posix: still alive! -> SIGTERM The reason the health-check failed, is because of this (and similar): [2013-11-22 04:53:19.528109] E [posix.c:616:posix_opendir] 0-master-posix: opendir failed on /bricks/master_brick2/: No such file or directory It suggests that the directory for this brick does not exist. Could it be that the brick process has been started before the directory /bricks/master_brick2/ was available (is /bricks a mountpoint)?. Or, in case the directory was removed/renamed while the brick process was running, I think it is correct to stop the process for the brick. Could you post the 'gluster volume info master' output to verify the bricks?
# gluster volume info master Volume Name: master Type: Distributed-Replicate Volume ID: 94f27837-1db2-457e-9a61-08a44d84c7ef Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.43.0:/bricks/master_brick1 Brick2: 10.70.43.29:/bricks/master_brick2 Brick3: 10.70.43.40:/bricks/master_brick3 Brick4: 10.70.43.53:/bricks/master_brick4 Options Reconfigured: changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on
Looks like the brick process got a SIGTERM due to posix health check which indicates some problem with the underlying storage layer. geo-rep sessions going to faulty is expected behavior in this scenario.