Description of problem: Gluster volume is down. Version-Release number of selected component (if applicable): 3.0.4 How reproducible: Tried once. Steps to Reproduce: 1. Created some large files. 2. Added some huge number of files. 3. Tried adding 8 bricks. Actual results: The volume collapsed. Some symptoms: [root@dht-rhs-23 ~]# gluster v info Volume Name: gv0 Type: Distributed-Replicate Volume ID: e3452801-37df-4ac5-af75-8b0b295cfa7d Status: Started Snap Volume: no Number of Bricks: 17 x 2 = 34 Transport-type: tcp Bricks: Brick1: 10.70.47.114:/rhs/brick1/gv0 Brick2: 10.70.47.174:/rhs/brick1/gv0 Brick3: 10.70.47.114:/rhs/brick2/gv0 Brick4: 10.70.47.174:/rhs/brick2/gv0 Brick5: 10.70.47.114:/rhs/brick3/gv0 Brick6: 10.70.47.174:/rhs/brick3/gv0 Brick7: 10.70.47.114:/rhs/brick4/gv0 Brick8: 10.70.47.174:/rhs/brick4/gv0 Brick9: 10.70.47.114:/rhs/brick5/gv0 Brick10: 10.70.47.174:/rhs/brick5/gv0 Brick11: 10.70.47.114:/rhs/brick6/gv0 Brick12: 10.70.47.174:/rhs/brick6/gv0 Brick13: 10.70.47.114:/rhs/brick7/gv0 Brick14: 10.70.47.174:/rhs/brick7/gv0 Brick15: 10.70.47.114:/rhs/brick8/gv0 Brick16: 10.70.47.174:/rhs/brick8/gv0 Brick17: 10.70.47.114:/rhs/brick9/gv0 Brick18: 10.70.47.174:/rhs/brick9/gv0 Brick19: 10.70.47.114:/rhs/brick10/gv0 Brick20: 10.70.47.174:/rhs/brick10/gv0 Brick21: 10.70.47.114:/rhs/brick11/gv0 Brick22: 10.70.47.174:/rhs/brick11/gv0 Brick23: 10.70.47.114:/rhs/brick12/gv0 Brick24: 10.70.47.174:/rhs/brick12/gv0 Brick25: 10.70.47.114:/rhs/brick13/gv0 Brick26: 10.70.47.174:/rhs/brick13/gv0 Brick27: 10.70.47.114:/rhs/brick14/gv0 Brick28: 10.70.47.174:/rhs/brick14/gv0 Brick29: 10.70.47.114:/rhs/brick15/gv0 Brick30: 10.70.47.174:/rhs/brick15/gv0 Brick31: 10.70.47.114:/rhs/brick16/gv0 Brick32: 10.70.47.174:/rhs/brick16/gv0 Brick33: 10.70.47.114:/rhs/brick17/gv0 Brick34: 10.70.47.174:/rhs/brick17/gv0 Options Reconfigured: performance.readdir-ahead: on features.quota: on cluster.min-free-disk: 20% auto-delete: disable snap-max-soft-limit: 90 snap-max-hard-limit: 256 [root@dht-rhs-23 ~]# gluster v status all Status of volume: gv0 Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.47.114:/rhs/brick1/gv0 N/A N N/A Brick 10.70.47.114:/rhs/brick2/gv0 N/A N N/A Brick 10.70.47.114:/rhs/brick3/gv0 N/A N N/A Brick 10.70.47.114:/rhs/brick4/gv0 N/A N N/A Brick 10.70.47.114:/rhs/brick5/gv0 N/A N N/A Brick 10.70.47.114:/rhs/brick6/gv0 N/A N N/A Brick 10.70.47.114:/rhs/brick7/gv0 N/A N N/A Brick 10.70.47.114:/rhs/brick8/gv0 N/A N N/A Brick 10.70.47.114:/rhs/brick9/gv0 N/A N N/A Brick 10.70.47.114:/rhs/brick10/gv0 N/A N N/A Brick 10.70.47.114:/rhs/brick11/gv0 N/A N N/A Brick 10.70.47.114:/rhs/brick12/gv0 N/A N N/A Brick 10.70.47.114:/rhs/brick13/gv0 N/A N N/A Brick 10.70.47.114:/rhs/brick14/gv0 N/A N N/A Brick 10.70.47.114:/rhs/brick15/gv0 N/A N N/A Brick 10.70.47.114:/rhs/brick16/gv0 N/A N N/A Brick 10.70.47.114:/rhs/brick17/gv0 N/A N N/A NFS Server on localhost N/A N N/A Self-heal Daemon on localhost N/A N N/A Quota Daemon on localhost N/A N N/A Task Status of Volume gv0 ------------------------------------------------------------------------------ Task : Rebalance ID : 8b9f9c3d-0282-4851-84a0-fbadba689585 Status : completed [root@dht-rhs-23 ~]# gluster volume start all volume start: all: failed: Volume all does not exist [root@dht-rhs-24 ~]# mount -t glusterfs 10.70.47.114:/gv0 /fuse_mnt1/ Mount failed. Please check the log file for more details. [root@dht-rhs-24 ~]# Expected results: Bricks should have been added successfully. Additional info: Essentially looks like E-poll issue. Log snippet: [2015-02-20 19:39:11.003972] I [MSGID: 100030] [glusterfsd.c:2019:main] 0-glusterfs: Started running glusterfs version 3.6.0.45 (args: glusterfs -s localhost --volfile-id=gv0 --client-pid=-42 /tmp/tmp.YCvfdf5b6W) [2015-02-20 19:39:11.022439] I [event-epoll.c:606:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2015-02-20 20:09:21.449501] E [rpc-clnt.c:201:call_bail] 0-glusterfs: bailing out frame type(GlusterFS Handshake) op(GETSPEC(2)) xid = 0x1 sent = 2015-02-20 19:39:11.022580. timeout = 1800 for 127.0.0.1:24007 [2015-02-20 20:09:21.449606] E [glusterfsd-mgmt.c:1596:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:gv0) [2015-02-20 20:09:21.450082] W [glusterfsd.c:1183:cleanup_and_exit] (--> 0-: received signum (0), shutting down [2015-02-20 20:09:21.450151] I [fuse-bridge.c:5584:fini] 0-fuse: Unmounting '/tmp/tmp.YCvfdf5b6W'. [2015-02-20 20:09:21.462573] W [glusterfsd.c:1183:cleanup_and_exit] (--> 0-: received signum (15), shutting down [root@dht-rhs-24 ~]# [root@dht-rhs-24 ~]# [root@dht-rhs-24 ~]# tail -30 tmp-tmp.TRH3z0MH7n.log tail: cannot open `tmp-tmp.TRH3z0MH7n.log' for reading: No such file or directory [root@dht-rhs-24 ~]# [root@dht-rhs-24 ~]# mount -t glusterfs 10.70.47.114:/gv0 /fuse_mnt1/ Mount failed. Please check the log file for more details. [root@dht-rhs-24 ~]# [root@dht-rhs-24 ~]# tail -50 tmp-tmp.TRH3z0MH7n.log tail: cannot open `tmp-tmp.TRH3z0MH7n.log' for reading: No such file or directory
Additionally, there is no connectivity between the cluster nodes. [root@dht-rhs-23 ~]# gluster peer status Number of Peers: 1 Hostname: 10.70.47.174 Uuid: 7e0465e6-029a-4052-bfdb-1b8db2cbdb47 State: Peer Rejected (Connected)
Sos reports are @http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1194733/
Amit, Please provide permission to these files.
In sos report "etc-glusterfs-glusterd.vol.log" glusterd logs files contain no logs. we need glusterd logs to analysis this bug. And in your volume status command, all of your brick are offline so it will not mount the volume. please provide necessary information.
I have copied /var/log/glusterfs of both nodes at the above location. Please have a look. Just note that this might also contain recent operations carried out after this issue. If these logs are still not helpful, please let me know. I plan to try and reproduce the issue with different settings.
in your sos report glusterd logs are empty. we need glusterd logs to analysis this bug. these logs still not helpful. can you reproduce this bug and attach sos report again.
Amit, Since the log files were rotated and sosreport doesn't capture all the log files until an explicit -all option provided, the current log files are insufficient to analyze the issue. Please try to reproduce it otherwise we would have to close this BZ. ~Atin