Bug 1194733

Summary: 3.0.4 : Multithreaded Epoll : Volume does not start after adding bricks.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Amit Chaurasia <achauras>
Component: glusterfsAssignee: Gaurav Kumar Garg <ggarg>
Status: CLOSED NOTABUG QA Contact: Amit Chaurasia <achauras>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.0CC: achauras, amukherj, annair, mzywusko, nbalacha, nlevinki, nsathyan, rcyriac, smohan, vagarwal, vbellur
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-03-30 05:32:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Amit Chaurasia 2015-02-20 16:12:23 UTC
Description of problem:
Gluster volume is down.


Version-Release number of selected component (if applicable):
3.0.4


How reproducible:
Tried once.


Steps to Reproduce:
1. Created some large files.
2. Added some huge number of files.
3. Tried adding 8 bricks. 

Actual results:
The volume collapsed. Some symptoms:

[root@dht-rhs-23 ~]# gluster v info
 
Volume Name: gv0
Type: Distributed-Replicate
Volume ID: e3452801-37df-4ac5-af75-8b0b295cfa7d
Status: Started
Snap Volume: no
Number of Bricks: 17 x 2 = 34
Transport-type: tcp
Bricks:
Brick1: 10.70.47.114:/rhs/brick1/gv0
Brick2: 10.70.47.174:/rhs/brick1/gv0
Brick3: 10.70.47.114:/rhs/brick2/gv0
Brick4: 10.70.47.174:/rhs/brick2/gv0
Brick5: 10.70.47.114:/rhs/brick3/gv0
Brick6: 10.70.47.174:/rhs/brick3/gv0
Brick7: 10.70.47.114:/rhs/brick4/gv0
Brick8: 10.70.47.174:/rhs/brick4/gv0
Brick9: 10.70.47.114:/rhs/brick5/gv0
Brick10: 10.70.47.174:/rhs/brick5/gv0
Brick11: 10.70.47.114:/rhs/brick6/gv0
Brick12: 10.70.47.174:/rhs/brick6/gv0
Brick13: 10.70.47.114:/rhs/brick7/gv0
Brick14: 10.70.47.174:/rhs/brick7/gv0
Brick15: 10.70.47.114:/rhs/brick8/gv0
Brick16: 10.70.47.174:/rhs/brick8/gv0
Brick17: 10.70.47.114:/rhs/brick9/gv0
Brick18: 10.70.47.174:/rhs/brick9/gv0
Brick19: 10.70.47.114:/rhs/brick10/gv0
Brick20: 10.70.47.174:/rhs/brick10/gv0
Brick21: 10.70.47.114:/rhs/brick11/gv0
Brick22: 10.70.47.174:/rhs/brick11/gv0
Brick23: 10.70.47.114:/rhs/brick12/gv0
Brick24: 10.70.47.174:/rhs/brick12/gv0
Brick25: 10.70.47.114:/rhs/brick13/gv0
Brick26: 10.70.47.174:/rhs/brick13/gv0
Brick27: 10.70.47.114:/rhs/brick14/gv0
Brick28: 10.70.47.174:/rhs/brick14/gv0
Brick29: 10.70.47.114:/rhs/brick15/gv0
Brick30: 10.70.47.174:/rhs/brick15/gv0
Brick31: 10.70.47.114:/rhs/brick16/gv0
Brick32: 10.70.47.174:/rhs/brick16/gv0
Brick33: 10.70.47.114:/rhs/brick17/gv0
Brick34: 10.70.47.174:/rhs/brick17/gv0
Options Reconfigured:
performance.readdir-ahead: on
features.quota: on
cluster.min-free-disk: 20%
auto-delete: disable
snap-max-soft-limit: 90
snap-max-hard-limit: 256

[root@dht-rhs-23 ~]# gluster v status all
Status of volume: gv0
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick 10.70.47.114:/rhs/brick1/gv0			N/A	N	N/A
Brick 10.70.47.114:/rhs/brick2/gv0			N/A	N	N/A
Brick 10.70.47.114:/rhs/brick3/gv0			N/A	N	N/A
Brick 10.70.47.114:/rhs/brick4/gv0			N/A	N	N/A
Brick 10.70.47.114:/rhs/brick5/gv0			N/A	N	N/A
Brick 10.70.47.114:/rhs/brick6/gv0			N/A	N	N/A
Brick 10.70.47.114:/rhs/brick7/gv0			N/A	N	N/A
Brick 10.70.47.114:/rhs/brick8/gv0			N/A	N	N/A
Brick 10.70.47.114:/rhs/brick9/gv0			N/A	N	N/A
Brick 10.70.47.114:/rhs/brick10/gv0			N/A	N	N/A
Brick 10.70.47.114:/rhs/brick11/gv0			N/A	N	N/A
Brick 10.70.47.114:/rhs/brick12/gv0			N/A	N	N/A
Brick 10.70.47.114:/rhs/brick13/gv0			N/A	N	N/A
Brick 10.70.47.114:/rhs/brick14/gv0			N/A	N	N/A
Brick 10.70.47.114:/rhs/brick15/gv0			N/A	N	N/A
Brick 10.70.47.114:/rhs/brick16/gv0			N/A	N	N/A
Brick 10.70.47.114:/rhs/brick17/gv0			N/A	N	N/A
NFS Server on localhost					N/A	N	N/A
Self-heal Daemon on localhost				N/A	N	N/A
Quota Daemon on localhost				N/A	N	N/A
 
Task Status of Volume gv0
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 8b9f9c3d-0282-4851-84a0-fbadba689585
Status               : completed           
 


[root@dht-rhs-23 ~]# gluster volume start all
volume start: all: failed: Volume all does not exist

[root@dht-rhs-24 ~]# mount -t glusterfs 10.70.47.114:/gv0 /fuse_mnt1/
Mount failed. Please check the log file for more details.
[root@dht-rhs-24 ~]# 


Expected results:
Bricks should have been added successfully.


Additional info: Essentially looks like E-poll issue.

Log snippet:

[2015-02-20 19:39:11.003972] I [MSGID: 100030] [glusterfsd.c:2019:main] 0-glusterfs: Started running glusterfs version 3.6.0.45 (args: glusterfs -s localhost --volfile-id=gv0 --client-pid=-42 /tmp/tmp.YCvfdf5b6W)
[2015-02-20 19:39:11.022439] I [event-epoll.c:606:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2015-02-20 20:09:21.449501] E [rpc-clnt.c:201:call_bail] 0-glusterfs: bailing out frame type(GlusterFS Handshake) op(GETSPEC(2)) xid = 0x1 sent = 2015-02-20 19:39:11.022580. timeout = 1800 for 127.0.0.1:24007
[2015-02-20 20:09:21.449606] E [glusterfsd-mgmt.c:1596:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:gv0)
[2015-02-20 20:09:21.450082] W [glusterfsd.c:1183:cleanup_and_exit] (--> 0-: received signum (0), shutting down
[2015-02-20 20:09:21.450151] I [fuse-bridge.c:5584:fini] 0-fuse: Unmounting '/tmp/tmp.YCvfdf5b6W'.
[2015-02-20 20:09:21.462573] W [glusterfsd.c:1183:cleanup_and_exit] (--> 0-: received signum (15), shutting down
[root@dht-rhs-24 ~]# 
[root@dht-rhs-24 ~]# 
[root@dht-rhs-24 ~]# tail -30 tmp-tmp.TRH3z0MH7n.log
tail: cannot open `tmp-tmp.TRH3z0MH7n.log' for reading: No such file or directory
[root@dht-rhs-24 ~]# 
[root@dht-rhs-24 ~]# mount -t glusterfs 10.70.47.114:/gv0 /fuse_mnt1/
Mount failed. Please check the log file for more details.
[root@dht-rhs-24 ~]# 
[root@dht-rhs-24 ~]# tail -50 tmp-tmp.TRH3z0MH7n.log
tail: cannot open `tmp-tmp.TRH3z0MH7n.log' for reading: No such file or directory

Comment 1 Amit Chaurasia 2015-02-20 17:00:31 UTC
Additionally, there is no connectivity between the cluster nodes.


[root@dht-rhs-23 ~]# gluster peer status
Number of Peers: 1

Hostname: 10.70.47.174
Uuid: 7e0465e6-029a-4052-bfdb-1b8db2cbdb47
State: Peer Rejected (Connected)

Comment 2 Amit Chaurasia 2015-02-23 12:09:42 UTC
Sos reports are @http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1194733/

Comment 3 Atin Mukherjee 2015-02-24 05:00:33 UTC
Amit,

Please provide permission to these files.

Comment 5 Gaurav Kumar Garg 2015-02-26 11:33:34 UTC
In sos report "etc-glusterfs-glusterd.vol.log"  glusterd logs files contain no logs. we need glusterd logs to analysis this bug.

And in your volume status command, all of your brick are offline so it will not mount the volume.

please provide necessary information.

Comment 6 Amit Chaurasia 2015-02-27 06:44:39 UTC
I have copied /var/log/glusterfs of both nodes at the above location. Please have a look. Just note that this might also contain recent operations carried out after this issue.

If these logs are still not helpful, please let me know. I plan to try and reproduce the issue with different settings.

Comment 7 Amit Chaurasia 2015-02-27 06:46:11 UTC
I have copied /var/log/glusterfs of both nodes at the above location. Please have a look. Just note that this might also contain recent operations carried out after this issue.

If these logs are still not helpful, please let me know. I plan to try and reproduce the issue with different settings.

Comment 8 Gaurav Kumar Garg 2015-02-27 10:52:49 UTC
in your sos report glusterd logs are empty. we need glusterd logs to analysis this bug.

these logs still not helpful. can you reproduce this bug and attach sos report again.

Comment 9 Atin Mukherjee 2015-03-01 15:08:32 UTC
Amit,

Since the log files were rotated and sosreport doesn't capture all the log files until an explicit -all option provided, the current log files are insufficient to analyze the issue. Please try to reproduce it otherwise we would have to close this BZ.

~Atin