Description of problem: ====================== On my systemic setup, I am doing same path directory creation simultaneously from 3 different clients. Each client used different server IP to mount the volume using fuse protocol Also, Each client were dumping sosreports every 5min into the volume mount in a screen session, along with top output being appended to a file every minute The dir-creations were happening from different users Eg: client1(el 7.2) was running the dir-creation using pavan@rhs-client23 client2(el 6.7) as root@rhs-client24 client3(el 7.3) as cli21@rhs-client21 Note: these logs are wrt client1 ie rhs-client23 Also, however note that I am able to access the mount I am seeing below logs: [2017-01-02 12:49:04.209279] E [MSGID: 108008] [afr-transaction.c:2602:afr_write_txn_refresh_done] 0-sysvol-replicate-3: Failing WRITE on gfid 2fa5d60d-9536-4d50-a1e9-5e 512b9bee08: split-brain observed. [2017-01-02 12:49:04.209330] W [fuse-bridge.c:1291:fuse_err_cbk] 0-glusterfs-fuse: 97459985: FLUSH() ERR => -1 (Input/output error) [2017-01-02 12:49:04.211225] W [MSGID: 108008] [afr-read-txn.c:239:afr_read_txn] 0-sysvol-replicate-3: Unreadable subvolume -1 found with event generation 376 for gfid 2fa5d60d-9536-4d50-a1e9-5e512b9bee08. (Possible split-brain) [2017-01-02 12:49:04.213489] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-sysvol-replicate-3: Failing STAT on gfid 2fa5d60d-9536-4d50-a1e9-5e512b9bee08: split-brain observed. [Input/output error] [2017-01-02 12:49:04.214165] W [MSGID: 108008] [afr-read-txn.c:239:afr_read_txn] 0-sysvol-replicate-3: Unreadable subvolume -1 found with event generation 376 for gfid 2fa5d60d-9536-4d50-a1e9-5e512b9bee08. (Possible split-brain) [2017-01-02 12:49:04.221112] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-sysvol-replicate-3: Failing GETXATTR on gfid 2fa5d60d-9536-4d50-a1e9-5e512b9bee08: split-brain observed. [Input/output error] [2017-01-02 12:49:04.810823] W [MSGID: 108027] [afr-common.c:2589:afr_discover_done] 0-sysvol-replicate-3: no read subvols for /data/same-dir-creation/dir.rhs-client23.lab.eng.blr.redhat.com.log [2017-01-02 12:49:04.810934] W [MSGID: 108008] [afr-read-txn.c:239:afr_read_txn] 0-sysvol-replicate-3: Unreadable subvolume -1 found with event generation 376 for gfid 2fa5d60d-9536-4d50-a1e9-5e512b9bee08. (Possible split-brain) [2017-01-02 12:49:04.812838] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-sysvol-replicate-3: Failing STAT on gfid 2fa5d60d-9536-4d50-a1e9-5e512b9bee08: split-brain observed. [Input/output error] [2017-01-02 12:49:04.814947] W [MSGID: 108008] [afr-read-txn.c:239:afr_read_txn] 0-sysvol-replicate-3: Unreadable subvolume -1 found with event generation 376 for gfid 2fa5d60d-9536-4d50-a1e9-5e512b9bee08. (Possible split-brain) [2017-01-02 12:49:04.826529] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-sysvol-replicate-3: Failing STAT on gfid 2fa5d60d-9536-4d50-a1e9-5e512b9bee08: split-brain observed. [Input/output error] [2017-01-02 12:49:04.827121] W [MSGID: 108008] [afr-read-txn.c:239:afr_read_txn] 0-sysvol-replicate-3: Unreadable subvolume -1 found with event generation 376 for gfid 2fa5d60d-9536-4d50-a1e9-5e512b9bee08. (Possible split-brain) [2017-01-02 12:49:04.828029] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-sysvol-replicate-3: Failing GETXATTR on gfid 2fa5d60d-9536-4d50-a1e9-5e512b9bee08: split-brain observed. [Input/output error] [2017-01-02 12:49:04.830158] W [MSGID: 108027] [afr-common.c:2589:afr_discover_done] 0-sysvol-replicate-3: no read subvols for /data/same-dir-creation/dir.rhs-client23.lab.eng.blr.redhat.com.log [2017-01-02 12:49:04.830299] W [MSGID: 108008] [afr-read-txn.c:239:afr_read_txn] 0-sysvol-replicate-3: Unreadable subvolume -1 found with event generation 376 for gfid 2fa5d60d-9536-4d50-a1e9-5e512b9bee08. (Possible split-brain) [2017-01-02 12:49:04.831159] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-sysvol-replicate-3: Failing STAT on gfid 2fa5d60d-9536-4d50-a1e9-5e512b9bee08: split-brain observed. [Input/output error] [2017-01-02 12:49:04.833027] W [MSGID: 108008] [afr-read-txn.c:239:afr_read_txn] 0-sysvol-replicate-3: Unreadable subvolume -1 found with event generation 376 for gfid 2fa5d60d-9536-4d50-a1e9-5e512b9bee08. (Possible split-brain) [2017-01-02 12:49:04.833890] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-sysvol-replicate-3: Failing STAT on gfid 2fa5d60d-9536-4d50-a1e9-5e512b9bee08: split-brain observed. [Input/output error] [2017-01-02 12:49:04.834378] W [MSGID: 108008] [afr-read-txn.c:239:afr_read_txn] 0-sysvol-replicate-3: Unreadable subvolume -1 found with event generation 376 for gfid 2fa5d60d-9536-4d50-a1e9-5e512b9bee08. (Possible split-brain) [2017-01-02 12:49:04.876508] E [MSGID: 108008] [afr-read-txn.c:80:afr_read_txn_refresh_done] 0-sysvol-replicate-3: Failing GETXATTR on gfid 2fa5d60d-9536-4d50-a1e9-5e512b9bee08: split-brain observed. [Input/output error] [2017-01-02 12:49:04.886720] W [MSGID: 108027] [afr-common.c:2589:afr_discover_done] 0-sysvol-replicate-3: no read subvols for /data/same-dir-creation/dir.rhs-client23.lab.eng.blr.redhat.com.log [2017-01-02 12:49:04.886865] W [MSGID: 108008] [afr-read-txn.c:239:afr_read_txn] 0-sysvol-replicate-3: Unreadable subvolume -1 found with event generation 376 for gfid 2fa5d60d-9536-4d50-a1e9-5e512b9bee08. (Possible split-brain) sosreports available at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/3.2_logs/systemic_testing_logs/regression_cycle/same_dir_create_clients/rhs-client23.lab.eng.blr.redhat.com/ test execution details available at https://docs.google.com/spreadsheets/d/1iP5Mi1TewBFVh8HTmlcBm9072Bgsbgkr3CLcGmawDys/edit#gid=632186609 Version-Release number of selected component (if applicable): ============ 3.8.4-10 note that 2 bricks are down for which seperate bz has been raised as below which are related to same setup 1409472 - brick crashed on systemic setup 1397907 - seeing frequent kernel hangs when doing operations both on fuse client and gluster nodes on replica volumes (edit) [NEEDINFO] 1409568 - seeing socket disconnects and transport endpoint not connected frequently on systemic setup But i am seeing disconnects even for bricks which are not down [root@dhcp35-20 ~]# gluster v status Status of volume: sysvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.20:/rhs/brick1/sysvol 49154 0 Y 2404 Brick 10.70.37.86:/rhs/brick1/sysvol 49154 0 Y 26789 Brick 10.70.35.156:/rhs/brick1/sysvol N/A N/A N N/A Brick 10.70.37.154:/rhs/brick1/sysvol 49154 0 Y 2192 Brick 10.70.35.20:/rhs/brick2/sysvol 49155 0 Y 2424 Brick 10.70.37.86:/rhs/brick2/sysvol N/A N/A N N/A Brick 10.70.35.156:/rhs/brick2/sysvol 49155 0 Y 26793 Brick 10.70.37.154:/rhs/brick2/sysvol 49155 0 Y 2212 Snapshot Daemon on localhost 49156 0 Y 2449 Self-heal Daemon on localhost N/A N/A Y 3131 Quota Daemon on localhost N/A N/A Y 3139 Snapshot Daemon on 10.70.37.86 49156 0 Y 26832 Self-heal Daemon on 10.70.37.86 N/A N/A Y 2187 Quota Daemon on 10.70.37.86 N/A N/A Y 2195 Snapshot Daemon on 10.70.35.156 49156 0 Y 26816 Self-heal Daemon on 10.70.35.156 N/A N/A Y 1376 Quota Daemon on 10.70.35.156 N/A N/A Y 1384 Snapshot Daemon on 10.70.37.154 49156 0 Y 2235 Self-heal Daemon on 10.70.37.154 N/A N/A Y 9321 Quota Daemon on 10.70.37.154 N/A N/A Y 9329 Task Status of Volume sysvol ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp35-20 ~]# gluster v info Volume Name: sysvol Type: Distributed-Replicate Volume ID: 4efd4f77-85c7-4eb9-b958-6769d31d84c8 Status: Started Snapshot Count: 0 Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: 10.70.35.20:/rhs/brick1/sysvol Brick2: 10.70.37.86:/rhs/brick1/sysvol Brick3: 10.70.35.156:/rhs/brick1/sysvol Brick4: 10.70.37.154:/rhs/brick1/sysvol Brick5: 10.70.35.20:/rhs/brick2/sysvol Brick6: 10.70.37.86:/rhs/brick2/sysvol Brick7: 10.70.35.156:/rhs/brick2/sysvol Brick8: 10.70.37.154:/rhs/brick2/sysvol Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on features.quota-deem-statfs: on features.inode-quota: on features.quota: on features.uss: enable transport.address-family: inet performance.readdir-ahead: on nfs.disable: on
client sosreports are available at scp -r /var/tmp/$HOSTNAME qe@rhsqe-repo:/var/www/html/sosreports/nchilaka/3.2_logs/systemic_testing_logs/regression_cycle/same_dir_create_clients/
hitting this even on 3.8.4-12
I went through the logs attached, and I could see number of split brain related messages. But There are lot of logs related to bricks unavailability as well. For further analysis, Is it possible to get logs for bricks ? Also the time when the tests started or the time where everything was perfect?
putting the needinfo back for the same comment5
see these server logs, if they help, as they were reported for another bz around the same day http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1409472/sosreports/
I tried to reproduce this bug locally, though I couldn't create a systemic setup and the same workload. I tried with the normal IO's with 2*2 and 2 clients. I was not successful in reproducing this bug. Since we don't have a proper reproducer i'm not planning to invest much time on this. So I think best for now is to close this bug and if you see the problem again in latest systematic or non-systematic set, we can reopen this bug. Let me know your thoughts?
(In reply to Mohammed Rafi KC from comment #10) > I tried to reproduce this bug locally, though I couldn't create a systemic > setup and the same workload. I tried with the normal IO's with 2*2 and 2 > clients. > > I was not successful in reproducing this bug. Since we don't have a proper > reproducer i'm not planning to invest much time on this. > > So I think best for now is to close this bug and if you see the problem > again in latest systematic or non-systematic set, we can reopen this bug. > > Let me know your thoughts? Agree with that, As I know you tried and spent enough time on reproducer. I will reopen this bug when I hit it again. You can go ahead and do the necessary.