Description of problem: ======================= Seen the crash while performing IO from multiple clients on a tiered volume. This is an ec volume + dist-rep tier Backtrace: (gdb) bt #0 dht_layout_search (this=0x7ffa24133c80, layout=0x0, name=0x7ffa26ec1708 ".") at dht-layout.c:171 #1 0x00007ffa2b13f2c8 in dht_readdirp_cbk (frame=0x7ffa36c52750, cookie=0x7ffa36c2bc84, this=0x7ffa24133c80, op_ret=4, op_errno=2, orig_entries=0x7ffa2c029900, xdata=0x7ffa366434e8) at dht-common.c:4654 #2 0x00007ffa2b38dfd4 in afr_readdir_cbk (frame=<optimized out>, cookie=<optimized out>, this=<optimized out>, op_ret=4, op_errno=2, subvol_entries=<optimized out>, xdata=0x7ffa366434e8) at afr-dir-read.c:238 #3 0x00007ffa2b5fadca in client3_3_readdirp_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7ffa36c3b1c0) at client-rpc-fops.c:2671 #4 0x00007ffa38ec2b20 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7ffa24905fa0, pollin=pollin@entry=0x7ff9b2beadb0) at rpc-clnt.c:766 #5 0x00007ffa38ec2ddf in rpc_clnt_notify (trans=<optimized out>, mydata=0x7ffa24905fd0, event=<optimized out>, data=0x7ff9b2beadb0) at rpc-clnt.c:907 #6 0x00007ffa38ebe913 in rpc_transport_notify (this=this@entry=0x7ffa24915ca0, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7ff9b2beadb0) at rpc-transport.c:545 #7 0x00007ffa2dcfc4b6 in socket_event_poll_in (this=this@entry=0x7ffa24915ca0) at socket.c:2236 #8 0x00007ffa2dcff3a4 in socket_event_handler (fd=fd@entry=63, idx=idx@entry=51, data=0x7ffa24915ca0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2349 #9 0x00007ffa391558ca in event_dispatch_epoll_handler (event=0x7ffa2c029e80, event_pool=0x7ffa3a4add10) at event-epoll.c:575 #10 event_dispatch_epoll_worker (data=0x7ffa3a4fa320) at event-epoll.c:678 #11 0x00007ffa37f5cdc5 in start_thread () from /lib64/libpthread.so.0 #12 0x00007ffa378a321d in clone () from /lib64/libc.so.6 (gdb) volume info: ============ [root@rhs-client17 ~]# gluster v info ec_tier Volume Name: ec_tier Type: Tier Volume ID: 84855431-e6cf-41e9-9cfc-7a735f2685ed Status: Started Number of Bricks: 44 Transport-type: tcp Hot Tier : Hot Tier Type : Distributed-Replicate Number of Bricks: 4 x 2 = 8 Brick1: 10.70.35.77:/rhs/brick4/ec-ht8 Brick2: 10.70.35.191:/rhs/brick4/ec-ht7 Brick3: 10.70.35.202:/rhs/brick4/ec-ht6 Brick4: 10.70.35.49:/rhs/brick4/ec-ht5 Brick5: 10.70.36.41:/rhs/brick4/ec-ht4 Brick6: 10.70.35.196:/rhs/brick4/ec-ht3 Brick7: 10.70.35.38:/rhs/brick4/ec-ht2 Brick8: dhcp35-153.lab.eng.blr.redhat.com:/rhs/brick4/ec-ht1 Cold Tier: Cold Tier Type : Distributed-Disperse Number of Bricks: 3 x (8 + 4) = 36 Brick9: dhcp35-153.lab.eng.blr.redhat.com:/rhs/brick1/ec Brick10: 10.70.35.38:/rhs/brick1/ec Brick11: 10.70.35.196:/rhs/brick1/ec Brick12: 10.70.36.41:/rhs/brick1/ec Brick13: 10.70.35.49:/rhs/brick1/ec Brick14: 10.70.35.202:/rhs/brick1/ec Brick15: 10.70.35.191:/rhs/brick1/ec Brick16: 10.70.35.77:/rhs/brick1/ec Brick17: 10.70.35.98:/rhs/brick1/ec Brick18: 10.70.35.132:/rhs/brick1/ec Brick19: 10.70.35.35:/rhs/brick1/ec Brick20: 10.70.35.51:/rhs/brick1/ec Brick21: 10.70.35.138:/rhs/brick1/ec Brick22: 10.70.35.122:/rhs/brick1/ec Brick23: 10.70.36.43:/rhs/brick1/ec Brick24: 10.70.36.42:/rhs/brick1/ec Brick25: dhcp35-153.lab.eng.blr.redhat.com:/rhs/brick2/ec Brick26: 10.70.35.38:/rhs/brick2/ec Brick27: 10.70.35.196:/rhs/brick2/ec Brick28: 10.70.36.41:/rhs/brick2/ec Brick29: 10.70.35.49:/rhs/brick2/ec Brick30: 10.70.35.202:/rhs/brick2/ec Brick31: 10.70.35.191:/rhs/brick2/ec Brick32: 10.70.35.77:/rhs/brick2/ec Brick33: 10.70.35.98:/rhs/brick2/ec Brick34: 10.70.35.132:/rhs/brick2/ec Brick35: 10.70.35.35:/rhs/brick2/ec Brick36: 10.70.35.51:/rhs/brick2/ec Brick37: 10.70.35.138:/rhs/brick2/ec Brick38: 10.70.35.122:/rhs/brick2/ec Brick39: 10.70.36.43:/rhs/brick2/ec Brick40: 10.70.36.42:/rhs/brick2/ec Brick41: dhcp35-153.lab.eng.blr.redhat.com:/rhs/brick3/ec Brick42: 10.70.35.38:/rhs/brick3/ec Brick43: 10.70.35.196:/rhs/brick3/ec Brick44: 10.70.36.41:/rhs/brick3/ec Options Reconfigured: diagnostics.brick-log-level: INFO features.scrub-freq: hourly nfs.outstanding-rpc-limit: 0 features.scrub: Inactive features.bitrot: off features.barrier: disable cluster.tier-mode: cache features.ctr-enabled: on features.quota-deem-statfs: on features.inode-quota: on features.quota: on features.uss: on performance.readdir-ahead: on [root@rhs-client17 ~]# Version-Release number of selected component (if applicable): ============================================================= 3.7.5.19 How reproducible: ================= seen once Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: ================ Core file will be copied to repo.
Another crash on 10.70.35.153 BT : (gdb) bt #0 dht_layout_ref (this=0x7f373c133c80, layout=layout@entry=0x0) at dht-layout.c:149 #1 0x00007f3742ebc2db in dht_selfheal_restore (frame=frame@entry=0x7f374ea304cc, dir_cbk=dir_cbk@entry=0x7f3742ec4fa0 <dht_rmdir_selfheal_cbk>, loc=loc@entry=0x7f37325e0c74, layout=0x0) at dht-selfheal.c:1934 #2 0x00007f3742eca6e2 in dht_rmdir_hashed_subvol_cbk (frame=0x7f374ea304cc, cookie=0x7f374e9bd738, this=0x7f373c133c80, op_ret=-1, op_errno=39, preparent=0x7f37318ec04c, postparent=0x7f37318ec0bc, xdata=0x0) at dht-common.c:6788 #3 0x00007f374311bcd7 in afr_rmdir_unwind (frame=<optimized out>, this=<optimized out>) at afr-dir-write.c:1339 #4 0x00007f374311d619 in __afr_dir_write_cbk (frame=0x7f374e9c13b0, cookie=<optimized out>, this=0x7f373c1320e0, op_ret=<optimized out>, op_errno=<optimized out>, buf=buf@entry=0x0, preparent=0x7f3743db5930, postparent=postparent@entry=0x7f3743db59a0, preparent2=preparent2@entry=0x0, postparent2=postparent2@entry=0x0, xdata=xdata@entry=0x0) at afr-dir-write.c:246 #5 0x00007f374311d816 in afr_rmdir_wind_cbk (frame=<optimized out>, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, preparent=<optimized out>, postparent=0x7f3743db59a0, xdata=0x0) at afr-dir-write.c:1351 #6 0x00007f374339a7e1 in client3_3_rmdir_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f374e9e85e0) at client-rpc-fops.c:729 #7 0x00007f3750c4eb20 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f373c8917e0, pollin=pollin@entry=0x7f36801731a0) at rpc-clnt.c:766 #8 0x00007f3750c4eddf in rpc_clnt_notify (trans=<optimized out>, mydata=0x7f373c891810, event=<optimized out>, data=0x7f36801731a0) at rpc-clnt.c:907 #9 0x00007f3750c4a913 in rpc_transport_notify (this=this@entry=0x7f373c8a14e0, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f36801731a0) at rpc-transport.c:545 #10 0x00007f3745a884b6 in socket_event_poll_in (this=this@entry=0x7f373c8a14e0) at socket.c:2236 #11 0x00007f3745a8b3a4 in socket_event_handler (fd=fd@entry=82, idx=idx@entry=99, data=0x7f373c8a14e0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2349 #12 0x00007f3750ee18ca in event_dispatch_epoll_handler (event=0x7f3743db5e80, event_pool=0x7f3752a66d10) at event-epoll.c:575 #13 event_dispatch_epoll_worker (data=0x7f3752ab3260) at event-epoll.c:678 #14 0x00007f374fce8dc5 in start_thread () from /lib64/libpthread.so.0 #15 0x00007f374f62f21d in clone () from /lib64/libc.so.6 (gdb) q File: [root@dhcp35-153 ~]# ll /var/log/core/core.22231.1455016922.dump -rw-------. 1 root root 5343039488 Feb 9 16:53 /var/log/core/core.22231.1455016922.dump
IO : dd (1GB files), linux untar and deletes ( rm -rf )
I am unable to recreate this using the procedure in comment #1. Can QE give us a way to reproduce it reliably? If it is related to "when the server which was restarted when the LVM pool became full" (comment #7), why did the LVM pool become full, and is gluster resilient to such situations?
This occurs during 'rm' operations. 1. Linux untar multiple instances (3-4) and delete with "rm -rf" 2. Continue dd (varying block size - creates ) from another client. Removing the needinfo. Let me know if you any help.
Do we know if this is related full LVM pool issues, as suggested in comment #7 and comment #8? Has this been reproduced on a normal volume?
As there is insufficient information to debug this issue now, and the initial analysis seems to indicate that the crashes are a side effect of the gfid mismatch seen because of a full brick, and In comment#7: QE was unable to reproduce the crash on a clean volume. I am therefore closing this with "WorksForMe" Please file a new BZ if this is seen with the latest builds.