Description of problem: ------------------------ 4 node cluster - 3 volumes - one 2*(4+2) ,one 2*3 and one 2*2. All volumes exported via Ganesha and mounted on 4 clients,each one having >1 mount via v3/v4. Almost an hour into my workload,I see that IO has errored out on my clients : *********** On gqac019 *********** Mount : 192.168.97.152:/butcher [root@gqac019 gluster-mount]# for i in {1..25};do mkdir dir$i;cp linux-4.9.4.tar.xz dir$i;cd dir$i;tar xvfJ linux-4.9.4.tar.xz >> /tar$i.log;cd ..;done cp: error reading ‘linux-4.9.4.tar.xz’: Input/output error cp: failed to extend ‘dir1/linux-4.9.4.tar.xz’: Input/output error tar: linux-4.9.4/drivers/bus/arm-cci.c: Cannot change ownership to uid 0, gid 0: Remote I/O error ********** On gqac011 ********** Mount :192.168.97.150:/butcher [root@gqac011 W]# for i in {1..25};do mkdir dir$i;cp linux-4.9.4.tar.xz dir$i;cd dir$i;tar xvfJ linux-4.9.4.tar.xz >> /tar$i.log;cd ..;done cp: error reading ‘linux-4.9.4.tar.xz’: Input/output error cp: failed to extend ‘dir1/linux-4.9.4.tar.xz’: Input/output error ********** On gqac019 ********** Mount : 192.168.97.152:/butcher [root@gqac019 ~]# /opt/qa/tools/system_light/run.sh -w /gluster-mount -t bonnie -l /var/tmp/bonnie.log /opt/qa/tools/system_light/scripts /opt/qa/tools/system_light /root /gluster-mount / ----- /gluster-mount /gluster-mount/run2179/ Tests available: arequal bonnie compile_kernel coverage dbench dd ffsb fileop fs_mark fsx fuse glusterfs glusterfs_build iozone locks ltp multiple_files openssl posix_compliance postmark read_large rpc syscallbench tiobench ===========================TESTS RUNNING=========================== Changing to the specified mountpoint /gluster-mount/run2179 executing bonnie Using uid:0, gid:0. Writing a byte at a time...done Writing intelligently...done Rewriting...Bonnie: drastic I/O error (re-write read): Stale file handle The brick processes are up and running. Version-Release number of selected component (if applicable): ------------------------------------------------------------- nfs-ganesha-2.4.1-6.el7rhgs.x86_64 nfs-ganesha-debuginfo-2.4.1-6.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-12.el7rhgs.x86_64 How reproducible: ----------------- 1/1 Actual results: ---------------- EIO. Expected results: ------------------ Zero error status on clients Additional info: ---------------- [root@gqas013 tmp]# gluster v status Status of volume: butcher Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks1/brick 49153 0 Y 2486 Brick gqas008.sbu.lab.eng.bos.redhat.com:/b ricks1/brick 49152 0 Y 9116 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks1/brick 49153 0 Y 5724 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks1/brick 49153 0 Y 6853 Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks3/brick 49154 0 Y 2513 Brick gqas008.sbu.lab.eng.bos.redhat.com:/b ricks3/brick 49153 0 Y 9135 Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks2/brick 49155 0 Y 2532 Brick gqas008.sbu.lab.eng.bos.redhat.com:/b ricks2/brick 49154 0 Y 9154 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks2/brick 49154 0 Y 5743 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks2/brick 49154 0 Y 6872 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks3/brick 49155 0 Y 5762 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks3/brick 49155 0 Y 6891 Self-heal Daemon on localhost N/A N/A Y 7066 Self-heal Daemon on gqas008.sbu.lab.eng.bos .redhat.com N/A N/A Y 2621 Self-heal Daemon on gqas005.sbu.lab.eng.bos .redhat.com N/A N/A Y 26374 Self-heal Daemon on gqas006.sbu.lab.eng.bos .redhat.com N/A N/A Y 27934 Task Status of Volume butcher ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: gluster_shared_storage Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gqas013.sbu.lab.eng.bos.redhat.com:/v ar/lib/glusterd/ss_brick 49152 0 Y 30998 Brick gqas005.sbu.lab.eng.bos.redhat.com:/v ar/lib/glusterd/ss_brick 49152 0 Y 30630 Brick gqas006.sbu.lab.eng.bos.redhat.com:/v ar/lib/glusterd/ss_brick 49152 0 Y 30713 Self-heal Daemon on localhost N/A N/A Y 7066 Self-heal Daemon on gqas008.sbu.lab.eng.bos .redhat.com N/A N/A Y 2621 Self-heal Daemon on gqas005.sbu.lab.eng.bos .redhat.com N/A N/A Y 26374 Self-heal Daemon on gqas006.sbu.lab.eng.bos .redhat.com N/A N/A Y 27934 Task Status of Volume gluster_shared_storage ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: rep2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks11/brick 49157 0 Y 3022 Brick gqas008.sbu.lab.eng.bos.redhat.com:/b ricks11/brick 49156 0 Y 13329 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks11/brick 49158 0 Y 8299 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks11/brick 49158 0 Y 27834 Self-heal Daemon on localhost N/A N/A Y 7066 Self-heal Daemon on gqas006.sbu.lab.eng.bos .redhat.com N/A N/A Y 27934 Self-heal Daemon on gqas005.sbu.lab.eng.bos .redhat.com N/A N/A Y 26374 Self-heal Daemon on gqas008.sbu.lab.eng.bos .redhat.com N/A N/A Y 2621 Task Status of Volume rep2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: rep3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks9/brick 49156 0 Y 2823 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks9/brick 49156 0 Y 7499 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks9/brick 49156 0 Y 13758 Brick gqas008.sbu.lab.eng.bos.redhat.com:/b ricks9/brick 49155 0 Y 12041 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks10/brick 49157 0 Y 7518 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks10/brick 49157 0 Y 13848 Self-heal Daemon on localhost N/A N/A Y 7066 Self-heal Daemon on gqas008.sbu.lab.eng.bos .redhat.com N/A N/A Y 2621 Self-heal Daemon on gqas006.sbu.lab.eng.bos .redhat.com N/A N/A Y 27934 Self-heal Daemon on gqas005.sbu.lab.eng.bos .redhat.com N/A N/A Y 26374 Task Status of Volume rep3 ------------------------------------------------------------------------------ There are no active volume tasks [root@gqas013 tmp]#
In the nodes which those clients are connected to (i.e, with VIP 192.168.97.152 & 192.168.97.150) , I see ENOTCONN errors- ===gqas005 == 19/01/2017 14:34:52 : epoch 48930000 : gqas005.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-28019[work-41] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error Transport endpoint is not connected ********** On gqac011 ********** Mount :192.168.97.150:/butcher [root@gqac011 W]# for i in {1..25};do mkdir dir$i;cp linux-4.9.4.tar.xz dir$i;cd dir$i;tar xvfJ linux-4.9.4.tar.xz >> /tar$i.log;cd ..;done cp: error reading ‘linux-4.9.4.tar.xz’: Input/output error cp: failed to extend ‘dir1/linux-4.9.4.tar.xz’: Input/output error From ganesha-gfapi.log - [2017-01-19 19:34:52.984372] I [MSGID: 122002] [ec-heal.c:2368:ec_heal_do] 0-butcher-disperse-0: /W/linux-4.9.4.tar.xz: name heal failed [Transport endpoint is not connected] There are many ENOTCONN errors on gqas006 as well.
Pranith - can you have a look at the comment 3?
This is not necessarily multivolume. I could repro it with 1 EC volume as well - with small file creates and kernel untars. Adjusting summary accordingly.
[root@gqas009 gluster]# grep -i mismatch /var/log/ganesha-gfapi.log [2017-01-22 05:34:47.319714] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 0-testvol-disperse-7: Mismatching GFID's in loc [2017-01-22 11:18:11.939976] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 10-testvol-disperse-7: Mismatching GFID's in loc [2017-01-22 12:35:54.265400] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 10-testvol-disperse-7: Mismatching GFID's in loc [2017-01-22 13:03:13.471747] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 10-testvol-disperse-4: Mismatching GFID's in loc [2017-01-22 13:04:04.590037] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 10-testvol-disperse-4: Mismatching GFID's in loc [2017-01-22 13:10:33.799298] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 10-testvol-disperse-4: Mismatching GFID's in loc [2017-01-22 13:11:06.763801] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 10-testvol-disperse-5: Mismatching GFID's in loc [2017-01-22 13:43:24.219911] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 10-testvol-disperse-7: Mismatching GFID's in loc [2017-01-22 14:35:53.688066] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 10-testvol-disperse-6: Mismatching GFID's in loc [2017-01-23 03:09:18.811044] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 10-testvol-disperse-7: Mismatching GFID's in loc
Thanks to Pranith & Ashish. Below is the snippet of the debugging done on the core >>>> (gdb) fr 7 #7 0x00007fde6c184df2 in ec_manager_lookup (fop=0x7fde575a5a80, state=<optimized out>) at ec-generic.c:837 837 ec_lookup_rebuild(fop->xl->private, fop, cbk); (gdb) p *fop->req_frame->parent->parent->parent->parent $17 = {root = 0x7fde75452350, parent = 0x7fde755344d4, frames = {next = 0x7fde755344e4, prev = 0x7fde7559c148}, local = 0x7fdc240cc420, this = 0x7fde6007ae20, ret = 0x7fde7a7de090 <default_lookup_cbk>, ref_count = 1, lock = { spinlock = 1, mutex = {__data = {__lock = 1, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001", '\000' <repeats 38 times>, __align = 1}}, cookie = 0x7fde7555340c, complete = _gf_false, op = GF_FOP_NULL, begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0}, wind_from = 0x7fde7a804d90 <__FUNCTION__.16554> "default_lookup_resume", wind_to = 0x7fde7a803d30 "FIRST_CHILD(this)->fops->lookup", unwind_from = 0x0, unwind_to = 0x7fde7a8043cd "default_lookup_cbk"} (gdb) p *fop->req_frame->parent->parent->parent->parent->parent $18 = {root = 0x7fde75452350, parent = 0x7fde755929ac, frames = {next = 0x7fde755929bc, prev = 0x7fde7555341c}, local = 0x0, this = 0x7fde6007c300, ret = 0x7fde5eafbb30 <io_stats_lookup_cbk>, ref_count = 1, lock = {spinlock = 1, mutex = {__data = {__lock = 1, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001", '\000' <repeats 38 times>, __align = 1}}, cookie = 0x7fde755344d4, complete = _gf_false, op = GF_FOP_NULL, begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0}, wind_from = 0x7fde5eb04ec0 <__FUNCTION__.16584> "io_stats_lookup", wind_to = 0x7fde5eb037b8 "FIRST_CHILD(this)->fops->lookup", unwind_from = 0x0, unwind_to = 0x7fde5eb02997 "io_stats_lookup_cbk"} (gdb) p *fop->req_frame->parent->parent->parent->parent $19 = {root = 0x7fde75452350, parent = 0x7fde755344d4, frames = {next = 0x7fde755344e4, prev = 0x7fde7559c148}, local = 0x7fdc240cc420, this = 0x7fde6007ae20, ret = 0x7fde7a7de090 <default_lookup_cbk>, ref_count = 1, lock = { spinlock = 1, mutex = {__data = {__lock = 1, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001", '\000' <repeats 38 times>, __align = 1}}, cookie = 0x7fde7555340c, complete = _gf_false, op = GF_FOP_NULL, begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0}, wind_from = 0x7fde7a804d90 <__FUNCTION__.16554> "default_lookup_resume", wind_to = 0x7fde7a803d30 "FIRST_CHILD(this)->fops->lookup", unwind_from = 0x0, unwind_to = 0x7fde7a8043cd "default_lookup_cbk"} (gdb) p *fop->req_frame->parent->parent->parent->parent->this->name $20 = 98 'b' (gdb) p fop->req_frame->parent->parent->parent->parent->this->name $21 = 0x7fde6007b9c0 "butcher-md-cache" (gdb) p (mdc_local_t*)fop->req_frame->parent->parent->parent->parent->local $22 = (mdc_local_t *) 0x7fdc240cc420 (gdb) p *$22->loc Structure has no component named operator*. (gdb) p $22->loc $23 = {path = 0x7fdc24192880 "<gfid:9d7aeb82-3c80-4316-8128-bba593c65f5c>/..", name = 0x7fdc241928ac "..", inode = 0x7fde4d02a9ec, parent = 0x7fde4d355240, gfid = '\000' <repeats 15 times>, pargfid = "\235z\353\202<\200C\026\201(\273\245\223\306_\\"} (gdb) p *$22->loc.inode $24 = {table = 0x7fde607fb710, gfid = '\000' <repeats 15 times>, lock = {spinlock = 1, mutex = {__data = {__lock = 1, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001", '\000' <repeats 38 times>, __align = 1}}, nlookup = 0, fd_count = 0, ref = 28, ia_type = IA_INVAL, fd_list = {next = 0x7fde4d02aa44, prev = 0x7fde4d02aa44}, dentry_list = {next = 0x7fde4d02aa54, prev = 0x7fde4d02aa54}, hash = {next = 0x7fde4d02aa64, prev = 0x7fde4d02aa64}, list = {next = 0x7fde4d28d9b8, prev = 0x7fde4d27befc}, _ctx = 0x7fdcac0191e0} (gdb) p (mdc_local_t*)fop->req_frame->parent->parent->parent $25 = (mdc_local_t *) 0x7fde7559c138 (gdb) p fop->req_frame->parent->parent->parent $26 = (call_frame_t *) 0x7fde7559c138 (gdb) p *fop->req_frame->parent->parent->parent $27 = {root = 0x7fde75452350, parent = 0x7fde7555340c, frames = {next = 0x7fde7555341c, prev = 0x7fde7558a6e4}, local = 0x7fde4d02a9ec, this = 0x7fde600784a0, ret = 0x7fde5ef22600 <mdc_lookup_cbk>, ref_count = 1, lock = {spinlock = 1, mutex = {__data = {__lock = 1, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001", '\000' <repeats 38 times>, __align = 1}}, cookie = 0x7fde7559c138, complete = _gf_false, op = GF_FOP_NULL, begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0}, wind_from = 0x7fde5ef2856f <__FUNCTION__.15558> "mdc_lookup", wind_to = 0x7fde7a803d30 "FIRST_CHILD(this)->fops->lookup", unwind_from = 0x0, unwind_to = 0x7fde5ef27709 "mdc_lookup_cbk"} (gdb) p *fop->req_frame->parent->parent->parent->this->name $28 = 98 'b' (gdb) p fop->req_frame->parent->parent->parent->this->name $29 = 0x7fde60079040 "butcher-quick-read" (gdb) p fop->req_frame->parent->parent $30 = (call_frame_t *) 0x7fde7558a6d4 (gdb) p *fop->req_frame->parent->parent $31 = {root = 0x7fde75452350, parent = 0x7fde7559c138, frames = {next = 0x7fde7559c148, prev = 0x7fde754d2f30}, local = 0x7fde6009ad54, this = 0x7fde60076d80, ret = 0x7fde5f33ab30 <qr_lookup_cbk>, ref_count = 1, lock = {spinlock = 1, mutex = {__data = {__lock = 1, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001", '\000' <repeats 38 times>, __align = 1}}, cookie = 0x7fde7558a6d4, complete = _gf_false, op = GF_FOP_NULL, begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0}, wind_from = 0x7fde5f33cae0 <__FUNCTION__.16084> "qr_lookup", wind_to = 0x7fde5f33c4c8 "FIRST_CHILD(this)->fops->lookup", unwind_from = 0x0, unwind_to = 0x7fde5f33c7a4 "qr_lookup_cbk"} (gdb) p *fop->req_frame->parent->parent->this->name $32 = 98 'b' (gdb) p fop->req_frame->parent->parent->this->name $33 = 0x7fde60076c00 "butcher-io-cache" (gdb) p ((ioc_local_t*)fop->req_frame->parent->parent->local)->file_loc $34 = {path = 0x7fdc240798a0 "<gfid:9d7aeb82-3c80-4316-8128-bba593c65f5c>/..", name = 0x7fdc240798cc "..", inode = 0x7fde4d02a9ec, parent = 0x7fde4d355240, gfid = '\000' <repeats 15 times>, pargfid = "\235z\353\202<\200C\026\201(\273\245\223\306_\\"} (gdb) p *fop->req_frame->parent $35 = {root = 0x7fde75452350, parent = 0x7fde7558a6d4, frames = {next = 0x7fde7558a6e4, prev = 0x7fde754ce3d4}, local = 0x7fde5de5b3cc, this = 0x7fde60070d00, ret = 0x7fde5f546c60 <ioc_lookup_cbk>, ref_count = 3, lock = {spinlock = 1, mutex = {__data = {__lock = 1, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001", '\000' <repeats 38 times>, __align = 1}}, cookie = 0x7fde754d2f20, complete = _gf_false, op = GF_FOP_NULL, begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0}, wind_from = 0x7fde5f54ecd7 <__FUNCTION__.16334> "ioc_lookup", wind_to = 0x7fde7a803d30 "FIRST_CHILD(this)->fops->lookup", unwind_from = 0x0, unwind_to = 0x7fde5f54df48 "ioc_lookup_cbk"} (gdb) p fop->req_frame->parent->this->name $36 = 0x7fde60062720 "butcher-dht" (gdb) p (dht_local_t*)fop->req_frame->parent->local $37 = (dht_local_t *) 0x7fde5de5b3cc (gdb) p $37->loc $38 = {path = 0x7fdc240420d0 "<gfid:9d7aeb82-3c80-4316-8128-bba593c65f5c>/..", name = 0x7fdc240420fc "..", inode = 0x7fde4d02a9ec, parent = 0x7fde4d355240, gfid = ";\202\020^@eC\024\240\306\b\365l,\270V", pargfid = "\235z\353\202<\200C\026\201(\273\245\223\306_\\"} (gdb) p /x $37->loc.gfid $39 = {0x3b, 0x82, 0x10, 0x5e, 0x40, 0x65, 0x43, 0x14, 0xa0, 0xc6, 0x8, 0xf5, 0x6c, 0x2c, 0xb8, 0x56} (gdb) fr 7 #7 0x00007fde6c184df2 in ec_manager_lookup (fop=0x7fde575a5a80, state=<optimized out>) at ec-generic.c:837 837 ec_lookup_rebuild(fop->xl->private, fop, cbk); (gdb) p /x fop->loc[0].gfid $40 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x43, 0x14, 0xa0, 0xc6, 0x8, 0xf5, 0x6c, 0x2c, 0xb8, 0x56} (gdb) fr 6 #6 0x00007fde6c184b85 in ec_lookup_rebuild (ec=<optimized out>, fop=fop@entry=0x7fde575a5a80, cbk=cbk@entry=0x7fde56dd8aa8) at ec-generic.c:644 644 err = ec_loc_update(fop->xl, &fop->loc[0], cbk->inode, &cbk->iatt[0]); (gdb) p /x cbk->iatt[0].ia_gfid $41 = {0x3b, 0x82, 0x10, 0x5e, 0x40, 0x65, 0x43, 0x14, 0xa0, 0xc6, 0x8, 0xf5, 0x6c, 0x2c, 0xb8, 0x56} (gdb) p /x $37->gfid $42 = {0x3b, 0x82, 0x10, 0x5e, 0x40, 0x65, 0x43, 0x14, 0xa0, 0xc6, 0x8, 0xf5, 0x6c, 0x2c, 0xb8, 0x56} (gdb) p *fop->req_frame $43 = {root = 0x7fde75452350, parent = 0x7fde754d2f20, frames = {next = 0x7fde75526d94, prev = 0x7fde75546864}, local = 0x0, this = 0x7fde6006c320, ret = 0x7fde5fdafd90 <dht_lookup_dir_cbk>, ref_count = 0, lock = {spinlock = 1, mutex = { __data = {__lock = 1, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001", '\000' <repeats 38 times>, __align = 1}}, cookie = 0x7fde75503c0c, complete = _gf_false, op = GF_FOP_NULL, begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0}, wind_from = 0x7fde5fdee590 <__FUNCTION__.18890> "dht_lookup_directory", wind_to = 0x7fde5fde9528 "conf->subvolumes[i]->fops->lookup", unwind_from = 0x0, unwind_to = 0x7fde5fdeb3cd "dht_lookup_dir_cbk"} (gdb) shell [root@gqas013 tmp]# gluster v info butcher Volume Name: butcher Type: Distributed-Disperse Volume ID: 5be704df-e1e1-43a2-8f67-5bf720efe9e5 Status: Started Snapshot Count: 0 Number of Bricks: 12 x (4 + 2) = 72 Transport-type: tcp Bricks: Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick2: gqas014.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick3: gqas015.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick4: gqas005.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick5: gqas006.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick6: gqas008.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick7: gqas013.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick8: gqas014.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick9: gqas015.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick10: gqas005.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick11: gqas006.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick12: gqas008.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick13: gqas013.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick14: gqas014.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick15: gqas015.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick16: gqas005.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick17: gqas006.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick18: gqas008.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick19: gqas013.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick20: gqas014.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick21: gqas015.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick22: gqas005.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick23: gqas006.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick24: gqas008.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick25: gqas013.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick26: gqas014.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick27: gqas015.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick28: gqas005.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick29: gqas006.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick30: gqas008.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick31: gqas013.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick32: gqas014.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick33: gqas015.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick34: gqas005.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick35: gqas006.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick36: gqas008.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick37: gqas013.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick38: gqas014.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick39: gqas015.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick40: gqas005.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick41: gqas006.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick42: gqas008.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick43: gqas013.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick44: gqas014.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick45: gqas015.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick46: gqas005.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick47: gqas006.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick48: gqas008.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick49: gqas013.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick50: gqas014.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick51: gqas015.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick52: gqas005.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick53: gqas006.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick54: gqas008.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick55: gqas013.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick56: gqas014.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick57: gqas015.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick58: gqas005.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick59: gqas006.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick60: gqas008.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick61: gqas013.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick62: gqas014.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick63: gqas015.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick64: gqas005.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick65: gqas006.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick66: gqas008.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick67: gqas013.sbu.lab.eng.bos.redhat.com:/bricks12/brick Brick68: gqas014.sbu.lab.eng.bos.redhat.com:/bricks12/brick Brick69: gqas015.sbu.lab.eng.bos.redhat.com:/bricks12/brick Brick70: gqas005.sbu.lab.eng.bos.redhat.com:/bricks12/brick Brick71: gqas006.sbu.lab.eng.bos.redhat.com:/bricks12/brick Brick72: gqas008.sbu.lab.eng.bos.redhat.com:/bricks12/brick Options Reconfigured: ganesha.enable: on features.cache-invalidation: on nfs.disable: on performance.readdir-ahead: on nfs-ganesha: enable cluster.enable-shared-storage: enable [root@gqas013 tmp]# exit (gdb) p *fop->req_frame $44 = {root = 0x7fde75452350, parent = 0x7fde754d2f20, frames = {next = 0x7fde75526d94, prev = 0x7fde75546864}, local = 0x0, this = 0x7fde6006c320, ret = 0x7fde5fdafd90 <dht_lookup_dir_cbk>, ref_count = 0, lock = {spinlock = 1, mutex = { __data = {__lock = 1, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001", '\000' <repeats 38 times>, __align = 1}}, cookie = 0x7fde75503c0c, complete = _gf_false, op = GF_FOP_NULL, begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0}, wind_from = 0x7fde5fdee590 <__FUNCTION__.18890> "dht_lookup_directory", wind_to = 0x7fde5fde9528 "conf->subvolumes[i]->fops->lookup", unwind_from = 0x0, unwind_to = 0x7fde5fdeb3cd "dht_lookup_dir_cbk"} <<<< Pranith observed that loc->gfid has been set in dht layer but got modified by the time ec was validating it in ec_lookup_cbk. He shall be able to provide more details on the code-flow when this issue arises.
https://review.gluster.org/16986 - Patch upstream
I could not repro it on two attempts to recreate it on 3.8.4-25. Since it's a stress test,I'll be revisiting it again during 3.3.Will reopen if I hit it again. Moving this to Verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774