Bug 1415038
| Summary: | [Ganesha + EC] : Input/Output Error while creating LOTS of smallfiles | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Ambarish <asoman> | |
| Component: | distribute | Assignee: | Pranith Kumar K <pkarampu> | |
| Status: | CLOSED ERRATA | QA Contact: | Ambarish <asoman> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | rhgs-3.2 | CC: | amukherj, asoman, aspandey, asrivast, bturner, jthottan, pkarampu, rcyriac, rhinduja, rhs-bugs, skoduri, storage-qa-internal | |
| Target Milestone: | --- | |||
| Target Release: | RHGS 3.3.0 | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | glusterfs-3.8.4-23 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1438411 (view as bug list) | Environment: | ||
| Last Closed: | 2017-09-21 04:30:55 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1417147, 1438411, 1438423, 1438424 | |||
In the nodes which those clients are connected to (i.e, with VIP 192.168.97.152 & 192.168.97.150) , I see ENOTCONN errors-
===gqas005 ==
19/01/2017 14:34:52 : epoch 48930000 : gqas005.sbu.lab.eng.bos.redhat.com : ganesha.nfsd-28019[work-41] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error Transport endpoint is not connected
**********
On gqac011
**********
Mount :192.168.97.150:/butcher
[root@gqac011 W]# for i in {1..25};do mkdir dir$i;cp linux-4.9.4.tar.xz dir$i;cd dir$i;tar xvfJ linux-4.9.4.tar.xz >> /tar$i.log;cd ..;done
cp: error reading ‘linux-4.9.4.tar.xz’: Input/output error
cp: failed to extend ‘dir1/linux-4.9.4.tar.xz’: Input/output error
From ganesha-gfapi.log -
[2017-01-19 19:34:52.984372] I [MSGID: 122002] [ec-heal.c:2368:ec_heal_do] 0-butcher-disperse-0: /W/linux-4.9.4.tar.xz: name heal failed [Transport endpoint is not connected]
There are many ENOTCONN errors on gqas006 as well.
This is not necessarily multivolume. I could repro it with 1 EC volume as well - with small file creates and kernel untars. Adjusting summary accordingly. [root@gqas009 gluster]# grep -i mismatch /var/log/ganesha-gfapi.log [2017-01-22 05:34:47.319714] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 0-testvol-disperse-7: Mismatching GFID's in loc [2017-01-22 11:18:11.939976] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 10-testvol-disperse-7: Mismatching GFID's in loc [2017-01-22 12:35:54.265400] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 10-testvol-disperse-7: Mismatching GFID's in loc [2017-01-22 13:03:13.471747] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 10-testvol-disperse-4: Mismatching GFID's in loc [2017-01-22 13:04:04.590037] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 10-testvol-disperse-4: Mismatching GFID's in loc [2017-01-22 13:10:33.799298] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 10-testvol-disperse-4: Mismatching GFID's in loc [2017-01-22 13:11:06.763801] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 10-testvol-disperse-5: Mismatching GFID's in loc [2017-01-22 13:43:24.219911] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 10-testvol-disperse-7: Mismatching GFID's in loc [2017-01-22 14:35:53.688066] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 10-testvol-disperse-6: Mismatching GFID's in loc [2017-01-23 03:09:18.811044] W [MSGID: 122019] [ec-helpers.c:354:ec_loc_gfid_check] 10-testvol-disperse-7: Mismatching GFID's in loc Thanks to Pranith & Ashish.
Below is the snippet of the debugging done on the core
>>>>
(gdb) fr 7
#7 0x00007fde6c184df2 in ec_manager_lookup (fop=0x7fde575a5a80, state=<optimized out>) at ec-generic.c:837
837 ec_lookup_rebuild(fop->xl->private, fop, cbk);
(gdb) p *fop->req_frame->parent->parent->parent->parent
$17 = {root = 0x7fde75452350, parent = 0x7fde755344d4, frames = {next = 0x7fde755344e4, prev = 0x7fde7559c148}, local = 0x7fdc240cc420, this = 0x7fde6007ae20, ret = 0x7fde7a7de090 <default_lookup_cbk>, ref_count = 1, lock = {
spinlock = 1, mutex = {__data = {__lock = 1, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001", '\000' <repeats 38 times>, __align = 1}},
cookie = 0x7fde7555340c, complete = _gf_false, op = GF_FOP_NULL, begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0}, wind_from = 0x7fde7a804d90 <__FUNCTION__.16554> "default_lookup_resume",
wind_to = 0x7fde7a803d30 "FIRST_CHILD(this)->fops->lookup", unwind_from = 0x0, unwind_to = 0x7fde7a8043cd "default_lookup_cbk"}
(gdb) p *fop->req_frame->parent->parent->parent->parent->parent
$18 = {root = 0x7fde75452350, parent = 0x7fde755929ac, frames = {next = 0x7fde755929bc, prev = 0x7fde7555341c}, local = 0x0, this = 0x7fde6007c300, ret = 0x7fde5eafbb30 <io_stats_lookup_cbk>, ref_count = 1, lock = {spinlock = 1,
mutex = {__data = {__lock = 1, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001", '\000' <repeats 38 times>, __align = 1}}, cookie = 0x7fde755344d4,
complete = _gf_false, op = GF_FOP_NULL, begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0}, wind_from = 0x7fde5eb04ec0 <__FUNCTION__.16584> "io_stats_lookup", wind_to = 0x7fde5eb037b8 "FIRST_CHILD(this)->fops->lookup",
unwind_from = 0x0, unwind_to = 0x7fde5eb02997 "io_stats_lookup_cbk"}
(gdb) p *fop->req_frame->parent->parent->parent->parent
$19 = {root = 0x7fde75452350, parent = 0x7fde755344d4, frames = {next = 0x7fde755344e4, prev = 0x7fde7559c148}, local = 0x7fdc240cc420, this = 0x7fde6007ae20, ret = 0x7fde7a7de090 <default_lookup_cbk>, ref_count = 1, lock = {
spinlock = 1, mutex = {__data = {__lock = 1, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001", '\000' <repeats 38 times>, __align = 1}},
cookie = 0x7fde7555340c, complete = _gf_false, op = GF_FOP_NULL, begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0}, wind_from = 0x7fde7a804d90 <__FUNCTION__.16554> "default_lookup_resume",
wind_to = 0x7fde7a803d30 "FIRST_CHILD(this)->fops->lookup", unwind_from = 0x0, unwind_to = 0x7fde7a8043cd "default_lookup_cbk"}
(gdb) p *fop->req_frame->parent->parent->parent->parent->this->name
$20 = 98 'b'
(gdb) p fop->req_frame->parent->parent->parent->parent->this->name
$21 = 0x7fde6007b9c0 "butcher-md-cache"
(gdb) p (mdc_local_t*)fop->req_frame->parent->parent->parent->parent->local
$22 = (mdc_local_t *) 0x7fdc240cc420
(gdb) p *$22->loc
Structure has no component named operator*.
(gdb) p $22->loc
$23 = {path = 0x7fdc24192880 "<gfid:9d7aeb82-3c80-4316-8128-bba593c65f5c>/..", name = 0x7fdc241928ac "..", inode = 0x7fde4d02a9ec, parent = 0x7fde4d355240, gfid = '\000' <repeats 15 times>,
pargfid = "\235z\353\202<\200C\026\201(\273\245\223\306_\\"}
(gdb) p *$22->loc.inode
$24 = {table = 0x7fde607fb710, gfid = '\000' <repeats 15 times>, lock = {spinlock = 1, mutex = {__data = {__lock = 1, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = "\001", '\000' <repeats 38 times>, __align = 1}}, nlookup = 0, fd_count = 0, ref = 28, ia_type = IA_INVAL, fd_list = {next = 0x7fde4d02aa44, prev = 0x7fde4d02aa44}, dentry_list = {next = 0x7fde4d02aa54,
prev = 0x7fde4d02aa54}, hash = {next = 0x7fde4d02aa64, prev = 0x7fde4d02aa64}, list = {next = 0x7fde4d28d9b8, prev = 0x7fde4d27befc}, _ctx = 0x7fdcac0191e0}
(gdb) p (mdc_local_t*)fop->req_frame->parent->parent->parent
$25 = (mdc_local_t *) 0x7fde7559c138
(gdb) p fop->req_frame->parent->parent->parent
$26 = (call_frame_t *) 0x7fde7559c138
(gdb) p *fop->req_frame->parent->parent->parent
$27 = {root = 0x7fde75452350, parent = 0x7fde7555340c, frames = {next = 0x7fde7555341c, prev = 0x7fde7558a6e4}, local = 0x7fde4d02a9ec, this = 0x7fde600784a0, ret = 0x7fde5ef22600 <mdc_lookup_cbk>, ref_count = 1, lock = {spinlock = 1,
mutex = {__data = {__lock = 1, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001", '\000' <repeats 38 times>, __align = 1}}, cookie = 0x7fde7559c138,
complete = _gf_false, op = GF_FOP_NULL, begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0}, wind_from = 0x7fde5ef2856f <__FUNCTION__.15558> "mdc_lookup", wind_to = 0x7fde7a803d30 "FIRST_CHILD(this)->fops->lookup",
unwind_from = 0x0, unwind_to = 0x7fde5ef27709 "mdc_lookup_cbk"}
(gdb) p *fop->req_frame->parent->parent->parent->this->name
$28 = 98 'b'
(gdb) p fop->req_frame->parent->parent->parent->this->name
$29 = 0x7fde60079040 "butcher-quick-read"
(gdb) p fop->req_frame->parent->parent
$30 = (call_frame_t *) 0x7fde7558a6d4
(gdb) p *fop->req_frame->parent->parent
$31 = {root = 0x7fde75452350, parent = 0x7fde7559c138, frames = {next = 0x7fde7559c148, prev = 0x7fde754d2f30}, local = 0x7fde6009ad54, this = 0x7fde60076d80, ret = 0x7fde5f33ab30 <qr_lookup_cbk>, ref_count = 1, lock = {spinlock = 1,
mutex = {__data = {__lock = 1, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001", '\000' <repeats 38 times>, __align = 1}}, cookie = 0x7fde7558a6d4,
complete = _gf_false, op = GF_FOP_NULL, begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0}, wind_from = 0x7fde5f33cae0 <__FUNCTION__.16084> "qr_lookup", wind_to = 0x7fde5f33c4c8 "FIRST_CHILD(this)->fops->lookup",
unwind_from = 0x0, unwind_to = 0x7fde5f33c7a4 "qr_lookup_cbk"}
(gdb) p *fop->req_frame->parent->parent->this->name
$32 = 98 'b'
(gdb) p fop->req_frame->parent->parent->this->name
$33 = 0x7fde60076c00 "butcher-io-cache"
(gdb) p ((ioc_local_t*)fop->req_frame->parent->parent->local)->file_loc
$34 = {path = 0x7fdc240798a0 "<gfid:9d7aeb82-3c80-4316-8128-bba593c65f5c>/..", name = 0x7fdc240798cc "..", inode = 0x7fde4d02a9ec, parent = 0x7fde4d355240, gfid = '\000' <repeats 15 times>,
pargfid = "\235z\353\202<\200C\026\201(\273\245\223\306_\\"}
(gdb) p *fop->req_frame->parent
$35 = {root = 0x7fde75452350, parent = 0x7fde7558a6d4, frames = {next = 0x7fde7558a6e4, prev = 0x7fde754ce3d4}, local = 0x7fde5de5b3cc, this = 0x7fde60070d00, ret = 0x7fde5f546c60 <ioc_lookup_cbk>, ref_count = 3, lock = {spinlock = 1,
mutex = {__data = {__lock = 1, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001", '\000' <repeats 38 times>, __align = 1}}, cookie = 0x7fde754d2f20,
complete = _gf_false, op = GF_FOP_NULL, begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0}, wind_from = 0x7fde5f54ecd7 <__FUNCTION__.16334> "ioc_lookup", wind_to = 0x7fde7a803d30 "FIRST_CHILD(this)->fops->lookup",
unwind_from = 0x0, unwind_to = 0x7fde5f54df48 "ioc_lookup_cbk"}
(gdb) p fop->req_frame->parent->this->name
$36 = 0x7fde60062720 "butcher-dht"
(gdb) p (dht_local_t*)fop->req_frame->parent->local
$37 = (dht_local_t *) 0x7fde5de5b3cc
(gdb) p $37->loc
$38 = {path = 0x7fdc240420d0 "<gfid:9d7aeb82-3c80-4316-8128-bba593c65f5c>/..", name = 0x7fdc240420fc "..", inode = 0x7fde4d02a9ec, parent = 0x7fde4d355240, gfid = ";\202\020^@eC\024\240\306\b\365l,\270V",
pargfid = "\235z\353\202<\200C\026\201(\273\245\223\306_\\"}
(gdb) p /x $37->loc.gfid
$39 = {0x3b, 0x82, 0x10, 0x5e, 0x40, 0x65, 0x43, 0x14, 0xa0, 0xc6, 0x8, 0xf5, 0x6c, 0x2c, 0xb8, 0x56}
(gdb) fr 7
#7 0x00007fde6c184df2 in ec_manager_lookup (fop=0x7fde575a5a80, state=<optimized out>) at ec-generic.c:837
837 ec_lookup_rebuild(fop->xl->private, fop, cbk);
(gdb) p /x fop->loc[0].gfid
$40 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x43, 0x14, 0xa0, 0xc6, 0x8, 0xf5, 0x6c, 0x2c, 0xb8, 0x56}
(gdb) fr 6
#6 0x00007fde6c184b85 in ec_lookup_rebuild (ec=<optimized out>, fop=fop@entry=0x7fde575a5a80, cbk=cbk@entry=0x7fde56dd8aa8) at ec-generic.c:644
644 err = ec_loc_update(fop->xl, &fop->loc[0], cbk->inode, &cbk->iatt[0]);
(gdb) p /x cbk->iatt[0].ia_gfid
$41 = {0x3b, 0x82, 0x10, 0x5e, 0x40, 0x65, 0x43, 0x14, 0xa0, 0xc6, 0x8, 0xf5, 0x6c, 0x2c, 0xb8, 0x56}
(gdb) p /x $37->gfid
$42 = {0x3b, 0x82, 0x10, 0x5e, 0x40, 0x65, 0x43, 0x14, 0xa0, 0xc6, 0x8, 0xf5, 0x6c, 0x2c, 0xb8, 0x56}
(gdb) p *fop->req_frame
$43 = {root = 0x7fde75452350, parent = 0x7fde754d2f20, frames = {next = 0x7fde75526d94, prev = 0x7fde75546864}, local = 0x0, this = 0x7fde6006c320, ret = 0x7fde5fdafd90 <dht_lookup_dir_cbk>, ref_count = 0, lock = {spinlock = 1, mutex = {
__data = {__lock = 1, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001", '\000' <repeats 38 times>, __align = 1}}, cookie = 0x7fde75503c0c,
complete = _gf_false, op = GF_FOP_NULL, begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0}, wind_from = 0x7fde5fdee590 <__FUNCTION__.18890> "dht_lookup_directory",
wind_to = 0x7fde5fde9528 "conf->subvolumes[i]->fops->lookup", unwind_from = 0x0, unwind_to = 0x7fde5fdeb3cd "dht_lookup_dir_cbk"}
(gdb) shell
[root@gqas013 tmp]# gluster v info butcher
Volume Name: butcher
Type: Distributed-Disperse
Volume ID: 5be704df-e1e1-43a2-8f67-5bf720efe9e5
Status: Started
Snapshot Count: 0
Number of Bricks: 12 x (4 + 2) = 72
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick2: gqas014.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick3: gqas015.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick4: gqas005.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick5: gqas006.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick6: gqas008.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick7: gqas013.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick8: gqas014.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick9: gqas015.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick10: gqas005.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick11: gqas006.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick12: gqas008.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick13: gqas013.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick14: gqas014.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick15: gqas015.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick16: gqas005.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick17: gqas006.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick18: gqas008.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick19: gqas013.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick20: gqas014.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick21: gqas015.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick22: gqas005.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick23: gqas006.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick24: gqas008.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick25: gqas013.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick26: gqas014.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick27: gqas015.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick28: gqas005.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick29: gqas006.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick30: gqas008.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick31: gqas013.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick32: gqas014.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick33: gqas015.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick34: gqas005.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick35: gqas006.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick36: gqas008.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick37: gqas013.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick38: gqas014.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick39: gqas015.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick40: gqas005.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick41: gqas006.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick42: gqas008.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick43: gqas013.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick44: gqas014.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick45: gqas015.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick46: gqas005.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick47: gqas006.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick48: gqas008.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick49: gqas013.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick50: gqas014.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick51: gqas015.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick52: gqas005.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick53: gqas006.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick54: gqas008.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick55: gqas013.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick56: gqas014.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick57: gqas015.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick58: gqas005.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick59: gqas006.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick60: gqas008.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick61: gqas013.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick62: gqas014.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick63: gqas015.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick64: gqas005.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick65: gqas006.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick66: gqas008.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick67: gqas013.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick68: gqas014.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick69: gqas015.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick70: gqas005.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick71: gqas006.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick72: gqas008.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
nfs.disable: on
performance.readdir-ahead: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
[root@gqas013 tmp]# exit
(gdb) p *fop->req_frame
$44 = {root = 0x7fde75452350, parent = 0x7fde754d2f20, frames = {next = 0x7fde75526d94, prev = 0x7fde75546864}, local = 0x0, this = 0x7fde6006c320, ret = 0x7fde5fdafd90 <dht_lookup_dir_cbk>, ref_count = 0, lock = {spinlock = 1, mutex = {
__data = {__lock = 1, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001", '\000' <repeats 38 times>, __align = 1}}, cookie = 0x7fde75503c0c,
complete = _gf_false, op = GF_FOP_NULL, begin = {tv_sec = 0, tv_usec = 0}, end = {tv_sec = 0, tv_usec = 0}, wind_from = 0x7fde5fdee590 <__FUNCTION__.18890> "dht_lookup_directory",
wind_to = 0x7fde5fde9528 "conf->subvolumes[i]->fops->lookup", unwind_from = 0x0, unwind_to = 0x7fde5fdeb3cd "dht_lookup_dir_cbk"}
<<<<
Pranith observed that loc->gfid has been set in dht layer but got modified by the time ec was validating it in ec_lookup_cbk. He shall be able to provide more details on the code-flow when this issue arises.
https://review.gluster.org/16986 - Patch upstream I could not repro it on two attempts to recreate it on 3.8.4-25. Since it's a stress test,I'll be revisiting it again during 3.3.Will reopen if I hit it again. Moving this to Verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774 |
Description of problem: ------------------------ 4 node cluster - 3 volumes - one 2*(4+2) ,one 2*3 and one 2*2. All volumes exported via Ganesha and mounted on 4 clients,each one having >1 mount via v3/v4. Almost an hour into my workload,I see that IO has errored out on my clients : *********** On gqac019 *********** Mount : 192.168.97.152:/butcher [root@gqac019 gluster-mount]# for i in {1..25};do mkdir dir$i;cp linux-4.9.4.tar.xz dir$i;cd dir$i;tar xvfJ linux-4.9.4.tar.xz >> /tar$i.log;cd ..;done cp: error reading ‘linux-4.9.4.tar.xz’: Input/output error cp: failed to extend ‘dir1/linux-4.9.4.tar.xz’: Input/output error tar: linux-4.9.4/drivers/bus/arm-cci.c: Cannot change ownership to uid 0, gid 0: Remote I/O error ********** On gqac011 ********** Mount :192.168.97.150:/butcher [root@gqac011 W]# for i in {1..25};do mkdir dir$i;cp linux-4.9.4.tar.xz dir$i;cd dir$i;tar xvfJ linux-4.9.4.tar.xz >> /tar$i.log;cd ..;done cp: error reading ‘linux-4.9.4.tar.xz’: Input/output error cp: failed to extend ‘dir1/linux-4.9.4.tar.xz’: Input/output error ********** On gqac019 ********** Mount : 192.168.97.152:/butcher [root@gqac019 ~]# /opt/qa/tools/system_light/run.sh -w /gluster-mount -t bonnie -l /var/tmp/bonnie.log /opt/qa/tools/system_light/scripts /opt/qa/tools/system_light /root /gluster-mount / ----- /gluster-mount /gluster-mount/run2179/ Tests available: arequal bonnie compile_kernel coverage dbench dd ffsb fileop fs_mark fsx fuse glusterfs glusterfs_build iozone locks ltp multiple_files openssl posix_compliance postmark read_large rpc syscallbench tiobench ===========================TESTS RUNNING=========================== Changing to the specified mountpoint /gluster-mount/run2179 executing bonnie Using uid:0, gid:0. Writing a byte at a time...done Writing intelligently...done Rewriting...Bonnie: drastic I/O error (re-write read): Stale file handle The brick processes are up and running. Version-Release number of selected component (if applicable): ------------------------------------------------------------- nfs-ganesha-2.4.1-6.el7rhgs.x86_64 nfs-ganesha-debuginfo-2.4.1-6.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-12.el7rhgs.x86_64 How reproducible: ----------------- 1/1 Actual results: ---------------- EIO. Expected results: ------------------ Zero error status on clients Additional info: ---------------- [root@gqas013 tmp]# gluster v status Status of volume: butcher Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks1/brick 49153 0 Y 2486 Brick gqas008.sbu.lab.eng.bos.redhat.com:/b ricks1/brick 49152 0 Y 9116 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks1/brick 49153 0 Y 5724 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks1/brick 49153 0 Y 6853 Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks3/brick 49154 0 Y 2513 Brick gqas008.sbu.lab.eng.bos.redhat.com:/b ricks3/brick 49153 0 Y 9135 Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks2/brick 49155 0 Y 2532 Brick gqas008.sbu.lab.eng.bos.redhat.com:/b ricks2/brick 49154 0 Y 9154 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks2/brick 49154 0 Y 5743 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks2/brick 49154 0 Y 6872 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks3/brick 49155 0 Y 5762 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks3/brick 49155 0 Y 6891 Self-heal Daemon on localhost N/A N/A Y 7066 Self-heal Daemon on gqas008.sbu.lab.eng.bos .redhat.com N/A N/A Y 2621 Self-heal Daemon on gqas005.sbu.lab.eng.bos .redhat.com N/A N/A Y 26374 Self-heal Daemon on gqas006.sbu.lab.eng.bos .redhat.com N/A N/A Y 27934 Task Status of Volume butcher ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: gluster_shared_storage Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gqas013.sbu.lab.eng.bos.redhat.com:/v ar/lib/glusterd/ss_brick 49152 0 Y 30998 Brick gqas005.sbu.lab.eng.bos.redhat.com:/v ar/lib/glusterd/ss_brick 49152 0 Y 30630 Brick gqas006.sbu.lab.eng.bos.redhat.com:/v ar/lib/glusterd/ss_brick 49152 0 Y 30713 Self-heal Daemon on localhost N/A N/A Y 7066 Self-heal Daemon on gqas008.sbu.lab.eng.bos .redhat.com N/A N/A Y 2621 Self-heal Daemon on gqas005.sbu.lab.eng.bos .redhat.com N/A N/A Y 26374 Self-heal Daemon on gqas006.sbu.lab.eng.bos .redhat.com N/A N/A Y 27934 Task Status of Volume gluster_shared_storage ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: rep2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks11/brick 49157 0 Y 3022 Brick gqas008.sbu.lab.eng.bos.redhat.com:/b ricks11/brick 49156 0 Y 13329 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks11/brick 49158 0 Y 8299 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks11/brick 49158 0 Y 27834 Self-heal Daemon on localhost N/A N/A Y 7066 Self-heal Daemon on gqas006.sbu.lab.eng.bos .redhat.com N/A N/A Y 27934 Self-heal Daemon on gqas005.sbu.lab.eng.bos .redhat.com N/A N/A Y 26374 Self-heal Daemon on gqas008.sbu.lab.eng.bos .redhat.com N/A N/A Y 2621 Task Status of Volume rep2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: rep3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks9/brick 49156 0 Y 2823 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks9/brick 49156 0 Y 7499 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks9/brick 49156 0 Y 13758 Brick gqas008.sbu.lab.eng.bos.redhat.com:/b ricks9/brick 49155 0 Y 12041 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks10/brick 49157 0 Y 7518 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks10/brick 49157 0 Y 13848 Self-heal Daemon on localhost N/A N/A Y 7066 Self-heal Daemon on gqas008.sbu.lab.eng.bos .redhat.com N/A N/A Y 2621 Self-heal Daemon on gqas006.sbu.lab.eng.bos .redhat.com N/A N/A Y 27934 Self-heal Daemon on gqas005.sbu.lab.eng.bos .redhat.com N/A N/A Y 26374 Task Status of Volume rep3 ------------------------------------------------------------------------------ There are no active volume tasks [root@gqas013 tmp]#