Description of problem: ======================= A brick has gone down unexpectedly on my setup. had a 44x(4+2) ec volume with brickmultiplex enabled. I had started IOs from 4 clients over the weekend and saw that one of the brick had stopped running(possibly only after about an hour of IOs) (no core seen) glusterd log of node where brick went down: ================================= [2019-04-26 13:40:35.491566] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: scrub service is stopped [2019-04-26 15:15:32.943490] I [MSGID: 106143] [glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick /gluster/brick10/bmxecv on port 49154 sample of brick log of node where brick went down: ====================================== [2019-04-26 15:15:02.322306] I [dict.c:541:dict_get] (-->/usr/lib64/glusterfs/6.0/xlator/features/worm.so(+0x7241) [0x7fe857bb3241] -->/usr/lib64/glusterfs/6.0/xlator/features/locks.so(+0x1c219) [0x7fe857dda219] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7fe86c2d7294] ) 41-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument] [2019-04-26 15:15:28.291214] W [MSGID: 113018] [posix-helpers.c:743:posix_istat] 9-bmxecv-posix: lstat failed on /gluster/brick10/bmxecv/.glusterfs/b0/02/b0022d0e-a294-4932-b3f6-b2b047cee4fb [Input/output error] [2019-04-26 15:15:28.291267] E [MSGID: 113039] [posix-inode-fd-ops.c:1459:posix_open] 9-bmxecv-posix: open on /gluster/brick10/bmxecv/.glusterfs/b0/02/b0022d0e-a294-4932-b3f6-b2b047cee4fb, flags: 133120 [Input/output error] [2019-04-26 15:15:28.291305] E [MSGID: 115070] [server-rpc-fops_v2.c:1503:server4_open_cbk] 0-bmxecv-server: 821939: OPEN /rhs-client24.lab.eng.blr.redhat.com/dir.4/linux-4.20.8/scripts/kconfig/preprocess.c (b0022d0e-a294-4932-b3f6-b2b047cee4fb), client: CTX_ID:5fed0f70-e26f-48cf-8478-6bf0f760c868-GRAPH_ID:0-PID:7877-HOST:rhs-client24.lab.eng.blr.redhat.com-PC_NAME:bmxecv-client-28-RECON_NO:-0, error-xlator: bmxecv-posix [Input/output error] [2019-04-26 15:15:28.292053] W [MSGID: 113018] [posix-helpers.c:818:posix_pstat] 9-bmxecv-posix: lstat failed on /gluster/brick10/bmxecv/rhs-client28.lab.eng.blr.redhat.com/dir.4/linux-4.20.8/sound/soc/intel/boards [Input/output error] [2019-04-26 15:15:28.292114] W [MSGID: 113018] [posix-helpers.c:743:posix_istat] 9-bmxecv-posix: lstat failed on /gluster/brick10/bmxecv/.glusterfs/66/3c/663c61f4-f544-4a84-b845-cfdcfcf5d5af [Input/output error] [2019-04-26 15:15:28.292138] W [MSGID: 113018] [posix-helpers.c:818:posix_pstat] 9-bmxecv-posix: lstat failed on /gluster/brick10/bmxecv/rhs-client28.lab.eng.blr.redhat.com/dir.4/linux-4.20.8/sound/soc/intel [Input/output error] [2019-04-26 15:15:28.292154] E [MSGID: 113018] [posix-entry-ops.c:706:posix_mkdir] 9-bmxecv-posix: pre-operation lstat on parent /gluster/brick10/bmxecv/rhs-client28.lab.eng.blr.redhat.com/dir.4/linux-4.20.8/sound/soc/intel failed [Input/output error] [2019-04-26 15:15:28.292182] E [MSGID: 115056] [server-rpc-fops_v2.c:515:server4_mkdir_cbk] 0-bmxecv-server: 776541: MKDIR /rhs-client28.lab.eng.blr.redhat.com/dir.4/linux-4.20.8/sound/soc/intel/boards (b699c200-fa86-47d0-871a-d022c23e28f4/boards) client: CTX_ID:b2a2d1db-deb3-42f3-85a3-b7b7555f18c3-GRAPH_ID:0-PID:7881-HOST:rhs-client28.lab.eng.blr.redhat.com-PC_NAME:bmxecv-client-28-RECON_NO:-0, error-xlator: bmxecv-posix [Input/output error] [2019-04-26 15:15:28.292316] E [MSGID: 113040] [posix-helpers.c:1905:__posix_fd_ctx_get] 9-bmxecv-posix: Failed to get anonymous fd for real_path: /gluster/brick10/bmxecv/.glusterfs/b0/02/b0022d0e-a294-4932-b3f6-b2b047cee4fb. [Input/output error] [2019-04-26 15:15:28.292348] W [MSGID: 113055] [posix-inode-fd-ops.c:4749:do_xattrop] 9-bmxecv-posix: failed to get pfd from fd=0x7fe3663986d8 [Interrupted system call] [2019-04-26 15:15:28.292380] E [MSGID: 115073] [server-rpc-fops_v2.c:1805:server4_fxattrop_cbk] 0-bmxecv-server: 821941: FXATTROP -2 (b0022d0e-a294-4932-b3f6-b2b047cee4fb), client: CTX_ID:5fed0f70-e26f-48cf-8478-6bf0f760c868-GRAPH_ID:0-PID:7877-HOST:rhs-client24.lab.eng.blr.redhat.com-PC_NAME:bmxecv-client-28-RECON_NO:-0, error-xlator: bmxecv-posix [Input/output error] [2019-04-26 15:15:28.293999] W [MSGID: 113018] [posix-helpers.c:818:posix_pstat] 9-bmxecv-posix: lstat failed on /gluster/brick10/bmxecv/rhs-client24.lab.eng.blr.redhat.com/dir.4/linux-4.20.8/scripts/kconfig/preprocess.c [Input/output error] [2019-04-26 15:15:28.294031] W [MSGID: 113018] [posix-entry-ops.c:235:posix_lookup] 9-bmxecv-posix: lstat on /gluster/brick10/bmxecv/rhs-client24.lab.eng.blr.redhat.com/dir.4/linux-4.20.8/scripts/kconfig/preprocess.c failed [Input/output error] [2019-04-26 15:15:28.294048] W [MSGID: 113018] [posix-helpers.c:818:posix_pstat] 9-bmxecv-posix: lstat failed on /gluster/brick10/bmxecv/rhs-client24.lab.eng.blr.redhat.com/dir.4/linux-4.20.8/scripts/kconfig [Input/output error] [2019-04-26 15:15:28.294059] E [MSGID: 113018] [posix-entry-ops.c:306:posix_lookup] 9-bmxecv-posix: post-operation lstat on parent /gluster/brick10/bmxecv/rhs-client24.lab.eng.blr.redhat.com/dir.4/linux-4.20.8/scripts/kconfig failed [Input/output error] [2019-04-26 15:15:28.294088] E [MSGID: 113040] [posix-helpers.c:1905:__posix_fd_ctx_get] 9-bmxecv-posix: Failed to get anonymous fd for real_path: /gluster/brick10/bmxecv/.glusterfs/b0/02/b0022d0e-a294-4932-b3f6-b2b047cee4fb. [Input/output error] [2019-04-26 15:15:28.294113] W [MSGID: 113055] [posix-inode-fd-ops.c:4749:do_xattrop] 9-bmxecv-posix: failed to get pfd from fd=0x7fe39618cc38 [Interrupted system call] [2019-04-26 15:15:28.294138] E [MSGID: 115073] [server-rpc-fops_v2.c:1805:server4_fxattrop_cbk] 0-bmxecv-server: 821946: FXATTROP -2 (b0022d0e-a294-4932-b3f6-b2b047cee4fb), client: CTX_ID:5fed0f70-e26f-48cf-8478-6bf0f760c868-GRAPH_ID:0-PID:7877-HOST:rhs-client24.lab.eng.blr.redhat.com-PC_NAME:bmxecv-client-28-RECON_NO:-0, error-xlator: bmxecv-posix [Input/output error] [2019-04-26 15:15:28.294165] W [MSGID: 113018] [posix-helpers.c:818:posix_pstat] 9-bmxecv-posix: lstat failed on /gluster/brick10/bmxecv/rhs-client24.lab.eng.blr.redhat.com/dir.4/linux-4.20.8/scripts/kconfig/preprocess.c [Input/output error] [2019-04-26 15:15:28.294200] E [MSGID: 115050] [server-rpc-fops_v2.c:158:server4_lookup_cbk] 0-bmxecv-server: 821945: LOOKUP /rhs-client24.lab.eng.blr.redhat.com/dir.4/linux-4.20.8/scripts/kconfig/preprocess.c (b4db8f2f-b89c-4ce8-a373-1b7dc919b34b/preprocess.c), client: CTX_ID:5fed0f70-e26f-48cf-8478-6bf0f760c868-GRAPH_ID:0-PID:7877-HOST:rhs-client24.lab.eng.blr.redhat.com-PC_NAME:bmxecv-client-28-RECON_NO:-0, error-xlator: bmxecv-posix [Input/output error] Version-Release number of selected component (if applicable): =================== 6.0.2 Steps to Reproduce: ==================== 1.create a 3 node cluster, enabled brickmux 2. created a 44x(4+2) ec volume 3. mounted volume on 4 clients 4. turned off other-eager-lock manually due to BZ#1703455 5. started IOs from all 4 clients (to seperate directories) 6. issuing volume status every 2 mins from one of the servers(for log capturing purpose) left the above for weekend Actual results: =============== saw that one brick had stopped running about 1 hour post IOs being triggered Expected results: Additional info:
This is not an xfs issue; your storage is failing: [Fri Apr 26 20:45:28 2019] megaraid_sas 0000:06:00.0: 102219 (609606928s/0x0001/FATAL) - Uncorrectable medium error logged for VD 03/3 at f22421a (on PD 02(e0x00/s2) at f22421a) [Fri Apr 26 20:45:28 2019] sd 0:2:3:0: [sdd] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK [Fri Apr 26 20:45:28 2019] sd 0:2:3:0: [sdd] tag#0 CDB: Read(10) 28 00 0f 22 42 00 00 00 20 00 [Fri Apr 26 20:45:28 2019] blk_update_request: I/O error, dev sdd, sector 253903360 xfs is simply reporting the IO error it received from storage.