Bug 1704211 - Brick gone down unexpectedly
Summary: Brick gone down unexpectedly
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: disperse
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Ashish Pandey
QA Contact: Nag Pavan Chilakam
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-29 10:52 UTC by Nag Pavan Chilakam
Modified: 2019-10-30 14:41 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-05-08 13:46:26 UTC
Embargoed:


Attachments (Terms of Use)

Description Nag Pavan Chilakam 2019-04-29 10:52:43 UTC
Description of problem:
=======================
A brick has gone down unexpectedly on my setup.
had a 44x(4+2) ec volume with brickmultiplex enabled.

I had started IOs from 4 clients over the weekend and saw that one of the brick had stopped running(possibly only after about an hour of IOs)

(no core seen)

glusterd log of node where brick went down:
=================================
[2019-04-26 13:40:35.491566] I [MSGID: 106568] [glusterd-svc-mgmt.c:261:glusterd_svc_stop] 0-management: scrub service is stopped
[2019-04-26 15:15:32.943490] I [MSGID: 106143] [glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick /gluster/brick10/bmxecv on port 49154


sample of brick log of node where brick went down:
======================================
[2019-04-26 15:15:02.322306] I [dict.c:541:dict_get] (-->/usr/lib64/glusterfs/6.0/xlator/features/worm.so(+0x7241) [0x7fe857bb3241] -->/usr/lib64/glusterfs/6.0/xlator/features/locks.so(+0x1c219) [0x7fe857dda219] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7fe86c2d7294] ) 41-dict: !this || key=trusted.glusterfs.enforce-mandatory-lock [Invalid argument]
[2019-04-26 15:15:28.291214] W [MSGID: 113018] [posix-helpers.c:743:posix_istat] 9-bmxecv-posix: lstat failed on /gluster/brick10/bmxecv/.glusterfs/b0/02/b0022d0e-a294-4932-b3f6-b2b047cee4fb [Input/output error]
[2019-04-26 15:15:28.291267] E [MSGID: 113039] [posix-inode-fd-ops.c:1459:posix_open] 9-bmxecv-posix: open on /gluster/brick10/bmxecv/.glusterfs/b0/02/b0022d0e-a294-4932-b3f6-b2b047cee4fb, flags: 133120 [Input/output error]
[2019-04-26 15:15:28.291305] E [MSGID: 115070] [server-rpc-fops_v2.c:1503:server4_open_cbk] 0-bmxecv-server: 821939: OPEN /rhs-client24.lab.eng.blr.redhat.com/dir.4/linux-4.20.8/scripts/kconfig/preprocess.c (b0022d0e-a294-4932-b3f6-b2b047cee4fb), client: CTX_ID:5fed0f70-e26f-48cf-8478-6bf0f760c868-GRAPH_ID:0-PID:7877-HOST:rhs-client24.lab.eng.blr.redhat.com-PC_NAME:bmxecv-client-28-RECON_NO:-0, error-xlator: bmxecv-posix [Input/output error]
[2019-04-26 15:15:28.292053] W [MSGID: 113018] [posix-helpers.c:818:posix_pstat] 9-bmxecv-posix: lstat failed on /gluster/brick10/bmxecv/rhs-client28.lab.eng.blr.redhat.com/dir.4/linux-4.20.8/sound/soc/intel/boards [Input/output error]
[2019-04-26 15:15:28.292114] W [MSGID: 113018] [posix-helpers.c:743:posix_istat] 9-bmxecv-posix: lstat failed on /gluster/brick10/bmxecv/.glusterfs/66/3c/663c61f4-f544-4a84-b845-cfdcfcf5d5af [Input/output error]
[2019-04-26 15:15:28.292138] W [MSGID: 113018] [posix-helpers.c:818:posix_pstat] 9-bmxecv-posix: lstat failed on /gluster/brick10/bmxecv/rhs-client28.lab.eng.blr.redhat.com/dir.4/linux-4.20.8/sound/soc/intel [Input/output error]
[2019-04-26 15:15:28.292154] E [MSGID: 113018] [posix-entry-ops.c:706:posix_mkdir] 9-bmxecv-posix: pre-operation lstat on parent /gluster/brick10/bmxecv/rhs-client28.lab.eng.blr.redhat.com/dir.4/linux-4.20.8/sound/soc/intel failed [Input/output error]
[2019-04-26 15:15:28.292182] E [MSGID: 115056] [server-rpc-fops_v2.c:515:server4_mkdir_cbk] 0-bmxecv-server: 776541: MKDIR /rhs-client28.lab.eng.blr.redhat.com/dir.4/linux-4.20.8/sound/soc/intel/boards (b699c200-fa86-47d0-871a-d022c23e28f4/boards) client: CTX_ID:b2a2d1db-deb3-42f3-85a3-b7b7555f18c3-GRAPH_ID:0-PID:7881-HOST:rhs-client28.lab.eng.blr.redhat.com-PC_NAME:bmxecv-client-28-RECON_NO:-0, error-xlator: bmxecv-posix [Input/output error]
[2019-04-26 15:15:28.292316] E [MSGID: 113040] [posix-helpers.c:1905:__posix_fd_ctx_get] 9-bmxecv-posix: Failed to get anonymous fd for real_path: /gluster/brick10/bmxecv/.glusterfs/b0/02/b0022d0e-a294-4932-b3f6-b2b047cee4fb. [Input/output error]
[2019-04-26 15:15:28.292348] W [MSGID: 113055] [posix-inode-fd-ops.c:4749:do_xattrop] 9-bmxecv-posix: failed to get pfd from fd=0x7fe3663986d8 [Interrupted system call]
[2019-04-26 15:15:28.292380] E [MSGID: 115073] [server-rpc-fops_v2.c:1805:server4_fxattrop_cbk] 0-bmxecv-server: 821941: FXATTROP -2 (b0022d0e-a294-4932-b3f6-b2b047cee4fb), client: CTX_ID:5fed0f70-e26f-48cf-8478-6bf0f760c868-GRAPH_ID:0-PID:7877-HOST:rhs-client24.lab.eng.blr.redhat.com-PC_NAME:bmxecv-client-28-RECON_NO:-0, error-xlator: bmxecv-posix [Input/output error]
[2019-04-26 15:15:28.293999] W [MSGID: 113018] [posix-helpers.c:818:posix_pstat] 9-bmxecv-posix: lstat failed on /gluster/brick10/bmxecv/rhs-client24.lab.eng.blr.redhat.com/dir.4/linux-4.20.8/scripts/kconfig/preprocess.c [Input/output error]
[2019-04-26 15:15:28.294031] W [MSGID: 113018] [posix-entry-ops.c:235:posix_lookup] 9-bmxecv-posix: lstat on /gluster/brick10/bmxecv/rhs-client24.lab.eng.blr.redhat.com/dir.4/linux-4.20.8/scripts/kconfig/preprocess.c failed [Input/output error]
[2019-04-26 15:15:28.294048] W [MSGID: 113018] [posix-helpers.c:818:posix_pstat] 9-bmxecv-posix: lstat failed on /gluster/brick10/bmxecv/rhs-client24.lab.eng.blr.redhat.com/dir.4/linux-4.20.8/scripts/kconfig [Input/output error]
[2019-04-26 15:15:28.294059] E [MSGID: 113018] [posix-entry-ops.c:306:posix_lookup] 9-bmxecv-posix: post-operation lstat on parent /gluster/brick10/bmxecv/rhs-client24.lab.eng.blr.redhat.com/dir.4/linux-4.20.8/scripts/kconfig failed [Input/output error]
[2019-04-26 15:15:28.294088] E [MSGID: 113040] [posix-helpers.c:1905:__posix_fd_ctx_get] 9-bmxecv-posix: Failed to get anonymous fd for real_path: /gluster/brick10/bmxecv/.glusterfs/b0/02/b0022d0e-a294-4932-b3f6-b2b047cee4fb. [Input/output error]
[2019-04-26 15:15:28.294113] W [MSGID: 113055] [posix-inode-fd-ops.c:4749:do_xattrop] 9-bmxecv-posix: failed to get pfd from fd=0x7fe39618cc38 [Interrupted system call]
[2019-04-26 15:15:28.294138] E [MSGID: 115073] [server-rpc-fops_v2.c:1805:server4_fxattrop_cbk] 0-bmxecv-server: 821946: FXATTROP -2 (b0022d0e-a294-4932-b3f6-b2b047cee4fb), client: CTX_ID:5fed0f70-e26f-48cf-8478-6bf0f760c868-GRAPH_ID:0-PID:7877-HOST:rhs-client24.lab.eng.blr.redhat.com-PC_NAME:bmxecv-client-28-RECON_NO:-0, error-xlator: bmxecv-posix [Input/output error]
[2019-04-26 15:15:28.294165] W [MSGID: 113018] [posix-helpers.c:818:posix_pstat] 9-bmxecv-posix: lstat failed on /gluster/brick10/bmxecv/rhs-client24.lab.eng.blr.redhat.com/dir.4/linux-4.20.8/scripts/kconfig/preprocess.c [Input/output error]
[2019-04-26 15:15:28.294200] E [MSGID: 115050] [server-rpc-fops_v2.c:158:server4_lookup_cbk] 0-bmxecv-server: 821945: LOOKUP /rhs-client24.lab.eng.blr.redhat.com/dir.4/linux-4.20.8/scripts/kconfig/preprocess.c (b4db8f2f-b89c-4ce8-a373-1b7dc919b34b/preprocess.c), client: CTX_ID:5fed0f70-e26f-48cf-8478-6bf0f760c868-GRAPH_ID:0-PID:7877-HOST:rhs-client24.lab.eng.blr.redhat.com-PC_NAME:bmxecv-client-28-RECON_NO:-0, error-xlator: bmxecv-posix [Input/output error]



Version-Release number of selected component (if applicable):
===================
6.0.2



Steps to Reproduce:
====================
1.create a 3 node cluster, enabled brickmux
2. created a 44x(4+2) ec volume 
3. mounted volume on 4 clients
4. turned off other-eager-lock manually due to BZ#1703455
5. started IOs from all 4 clients (to seperate directories)
6. issuing volume status every 2 mins from one of the servers(for log capturing purpose)
left the above for weekend

Actual results:
===============
saw that one brick had stopped running about 1 hour post IOs being triggered

Expected results:


Additional info:

Comment 14 Eric Sandeen 2019-05-02 13:52:08 UTC
This is not an xfs issue; your storage is failing:

[Fri Apr 26 20:45:28 2019] megaraid_sas 0000:06:00.0: 102219 (609606928s/0x0001/FATAL) - Uncorrectable medium error logged for VD 03/3 at f22421a (on PD 02(e0x00/s2) at f22421a)
[Fri Apr 26 20:45:28 2019] sd 0:2:3:0: [sdd] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[Fri Apr 26 20:45:28 2019] sd 0:2:3:0: [sdd] tag#0 CDB: Read(10) 28 00 0f 22 42 00 00 00 20 00
[Fri Apr 26 20:45:28 2019] blk_update_request: I/O error, dev sdd, sector 253903360



xfs is simply reporting the IO error it received from storage.


Note You need to log in before you can comment on or make changes to this bug.