Bug 1237038
| Summary: | bad brick daemon | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Lubos Trilety <ltrilety> |
| Component: | core | Assignee: | Bug Updates Notification Mailing List <rhs-bugs> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Anoop <annair> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | rhgs-3.1 | CC: | atumball, hanakp, rhs-bugs, smohan |
| Target Milestone: | --- | Keywords: | ZStream |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-02-05 08:58:49 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Lubos Trilety
2015-06-30 09:08:03 UTC
hello, Did you find the issue? We had a similar problem. Our server crashed. After the crash we are not able to add one particular brick to the Gluster. When we try the server panics. Same log appear in the log of the brick just before the crashes. Gluster 3.1 4 nodes RHEL 7.1 NFS export. Thank you, Peter [2016-05-01 09:35:33.435317] W [MSGID: 101105] [gfdb_sqlite3.h:240:gfdb_set_sql_params] 0-distrepvol-changetimerecorder: Failed to retrieve sql-db-pagesize from params.Assigning default value: 4096 [2016-05-01 09:35:33.435325] W [MSGID: 101105] [gfdb_sqlite3.h:240:gfdb_set_sql_params] 0-distrepvol-changetimerecorder: Failed to retrieve sql-db-cachesize from params.Assigning default value: 1000 [2016-05-01 09:35:33.435331] W [MSGID: 101105] [gfdb_sqlite3.h:240:gfdb_set_sql_params] 0-distrepvol-changetimerecorder: Failed to retrieve sql-db-journalmode from params.Assigning default value: wal [2016-05-01 09:35:33.435336] W [MSGID: 101105] [gfdb_sqlite3.h:240:gfdb_set_sql_params] 0-distrepvol-changetimerecorder: Failed to retrieve sql-db-wal-autocheckpoint from params.Assigning default value: 1000 [2016-05-01 09:35:33.435342] W [MSGID: 101105] [gfdb_sqlite3.h:240:gfdb_set_sql_params] 0-distrepvol-changetimerecorder: Failed to retrieve sql-db-sync from params.Assigning default value: normal [2016-05-01 09:35:33.435348] W [MSGID: 101105] [gfdb_sqlite3.h:240:gfdb_set_sql_params] 0-distrepvol-changetimerecorder: Failed to retrieve sql-db-autovacuum from params.Assigning default value: none [2016-05-01 09:35:33.436379] I [trash.c:2363:init] 0-distrepvol-trash: no option specified for 'eliminate', using NULL [2016-05-01 09:35:33.452832] W [MSGID: 101174] [graph.c:362:_log_if_unknown_option] 0-distrepvol-server: option 'rpc-auth.auth-glusterfs' is not recognized [2016-05-01 09:35:33.452863] W [MSGID: 101174] [graph.c:362:_log_if_unknown_option] 0-distrepvol-server: option 'rpc-auth.auth-unix' is not recognized [2016-05-01 09:35:33.452877] W [MSGID: 101174] [graph.c:362:_log_if_unknown_option] 0-distrepvol-server: option 'rpc-auth.auth-null' is not recognized [2016-05-01 09:35:33.452918] W [MSGID: 101174] [graph.c:362:_log_if_unknown_option] 0-distrepvol-quota: option 'timeout' is not recognized [2016-05-01 09:35:33.452956] W [MSGID: 101174] [graph.c:362:_log_if_unknown_option] 0-distrepvol-trash: option 'brick-path' is not recognized [2016-05-01 09:35:33.459524] W [MSGID: 113026] [posix.c:1326:posix_mkdir] 0-distrepvol-posix: mkdir (/.trashcan/): gfid (00000000-0000-0000-0000-000000000005) isalready associated with directory (/gluster/brick1b_3/brick/.glusterfs/00/00/00000000-0000-0000-0000-000000000001/.trashcan). Hence,both directories will share same gfid and thiscan lead to inconsistencies. [2016-05-01 09:35:33.459550] E [MSGID: 113027] [posix.c:1348:posix_mkdir] 0-distrepvol-posix: mkdir of /gluster/brick1b_3/brick/.trashcan/ failed [File exists] [2016-05-01 09:35:33.459619] W [MSGID: 113026] [posix.c:1326:posix_mkdir] 0-distrepvol-posix: mkdir (/.trashcan/internal_op): gfid (00000000-0000-0000-0000-000000000006) isalready associated with directory (/gluster/brick1b_3/brick/.glusterfs/00/00/00000000-0000-0000-0000-000000000005/internal_op). Hence,both directories will share same gfid and thiscan lead to inconsistencies. [2016-05-01 09:35:33.459630] E [MSGID: 113027] [posix.c:1348:posix_mkdir] 0-distrepvol-posix: mkdir of /gluster/brick1b_3/brick/.trashcan/internal_op failed [File exists] [2016-05-01 09:35:33.459639] E [trash.c:387:trash_internal_op_mkdir_cbk] 0-distrepvol-trash: mkdir failed for internal op directory : File exists Our issue was sorted out by REDHAT support.
To reiterate, RCA of this Issue :
The root cause is a problem in the XFS multi-block buffer logging mechanism. This issue is seen in your case due to use of 16k directory block size.
Problem is that the buffer logging code maked an area of the bitmap associated with larger block sizes rather than a multiblock buffer on a 4k block size fs.
Larger block sizes is supported with ppc64 architecture. The areas marked of bitmap are invalid for 4k block size filesystem. Therefore, the logging code will not find valid regions in the buffer and will not allocate a log vector. The log code expects each logged object to have a log vector with the associated data, ultimately leading to the crash when the CIL code finds an incorrectly constructed item without an lv.
Fixed by kernel update.
kernel-3.10.0-327.25.1.el7.x86_64
Not seen in recent releases. Please re-open if seen in RHGS 3.3+ versions. |