Bug 1528558

Summary:	/usr/sbin/glusterfs crashing on Red Hat OpenShift Container Platform node
Product:	[Community] GlusterFS	Reporter:	Raghavendra G <rgowdapp>
Component:	write-behind	Assignee:	Raghavendra G <rgowdapp>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	mainline	CC:	atumball, bkunal, csaba, ravishankar, rgowdapp, rhinduja, rhs-bugs, sreber, storage-qa-internal
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-4.0.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1516638
Clones:	1529094 1529095 1529096 (view as bug list)		Environment:
Last Closed:	2018-03-15 11:24:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1516638
Bug Blocks:	1529094, 1529095, 1529096

Comment 1 Raghavendra G 2017-12-22 06:26:00 UTC

Description of problem:

Noticed a crash of /usr/sbin/glusterfs on two different Red Hat OpenShift Container Platform nodes.

(gdb) bt
#0  0x00007fba76602b17 in __wb_preprocess_winds () from /usr/lib64/glusterfs/3.8.4/xlator/performance/write-behind.so
#1  0x00007fba76602fdd in wb_process_queue () from /usr/lib64/glusterfs/3.8.4/xlator/performance/write-behind.so
#2  0x00007fba766036fc in wb_writev () from /usr/lib64/glusterfs/3.8.4/xlator/performance/write-behind.so
#3  0x00007fba763f3e35 in ra_writev () from /usr/lib64/glusterfs/3.8.4/xlator/performance/read-ahead.so
#4  0x00007fba846cd839 in default_writev () from /lib64/libglusterfs.so.0
#5  0x00007fba75fd942e in ioc_writev () from /usr/lib64/glusterfs/3.8.4/xlator/performance/io-cache.so
#6  0x00007fba75dce173 in qr_writev () from /usr/lib64/glusterfs/3.8.4/xlator/performance/quick-read.so
#7  0x00007fba846e6976 in default_writev_resume () from /lib64/libglusterfs.so.0
#8  0x00007fba84676138 in call_resume_wind () from /lib64/libglusterfs.so.0
#9  0x00007fba846765c5 in call_resume () from /lib64/libglusterfs.so.0
#10 0x00007fba75bc42e8 in open_and_resume () from /usr/lib64/glusterfs/3.8.4/xlator/performance/open-behind.so
#11 0x00007fba75bc4350 in ob_writev () from /usr/lib64/glusterfs/3.8.4/xlator/performance/open-behind.so
#12 0x00007fba759afd4d in mdc_writev () from /usr/lib64/glusterfs/3.8.4/xlator/performance/md-cache.so
#13 0x00007fba7578dfdc in io_stats_writev () from /usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so
#14 0x00007fba846cd839 in default_writev () from /lib64/libglusterfs.so.0
#15 0x00007fba755713c0 in meta_writev () from /usr/lib64/glusterfs/3.8.4/xlator/meta.so
#16 0x00007fba7ba40c4f in fuse_write_resume () from /usr/lib64/glusterfs/3.8.4/xlator/mount/fuse.so
#17 0x00007fba7ba37885 in fuse_resolve_all () from /usr/lib64/glusterfs/3.8.4/xlator/mount/fuse.so
#18 0x00007fba7ba37598 in fuse_resolve () from /usr/lib64/glusterfs/3.8.4/xlator/mount/fuse.so
#19 0x00007fba7ba378ce in fuse_resolve_all () from /usr/lib64/glusterfs/3.8.4/xlator/mount/fuse.so
#20 0x00007fba7ba36bb3 in fuse_resolve_continue () from /usr/lib64/glusterfs/3.8.4/xlator/mount/fuse.so
#21 0x00007fba7ba37516 in fuse_resolve () from /usr/lib64/glusterfs/3.8.4/xlator/mount/fuse.so
#22 0x00007fba7ba378ae in fuse_resolve_all () from /usr/lib64/glusterfs/3.8.4/xlator/mount/fuse.so
#23 0x00007fba7ba378f0 in fuse_resolve_and_resume () from /usr/lib64/glusterfs/3.8.4/xlator/mount/fuse.so
#24 0x00007fba7ba4d90e in fuse_thread_proc () from /usr/lib64/glusterfs/3.8.4/xlator/mount/fuse.so
#25 0x00007fba834b2e25 in start_thread (arg=0x7fba6ffff700) at pthread_create.c:308
#26 0x00007fba82d7f34d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Version-Release number of selected component (if applicable):

 - glusterfs-3.8.4-44.el7.x86_64
 - kernel-3.10.0-693.2.2.el7.x86_64

How reproducible:

 - Happen so far twice only on customer systems

Steps to Reproduce:
1. N/A

Actual results:

`glusterfs-fuse` client crashed causing impact on production application


Expected results:

`glusterfs-fuse` client should not crash

Additional info:

Comment 2 Raghavendra G 2017-12-22 06:27:32 UTC

(gdb) thr 1
[Switching to thread 1 (Thread 0x7fba6ffff700 (LWP 64279))]
#0  0x00007fba76602b17 in __wb_preprocess_winds (wb_inode=wb_inode@entry=0x7fba6434cd60) at write-behind.c:1325
1325			offset_expected = holder->stub->args.offset
(gdb) p *holder
$13 = {all = {next = 0x7fba65756520, prev = 0x7fba65756520}, todo = {next = 0x7fba643e8240, prev = 0x7fba6434cd88}, lie = {next = 0x7fba65756540, 
    prev = 0x7fba65756540}, winds = {next = 0x7fba65756550, prev = 0x7fba65756550}, unwinds = {next = 0x7fba65756560, prev = 0x7fba65756560}, wip = {
    next = 0x7fba65756570, prev = 0x7fba65756570}, stub = 0x0, write_size = 20480, orig_size = 131072, total_size = 0, op_ret = 131072, op_errno = 0, refcount = 0, 
  wb_inode = 0x7fba6434cd60, fop = GF_FOP_WRITE, lk_owner = {len = 8, data = '\000' <repeats 1023 times>}, iobref = 0x0, gen = 9, fd = 0x7fba648d99d0, wind_count = 9, 
  ordering = {size = 131072, off = 1179648, append = 0, tempted = -1, lied = -1, fulfilled = -1, go = 0}}
(gdb) p &holder->todo
$14 = (list_head_t *) 0x7fba65756530
(gdb) p holder->lie
$15 = {next = 0x7fba65756540, prev = 0x7fba65756540}
(gdb) p &holder->lie
$16 = (list_head_t *) 0x7fba65756540
(gdb) p holder->todo
$17 = {next = 0x7fba643e8240, prev = 0x7fba6434cd88}
(gdb) p &holder->todo
$18 = (list_head_t *) 0x7fba65756530
(gdb) p holder->winds
$19 = {next = 0x7fba65756550, prev = 0x7fba65756550}
(gdb) p &holder->winds
$20 = (list_head_t *) 0x7fba65756550
(gdb) p holder->unwinds
$21 = {next = 0x7fba65756560, prev = 0x7fba65756560}
(gdb) p &holder->unwinds
$22 = (list_head_t *) 0x7fba65756560
(gdb) p holder->wip
$23 = {next = 0x7fba65756570, prev = 0x7fba65756570}
(gdb) p &holder->wip
$24 = (list_head_t *) 0x7fba65756570
(gdb) p holder->stub
$25 = (call_stub_t *) 0x0

Some observations:
* holder->stub is NULL causing the crash
* holder->refcount is 0, so the "holder" is likely to be freed
* holder->todo is not an empty list. However, holder is not part of any other lists.

Comment 3 Raghavendra G 2017-12-22 06:28:38 UTC

Note that the wind_count for holder is 9. This means the request was retried multiple times to sync the content to backend. This can be due to two reasons:
* previous syncs failed
* previous syncs succeeded partially resulting in short write

Note that holder->orig_size is 131072, but holder->write_size is 20480 which is less than the orig_size. This means we ran into case 2 above. Looking through how write-behind handles short writes, I saw this code:

  list_for_each_entry_safe (req, next, &head->winds, winds) {
                        accounted_size = __wb_fulfill_short_write (req, size,
                                                                   &fulfilled);
                        size -= accounted_size;

                        if (size == 0) {
				if (fulfilled)
                                        req = next;
				break;
                        }

		}

Note that if the flag "fulfilled" is true, the entire content of request is written to backend and the request can potentially be freed. Also, as the contents are written, there is no need to "retry" that request. So, we move the "head" (tracked by variable req) of "requests to be retried" to the next request in list. But, the variable "fulfilled" is never reset for each member of the list. This means if _any_ of the previous requests are fulfilled, when we break from the loop (if (size == 0)), the "head" is moved to "next". This has some interesting consequences:

* If we break from the loop while processing last member of the list head->winds, req is reset to head as the list is a circular one. However, head is already fulfilled and can potentially be freed. So, we end up adding a freed request to wb_inode->todo list. This is the RCA for the crash tracked by this bug (Note that we saw "holder" which is freed in todo list).

* If we break from the loop while processing any of the last but one member of the list head->winds, req is set to next member in the list, skipping the current request, even though it is not entirely synced. This can lead to data corruption.

The fix is very simple and we've to change the code to make sure "fulfilled" reflects whether the current request is fulfilled or not and it doesn't carry history of previous requests in the list.

Comment 4 Worker Ant 2017-12-22 07:05:17 UTC

REVIEW: https://review.gluster.org/19064 (performance/write-behind: fix bug while handling short writes) posted (#1) for review on master by Raghavendra G

Comment 5 Worker Ant 2017-12-22 17:45:28 UTC

COMMIT: https://review.gluster.org/19064 committed in master by \"Raghavendra G\" <rgowdapp> with a commit message- performance/write-behind: fix bug while handling short writes

The variabled "fulfilled" in wb_fulfill_short_write is not reset to 0
while handling every member of the list.

This has some interesting consequences:

* If we break from the loop while processing last member of the list
  head->winds, req is reset to head as the list is a circular
  one. However, head is already fulfilled and can potentially be
  freed. So, we end up adding a freed request to wb_inode->todo
  list. This is the RCA for the crash tracked by the bug associated
  with this patch (Note that we saw "holder" which is freed in todo
  list).

* If we break from the loop while processing any of the last but one
  member of the list head->winds, req is set to next member in the
  list, skipping the current request, even though it is not entirely
  synced. This can lead to data corruption.

The fix is very simple and we've to change the code to make sure
"fulfilled" reflects whether the current request is fulfilled or not
and it doesn't carry history of previous requests in the list.

Change-Id: Ia3d6988175a51c9e08efdb521a7b7938b01f93c8
BUG: 1528558
Signed-off-by: Raghavendra G <rgowdapp>

Comment 6 Shyamsundar 2018-03-15 11:24:00 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-4.0.0, please open a new bug report.

glusterfs-4.0.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2018-March/000092.html
[2] https://www.gluster.org/pipermail/gluster-users/

Comment 7 Xavi Hernandez 2019-11-20 10:32:02 UTC

*** Bug 1434332 has been marked as a duplicate of this bug. ***