Bug 1449245

Summary:	[gluster-block]:vmcore generated when deleting the block with ha 3 when one of the node is down
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	surabhi <sbhaloth>
Component:	tcmu-runner	Assignee:	Prasanna Kumar Kalever <prasanna.kalever>
Status:	CLOSED ERRATA	QA Contact:	Sweta Anandpara <sanandpa>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.3	CC:	amukherj, kramdoss, prasanna.kalever, rcyriac, rhs-bugs, sanandpa, storage-qa-internal
Target Milestone:	---
Target Release:	RHGS 3.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	rhel-fix
Fixed In Version:	tcmu-runner-1.2.0-9.el7rhgs	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1468963 (view as bug list)		Environment:
Last Closed:	2017-09-21 04:17:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1430225
Bug Blocks:	1417151, 1468963

Description surabhi 2017-05-09 13:21:00 UTC

Description of problem:
******************************
There is a VMcore generated when I tried to delete a block which is created with HA 3 and one of the node is down.

Actually this issue is seen multiple times where VM becomes inaccessible for a while and there is a crash. 

Above mentioned is one of the scenario where this is observed. Don't know the exact steps to reproduce this but it is happening quiet often in my setup.

************************************


[251127.647553] CPU: 1 PID: 3853 Comm: tcmu-runner Not tainted 3.10.0-514.16.1.el7.x86_64 #1
[251127.647595] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[251127.647627] task: ffff880036558000 ti: ffff8802307c4000 task.ti: ffff8802307c4000
[251127.647666] RIP: 0010:[<ffffffff810fb7ed>]  [<ffffffff810fb7ed>] module_put+0x1d/0x80
[251127.647715] RSP: 0018:ffff8802307c7e60  EFLAGS: 00010282
[251127.647748] RAX: dead000000000100 RBX: ffff88023fec9840 RCX: 0000000000000000
[251127.647784] RDX: ffffffffa06f8820 RSI: ffff88020ce0b048 RDI: ffff880228b49240
[251127.647819] RBP: ffff8802307c7e78 R08: 0000000000000000 R09: 0000000000000000
[251127.647854] R10: ffff88020ce0b048 R11: ffff880227ff5c10 R12: ffff880228b49240
[251127.647900] R13: 0000000000000000 R14: ffff8802245d9a80 R15: ffff88023feabd20
[251127.647936] FS:  00007f70765cf800(0000) GS:ffff88023fd00000(0000) knlGS:0000000000000000
[251127.647975] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[251127.648004] CR2: 00007f7030002000 CR3: 000000022572e000 CR4: 00000000000006e0
[251127.648045] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[251127.648080] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[251127.648114] Stack:
[251127.648129]  ffff88023fec9840 ffff880228b49840 0000000000000000 ffff8802307c7ea0
[251127.648171]  ffffffffa0654530 ffff880227ff5c00 0000000000000008 ffff88020ce0b048
[251127.648211]  ffff8802307c7ee8 ffffffff81200589 ffff88020ce0b048 ffff880227ff5c10
[251127.648251] Call Trace:
[251127.648275]  [<ffffffffa0654530>] uio_release+0x40/0x60 [uio]
[251127.648308]  [<ffffffff81200589>] __fput+0xe9/0x260
[251127.648334]  [<ffffffff8120083e>] ____fput+0xe/0x10
[251127.648362]  [<ffffffff810ad1e7>] task_work_run+0xa7/0xe0
[251127.648393]  [<ffffffff8102ab22>] do_notify_resume+0x92/0xb0
[251127.648424]  [<ffffffff8169733d>] int_signal+0x12/0x17
[251127.648449] Code: eb bf 66 90 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 85 ff 48 89 e5 41 55 41 54 49 89 fc 53 74 1c 48 8b 87 30 02 00 00 <65> 48 ff 40 08 4c 8b 6d 08 66 66 66 66 90 41 83 3c 24 02 74 38 
[251127.649711] RIP  [<ffffffff810fb7ed>] module_put+0x1d/0x80
[251127.650220]  RSP <ffff8802307c7e60>

Looks like related to tcmu-runner when it is not running on one of the nodes?


Version-Release number of selected component (if applicable):

tcmu-runner-1.2.0-1.el7rhgs.x86_64


How reproducible:
multiple times.

Steps to Reproduce:
Seen with following steps once, not sure of exact reproducer.

1.Delete the block which is created by using HA 3
2. Out of 3 nodes , bring down gluster-blockd on one node and check the status of delete.
3. the node becomes inaccessible for a while.

Actual results:
***************
The block doesn't get deleted and there is a vmcore and core dumped.


Expected results:
*****************

There should not be any crash.

Additional info:
Sosreports and core dump to follow soon.

Comment 9 Atin Mukherjee 2017-06-20 10:42:15 UTC

upstream patch  : https://review.gluster.org/#/c/17545/

Comment 10 Atin Mukherjee 2017-06-20 10:59:16 UTC

(In reply to Atin Mukherjee from comment #9)
> upstream patch  : https://review.gluster.org/#/c/17545/

ignore this.

Comment 17 Sweta Anandpara 2017-07-24 09:07:49 UTC

Tested and verified this on the build glusterfs-3.8.4-33 and gluster-block-0.2.1-6.

Hitting another issue with respect to failed delete (bz 1474256). Not seeing the VM crash mentioned with this bug, when tried multiple times. 

Moving this bug to verified in 3.3.

Comment 20 errata-xmlrpc 2017-09-21 04:17:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2773