Bug 1449245 - [gluster-block]:vmcore generated when deleting the block with ha 3 when one of the node is down
Summary: [gluster-block]:vmcore generated when deleting the block with ha 3 when one o...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: tcmu-runner
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: RHGS 3.3.0
Assignee: Prasanna Kumar Kalever
QA Contact: Sweta Anandpara
URL:
Whiteboard: rhel-fix
Depends On: 1430225
Blocks: 1417151 1468963
TreeView+ depends on / blocked
 
Reported: 2017-05-09 13:21 UTC by surabhi
Modified: 2017-09-21 04:17 UTC (History)
7 users (show)

Fixed In Version: tcmu-runner-1.2.0-9.el7rhgs
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1468963 (view as bug list)
Environment:
Last Closed: 2017-09-21 04:17:54 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2017:2773 0 normal SHIPPED_LIVE new packages: gluster-block 2017-09-21 08:16:22 UTC

Description surabhi 2017-05-09 13:21:00 UTC
Description of problem:
******************************
There is a VMcore generated when I tried to delete a block which is created with HA 3 and one of the node is down.

Actually this issue is seen multiple times where VM becomes inaccessible for a while and there is a crash. 

Above mentioned is one of the scenario where this is observed. Don't know the exact steps to reproduce this but it is happening quiet often in my setup.

************************************


[251127.647553] CPU: 1 PID: 3853 Comm: tcmu-runner Not tainted 3.10.0-514.16.1.el7.x86_64 #1
[251127.647595] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[251127.647627] task: ffff880036558000 ti: ffff8802307c4000 task.ti: ffff8802307c4000
[251127.647666] RIP: 0010:[<ffffffff810fb7ed>]  [<ffffffff810fb7ed>] module_put+0x1d/0x80
[251127.647715] RSP: 0018:ffff8802307c7e60  EFLAGS: 00010282
[251127.647748] RAX: dead000000000100 RBX: ffff88023fec9840 RCX: 0000000000000000
[251127.647784] RDX: ffffffffa06f8820 RSI: ffff88020ce0b048 RDI: ffff880228b49240
[251127.647819] RBP: ffff8802307c7e78 R08: 0000000000000000 R09: 0000000000000000
[251127.647854] R10: ffff88020ce0b048 R11: ffff880227ff5c10 R12: ffff880228b49240
[251127.647900] R13: 0000000000000000 R14: ffff8802245d9a80 R15: ffff88023feabd20
[251127.647936] FS:  00007f70765cf800(0000) GS:ffff88023fd00000(0000) knlGS:0000000000000000
[251127.647975] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[251127.648004] CR2: 00007f7030002000 CR3: 000000022572e000 CR4: 00000000000006e0
[251127.648045] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[251127.648080] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[251127.648114] Stack:
[251127.648129]  ffff88023fec9840 ffff880228b49840 0000000000000000 ffff8802307c7ea0
[251127.648171]  ffffffffa0654530 ffff880227ff5c00 0000000000000008 ffff88020ce0b048
[251127.648211]  ffff8802307c7ee8 ffffffff81200589 ffff88020ce0b048 ffff880227ff5c10
[251127.648251] Call Trace:
[251127.648275]  [<ffffffffa0654530>] uio_release+0x40/0x60 [uio]
[251127.648308]  [<ffffffff81200589>] __fput+0xe9/0x260
[251127.648334]  [<ffffffff8120083e>] ____fput+0xe/0x10
[251127.648362]  [<ffffffff810ad1e7>] task_work_run+0xa7/0xe0
[251127.648393]  [<ffffffff8102ab22>] do_notify_resume+0x92/0xb0
[251127.648424]  [<ffffffff8169733d>] int_signal+0x12/0x17
[251127.648449] Code: eb bf 66 90 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 85 ff 48 89 e5 41 55 41 54 49 89 fc 53 74 1c 48 8b 87 30 02 00 00 <65> 48 ff 40 08 4c 8b 6d 08 66 66 66 66 90 41 83 3c 24 02 74 38 
[251127.649711] RIP  [<ffffffff810fb7ed>] module_put+0x1d/0x80
[251127.650220]  RSP <ffff8802307c7e60>

Looks like related to tcmu-runner when it is not running on one of the nodes?


Version-Release number of selected component (if applicable):

tcmu-runner-1.2.0-1.el7rhgs.x86_64


How reproducible:
multiple times.

Steps to Reproduce:
Seen with following steps once, not sure of exact reproducer.

1.Delete the block which is created by using HA 3
2. Out of 3 nodes , bring down gluster-blockd on one node and check the status of delete.
3. the node becomes inaccessible for a while.

Actual results:
***************
The block doesn't get deleted and there is a vmcore and core dumped.


Expected results:
*****************

There should not be any crash.

Additional info:
Sosreports and core dump to follow soon.

Comment 9 Atin Mukherjee 2017-06-20 10:42:15 UTC
upstream patch  : https://review.gluster.org/#/c/17545/

Comment 10 Atin Mukherjee 2017-06-20 10:59:16 UTC
(In reply to Atin Mukherjee from comment #9)
> upstream patch  : https://review.gluster.org/#/c/17545/

ignore this.

Comment 17 Sweta Anandpara 2017-07-24 09:07:49 UTC
Tested and verified this on the build glusterfs-3.8.4-33 and gluster-block-0.2.1-6.

Hitting another issue with respect to failed delete (bz 1474256). Not seeing the VM crash mentioned with this bug, when tried multiple times. 

Moving this bug to verified in 3.3.

Comment 20 errata-xmlrpc 2017-09-21 04:17:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2773


Note You need to log in before you can comment on or make changes to this bug.