Bug 1468506 - gluster-block: block delete causing kernel crash/reboot due to page_fault
gluster-block: block delete causing kernel crash/reboot due to page_fault
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: gluster-block (Show other bugs)
3.3
Unspecified Unspecified
unspecified Severity unspecified
: ---
: RHGS 3.3.0
Assigned To: Prasanna Kumar Kalever
Sweta Anandpara
:
Depends On:
Blocks: 1417151 1468990
  Show dependency treegraph
 
Reported: 2017-07-07 05:32 EDT by Prasanna Kumar Kalever
Modified: 2017-09-21 00:20 EDT (History)
9 users (show)

See Also:
Fixed In Version: gluster-block-0.2.1-5.el7rhgs
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1468990 (view as bug list)
Environment:
Last Closed: 2017-09-21 00:20:54 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Prasanna Kumar Kalever 2017-07-07 05:32:53 EDT
Description of problem:

Noticed a node reboot due to some page_fault
[171884.552080] BUG: unable to handle kernel NULL pointer dereference at (null)


Action:
Delete the block (used for about 2 days for some IO)


Observation  from userspace:
---------------------------
[root@gprfc076 mnt]# ps -aux | grep target
root       923  0.0  0.0      0     0 ?        S<   Jun27   0:00 [target_completi]
root      7167  0.0  0.0 117300  1448 ?        S    03:30   0:00 sh -c targetcli /backstores/user:glfs delete block && targetcli /iscsi delete iqn.2016-12.org.gluster-block:d81024d7-21a6-4a8c-ac11-06fe56fee9d6 && targetcli / saveconfig > /dev/null
root      7168  0.0  0.0 258008 15840 ?        D    03:30   0:00 /usr/bin/python /usr/bin/targetcli /backstores/user:glfs delete block
root      7191  0.0  0.0 255072 14652 pts/6    D+   03:31   0:00 /usr/bin/python /usr/bin/targetcli ls
root      7289  0.0  0.0 114712   972 pts/9    S+   03:36   0:00 grep --color=auto target

You can notice that targetcli delete command went into "uninterruptible sleep", the targetcli command hung, I have also noticed tcmu-runner segfault at this time, unfortunately abrtd was not running in that machine so, could not get a core dump of it, sorry about this.


tcmu-runner logs:
---------------
2017-06-28 12:10:58.584 5965 [DEBUG] main:808 : handler path: /usr/lib64/tcmu-runner
2017-06-28 12:10:58.656 5965 [DEBUG] load_our_module:524 : Module 'target_core_user' is already loaded
2017-06-28 12:10:58.670 5965 [DEBUG] main:821 : 1 runner handlers found
2017-06-28 12:10:59.976 5965 [DEBUG] dbus_bus_acquired:437 : bus org.kernel.TCMUService1 acquired
2017-06-28 12:10:59.977 5965 [DEBUG] dbus_name_acquired:453 : name org.kernel.TCMUService1 acquired
2017-06-29 03:30:33.444 5965 [DEBUG] handle_netlink:127 : cmd 2. Got header version 2. Supported 2.


Kernel Oops:
-----------
[...]
[171884.552188] Oops: 0000 [#1] SMP 
[171884.552207] Modules linked in: fuse loop target_core_pscsi target_core_file target_core_iblock iscsi_target_mod scsi_transport_iscsi ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter target_core_user target_core_mod uio dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper
[171884.552583]  iTCO_wdt ablk_helper cryptd ipmi_ssif iTCO_vendor_support mei_me pcspkr sg dcdbas joydev ipmi_si ipmi_devintf wmi ipmi_msghandler mei acpi_power_meter acpi_pad lpc_ich shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ixgbe ahci libahci libata crct10dif_pclmul crct10dif_common crc32c_intel tg3 megaraid_sas mdio i2c_core dca ptp pps_core dm_mirror dm_region_hash dm_log dm_mod
[171884.552867] CPU: 2 PID: 7294 Comm: tcmu-runner Not tainted 3.10.0-686.el7.test.x86_64 #1
[171884.552903] Hardware name: Dell Inc. PowerEdge R620/0KCKR5, BIOS 1.3.6 09/11/2012
[171884.552936] task: ffff8810092c3f40 ti: ffff880540f0c000 task.ti: ffff880540f0c000
[171884.552968] RIP: 0010:[<ffffffffc05397c2>]  [<ffffffffc05397c2>] tcmu_vma_fault+0x72/0xf0 [target_core_user]
[171884.553014] RSP: 0000:ffff880540f0fd58  EFLAGS: 00010246
[171884.553038] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff880000000400
[171884.553069] RDX: ffff880000000250 RSI: 00003ffffffff000 RDI: 0000000000000000
[171884.553100] RBP: ffff880540f0fd68 R08: 0000000000000000 R09: ffff880540f0fde8
[171884.553131] R10: 0000000000000002 R11: 0000000000000000 R12: ffff880540f0fd80
[171884.553162] R13: ffff88101b6616c8 R14: 0000000000000000 R15: ffff88081f3fe398
[171884.553193] FS:  00007f1cab944880(0000) GS:ffff88081fa40000(0000) knlGS:0000000000000000
[171884.553228] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[171884.553254] CR2: 0000000000000000 CR3: 0000000544b96000 CR4: 00000000000407e0
[171884.553285] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[171884.553315] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[171884.553346] Stack:
[171884.553357]  0000000000000000 ffff880540f0fde8 ffff880540f0fdc8 ffffffff811ad122
[171884.554683]  0000000000000000 ffff8810000000a8 0000000000000000 00007f1c8e6ef000
[171884.556000]  0000000000000000 0000000000000000 ffff88081f3fe398 0000000026551ff4
[171884.557312] Call Trace:
[171884.558616]  [<ffffffff811ad122>] __do_fault+0x52/0xe0
[171884.559912]  [<ffffffff811ad5cb>] do_read_fault.isra.44+0x4b/0x130
[171884.561204]  [<ffffffff811b1ed1>] handle_mm_fault+0x691/0x1010
[171884.562484]  [<ffffffff811b8c9e>] ? do_mmap_pgoff+0x31e/0x3e0
[171884.563743]  [<ffffffff816aef74>] __do_page_fault+0x154/0x450
[171884.564984]  [<ffffffff816af2a5>] do_page_fault+0x35/0x90
[171884.566218]  [<ffffffff816ab4c8>] page_fault+0x28/0x30
[171884.567428] Code: d7 48 63 d2 48 8d 04 52 48 c1 e7 0c 48 c1 e0 04 48 01 c6 48 03 be 80 10 00 00 83 be 90 10 00 00 02 74 36 e8 31 6c c8 c0 48 89 c3 <48> 8b 03 f6 c4 80 75 59 f0 ff 43 1c 48 8b 03 a9 00 00 00 80 74 
[171884.569986] RIP  [<ffffffffc05397c2>] tcmu_vma_fault+0x72/0xf0 [target_core_user]
[171884.571233]  RSP <ffff880540f0fd58>
[171884.572457] CR2: 0000000000000000

How reproducible:
Have hit only once. Chances being very rare.
Comment 5 Prasanna Kumar Kalever 2017-07-10 01:48:00 EDT
Patch:
https://review.gluster.org/#/c/17725/
Comment 10 Sweta Anandpara 2017-07-14 02:00:36 EDT
Prasanna, Please refer comment9.
Comment 13 Sweta Anandpara 2017-07-17 07:06:42 EDT
Tested and verified this on the build glusterfs-3.8.4-33 and gluster-block-0.2.1-6.

Gluster-block create and delete works without any issues, and have also done one round of health check with gluster-block on the said bits.
As mentioned in comment11, teh attribute cmd_time_out is set to zero for all new blocks created.

Moving this bug to verified based on comment 11 and logs pasted below:

[root@dhcp47-115 ~]# targetcli /backstores/user:glfs/nb21 get attribute cmd_time_out
cmd_time_out=0 
[root@dhcp47-115 ~]# targetcli /backstores/user:glfs/nb50 get attribute cmd_time_out
cmd_time_out=0 
[root@dhcp47-115 ~]# 
[root@dhcp47-115 ~]# rpm -qa | grep gluster
glusterfs-cli-3.8.4-33.el7rhgs.x86_64
glusterfs-rdma-3.8.4-33.el7rhgs.x86_64
python-gluster-3.8.4-33.el7rhgs.noarch
vdsm-gluster-4.17.33-1.1.el7rhgs.noarch
glusterfs-client-xlators-3.8.4-33.el7rhgs.x86_64
glusterfs-fuse-3.8.4-33.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-events-3.8.4-33.el7rhgs.x86_64
gluster-block-0.2.1-6.el7rhgs.x86_64
libvirt-daemon-driver-storage-gluster-3.2.0-14.el7.x86_64
gluster-nagios-addons-0.2.9-1.el7rhgs.x86_64
samba-vfs-glusterfs-4.6.3-3.el7rhgs.x86_64
glusterfs-3.8.4-33.el7rhgs.x86_64
glusterfs-debuginfo-3.8.4-26.el7rhgs.x86_64
glusterfs-api-3.8.4-33.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-33.el7rhgs.x86_64
glusterfs-libs-3.8.4-33.el7rhgs.x86_64
glusterfs-server-3.8.4-33.el7rhgs.x86_64
[root@dhcp47-115 ~]# gluster-block list nash
nb21
nb22
nb23
nb24
nb25
nb26
nb27
nb28
nb29
nb30
nb31
nb32
nb33
nb34
nb35
nb36
nb37
nb38
nb39
nb40
nb41
nb42
nb43
nb44
nb45
nb46
nb47
nb48
nb49
nb50
[root@dhcp47-115 ~]# gluster v info nash
 
Volume Name: nash
Type: Replicate
Volume ID: f1ea3d3e-c536-4f36-b61f-cb9761b8a0a6
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.47.115:/bricks/brick4/nash0
Brick2: 10.70.47.116:/bricks/brick4/nash1
Brick3: 10.70.47.117:/bricks/brick4/nash2
Options Reconfigured:
nfs.disable: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
performance.open-behind: off
performance.readdir-ahead: off
network.remote-dio: enable
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
user.cifs: off
server.allow-insecure: on
cluster.brick-multiplex: disable
cluster.enable-shared-storage: enable
[root@dhcp47-115 ~]# gluster pool list
UUID					Hostname                         	State
49610061-1788-4cbc-9205-0e59fe91d842	dhcp47-121.lab.eng.blr.redhat.com	Connected 
a0557927-4e5e-4ff7-8dce-94873f867707	dhcp47-113.lab.eng.blr.redhat.com	Connected 
c0dac197-5a4d-4db7-b709-dbf8b8eb0896	dhcp47-114.lab.eng.blr.redhat.com	Connected 
a96e0244-b5ce-4518-895c-8eb453c71ded	dhcp47-116.lab.eng.blr.redhat.com	Connected 
17eb3cef-17e7-4249-954b-fc19ec608304	dhcp47-117.lab.eng.blr.redhat.com	Connected 
f828fdfa-e08f-4d12-85d8-2121cafcf9d0	localhost                        	Connected 
[root@dhcp47-115 ~]#
Comment 15 errata-xmlrpc 2017-09-21 00:20:54 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2773

Note You need to log in before you can comment on or make changes to this bug.