Description of problem: Noticed a node reboot due to some page_fault [171884.552080] BUG: unable to handle kernel NULL pointer dereference at (null) Action: Delete the block (used for about 2 days for some IO) Observation from userspace: --------------------------- [root@gprfc076 mnt]# ps -aux | grep target root 923 0.0 0.0 0 0 ? S< Jun27 0:00 [target_completi] root 7167 0.0 0.0 117300 1448 ? S 03:30 0:00 sh -c targetcli /backstores/user:glfs delete block && targetcli /iscsi delete iqn.2016-12.org.gluster-block:d81024d7-21a6-4a8c-ac11-06fe56fee9d6 && targetcli / saveconfig > /dev/null root 7168 0.0 0.0 258008 15840 ? D 03:30 0:00 /usr/bin/python /usr/bin/targetcli /backstores/user:glfs delete block root 7191 0.0 0.0 255072 14652 pts/6 D+ 03:31 0:00 /usr/bin/python /usr/bin/targetcli ls root 7289 0.0 0.0 114712 972 pts/9 S+ 03:36 0:00 grep --color=auto target You can notice that targetcli delete command went into "uninterruptible sleep", the targetcli command hung, I have also noticed tcmu-runner segfault at this time, unfortunately abrtd was not running in that machine so, could not get a core dump of it, sorry about this. tcmu-runner logs: --------------- 2017-06-28 12:10:58.584 5965 [DEBUG] main:808 : handler path: /usr/lib64/tcmu-runner 2017-06-28 12:10:58.656 5965 [DEBUG] load_our_module:524 : Module 'target_core_user' is already loaded 2017-06-28 12:10:58.670 5965 [DEBUG] main:821 : 1 runner handlers found 2017-06-28 12:10:59.976 5965 [DEBUG] dbus_bus_acquired:437 : bus org.kernel.TCMUService1 acquired 2017-06-28 12:10:59.977 5965 [DEBUG] dbus_name_acquired:453 : name org.kernel.TCMUService1 acquired 2017-06-29 03:30:33.444 5965 [DEBUG] handle_netlink:127 : cmd 2. Got header version 2. Supported 2. Kernel Oops: ----------- [...] [171884.552188] Oops: 0000 [#1] SMP [171884.552207] Modules linked in: fuse loop target_core_pscsi target_core_file target_core_iblock iscsi_target_mod scsi_transport_iscsi ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter target_core_user target_core_mod uio dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper [171884.552583] iTCO_wdt ablk_helper cryptd ipmi_ssif iTCO_vendor_support mei_me pcspkr sg dcdbas joydev ipmi_si ipmi_devintf wmi ipmi_msghandler mei acpi_power_meter acpi_pad lpc_ich shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ixgbe ahci libahci libata crct10dif_pclmul crct10dif_common crc32c_intel tg3 megaraid_sas mdio i2c_core dca ptp pps_core dm_mirror dm_region_hash dm_log dm_mod [171884.552867] CPU: 2 PID: 7294 Comm: tcmu-runner Not tainted 3.10.0-686.el7.test.x86_64 #1 [171884.552903] Hardware name: Dell Inc. PowerEdge R620/0KCKR5, BIOS 1.3.6 09/11/2012 [171884.552936] task: ffff8810092c3f40 ti: ffff880540f0c000 task.ti: ffff880540f0c000 [171884.552968] RIP: 0010:[<ffffffffc05397c2>] [<ffffffffc05397c2>] tcmu_vma_fault+0x72/0xf0 [target_core_user] [171884.553014] RSP: 0000:ffff880540f0fd58 EFLAGS: 00010246 [171884.553038] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff880000000400 [171884.553069] RDX: ffff880000000250 RSI: 00003ffffffff000 RDI: 0000000000000000 [171884.553100] RBP: ffff880540f0fd68 R08: 0000000000000000 R09: ffff880540f0fde8 [171884.553131] R10: 0000000000000002 R11: 0000000000000000 R12: ffff880540f0fd80 [171884.553162] R13: ffff88101b6616c8 R14: 0000000000000000 R15: ffff88081f3fe398 [171884.553193] FS: 00007f1cab944880(0000) GS:ffff88081fa40000(0000) knlGS:0000000000000000 [171884.553228] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [171884.553254] CR2: 0000000000000000 CR3: 0000000544b96000 CR4: 00000000000407e0 [171884.553285] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [171884.553315] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [171884.553346] Stack: [171884.553357] 0000000000000000 ffff880540f0fde8 ffff880540f0fdc8 ffffffff811ad122 [171884.554683] 0000000000000000 ffff8810000000a8 0000000000000000 00007f1c8e6ef000 [171884.556000] 0000000000000000 0000000000000000 ffff88081f3fe398 0000000026551ff4 [171884.557312] Call Trace: [171884.558616] [<ffffffff811ad122>] __do_fault+0x52/0xe0 [171884.559912] [<ffffffff811ad5cb>] do_read_fault.isra.44+0x4b/0x130 [171884.561204] [<ffffffff811b1ed1>] handle_mm_fault+0x691/0x1010 [171884.562484] [<ffffffff811b8c9e>] ? do_mmap_pgoff+0x31e/0x3e0 [171884.563743] [<ffffffff816aef74>] __do_page_fault+0x154/0x450 [171884.564984] [<ffffffff816af2a5>] do_page_fault+0x35/0x90 [171884.566218] [<ffffffff816ab4c8>] page_fault+0x28/0x30 [171884.567428] Code: d7 48 63 d2 48 8d 04 52 48 c1 e7 0c 48 c1 e0 04 48 01 c6 48 03 be 80 10 00 00 83 be 90 10 00 00 02 74 36 e8 31 6c c8 c0 48 89 c3 <48> 8b 03 f6 c4 80 75 59 f0 ff 43 1c 48 8b 03 a9 00 00 00 80 74 [171884.569986] RIP [<ffffffffc05397c2>] tcmu_vma_fault+0x72/0xf0 [target_core_user] [171884.571233] RSP <ffff880540f0fd58> [171884.572457] CR2: 0000000000000000 How reproducible: Have hit only once. Chances being very rare.
Patch: https://review.gluster.org/#/c/17725/
Prasanna, Please refer comment9.
Tested and verified this on the build glusterfs-3.8.4-33 and gluster-block-0.2.1-6. Gluster-block create and delete works without any issues, and have also done one round of health check with gluster-block on the said bits. As mentioned in comment11, teh attribute cmd_time_out is set to zero for all new blocks created. Moving this bug to verified based on comment 11 and logs pasted below: [root@dhcp47-115 ~]# targetcli /backstores/user:glfs/nb21 get attribute cmd_time_out cmd_time_out=0 [root@dhcp47-115 ~]# targetcli /backstores/user:glfs/nb50 get attribute cmd_time_out cmd_time_out=0 [root@dhcp47-115 ~]# [root@dhcp47-115 ~]# rpm -qa | grep gluster glusterfs-cli-3.8.4-33.el7rhgs.x86_64 glusterfs-rdma-3.8.4-33.el7rhgs.x86_64 python-gluster-3.8.4-33.el7rhgs.noarch vdsm-gluster-4.17.33-1.1.el7rhgs.noarch glusterfs-client-xlators-3.8.4-33.el7rhgs.x86_64 glusterfs-fuse-3.8.4-33.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-events-3.8.4-33.el7rhgs.x86_64 gluster-block-0.2.1-6.el7rhgs.x86_64 libvirt-daemon-driver-storage-gluster-3.2.0-14.el7.x86_64 gluster-nagios-addons-0.2.9-1.el7rhgs.x86_64 samba-vfs-glusterfs-4.6.3-3.el7rhgs.x86_64 glusterfs-3.8.4-33.el7rhgs.x86_64 glusterfs-debuginfo-3.8.4-26.el7rhgs.x86_64 glusterfs-api-3.8.4-33.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-33.el7rhgs.x86_64 glusterfs-libs-3.8.4-33.el7rhgs.x86_64 glusterfs-server-3.8.4-33.el7rhgs.x86_64 [root@dhcp47-115 ~]# gluster-block list nash nb21 nb22 nb23 nb24 nb25 nb26 nb27 nb28 nb29 nb30 nb31 nb32 nb33 nb34 nb35 nb36 nb37 nb38 nb39 nb40 nb41 nb42 nb43 nb44 nb45 nb46 nb47 nb48 nb49 nb50 [root@dhcp47-115 ~]# gluster v info nash Volume Name: nash Type: Replicate Volume ID: f1ea3d3e-c536-4f36-b61f-cb9761b8a0a6 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.70.47.115:/bricks/brick4/nash0 Brick2: 10.70.47.116:/bricks/brick4/nash1 Brick3: 10.70.47.117:/bricks/brick4/nash2 Options Reconfigured: nfs.disable: on transport.address-family: inet performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off performance.open-behind: off performance.readdir-ahead: off network.remote-dio: enable cluster.eager-lock: enable cluster.quorum-type: auto cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 10000 features.shard: on user.cifs: off server.allow-insecure: on cluster.brick-multiplex: disable cluster.enable-shared-storage: enable [root@dhcp47-115 ~]# gluster pool list UUID Hostname State 49610061-1788-4cbc-9205-0e59fe91d842 dhcp47-121.lab.eng.blr.redhat.com Connected a0557927-4e5e-4ff7-8dce-94873f867707 dhcp47-113.lab.eng.blr.redhat.com Connected c0dac197-5a4d-4db7-b709-dbf8b8eb0896 dhcp47-114.lab.eng.blr.redhat.com Connected a96e0244-b5ce-4518-895c-8eb453c71ded dhcp47-116.lab.eng.blr.redhat.com Connected 17eb3cef-17e7-4249-954b-fc19ec608304 dhcp47-117.lab.eng.blr.redhat.com Connected f828fdfa-e08f-4d12-85d8-2121cafcf9d0 localhost Connected [root@dhcp47-115 ~]#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:2773