Description of problem: ====================== Scenario: 'Gluster-block create' command is executed. It fails for some reason on one of the nodes, succeeds on other. Such a partial success is actually a failed create. Hence, the code goes ahead with internal block deletion - undoing everything that it did sometime back. That is when VM crashes. Meta file of the block says 'CLEANUPINPROGRESS' but does not go further. Backtraces pasted below. Gluster-block logs and core files will be copied in http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/ Version-Release number of selected component (if applicable): ============================================================= gluster-block-0.2.1-6 and glusterfs-3.8.4-33 How reproducible: ================= Seen it twice Additional info: ================= BUG: unable to handle kernel paging request at 00000000db8d38b8 IP: [<ffffffffc0573080>] uio_poll+0x20/0x70 [uio] PGD b64d0067 PUD 0 Oops: 0000 [#1] SMP Modules linked in: target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_user target_core_mod crc_t10dif crct10dif_generic uio crct10dif_common fuse nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio pcspkr joydev sg ppdev i2c_piix4 virtio_balloon parport_pc parport nfsd auth_rpcgss nfs_acl lockd dm_multipath grace sunrpc ip_tables xfs libcrc32c sr_mod cdrom ata_generic pata_acpi cirrus virtio_blk drm_kms_helper syscopyarea sysfillrect serio_raw sysimgblt fb_sys_fops ttm ata_piix drm libata 8139too virtio_pci virtio_ring virtio floppy 8139cp i2c_core mii dm_mirror dm_region_hash dm_log dm_mod 8021q garp mrp bridge stp llc bonding CPU: 1 PID: 19839 Comm: tcmu-runner Not tainted 3.10.0-693.el7.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 task: ffff88005339af70 ti: ffff880056338000 task.ti: ffff880056338000 RIP: 0010:[<ffffffffc0573080>] [<ffffffffc0573080>] uio_poll+0x20/0x70 [uio] RSP: 0018:ffff88005633bb08 EFLAGS: 00010202 RAX: 00000000fffffffb RBX: ffff880109a70780 RCX: 00000000db8d36e8 RDX: ffffffffc0573060 RSI: ffff88005633bc90 RDI: ffff880053326600 RBP: ffff88005633bb18 R08: 0000000000000001 R09: ffff88011fc16d40 R10: 0000000000000000 R11: 0000000000000000 R12: ffff88011883c830 R13: 0000000000000000 R14: 0000000000000000 R15: ffff88005633bb9c FS: 00007f91e5679700(0000) GS:ffff88011fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000db8d38b8 CR3: 00000000cebf0000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff88005633bba4 0000000000000000 ffff88005633bf38 ffffffff81217297 00007f91e5678da0 ffff88005633bfd8 ffff88005339af70 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 Call Trace: [<ffffffff81217297>] do_sys_poll+0x327/0x580 [<ffffffff81215dd0>] ? poll_select_copy_remaining+0x150/0x150 [<ffffffff8133d9dd>] ? list_del+0xd/0x30 [<ffffffff810b1671>] ? remove_wait_queue+0x31/0x40 [<ffffffffc057394d>] ? uio_read+0x11d/0x180 [uio] [<ffffffff810c4810>] ? wake_up_state+0x20/0x20 [<ffffffff812175f4>] SyS_poll+0x74/0x110 [<ffffffff8111f5c6>] ? __audit_syscall_exit+0x1e6/0x280 [<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b Code: ff ff c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 b8 fb ff ff ff 48 89 e5 41 54 53 4c 8b a7 a8 00 00 00 49 8b 1c 24 48 8b 4b 40 <48> 83 b9 d0 01 00 00 00 75 06 5b 41 5c 5d c3 90 48 85 f6 74 19 RIP [<ffffffffc0573080>] uio_poll+0x20/0x70 [uio] RSP <ffff88005633bb08> [root@dhcp47-115 ~]# BUG: unable to handle kernel NULL pointer dereference at 00000000000001d0 IP: [<ffffffffc0566080>] uio_poll+0x20/0x70 [uio] PGD b5f66067 PUD 3660f067 PMD 0 Oops: 0000 [#1] SMP Modules linked in: target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_user target_core_mod crc_t10dif crct10dif_generic uio crct10dif_common fuse nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio pcspkr sg ppdev joydev i2c_piix4 parport_pc virtio_balloon parport nfsd auth_rpcgss nfs_acl dm_multipath lockd grace sunrpc ip_tables xfs libcrc32c sr_mod cdrom ata_generic pata_acpi virtio_blk cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix drm serio_raw libata 8139too virtio_pci virtio_ring virtio 8139cp mii i2c_core floppy 8021q garp mrp bridge stp llc dm_mirror dm_region_hash dm_log bonding dm_mod CPU: 0 PID: 18943 Comm: tcmu-runner Not tainted 3.10.0-693.el7.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 task: ffff8800657d0000 ti: ffff880060f80000 task.ti: ffff880060f80000 RIP: 0010:[<ffffffffc0566080>] [<ffffffffc0566080>] uio_poll+0x20/0x70 [uio] RSP: 0018:ffff880060f83b08 EFLAGS: 00010202 RAX: 00000000fffffffb RBX: ffff8800b610a9c0 RCX: 0000000000000000 RDX: ffffffffc0566060 RSI: ffff880060f83c90 RDI: ffff88008963e700 RBP: ffff880060f83b18 R08: 0000000000000001 R09: ffff88011fd16d40 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880060c17250 R13: 0000000000000000 R14: 0000000000000000 R15: ffff880060f83b9c FS: 00007fc4086ac700(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000000001d0 CR3: 00000000b5f7a000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff880060f83ba4 0000000000000000 ffff880060f83f38 ffffffff81217297 00007fc4086abda0 ffff880060f83fd8 ffff8800657d0000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 Call Trace: [<ffffffff81217297>] do_sys_poll+0x327/0x580 [<ffffffff81215dd0>] ? poll_select_copy_remaining+0x150/0x150 [<ffffffff8133d9dd>] ? list_del+0xd/0x30 [<ffffffff810b1671>] ? remove_wait_queue+0x31/0x40 [<ffffffffc056694d>] ? uio_read+0x11d/0x180 [uio] [<ffffffff810c4810>] ? wake_up_state+0x20/0x20 [<ffffffff812175f4>] SyS_poll+0x74/0x110 [<ffffffff8111f5c6>] ? __audit_syscall_exit+0x1e6/0x280 [<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b Code: ff ff c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 b8 fb ff ff ff 48 89 e5 41 54 53 4c 8b a7 a8 00 00 00 49 8b 1c 24 48 8b 4b 40 <48> 83 b9 d0 01 00 00 00 75 06 5b 41 5c 5d c3 90 48 85 f6 74 19 RIP [<ffffffffc0566080>] uio_poll+0x20/0x70 [uio] RSP <ffff880060f83b08> [root@dhcp47-117 abrt]# [root@dhcp47-115 ~]# rpm -qa | grep gluster glusterfs-cli-3.8.4-33.el7rhgs.x86_64 glusterfs-rdma-3.8.4-33.el7rhgs.x86_64 python-gluster-3.8.4-33.el7rhgs.noarch vdsm-gluster-4.17.33-1.1.el7rhgs.noarch glusterfs-client-xlators-3.8.4-33.el7rhgs.x86_64 glusterfs-fuse-3.8.4-33.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-events-3.8.4-33.el7rhgs.x86_64 gluster-block-0.2.1-6.el7rhgs.x86_64 libvirt-daemon-driver-storage-gluster-3.2.0-14.el7.x86_64 gluster-nagios-addons-0.2.9-1.el7rhgs.x86_64 samba-vfs-glusterfs-4.6.3-3.el7rhgs.x86_64 glusterfs-3.8.4-33.el7rhgs.x86_64 glusterfs-debuginfo-3.8.4-26.el7rhgs.x86_64 glusterfs-api-3.8.4-33.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-33.el7rhgs.x86_64 glusterfs-libs-3.8.4-33.el7rhgs.x86_64 glusterfs-server-3.8.4-33.el7rhgs.x86_64 [root@dhcp47-115 ~]# [root@dhcp47-115 ~]# gluster peer status Number of Peers: 5 Hostname: dhcp47-121.lab.eng.blr.redhat.com Uuid: 49610061-1788-4cbc-9205-0e59fe91d842 State: Peer in Cluster (Connected) Other names: 10.70.47.121 Hostname: dhcp47-113.lab.eng.blr.redhat.com Uuid: a0557927-4e5e-4ff7-8dce-94873f867707 State: Peer in Cluster (Connected) Hostname: dhcp47-114.lab.eng.blr.redhat.com Uuid: c0dac197-5a4d-4db7-b709-dbf8b8eb0896 State: Peer in Cluster (Connected) Other names: 10.70.47.114 Hostname: dhcp47-116.lab.eng.blr.redhat.com Uuid: a96e0244-b5ce-4518-895c-8eb453c71ded State: Peer in Cluster (Disconnected) Other names: 10.70.47.116 Hostname: dhcp47-117.lab.eng.blr.redhat.com Uuid: 17eb3cef-17e7-4249-954b-fc19ec608304 State: Peer in Cluster (Connected) Other names: 10.70.47.117 [root@dhcp47-115 ~]# [root@dhcp47-115 ~]# [root@dhcp47-115 ~]# gluster v info nash Volume Name: nash Type: Replicate Volume ID: f1ea3d3e-c536-4f36-b61f-cb9761b8a0a6 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.70.47.115:/bricks/brick4/nash0 Brick2: 10.70.47.116:/bricks/brick4/nash1 Brick3: 10.70.47.117:/bricks/brick4/nash2 Options Reconfigured: nfs.disable: on transport.address-family: inet performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off performance.open-behind: off performance.readdir-ahead: off network.remote-dio: enable cluster.eager-lock: enable cluster.quorum-type: auto cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 10000 features.shard: on user.cifs: off server.allow-insecure: on cluster.brick-multiplex: disable cluster.enable-shared-storage: enable [root@dhcp47-115 ~]# [root@dhcp47-115 ~]# [root@dhcp47-115 ~]# gluster-block list nash nb21 nb22 nb23 nb24 nb25 nb26 nb27 nb28 nb29 nb30 nb31 nb32 nb33 nb34 nb35 nb36 nb37 nb38 nb39 nb40 nb41 nb42 nb43 nb44 nb45 nb46 nb47 nb48 nb49 nb50 nb51 nb52 nb54 nb55 [root@dhcp47-115 ~]#
Seen in twice on two different peer nodes. But I don't have straight-forward steps to reproduce. Would like this bug to be discussed in the wider forum, as I am not completely sure of the likelihood and the repercussions of this happening in CNS environment. Hence, setting blocker to '?'
I hit another VM crash today when the block-create command that I gave failed. It was a not a negative test that I was doing. I was expecting the block to get created successfully. The bug title looks the same, the backtrace is different though. Please do advise if this is different. BUG: unable to handle kernel NULL pointer dereference at 00000000000001d0 IP: [<ffffffffc0623080>] uio_poll+0x20/0x70 [uio] PGD 7d462067 PUD ce7b0067 PMD 0 Oops: 0000 [#1] SMP Modules linked in: target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_user target_core_mod crc_t10dif crct10dif_generic uio crct10dif_common sctp_diag sctp dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag binfmt_misc fuse nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio ppdev pcspkr joydev sg virtio_balloon parport_pc i2c_piix4 parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_multipath ip_tables xfs libcrc32c sr_mod cdrom ata_generic pata_acpi cirrus drm_kms_helper virtio_blk syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm serio_raw 8139too virtio_pci virtio_ring virtio ata_piix libata 8139cp mii i2c_core floppy dm_mirror dm_region_hash dm_log dm_mod 8021q garp mrp bridge stp llc bonding CPU: 0 PID: 14320 Comm: tcmu-runner Not tainted 3.10.0-693.el7.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 task: ffff880049cf2f70 ti: ffff880117d94000 task.ti: ffff880117d94000 RIP: 0010:[<ffffffffc0623080>] [<ffffffffc0623080>] uio_poll+0x20/0x70 [uio] RSP: 0018:ffff880117d97b08 EFLAGS: 00010202 RAX: 00000000fffffffb RBX: ffff880049c781e0 RCX: 0000000000000000 RDX: ffffffffc0623060 RSI: ffff880117d97c90 RDI: ffff8800c34c8d00 RBP: ffff880117d97b18 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800c866f560 R13: 0000000000000000 R14: 0000000000000000 R15: ffff880117d97b9c FS: 00007fb730e3b700(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000000001d0 CR3: 000000009c696000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff880117d97ba4 0000000000000000 ffff880117d97f38 ffffffff81217297 00007fb730e3ada0 ffff880117d97fd8 ffff880049cf2f70 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 Call Trace: [<ffffffff81217297>] do_sys_poll+0x327/0x580 [<ffffffff810cd794>] ? update_curr+0x104/0x190 [<ffffffff810c8f18>] ? __enqueue_entity+0x78/0x80 [<ffffffff810cf90c>] ? enqueue_entity+0x26c/0xb60 [<ffffffff810ce8d8>] ? check_preempt_wakeup+0x148/0x250 [<ffffffff810c12d5>] ? check_preempt_curr+0x85/0xa0 [<ffffffff81215dd0>] ? poll_select_copy_remaining+0x150/0x150 [<ffffffff810cd794>] ? update_curr+0x104/0x190 [<ffffffff810ca29e>] ? account_entity_dequeue+0xae/0xd0 [<ffffffff810cdc7c>] ? dequeue_entity+0x11c/0x5d0 [<ffffffff81062ede>] ? kvm_clock_read+0x1e/0x20 [<ffffffff810ce54e>] ? dequeue_task_fair+0x41e/0x660 [<ffffffff810cb62c>] ? set_next_entity+0x3c/0xe0 [<ffffffff810cb72f>] ? pick_next_task_fair+0x5f/0x1b0 [<ffffffff8133d9dd>] ? list_del+0xd/0x30 [<ffffffff810b1671>] ? remove_wait_queue+0x31/0x40 [<ffffffffc062394d>] ? uio_read+0x11d/0x180 [uio] [<ffffffff810c4810>] ? wake_up_state+0x20/0x20 [<ffffffff812175f4>] SyS_poll+0x74/0x110 [<ffffffff8111f5c6>] ? __audit_syscall_exit+0x1e6/0x280 [<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b Code: ff ff c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 b8 fb ff ff ff 48 89 e5 41 54 53 4c 8b a7 a8 00 00 00 49 8b 1c 24 48 8b 4b 40 <48> 83 b9 d0 01 00 00 00 75 06 5b 41 5c 5d c3 90 48 85 f6 74 19 RIP [<ffffffffc0623080>] uio_poll+0x20/0x70 [uio] RSP <ffff880117d97b08>
The above trace is seen with glusterfs-3.8.4-35 and gluster-block-0.2.1-6
corresponding cns bug is verified https://bugzilla.redhat.com/show_bug.cgi?id=1490350#c3). We are good from the verification of cns perspective.
Tested and verified this on the build tcmu-runner-1.2.0-15 and gluster-block-0.2.1-13. Executed multiple block creates and deletes. Stopped gluster-blockd service and did node reboots. I do not see the mentioned VM crash in all my attempts. I did see partially created blocks (on failed creates) for which bz 1490818 has been raised. Moving this bug to verified in rhgs 3.3.0.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:2773