Bug 1298646

Summary: [Lago] - hosted engine setup causes kernel panic
Product: [Community] ovirt-system-tests Reporter: Anatoly Litovsky <tlitovsk>
Component: GeneralAssignee: Bandan Das <bdas>
Status: CLOSED INSUFFICIENT_DATA QA Contact:
Severity: low Docs Contact:
Priority: low    
Version: ---CC: bdas, bugs, dfediuck, didi, fdeutsch, lveyde, rgolan, rmartins, sbonazzo, stirabos, tlitovsk, ylavi
Target Milestone: ---Keywords: AutomationBlocker
Target Release: ---Flags: ylavi: ovirt-3.6.z?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-24 18:56:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: External RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Anatoly Litovsky 2016-01-14 16:00:36 UTC
Install hosted engine on nested system
NFS for storage
Using 2 cores and 4096 for the engine VM.
the rhevh defined with 8192 ram and 4 cores.
Sandy Bridge.

During engine database creation kernel panic appears in the engine.
retrieving sosreport caused second kernel panic in both appliance and brought down RHEVH.

rhevh sosreport is not working following that crash
tar of /var/log will have to do

Comment 1 Simone Tiraboschi 2016-01-14 16:08:35 UTC
We were already discussing it here: https://bugzilla.redhat.com/show_bug.cgi?id=1255026#c16

We saw a strictly dependency with the number of cores allocated to the engine VM:
- 1 core: it often happens
- 2 cores: seldom
- 4 cores: sometimes

Comment 2 Bandan Das 2016-01-19 20:07:22 UTC
Can you please post a stack trace ?

Comment 3 Simone Tiraboschi 2016-01-20 09:00:21 UTC
Sometimes it's still failing on hosted-engine CI jobs:
here for instance we have one example in /var/log/messages on the hosted-engine host (L1):
http://jenkins-ci.eng.lab.tlv.redhat.com/job/hosted_engine_3.6_el7_dynamicip_iscsi_install_on_7.2/126/artifact/logs/messages


Jan 19 06:43:05 hehost kernel: ------------[ cut here ]------------
Jan 19 06:43:05 hehost kernel: WARNING: at fs/block_dev.c:67 bdev_inode_switch_bdi+0x7a/0x90()
Jan 19 06:43:05 hehost kernel: Modules linked in: loop dm_service_time dm_multipath dm_mod sd_mod sg iscsi_tcp libiscsi_tcp libiscsi iscsi_target_mod target_core_pscsi target_core_file target_core_iblock target_core_mod crc_t10dif crct10dif_generic crct10dif_common nfsv3 nfs fscache ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter softdog scsi_transport_iscsi 8021q garp mrp bridge stp llc bonding snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device kvm_intel snd_pcm ppdev kvm pcspkr snd_timer virtio_balloon snd soundcore parport_pc i2c_piix4 parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_console virtio_net virtio_blk qxl syscopyarea sysfillrect sysimgblt drm_kms_helper ttm ata_piix serio_raw drm
Jan 19 06:43:05 hehost kernel: crc32c_intel virtio_pci virtio_ring libata i2c_core virtio floppy
Jan 19 06:43:05 hehost kernel: CPU: 2 PID: 14227 Comm: qemu-img Not tainted 3.10.0-327.el7.x86_64 #1
Jan 19 06:43:05 hehost kernel: Hardware name: Red Hat KVM, BIOS seabios-1.7.5-11.el7 04/01/2014
Jan 19 06:43:05 hehost kernel: 0000000000000000 0000000022ec8ccd ffff880074b7fdb0 ffffffff816351f1
Jan 19 06:43:05 hehost kernel: ffff880074b7fde8 ffffffff8107b200 ffff8802331a3b70 ffff8802331a3bf8
Jan 19 06:43:05 hehost kernel: ffffffff819c1900 000000000000001f 0000000000000000 ffff880074b7fdf8
Jan 19 06:43:05 hehost kernel: Call Trace:
Jan 19 06:43:05 hehost kernel: [<ffffffff816351f1>] dump_stack+0x19/0x1b
Jan 19 06:43:05 hehost kernel: [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
Jan 19 06:43:05 hehost kernel: [<ffffffff8107b34a>] warn_slowpath_null+0x1a/0x20
Jan 19 06:43:05 hehost kernel: [<ffffffff8121963a>] bdev_inode_switch_bdi+0x7a/0x90
Jan 19 06:43:05 hehost kernel: [<ffffffff8121a224>] __blkdev_put+0x74/0x1a0
Jan 19 06:43:05 hehost kernel: [<ffffffff8121ac9e>] blkdev_put+0x4e/0x140
Jan 19 06:43:05 hehost kernel: [<ffffffff8121ae45>] blkdev_close+0x25/0x30
Jan 19 06:43:05 hehost kernel: [<ffffffff811e0329>] __fput+0xe9/0x270
Jan 19 06:43:05 hehost kernel: [<ffffffff811e05ee>] ____fput+0xe/0x10
Jan 19 06:43:05 hehost kernel: [<ffffffff810a22d7>] task_work_run+0xa7/0xe0
Jan 19 06:43:05 hehost kernel: [<ffffffff81014b12>] do_notify_resume+0x92/0xb0
Jan 19 06:43:05 hehost kernel: [<ffffffff81645bbd>] int_signal+0x12/0x17
Jan 19 06:43:05 hehost kernel: ---[ end trace a0e1ad231323ba80 ]---



Here another:
http://jenkins-ci.eng.lab.tlv.redhat.com/job/hosted_engine_3.6_el7_dynamicip_nfs4_install_on_latest/122/artifact/logs/messages

Jan 19 05:55:34 hehost kernel: ------------[ cut here ]------------
Jan 19 05:55:34 hehost kernel: WARNING: at fs/block_dev.c:67 bdev_inode_switch_bdi+0x7a/0x90()
Jan 19 05:55:34 hehost kernel: Modules linked in: loop dm_service_time dm_multipath dm_mod sd_mod sg iscsi_tcp libiscsi_tcp libiscsi iscsi_target_mod target_core_pscsi target_core_file target_core_iblock target_core_mod crc_t10dif crct10dif_generic nfsv3 nfs fscache ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter softdog scsi_transport_iscsi 8021q garp mrp bridge stp llc bonding kvm_intel kvm snd_hda_codec_generic crc32_pclmul ghash_clmulni_intel ppdev snd_hda_intel aesni_intel snd_hda_codec lrw gf128mul glue_helper snd_hda_core ablk_helper snd_hwdep cryptd snd_seq snd_seq_device snd_pcm pcspkr snd_timer virtio_balloon snd parport_pc parport soundcore i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 virtio_net virtio_console virtio_blk ata_generic pata_acpi qxl syscopyarea
Jan 19 05:55:34 hehost kernel: sysfillrect sysimgblt drm_kms_helper ttm drm serio_raw crct10dif_pclmul crct10dif_common crc32c_intel ata_piix virtio_pci i2c_core virtio_ring libata virtio floppy
Jan 19 05:55:34 hehost kernel: CPU: 3 PID: 14180 Comm: qemu-img Not tainted 3.10.0-327.el7.x86_64 #1
Jan 19 05:55:34 hehost kernel: Hardware name: Red Hat KVM, BIOS seabios-1.7.5-11.el7 04/01/2014
Jan 19 05:55:34 hehost kernel: 0000000000000000 00000000e63c4116 ffff880231c7fdb0 ffffffff816351f1
Jan 19 05:55:34 hehost kernel: ffff880231c7fde8 ffffffff8107b200 ffff880233b27830 ffff880233b278b8
Jan 19 05:55:34 hehost kernel: ffffffff819c1900 000000000000001f 0000000000000000 ffff880231c7fdf8
Jan 19 05:55:34 hehost kernel: Call Trace:
Jan 19 05:55:34 hehost kernel: [<ffffffff816351f1>] dump_stack+0x19/0x1b
Jan 19 05:55:34 hehost kernel: [<ffffffff8107b200>] warn_slowpath_common+0x70/0xb0
Jan 19 05:55:34 hehost kernel: [<ffffffff8107b34a>] warn_slowpath_null+0x1a/0x20
Jan 19 05:55:34 hehost kernel: [<ffffffff8121963a>] bdev_inode_switch_bdi+0x7a/0x90
Jan 19 05:55:34 hehost kernel: [<ffffffff8121a224>] __blkdev_put+0x74/0x1a0
Jan 19 05:55:34 hehost kernel: [<ffffffff8121ac9e>] blkdev_put+0x4e/0x140
Jan 19 05:55:34 hehost kernel: [<ffffffff8121ae45>] blkdev_close+0x25/0x30
Jan 19 05:55:34 hehost kernel: [<ffffffff811e0329>] __fput+0xe9/0x270
Jan 19 05:55:34 hehost kernel: [<ffffffff811e05ee>] ____fput+0xe/0x10
Jan 19 05:55:34 hehost kernel: [<ffffffff810a22d7>] task_work_run+0xa7/0xe0
Jan 19 05:55:34 hehost kernel: [<ffffffff81014b12>] do_notify_resume+0x92/0xb0
Jan 19 05:55:34 hehost kernel: [<ffffffff81645bbd>] int_signal+0x12/0x17
Jan 19 05:55:34 hehost kernel: ---[ end trace c6ced33f6a84c494 ]---


This one looks different:
http://jenkins-ci.eng.lab.tlv.redhat.com/job/hosted_engine_3.6_el7_staticip_iscsi_install_on_7.2/125/artifact/logs/messages

Jan 17 08:01:56 hehost kernel: INFO: task supervdsmServer:13000 blocked for more than 120 seconds.
Jan 17 08:01:56 hehost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 17 08:01:56 hehost kernel: supervdsmServer D 0000000002c02d18     0 13000      1 0x00000080
Jan 17 08:01:56 hehost kernel: ffff8800bb577e88 0000000000000082 ffff880230da2e00 ffff8800bb577fd8
Jan 17 08:01:56 hehost kernel: ffff8800bb577fd8 ffff8800bb577fd8 ffff880230da2e00 ffff8800bb577ee0
Jan 17 08:01:56 hehost kernel: 00000000000008e0 ffffffff81c6f7e0 0000000002bc0a60 0000000002c02d18
Jan 17 08:01:56 hehost kernel: Call Trace:
Jan 17 08:01:56 hehost kernel: [<ffffffff8163a909>] schedule+0x29/0x70
Jan 17 08:01:56 hehost kernel: [<ffffffff81057fef>] kvm_async_pf_task_wait+0x1df/0x230
Jan 17 08:01:56 hehost kernel: [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
Jan 17 08:01:56 hehost kernel: [<ffffffff81641100>] ? do_page_fault+0x10/0x80
Jan 17 08:01:56 hehost kernel: [<ffffffff8164094a>] do_async_page_fault+0x9a/0xe0
Jan 17 08:01:56 hehost kernel: [<ffffffff8163d438>] async_page_fault+0x28/0x30
Jan 17 08:01:56 hehost kernel: INFO: task dhclient:13337 blocked for more than 120 seconds.
Jan 17 08:01:56 hehost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 17 08:01:56 hehost kernel: dhclient        D 00007f7338eab010     0 13337    588 0x00000080
Jan 17 08:01:56 hehost kernel: ffff88009a69be88 0000000000000086 ffff8800b99bb980 ffff88009a69bfd8
Jan 17 08:01:56 hehost kernel: ffff88009a69bfd8 ffff88009a69bfd8 ffff8800b99bb980 ffff88009a69bee0
Jan 17 08:01:56 hehost kernel: 0000000000000970 ffffffff81c6f870 00007f7338ec64b0 00007f7338eab010
Jan 17 08:01:56 hehost kernel: Call Trace:
Jan 17 08:01:56 hehost kernel: [<ffffffff8163a909>] schedule+0x29/0x70
Jan 17 08:01:56 hehost kernel: [<ffffffff81057fef>] kvm_async_pf_task_wait+0x1df/0x230
Jan 17 08:01:56 hehost kernel: [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
Jan 17 08:01:56 hehost kernel: [<ffffffff8164094a>] do_async_page_fault+0x9a/0xe0
Jan 17 08:01:56 hehost kernel: [<ffffffff8163d438>] async_page_fault+0x28/0x30

Comment 4 Anatoly Litovsky 2016-01-20 16:45:01 UTC
Hi it seems the bugzilla didnt attach my sosreport.
And the host already destroyed.
But it seems to be the same message since those jobs fail on exactly the step I failed .

Setting the rhevm db.

Comment 5 Bandan Das 2016-01-21 17:18:26 UTC
(In reply to Anatoly Litovsky from comment #4)
> Hi it seems the bugzilla didnt attach my sosreport.
> And the host already destroyed.
> But it seems to be the same message since those jobs fail on exactly the
> step I failed .
> 
> Setting the rhevm db.

I am not sure I understand what you mean. If the failure is nested hypervisor related, there should some related error message or trace in the host dmesg.

Let me know when you get a chance and can reproduce this again.

Comment 6 Sandro Bonazzola 2016-02-08 08:13:36 UTC
Moving from hosted engine to distribution, being this something exploited by hosted engine setup in nested virtualization but not a bug in hosted engine.