Created attachment 998288 [details] Error_detail_Upgrade.jpg Description of problem: Upgrade from RHEV-H 7.0GA to RHEV-H 7.1 latest through TUI, RHEVM and CMD failed to reboot. When we finished upgrade, it can not reboot succeed. The error info is as Error_detail_Upgrade.jpg shows. Version-Release number of selected component (if applicable): rhev-hypervisor7-7.0-20150127.0 ovirt-node-3.2.1-6.el7.noarch rhev-hypervisor7-7.1-20150226.0.el7ev ovirt-node-3.2.1-9.el7.noarch How reproducible: 100% QA Whiteboard: upgrade Steps to Reproduce: 1. TUI install rhev-hypervisor7-7.0-20150127.0 2. Upgrade RHEV-H 7.0 to RHEV-H 7.1 in three ways: 1)TUI 2)CMD 3)RHEVM 3.5 -- Red Hat Enterprise Virtualization Manager Version: 3.5.0-0.33.el6ev Actual results: 1. After finished upgrade, it can not reboot successfully with the error like follows. systemd-readahead[812]: Failed to open pack file: Read-only file system Expected results: 1. It can upgrade RHEV-H 7.0 to RHEVH 7.1 and login rhevh7.1. Additional info:
Created attachment 998291 [details] All log files in /var/log
Created attachment 998303 [details] sosreport
I can not reproduce this bug in a plain VM. Shang, does this bug appear on only one machine? Can you please boot the machine with "rd.debug systemd.log_level=debug" And remove the quiet and rhgb arguments from the cmdline.
(In reply to Fabian Deutsch from comment #5) > I can not reproduce this bug in a plain VM. > > Shang, does this bug appear on only one machine? As I know, all Virt QE physical machines meet this issues. and 100% reproduce.
Created attachment 998507 [details] oops after an upgrade from 7.0 to 7.1 I was able to capture this oops in 50% of the cases on reboot: [26634.453675] SQUASHFS error: unable to read inode lookup table [26634.467530] BUG: unable to handle kernel NULL pointer dereference at 0000000000000098 [26634.468108] IP: [<ffffffffa02f6fab>] dm_exception_store_set_chunk_size+0x7b/0x120 [dm_snapshot] [26634.468760] PGD 0 [26634.468962] Oops: 0000 [#1] SMP [26634.469203] Modules linked in: dm_snapshot dm_bufio ext4 mbcache jbd2 squashfs dm_service_time sd_mod crc_t10dif sr_mod cdrom ata_generic pata_acpi virtio_net virtio_balloon crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel bochs_drm syscopyarea sysfillrect sysimgblt drm_kms_helper ttm aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ahci libahci ata_piix virtio_pci virtio_ring virtio drm i2c_core libata sunrpc dm_mirror dm_region_hash dm_log loop dm_multipath dm_mod [26634.472715] CPU: 0 PID: 638 Comm: dmsetup Not tainted 3.10.0-230.el7.x86_64 #1 [26634.473231] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014 [26634.473861] task: ffff880075a15b00 ti: ffff880034b30000 task.ti: ffff880034b30000 [26634.474399] RIP: 0010:[<ffffffffa02f6fab>] [<ffffffffa02f6fab>] dm_exception_store_set_chunk_size+0x7b/0x120 [dm_snapshot] [26634.475175] RSP: 0018:ffff880034b33b68 EFLAGS: 00010246 [26634.475518] RAX: 0000000000000000 RBX: 0000000000000008 RCX: 0000000000000001 [26634.475987] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8800359eb600 [26634.476476] RBP: ffff880034b33b80 R08: 0000000000000000 R09: 0000000000000001 [26634.476928] R10: 000000000000000a R11: f000000000000000 R12: ffff8800353e8d80 [26634.477420] R13: ffffc900003b6088 R14: ffff880034b33c04 R15: ffff8800353e8d80 [26634.477880] FS: 00007f319f9df800(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 [26634.478438] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [26634.478803] CR2: 0000000000000098 CR3: 0000000034415000 CR4: 00000000001406f0 [26634.479295] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [26634.479746] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [26634.480242] Stack: [26634.480367] ffffc900003b6040 ffff8800359eb600 ffff8800353e8a90 ffff880034b33be0 [26634.480883] ffffffffa02f721f 0000000000000018 ffffffffa02fb3e0 ffff8800359eb728 [26634.481430] 00000008353e83d8 00000000c7c4fd00 ffffc900003b6040 ffff8800353e8a80 [26634.481934] Call Trace: [26634.482092] [<ffffffffa02f721f>] dm_exception_store_create+0x1cf/0x240 [dm_snapshot] [26634.482631] [<ffffffffa02f539b>] snapshot_ctr+0x14b/0x630 [dm_snapshot] [26634.483076] [<ffffffffa0005638>] ? dm_split_args+0x68/0x170 [dm_mod] [26634.483512] [<ffffffffa00058b7>] dm_table_add_target+0x177/0x460 [dm_mod] [26634.483968] [<ffffffffa0008e57>] table_load+0x157/0x380 [dm_mod] [26634.484382] [<ffffffffa0008d00>] ? retrieve_status+0x1c0/0x1c0 [dm_mod] [26634.484826] [<ffffffffa0009ac5>] ctl_ioctl+0x255/0x500 [dm_mod] [26634.485231] [<ffffffffa0009d83>] dm_ctl_ioctl+0x13/0x20 [dm_mod] [26634.485625] [<ffffffff811d9bd5>] do_vfs_ioctl+0x2e5/0x4c0 [26634.485988] [<ffffffff8126f0ae>] ? file_has_perm+0xae/0xc0 [26634.486368] [<ffffffff811d9e51>] SyS_ioctl+0xa1/0xc0 [26634.486701] [<ffffffff8160ed99>] ? do_async_page_fault+0x29/0xe0 [26634.487093] [<ffffffff81613ea9>] system_call_fastpath+0x16/0x1b [26634.487512] Code: 14 06 00 00 66 85 c0 74 15 66 c1 e8 09 31 d2 0f b7 c8 89 d8 f7 f1 85 d2 0f 85 8a 00 00 00 49 8b 7c 24 08 e8 58 d0 ff ff 48 8b 00 <48> 8b 80 98 00 00 00 48 8b 80 60 03 00 00 48 85 c0 74 1d 0f b7 [26634.489314] RIP [<ffffffffa02f6fab>] dm_exception_store_set_chunk_size+0x7b/0x120 [dm_snapshot] [26634.489906] RSP <ffff880034b33b68> [26634.490131] CR2: 0000000000000098 [26634.490397] ---[ end trace 3bb6c95d82d638f6 ]--- [26634.490709] Kernel panic - not syncing: Fatal exception [26634.491151] drm_kms_helper: panic occurred, switching back to text console
Created attachment 998521 [details] Aneven longer stack trace. My current assumption is that something goes wrong with the squashfs uncompression. But I am not sure how to debug such an issue. Virt QE, can you please just remove the quiet keyword from the kernel commandline on booting, then you should get the stack trace which is causing the failed boot for you.
Created attachment 998522 [details] Stack trace including rd.debug An additional log, including rd.debug data.
As a note, I was unable to reproduce this on VMs or on two physical systems tested (both Dell, one R300 and one Optiplex 9020). Seconding the request for logs without "quiet rhgb"
Virt qe, can you please try the following things: 1. Remove quiet and boot, to see if you get a stack trace 2. Boot in permissive mode (enforcing=0) to see if the bug is fixed
(In reply to Fabian Deutsch from comment #8) > Created attachment 998521 [details] > Aneven longer stack trace. > > My current assumption is that something goes wrong with the squashfs > uncompression. > Unlikely to be a bug in Squashfs, a corrupted filesystem is much more likely. People constantly misinterpret the error messages from Squashfs "Squashfs error: unable to read inode lookup table" Does not mean an error *in* Squashfs, merely that Squashfs experienced an error caused elsewhere which prevented it from continuing. In this case, this error will be printed at Squashfs mount time if it can't read the inode lookup table. 99% of the time this is because the file has been truncated. Two points here: 1. This will cause the Squashfs filesystem mount to fail. This failure to mount should be checked, but it evidently isn't. 2. If you didn't mount the Squashfs filesystem, then you can't mount the embedded ext3/ext4 filesystem. > But I am not sure how to debug such an issue. Well first you need to check the Squashfs filesystems in the rpm are correct. I in fact did that, and they're correct. Second, when the Squashfs filesystem fails to mount, the system needs to fall back to a bash prompt, which will allow you to verify the filesystem, is it the correct length, does the checksum match etc. There's probably a debug option to do this in dmsquash-live. In any case, this is likely a truncation/file corruption issue, it is not a bug in Squashfs. > > Virt QE, can you please just remove the quiet keyword from the kernel > commandline on booting, then you should get the stack trace which is causing > the failed boot for you.
Created attachment 998657 [details] the attachment is the log when remove the quiet and rhgb keyword from the kernel commandline on booting.
No matter with or without enforcing=0 when boot, I still can't upgrade from RHEV-H 7.0 to RHEV-H 7.1 If you fail to upgrade RHEV-H 7.0 to RHEV-H 7.1, you must re-install the RHEV-H 7.0 before the next upgrade. If you don't re-install RHEV-H 7.0, the upgrade will success.
checked comment 16, and with my testing with enforcing=0 to upgrade rhevh 7.0 to rhevh 7.1, with enforcing=0 rhevh upgrade still failed. So, let's waive the comment 15.
* with enforcing=0 Test steps: 1. Installed rhevh 7.0 2. upgrade to rhevh 7.1 with enforcing=0 3. first time upgrade hang. see screenshot: upgrade_hang.png 4. then reboot and upgrade rhevh 7.1 _again_, the second upgrade is successful.
Created attachment 998933 [details] upgrade_upgrade.png for comment 18
Created attachment 998934 [details] varlog for comment 18
Created attachment 998935 [details] sosreport for comment 18
* without enforcing=0 Test steps: 1. totally clean installed rhevh 7.0(uninstall, then firstboot install.) 2. upgrade to rhevh 7.1 _without_ enforcing=0 3. first time upgrade hang. the same screenshot: upgrade_hang.png in comment 18. 4. then reboot and upgrade rhevh 7.1 _again_, the second upgrade is successful.
Created attachment 998945 [details] sosreport for comment 22
Created attachment 998946 [details] varlog for comment 22
Created attachment 999435 [details] console_output for comment 22
In the log from comment 22 I see nothing special. Could you please provide the logs of a failed boot with the following kargs: systemd.log_level=debug rd.debug debug
Created attachment 999504 [details] console_output_for_comment 27
(In reply to Ying Cui from comment #22) > * without enforcing=0 > Test steps: > 1. totally clean installed rhevh 7.0(uninstall, then firstboot install.) > 2. upgrade to rhevh 7.1 _without_ enforcing=0 > 3. first time upgrade hang. the same screenshot: upgrade_hang.png in comment > 18. > 4. then reboot and upgrade rhevh 7.1 _again_, the second upgrade is > successful. Does the double-upgrade also work with SElinux in enforcing mode? I haven't been able to reproduce this (though I'm going to try again today), and I don't see anything obvious in the console output, but knowing whether SElinux is involved would be helpful.
(In reply to Ryan Barry from comment #29) > (In reply to Ying Cui from comment #22) > > * without enforcing=0 > > Test steps: > > 1. totally clean installed rhevh 7.0(uninstall, then firstboot install.) > > 2. upgrade to rhevh 7.1 _without_ enforcing=0 > > 3. first time upgrade hang. the same screenshot: upgrade_hang.png in comment > > 18. > > 4. then reboot and upgrade rhevh 7.1 _again_, the second upgrade is > > successful. > > Does the double-upgrade also work with SElinux in enforcing mode? Yes, the second upgrade also work with selinux in enforcing mode. > > I haven't been able to reproduce this (though I'm going to try again today), > and I don't see anything obvious in the console output, but knowing whether > SElinux is involved would be helpful.
Created attachment 999755 [details] console_output for comment 30 Test machines: 1. dell r210 server - local disk - 5 times 2. dell 9010 desktop - local disk - 3 times 3. hp 5808 desktop - local disk - 3 times _Always_ upgrade hang when the first time upgrade with selinux in enforcing mode, and the second upgrade with selinux in enforcing mode successful. Test steps: 1. clean install rhevh 7.0 rhev-hypervisor7-7.0-20150127.0 in local disk( uninstall firstly, then TUI install rhevh) 2. first time upgrade rhevh 7.0 to rhevh 7.1(rhev-hypervisor7-7.1-20150309.28.iso)(tested both TUI upgrade and cmdline with upgrade kargs, the phenomenon is the same, hang!) 3. after upgrade hang, reboot rhevh 4. again upgrade(second upgrade) via TUI or cmdline 5. the second upgrade successful and can login rhevh 7.1 successful. Additional info: rhevh-7.1-20150304.0.el7ev.iso upgrade to rhev-hypervisor7-7.1-20150309.28.iso the first time upgrade successful, both are rhevh 7.1.
Thanks for the very previse informations. Maybe it is that the 7.1 kernel (or some component) has issue beeing installed alongside of 7.0. Could you please try the following: 1. Install RHEV-H 7.0 2. Upgrade to 7.1 After 2: Please check that the reboot into 7.1 fails after installation (Now the important part:) 3. Boot the installation CD, into the installer, but do NOT install 4. Drop to shell using F2 5. Run: blkid -L RootBackup to find the partition with the 7.0 image (after installation 7.0 is on backup) example: /dev/sda3 6. Run parted, then in parted: "rm 3" (where the 3 comes from sda3), then "q" to quit. 7. Reboot, and try rebooting into 7.1
(In reply to Fabian Deutsch from comment #35) > Could you please try the following: > > 1. Install RHEV-H 7.0 > 2. Upgrade to 7.1 > After 2: Please check that the reboot into 7.1 fails after installation Fabian, I am not clearly enough on this comment, here is it the first time upgrade? or second time upgrade? If the first time upgrade, during upgrade hang, reboot the rhevh manually, then there is rhevh _7.0_ only, no 7.1 rhevh installation. Based on this, the following trying for important part is invalid. If the second time upgrade, rhevh 7.1 upgrade process successful, and after reboot, rhevh 7.1 can be login. we do not need to do the following important part. Any thoughts? Thanks > > (Now the important part:) > 3. Boot the installation CD, into the installer, but do NOT install > 4. Drop to shell using F2 > 5. Run: blkid -L RootBackup to find the partition with the 7.0 image (after > installation 7.0 is on backup) > example: /dev/sda3 > 6. Run parted, then in parted: "rm 3" (where the 3 comes from sda3), then > "q" to quit. > 7. Reboot, and try rebooting into 7.1
(In reply to Ying Cui from comment #36) … > Fabian, I am not clearly enough on this comment, here is it the first time > upgrade? or second time upgrade? > If the first time upgrade, during upgrade hang, reboot the rhevh manually, > then there is rhevh _7.0_ only, no 7.1 rhevh installation. Based on this, > the following trying for important part is invalid. Okay, it sounds like there are missunderstandings, some questions: 1. In comment 26 there is a console output: From which boot is this? 2. When does the machine hang exactly? 3. After the first upgrade, does a boot entry appear?
(In reply to Fabian Deutsch from comment #37) > (In reply to Ying Cui from comment #36) > > … > > > Fabian, I am not clearly enough on this comment, here is it the first time > > upgrade? or second time upgrade? > > If the first time upgrade, during upgrade hang, reboot the rhevh manually, > > then there is rhevh _7.0_ only, no 7.1 rhevh installation. Based on this, > > the following trying for important part is invalid. > > Okay, it sounds like there are missunderstandings, some questions: > > 1. In comment 26 there is a console output: From which boot is this? During upgrade process boot, hang happen during upgrade. and checked the screen output.log in comment 26, the hang happen after plymouth checking at least from the view of phenomenon like this: <snip> [ 78.662208] systemd[1]: Starting Terminate Plymouth Boot Screen... [ 78.669886] systemd[1]: About to execute: /usr/bin/plymouth quit Startin[ 78.676237] systemd[1]: Forked /usr/bin/plymouth as 1770 [ 78.676683] systemd[1770]: Executing: /usr/bin/plymouth quit g Terminate Plym[ 78.688558] systemd[1]: plymouth-quit.service changed dead -> start [ 78.696232] systemd[1]: Starting Wait for Plymouth Boot Screen to Quit... outh Boot Screen[ 78.703221] systemd[1]: About to execute: /usr/bin/plymouth --wait ... S[ 78.710832] systemd[1]: Forked /usr/bin/plymouth as 1771 [ 78.711246] systemd[1771]: Executing: /usr/bin/plymouth --wait tarting Wait for[ 78.723386] systemd[1]: plymouth-quit-wait.service changed dead -> start </snip> > > 2. When does the machine hang exactly? upgrade process hang, but if you enter ctrl+alt+del, can reboot rhevh manually. > > 3. After the first upgrade, does a boot entry appear? no boot entry of rhevh 7.1 after first upgrade, only 7.0 boot entry and after the first time upgrade, the rhevh _7.0_ can be login as well.
After the clarifications I can now reproduce it as well. It would have helped if someone would haven mentioned that this bug appears when booting _into_ the upgrade process. It sounded like th eboot _after_ the upgrade goes wrong. I also do not understand how this bug can be reproduced with RHEV-M 3.5. Please clarify the exact steps here. The steps to reproduce: 1. Install RHEV-H 7.0 (install using regular qemu) 2. Reboot with the RHEV-H 7.1 ISO (booted with qemu in snapshot mode now, to not touch the virtual disk) 2.a During the boot of the RHEV-H 7.1 iso the boot process hangs and does not enter the TUI installer As noted in several comments, this only happens the first time, on the second boot of the RHEV-H 7.1 iso, the boot succeeds. This indicates that some on disk changed was nevessary to boot. Using qemu to install 7.0, and then using qemu -snapshot to boot 7.1 will help to reproduce the error with every boot.
In production the workaround can be to just reboot the installer once it hangs.
Commands used to reproduce: Prepare the disk: $ qemu-img create -f qcow2 dst.img 20G Install 7.0 and shut down: $ qemu-kvm -m 2048 -cdrom rhev-hypervisor7-7.0-*.iso -serial stdio -hda dst.img Boot into the 7.1 installer (which hangs while booting into it): $ qemu-kvm -m 2048 -cdrom rhev-hypervisor7-7.1-20150304.0.iso -serial stdio -hda dst.img -snapshot -boot d
No solution today. There are messages from logind and dbus which look like errors, but those also appear on a successful boot. I put together a scratch build with a systemd debug shell enabled so I can look through systemd-analyze and see what's hanging up, hopefully. Judging from console output, it appears to be ovirt-early, but systemd does not warn about any hung jobs or jobs still starting, which makes me wonder if it's a systemd issue. Will continue looking tomorrow.
Based on my today's testing, I noticed this issue is a regression issue. Upgrade from rhev-hypervisor7-7.0-20150127.0 to rhev-hypervisor7-7.1-20150213.0 can be successful, and no hang happen during first time upgrade process. So I added regression keywords on this bug, and regression happened between rhev-hypervisor7-7.1-20150213.0 and rhev-hypervisor7-7.1-20150226.0.
It is blocked by Bug 1275956, 1263648. I will verify this issue after Bug 1275956, 1263648 are fixed.
Why is this bug blocked by bug 1263648? The issue in bug 1263648 is only affecting an optional flow. Please drop the dependency if you agree.
Fabian, I have dropped the bug 1263648 dependency on Comment 50.
Version-Release number of selected component (if applicable): rhev-hypervisor7-7.1-20151015.0.el7ev ovirt-node-3.2.3-23.el7.noarch rhev-hypervisor7-7.2-20151112.1.el7ev ovirt-node-3.6.0-0.20.20151103git3d3779a.el7ev.noarch Test Steps: 1. TUI install rhev-hypervisor7-7.1-20151015.0.el7ev 2. Upgrade RHEV-H 7.1-20151015.0 to rhev-hypervisor7-7.2-20151112.1.el7ev in three ways: 1)TUI 2)CMD 3)RHEVM 3.5 -- Red Hat Enterprise Virtualization Manager Version: 3.5.6.2-0.1.el6ev Test results: 1. It can upgrade RHEV-H 7.1 to RHEVH 7.2 and login rhevh7.2 successful. So this bug is fixed on rhev-hypervisor7-7.2-20151112.1.el7ev, I will change the status to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0378.html