Bug 1334726 - [PPC][rhevm-3.6.6-0.1] RHEL 7.2 vm with copied disks enters emergency mode when booted.
Summary: [PPC][rhevm-3.6.6-0.1] RHEL 7.2 vm with copied disks enters emergency mode wh...
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 3.6.5.1
Hardware: ppc64le
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.0.7
: ---
Assignee: Nir Soffer
QA Contact: Carlos Mestre González
URL:
Whiteboard:
: 1439683 (view as bug list)
Depends On:
Blocks: RHV4.1PPC 1361549
TreeView+ depends on / blocked
 
Reported: 2016-05-10 12:20 UTC by Carlos Mestre González
Modified: 2017-04-06 13:12 UTC (History)
14 users (show)

Fixed In Version:
Clone Of:
: 1361549 (view as bug list)
Environment:
Last Closed: 2016-11-28 16:10:12 UTC
oVirt Team: Storage
Embargoed:
amureini: ovirt-4.0.z?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?


Attachments (Terms of Use)
Collection of messages, journalctl output, engine and vdsm. (1.97 MB, application/x-gzip)
2016-05-10 12:26 UTC, Carlos Mestre González
no flags Details
libvirtd.log and other logs for passing and failing scenario (3.02 MB, application/x-gzip)
2016-05-23 12:44 UTC, Carlos Mestre González
no flags Details
screenshot fstab - kernel version after failure to start (32.76 KB, image/png)
2016-11-22 17:15 UTC, Carlos Mestre González
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1374545 0 urgent CLOSED Guest LVs created in ovirt raw volumes are auto activated on the hypervisor in RHEL 7 2021-06-10 11:42:20 UTC

Internal Links: 1374545

Description Carlos Mestre González 2016-05-10 12:20:21 UTC
Description of problem:
This seems related to RHEL 7.2 on PowerPC and how it handles the devices, but I need your input before moving it to other teams in case I'm missing something.

Basically we have this scenario of copying all the iscsi disks from a vm to a new one, and then boot it and compare it works. In PowerPC after booting the copied vm the systems shows emergency mode and ask to prompt the password (the systems is not fully booted)


Version-Release number of selected component (if applicable):
rhevm-3.6.6-0.1
qemu-kvm-rhev-2.3.0-31.el7_2.12.ppc64le
qemu-img-rhev-2.3.0-31.el7_2.12.ppc64le
vdsm-4.17.27-0.el7ev.noarch

VM:
RHEL 7.2 3.10.0-327.13.1.el7.ppc64le #1 SMP Mon Feb 29 13:22:06 EST 2016 ppc64le ppc64le ppc64le GNU/Linux
Host:
RHEL 7.2 3.10.0-327.18.2.el7.ppc64le #1 SMP Fri Apr 8 05:10:45 EDT 2016 ppc64le ppc64le ppc64le GNU/Linux

How reproducible:
100%


Steps to Reproduce:
1. Clone a vm with a RHEL 7.2 3.10.0-327.18.2.el7 thin provisioning disk
2. Create a permutation of 1 GB disks for Virtio, Virtio-scsi and spapr-vscsi and thin provisioning/preallocated (total 6 disks)
3. Attach those disk to the vm, and activate them
4. Start the vm and create partition and filesystem (ext4 this case) for all, create a small file on them
5. Shutdown the vm.
6. Copy the vm's disk to a new iscsi domain
7. Create a new vm and attach all created disks from the copy to the new vm.

Actual results:
Vm boots in emergency mode (I cannot understand clearly why, seems related to to the devices and how's unable to mount vdc on the logs)

Expected results:
Vm boots normally having all the filesystems created as in the original vm 

Additional info:
After the emergency it recommends to check journalctl -xb, I checked but the only highlighted errors are the ones that also appear on the original vm  but with different device (that runs fine):

copied vm journalctl log:
May 09 10:56:09 dhcp167-130.klab.eng.bos.redhat.com systemd[1]: Device dev-disk-by\x2dpartlabel-primary.device appeared twice with different sysfs paths /sys/devices/pci0000:00/0000:00:05.0/virtio3/block/vdc/vdc1 and /sys/devices/pci0000:00/0000:00:07.0/virtio1/block/vda/vda1
May 09 10:56:16 localhost.localdomain systemd[1]: Device dev-disk-by\x2dpartlabel-primary.device appeared twice with different sysfs paths /sys/devices/pci0000:00/0000:00:05.0/virtio3/block/vdc/vdc1 and /sys/devices/vio/2000/host0/target0:0:0/0:0:0:2/block/sdb/sdb1
May 09 10:56:16 localhost.localdomain systemd[1]: Device dev-disk-by\x2dpartlabel-primary.device appeared twice with different sysfs paths /sys/devices/pci0000:00/0000:00:05.0/virtio3/block/vdc/vdc1 and /sys/devices/vio/2000/host0/target0:0:0/0:0:0:3/block/sda/sda1

The issue is the partitions are mounted fine except for vdc device:

Filesystem                          Size  Used Avail Use% Mounted on
/dev/mapper/rhel_dhcp167--130-root  8.5G  2.3G  6.2G  28% /
devtmpfs                            458M     0  458M   0% /dev
tmpfs                               503M     0  503M   0% /dev/shm
tmpfs                               503M   12M  491M   3% /run
tmpfs                               503M     0  503M   0% /sys/fs/cgroup
/dev/sdc1                           992M  2.6M  923M   1% /mount-point8083c9c2bf430399d6343552e55ee67aa9fbd8a9
/dev/sdd1                           992M  2.6M  923M   1% /mount-pointbebfd787df8bfa9eed33b5b1ea3ac5bf4059e754
/dev/vda1                           992M  2.6M  923M   1% /mount-pointac3bff80b8c4147949ffadaaf1af86295d7b512f
/dev/sdb1                           992M  2.6M  923M   1% /mount-pointaf322a92a6ae89bf0b62560771c3d3066390bd1b
/dev/sda1                           992M  2.6M  923M   1% /mount-pointe62eb700e5f83257db9c2bd1a1848075418275cc
/dev/vdb2                           497M  279M  218M  57% /boot


I'll attach logs

Comment 1 Carlos Mestre González 2016-05-10 12:26:42 UTC
Created attachment 1155729 [details]
Collection of messages, journalctl output, engine and vdsm.

This contains output of /var/log/messages and journalctl output for both the original vm and the new one (with the copied disks - copied_vm).

Also includes df output for the copied vm and the GET /disks for the vm with all the disks attached.

I also included the engine.log of the run and the vdsm.log just in case.

Comment 2 Tal Nisan 2016-05-11 08:42:30 UTC
Nir, have a look please

Comment 3 Yaniv Kaul 2016-05-16 12:52:42 UTC
- Is it a regression?
- Does it happen on X86?
- Is it reproduced?
- Is it related to the disk types (there are different types here)?

Comment 4 Carlos Mestre González 2016-05-18 14:09:23 UTC
(In reply to Yaniv Kaul from comment #3)
> - Is it a regression?
I don't know, this was a new test that run for the first time in this build.

> - Does it happen on X86?
No, I marked this specifically for PPC architecture (if also happens in x86 I put x86 and add the PPC on the description).

> - Is it reproduced?
100%, I wrote that on the description.

> - Is it related to the disk types (there are different types here)?
I'll try other scenarios with different types and update.

Comment 5 Carlos Mestre González 2016-05-20 16:30:09 UTC
I simplified the test and checked for the different types, is the same of my first comment, but instead of attaching all the (6) disks for all permutations I've tested for 1 boot disk + 2 attached 1 GB (thin and prealloc) for different interfaces:

Boot disk     2 Attached disks with fs
VIRTIO        VIRTIO              => FAILS to boot
VIRTIO        VIRTIO SCSI         => PASS
VIRTIO        sPARP VSCSI         => PASS
VIRTIO_SCSI   VIRTIO SCSI         => PASS
VIRTIO_SCSI   VIRTIO              => PASS

The emergency mode only happens with those 3 VIRTIO disks.

Remember this happens when copying disks to another domain and attach them to a new vm, original vm with different combination of disks work fine.

Comment 6 Carlos Mestre González 2016-05-20 16:34:39 UTC
putting need info back

Comment 7 Yaniv Kaul 2016-05-20 17:26:51 UTC
(In reply to Carlos Mestre González from comment #5)
> I simplified the test and checked for the different types, is the same of my
> first comment, but instead of attaching all the (6) disks for all
> permutations I've tested for 1 boot disk + 2 attached 1 GB (thin and
> prealloc) for different interfaces:
> 
> Boot disk     2 Attached disks with fs
> VIRTIO        VIRTIO              => FAILS to boot
> VIRTIO        VIRTIO SCSI         => PASS
> VIRTIO        sPARP VSCSI         => PASS
> VIRTIO_SCSI   VIRTIO SCSI         => PASS
> VIRTIO_SCSI   VIRTIO              => PASS
> 
> The emergency mode only happens with those 3 VIRTIO disks.
> 
> Remember this happens when copying disks to another domain and attach them
> to a new vm, original vm with different combination of disks work fine.

Excellent information, can you compare the libvirt XMLs, to see what the difference is, if any? I wonder if the disks order changed.

Comment 8 Carlos Mestre González 2016-05-23 12:44:19 UTC
Created attachment 1160608 [details]
libvirtd.log and other logs for passing and failing scenario

I attached the libvirtd logs and others for two scenarios that are the same but with different interfaces according to my previous comment:

Boot disk     2 Attached disks with fs
VIRTIO        VIRTIO              => FAILS to boot
VIRTIO        VIRTIO SCSI         => PASS

Regarding tests, copy_disk_test_vm is the one that fails to boot, copy_disk_vm_iscsi is the original one, check the time in the qemu logs shows the start/shutdown time.

in failed run copy_disk_test_vm:
2016-05-23 11:30:36.709+0000:
2016-05-23 11:41:12.287+0000: shutting down

pass run:
2016-05-23 12:14:02.631+0000
2016-05-23 12:15:35.476+0000: shutting down

Comment 9 Marian Csontos 2016-06-21 11:33:13 UTC
Run into this BZ by random while searching another one. Found this in messages:

> May  9 10:42:36 dhcp167-130 systemd: Device dev-disk-by\x2dpartlabel-primary.device appeared twice with different sysfs paths /sys/devices/pci0000:00/0000:00:07.0/virtio1/block/vdb/vdb1 and /sys/devices/vio/2000/host0/target0:0:0/0:0:0:3/block/sda/sda1

Any chance LVM (and lvmetad) sees duplicate PVs?

If so you should either remove duplicate disk or filter out devices by setting global_filter in lvm.conf.

Also found following in the journal:

May 09 10:56:16 localhost.localdomain kernel: EXT4-fs (vdb1): VFS: Can't find ext4 filesystem
May 09 10:56:16 localhost.localdomain systemd[1]: mount\x2dpoint6fb75ac675058ef30267bb71e17db05e5d622560.mount mount process exited, code=exited status=32
May 09 10:56:16 localhost.localdomain systemd[1]: Failed to mount /mount-point6fb75ac675058ef30267bb71e17db05e5d622560.
May 09 10:56:16 localhost.localdomain systemd[1]: Dependency failed for Local File Systems.

Are there any `/dev/vdXN` or `/dev/sdXN` in /etc/fstab?
I have seen /dev/vdX names changed after cloning VM.

Replace them by `LABEL=` or `UUID=` lines.

Comment 10 Allon Mureinik 2016-07-11 12:00:13 UTC
No clear RCA ATM, pushing out to 3.6.9.

Comment 11 Sandro Bonazzola 2016-07-29 11:18:28 UTC
3.6 is gone EOL; Please re-target this bug to a 4.0 release.

Comment 12 Raz Tamir 2016-08-03 07:23:32 UTC
Affects also 4.0 (4.0.2.3-0.1.el7ev)

Comment 13 Fred Rolland 2016-11-03 15:02:09 UTC
Hi Carlos,

Can you reply to comment #9 regarding '/etc/fstab' ?

Thanks

Comment 14 Carlos Mestre González 2016-11-22 17:15:07 UTC
Created attachment 1222805 [details]
screenshot fstab - kernel version after failure to start

Hi, 

The fstab is also in the description of the bug. 

I made a new run and took a screenshot of the /etc/fstab, as you can see there are multiple vdXN and sdXN on the file.

this was tested with kernel 3.10.0-327 kernel.

Comment 15 Fred Rolland 2016-11-23 10:55:38 UTC
(In reply to Carlos Mestre González from comment #14)
> Created attachment 1222805 [details]
> screenshot fstab - kernel version after failure to start
> 
> Hi, 
> 
> The fstab is also in the description of the bug. 
> 
> I made a new run and took a screenshot of the /etc/fstab, as you can see
> there are multiple vdXN and sdXN on the file.
> 
> this was tested with kernel 3.10.0-327 kernel.

I think the best practice should be to use UUID in the /etc/fstab.
Can you try to reproduce while mounting with UUID ?

Comment 16 Carlos Mestre González 2016-11-28 13:41:43 UTC
Yes, with UUID the OS boots properly.

Comment 17 Fred Rolland 2016-11-28 16:10:12 UTC
Derek hi,

We cannot currently fix this issue, but we definitely need to have the workaround/best practice documented somewhere.

Any suggestions ?

Thanks,

Freddy

Comment 18 Fred Rolland 2016-11-29 08:28:23 UTC
From [1] :

Issue
    After rebooting, one of my /dev/sdX partitions did not mount automatically.

Resolution
    Device names like /dev/sdX can change across reboots.
    To prevent this from happening, either set /etc/fstab to use UUIDs or labels.


[1] https://access.redhat.com/solutions/424513

Comment 21 Carlos Mestre González 2017-04-06 13:12:12 UTC
*** Bug 1439683 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.