Bug 1494249 - Resizing a Cinder disk for a VM that is powered on corrupts the underlying disk
Summary: Resizing a Cinder disk for a VM that is powered on corrupts the underlying disk
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.1.6.2
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ovirt-4.3.0
: ---
Assignee: Fred Rolland
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-21 20:02 UTC by Logan Kuhn
Modified: 2018-07-16 08:50 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-07-16 08:50:26 UTC
oVirt Team: Storage
Embargoed:
ylavi: ovirt-4.3+


Attachments (Terms of Use)
engine log during the time the disk was resized (1.90 MB, text/plain)
2017-09-21 20:02 UTC, Logan Kuhn
no flags Details
volume.log (16.09 KB, text/plain)
2017-09-28 13:00 UTC, Logan Kuhn
no flags Details

Description Logan Kuhn 2017-09-21 20:02:53 UTC
Created attachment 1329184 [details]
engine log during the time the disk was resized

Description of problem:
We updated to 4.1.6.2 on Tuesday and today tried to resize the root disk for a VM while powered on, this worked on 4.0.6.3.  The disk that was resized was an LVM root partition

Version-Release number of selected component (if applicable): 
oVirt - 4.1.6.2
Ceph - 11.2.1
Cinder - 9.1.4
CentOS - 7.2
Scientific Linux - 6.9

How reproducible:
100%

Steps to Reproduce:
1. Increase the size of a cinder based disk

Actual results:
VM will pause shortly after command is run and data is irrecoverable

Expected results:
Disk is resized and I'm able to resize the disk in the VM with common filesystem tools

Additional info:

Comment 1 Logan Kuhn 2017-09-22 12:23:39 UTC
CentOS 7.3 and 7.4 of the guest OS also affected.  They also had the ovirt-guest-agent installed as well.

Comment 2 Logan Kuhn 2017-09-25 00:01:26 UTC
Does the 4.2 flag mean that RedHat was able to confirm the bug?

Comment 3 Allon Mureinik 2017-09-25 06:50:52 UTC
(In reply to Logan Kuhn from comment #2)
> Does the 4.2 flag mean that RedHat was able to confirm the bug?

I flagged the BZ with "ovirt-4.2?", not quite sure why the bot changed it to a +. Fred, the assigned developer, will definitely look into it in this timeline (probably quite shortly), but as we don't have a root cause analysis yet, we cannot commit on a fix date.

Comment 4 Fred Rolland 2017-09-28 09:28:50 UTC
It works fine on 4.2. I will try to reproduce on 4.1.x.

In the log provided, the command starts, but I cannot see any logs about it completing:

2017-09-21 11:44:18,981 INFO  [org.ovirt.engine.core.bll.storage.disk.cinder.ExtendCinderDiskCommand] (pool-5-thread-6) [6de51a79] Lock Acquired to object 'EngineLock:{exclusiveLocks='[]', sharedLocks='[]'}'
2017-09-21 11:44:18,989 INFO  [org.ovirt.engine.core.bll.storage.disk.cinder.ExtendCinderDiskCommand] (pool-5-thread-6) [6de51a79] Running command: ExtendCinderDiskCommand internal: true. Entities affected :  ID: 0d132c0b-58c0-4166-8ae4-2c6d14f6027a Type: DiskAction group EDIT_DISK_PROPERTIES with role type USER

Logan,
Can you please provide the cinder log from the Openstack server?
It is located here: /var/log/cinder/volume.log

Comment 5 Logan Kuhn 2017-09-28 13:00:39 UTC
Created attachment 1331976 [details]
volume.log

Comment 6 Logan Kuhn 2017-09-28 13:01:59 UTC
tl;dr bad dimm in one of the cinder servers caused it to be unstable and intermittently unable to communicate with the ceph cluster.

Longer version is that we have several servers managing cinder services and almost all of it goes through one server where I pulled logs from.  I've attached the volume log as well, there doesn't appear to be anything from this event in that log.  On the failed server when I went to check it's log it kernel panic'd and is now sitting in a bad state, but I can resize any disks I want it would seem.

I've resized a disk with lvm root, no lvm root and no lvm at all.  All centos7

My theory is that the disks that were corrupted which was all that were tested queried <good server> api and used <bad server> volume and since that server couldn't consistently talk with ceph it never returned causing what appeared to be a hang and a timeout which left the disks inconsistent.


Note You need to log in before you can comment on or make changes to this bug.