Bug 1494249

Summary:

Resizing a Cinder disk for a VM that is powered on corrupts the underlying disk

Product:

[oVirt] ovirt-engine

Reporter:

Logan Kuhn <logank>

Component:

BLL.Storage

Assignee:

Fred Rolland <frolland>

Status:

CLOSED NOTABUG

QA Contact:

Elad <ebenahar>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

4.1.6.2

CC:

bugs, ebenahar, logank, tnisan, ylavi

Target Milestone:

ovirt-4.3.0

Flags:

ylavi: ovirt-4.3+

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-07-16 08:50:26 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
engine log during the time the disk was resized	none
volume.log	none

Description Logan Kuhn 2017-09-21 20:02:53 UTC

Created attachment 1329184 [details]
engine log during the time the disk was resized

Description of problem:
We updated to 4.1.6.2 on Tuesday and today tried to resize the root disk for a VM while powered on, this worked on 4.0.6.3.  The disk that was resized was an LVM root partition

Version-Release number of selected component (if applicable): 
oVirt - 4.1.6.2
Ceph - 11.2.1
Cinder - 9.1.4
CentOS - 7.2
Scientific Linux - 6.9

How reproducible:
100%

Steps to Reproduce:
1. Increase the size of a cinder based disk

Actual results:
VM will pause shortly after command is run and data is irrecoverable

Expected results:
Disk is resized and I'm able to resize the disk in the VM with common filesystem tools

Additional info:

Comment 1 Logan Kuhn 2017-09-22 12:23:39 UTC

CentOS 7.3 and 7.4 of the guest OS also affected.  They also had the ovirt-guest-agent installed as well.

Comment 2 Logan Kuhn 2017-09-25 00:01:26 UTC

Does the 4.2 flag mean that RedHat was able to confirm the bug?

Comment 3 Allon Mureinik 2017-09-25 06:50:52 UTC

(In reply to Logan Kuhn from comment #2)
> Does the 4.2 flag mean that RedHat was able to confirm the bug?

I flagged the BZ with "ovirt-4.2?", not quite sure why the bot changed it to a +. Fred, the assigned developer, will definitely look into it in this timeline (probably quite shortly), but as we don't have a root cause analysis yet, we cannot commit on a fix date.

Comment 4 Fred Rolland 2017-09-28 09:28:50 UTC

It works fine on 4.2. I will try to reproduce on 4.1.x.

In the log provided, the command starts, but I cannot see any logs about it completing:

2017-09-21 11:44:18,981 INFO  [org.ovirt.engine.core.bll.storage.disk.cinder.ExtendCinderDiskCommand] (pool-5-thread-6) [6de51a79] Lock Acquired to object 'EngineLock:{exclusiveLocks='[]', sharedLocks='[]'}'
2017-09-21 11:44:18,989 INFO  [org.ovirt.engine.core.bll.storage.disk.cinder.ExtendCinderDiskCommand] (pool-5-thread-6) [6de51a79] Running command: ExtendCinderDiskCommand internal: true. Entities affected :  ID: 0d132c0b-58c0-4166-8ae4-2c6d14f6027a Type: DiskAction group EDIT_DISK_PROPERTIES with role type USER

Logan,
Can you please provide the cinder log from the Openstack server?
It is located here: /var/log/cinder/volume.log

Comment 5 Logan Kuhn 2017-09-28 13:00:39 UTC

Created attachment 1331976 [details]
volume.log

Comment 6 Logan Kuhn 2017-09-28 13:01:59 UTC

tl;dr bad dimm in one of the cinder servers caused it to be unstable and intermittently unable to communicate with the ceph cluster.

Longer version is that we have several servers managing cinder services and almost all of it goes through one server where I pulled logs from.  I've attached the volume log as well, there doesn't appear to be anything from this event in that log.  On the failed server when I went to check it's log it kernel panic'd and is now sitting in a bad state, but I can resize any disks I want it would seem.

I've resized a disk with lvm root, no lvm root and no lvm at all.  All centos7

My theory is that the disks that were corrupted which was all that were tested queried <good server> api and used <bad server> volume and since that server couldn't consistently talk with ceph it never returned causing what appeared to be a hang and a timeout which left the disks inconsistent.