Bug 1290160

Summary: Add-Disk operation failed to complete
Product: Red Hat Enterprise Virtualization Manager Reporter: Sanjay Rao <srao>
Component: vdsmAssignee: Fred Rolland <frolland>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Aharon Canan <acanan>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: amureini, bazulay, ecohen, gklein, lsurette, srao, tnisan, ycui, yeylon, ylavi
Target Milestone: ovirt-3.6.2   
Target Release: 3.6.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-01-11 14:43:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vdsm log from gprfs030
none
vdsm log from gprfs031
none
vdsm log from gprfs029
none
engine.log for the RHEV cluster none

Description Sanjay Rao 2015-12-09 19:10:44 UTC
Description of problem:
I have a working RHEL / KVM managed by RHEV. I have a working glusterfs mount in the RHEV cluster with working volumes. When I try to create a new VM, it won't let me add new disks


Version-Release number of selected component (if applicable):
RHEV 3.6
qemu-kvm-rhev-2.3.0-31.el7_2.3.x86_64
qemu-img-rhev-2.3.0-31.el7_2.3.x86_64
vdsm-4.17.10.1-0.el7ev.noarch



How reproducible:
Easily reproducible

Steps to Reproduce:
1.Attempted to add disk
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Allon Mureinik 2015-12-10 10:00:02 UTC
Please include the engine and VDSM logs.

Comment 2 Sanjay Rao 2015-12-10 12:59:51 UTC
Created attachment 1104327 [details]
vdsm log from gprfs030

Comment 3 Sanjay Rao 2015-12-10 13:00:40 UTC
Created attachment 1104328 [details]
vdsm log from gprfs031

Comment 4 Sanjay Rao 2015-12-10 13:01:13 UTC
Created attachment 1104329 [details]
vdsm log from gprfs029

Comment 5 Sanjay Rao 2015-12-10 13:05:38 UTC
This testing is done on a RHEV cluster with 3 hosts gprfs029, gprfs030, gprfs031. The 3 hosts also have a 3 way replicated gluster file system. The RHEV cluster is using gluster volume gprfs029:gl_01 for storage. 

I was able to successfully create 3 disks in the storage pool and run 1 VM. When I try to add more disks now, it won't work. But the gluster volume is fine because I can still run the VM that I created earlier.

Comment 6 Sanjay Rao 2015-12-10 13:06:28 UTC
Created attachment 1104332 [details]
engine.log for the RHEV cluster

Comment 7 Allon Mureinik 2015-12-10 13:46:25 UTC
Writing the volume's metadata seems to fail for some reason:

abcf08c7-19bc-4687-94a3-500c96e3ee67::ERROR::2015-12-10 07:53:34,391::volume::518::Storage.Volume::(create) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/volume.py", line 509, in create
    volType, diskType, desc, LEGAL_VOL)
  File "/usr/share/vdsm/storage/volume.py", line 889, in newMetadata
    cls.createMetadata(metaId, meta)
  File "/usr/share/vdsm/storage/fileVolume.py", line 333, in createMetadata
    cls.__putMetadata(metaId, meta)
  File "/usr/share/vdsm/storage/fileVolume.py", line 326, in __putMetadata
    f.write("EOF\n")
IOError: [Errno 22] Invalid argument

Can't imagine why, though - need to dig a bit further.

Comment 9 Sanjay Rao 2015-12-11 11:26:59 UTC
I am sorry my cluster went into a really bad state when I took the storage offline and tried to bring it back online. I had to remove the storage and recreate the datacenter.

I will try to reproduce the error and report to the BZ.

Comment 10 Sanjay Rao 2015-12-11 11:59:48 UTC
Not sure if this is related but I cannot add the storage domain to the cluster. I have re-created gluster volume and set permissions

Volume Name: gl_01
Type: Replicate
Volume ID: 857ed73d-d69c-42f8-81f7-35cfdb2e77bc
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: gprfs029-10ge:/brick/b01/g
Brick2: gprfs030-10ge:/brick/b01/g
Brick3: gprfs031-10ge:/brick/b01/g
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
performance.readdir-ahead: on


But I get a Failed to add storage domain. I have verified that the volume can be mounted manually on the same hosts.

Comment 11 Sanjay Rao 2015-12-11 12:40:44 UTC
I am re-creating the whole setup. I am getting VM import errors from one of the hosts in the engine.log although this is a new set up. I am guessing there's some stuff left over on the host from the previous set up.

Comment 16 Fred Rolland 2015-12-21 15:45:00 UTC
Hi,

From the logs provided, I can see that there is connectivity issue with the gluster storage.
Right after the first Metadata writing error, there is an attempt to disconnect form the mount and it failed:

MountError: (32, ';umount: /rhev/data-center/mnt/glusterSD/gprfs029-10ge:gl__01: mountpoint not found\n')

Sanjay, is this issue still happening ?

There are logs under /var/log/glusterfs that can maybe help understanding.
Can you please provide the logs from around this time 2015-12-09 14:18:34 on host gprfs029 ?

Thread-214905::DEBUG::2015-12-09 14:18:34,195::mount::229::Storage.Misc.excCmd::(_runcmd) /usr/bin/sudo -n /usr/bin/umount -f -l /rhev/data-center/mnt/glusterSD/gprfs029-10ge:gl__01

In any case it seems like a gluster problem more than an issue in Vdsm.

Thanks,

Freddy

Comment 18 Sanjay Rao 2016-01-04 19:20:25 UTC
I have not seen the problem with the new config that I created but I have theory. I think this happens if a RHEV storage pool is created on the gluster volume and if any other files are created on the gluster volume outside the RHEV storage pool. This can easily happen as gluster is a shared files system. I think this confuses RHEV because the available space on the gluster volume changes. I have not had the chance to test the theory.

Comment 19 Fred Rolland 2016-01-11 12:01:50 UTC
Hi, I have failed to reproduce on my setup.

All disk operations were OK, before and after adding a file in the gluster volume from outside RHEV.

I think we should close this BZ unless we have a clear scenario to reproduce.

Comment 20 Sanjay Rao 2016-01-11 14:33:38 UTC
I am ok with closing this for now because my environment is not reproducing the problem after the rebuild. If I run into the problem again, I will re-open the BZ or file a new ticket.