Bug 1373118 - [scale] Unable to add more disks "VolumeCreationError: Error creating a new volume" (we do not support more than 1948 leases due to lease space exhaustion )
Summary: [scale] Unable to add more disks "VolumeCreationError: Error creating a new v...
Keywords:
Status: CLOSED DUPLICATE of bug 1386732
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium vote
Target Milestone: ovirt-4.1.0-beta
: ---
Assignee: Nir Soffer
QA Contact: eberman
URL:
Whiteboard:
Depends On: 1374545
Blocks: 1386732
TreeView+ depends on / blocked
 
Reported: 2016-09-05 09:19 UTC by mlehrer
Modified: 2017-03-06 22:09 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1386732 (view as bug list)
Environment:
Last Closed: 2016-12-22 21:06:04 UTC
oVirt Team: Storage
amureini: ovirt-4.1?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?


Attachments (Terms of Use)

Description mlehrer 2016-09-05 09:19:27 UTC
Description of problem:
Using 2 hosts connected to ISCSI storage, created 665 VMs with 3165 disks created form template.

The following erorrs occur:

-"VolumeCreationError: Error creating a new volume: (u"Volume creation 'Sanlock resource init failure', 'No space left on device')",)" when not true.
-Vdsm SchemaCache Warning provided parameters do not match any union VmStats values
-ERROR could not allocate request thread "raise TOOManyTasks"
-ERROR Failed to HSMGetAllTasksStatusesVDS, error = Error creating a new volume, code = 205 //engine.log

Current # of LVs: 3068

After verifying "VolumeCreationError: Error creating a new volume" whenever atempting to create another disk I then verified that manually I can create LV successfully and that the total number of LVs increase after lvcreate is manually execute.

     VG                                   #PV #LV  #SN Attr   VSize   VFree  
    86f1a1bc-7749-4a6b-8ea6-4a825c8cbda3   1    8   0 wz--n- 499.62g 495.50g
    b5067b73-9954-45b2-909a-b26f09f7105f   3    9   0 wz--n-   2.98t   2.97t
    c6838f59-c474-457f-8464-9771a318b752   2 3068   0 wz--n-  10.00t   6.54t
    d4baca8b-3aa4-4f0a-8ff1-5222812082c9   1   13   0 wz--n-   5.00t   4.98t
    vg0                                    1    3   0 wz--n- 277.97g  60.00m


Version-Release number of selected component (if applicable):
vdsm-api-4.18.2-0.el7ev.noarch
vdsm-4.18.2-0.el7ev.x86_64
vdsm-yajsonrpc-4.18.2-0.el7ev.noarch
vdsm-python-4.18.2-0.el7ev.noarch
vdsm-jsonrpc-4.18.2-0.el7ev.noarch
vdsm-xmlrpc-4.18.2-0.el7ev.noarch
vdsm-hook-vmfex-dev-4.18.2-0.el7ev.noarch
vdsm-infra-4.18.2-0.el7ev.noarch
vdsm-cli-4.18.2-0.el7ev.noarch


How reproducible:
Seems consistent.

Steps to Reproduce:
1.Setup 2 hosts and 1 engine with iscsi luns
2.Create 4 large SDs (2 or 5TB)
3.Create VM from template which has 5 disks in total execute more than 3,100 times

Actual results:
Unable to add more disks to the system.
LV / VG operations take 1-2s pvscan --cache #took minutes improved after restart
Dashboard may show inconssistent information to Storage Domain UI changes (ex: size of lun after extension) 


Expected results:
Ability to add more than disks since I am able to create more lvs manually


Additional info:
See private comment for further details / attachment info

Comment 3 Nir Soffer 2016-09-05 10:43:05 UTC
(In reply to mlehrer from comment #0)
> -"VolumeCreationError: Error creating a new volume: (u"Volume creation
> 'Sanlock resource init failure', 'No space left on device')",)" when not
> true.

Each volume has a 1MiB lease in the leases volumes. The leases volumes is 2GiB,
so it can hold 2048 leases. The first 100 leases are reserved, so you have space
for 1948 leases in each storage domain, limiting the number of volumes to 1948.

If you tried to create more volumes, this failure is expected. We can improve
the error message in this case, please open a separate bug for improving the
error message if needed.

> -Vdsm SchemaCache Warning provided parameters do not match any union VmStats
> values

This is not relevant, and should not be in the logs, and show that you are using
and old unsupported version.

> -ERROR could not allocate request thread "raise TOOManyTasks"

This means there are too many tasks in the relevant executor, probably meaning
your are overloading the host with requests. But this is not relevant to 
this bug, please open infra bug for this

> -ERROR Failed to HSMGetAllTasksStatusesVDS, error = Error creating a new
> volume, code = 205 //engine.log

Probably related to the previous issue - infra bug.

> Current # of LVs: 3068
> 
> After verifying "VolumeCreationError: Error creating a new volume" whenever
> atempting to create another disk I then verified that manually I can create
> LV successfully and that the total number of LVs increase after lvcreate is
> manually execute.
> 
>      VG                                   #PV #LV  #SN Attr   VSize   VFree  
>     86f1a1bc-7749-4a6b-8ea6-4a825c8cbda3   1    8   0 wz--n- 499.62g 495.50g
>     b5067b73-9954-45b2-909a-b26f09f7105f   3    9   0 wz--n-   2.98t   2.97t
>     c6838f59-c474-457f-8464-9771a318b752   2 3068   0 wz--n-  10.00t   6.54t
>     d4baca8b-3aa4-4f0a-8ff1-5222812082c9   1   13   0 wz--n-   5.00t   4.98t
>     vg0                                    1    3   0 wz--n- 277.97g  60.00m

You can create lvs manually, but this they cannot be used for ovirt volume,
since there is no room for the lease.

I wonder how you could create 3068 volumes, this should have failed when
creating 1949th volume.

I think we should limit testing to to the maximum number of volumes we can 
support, so this does not block further testing.

If we find that 1948 volumes are usable (unlikely), we can increase the size
of the leases volume in the next storage format to make room for more volumes.

The maximum number of volume should be documented.

> Version-Release number of selected component (if applicable):
> vdsm-api-4.18.2-0.el7ev.noarch
> vdsm-4.18.2-0.el7ev.x86_64

This is old unsupported version, please use vdsm-4.18.11 or later.

> vdsm-yajsonrpc-4.18.2-0.el7ev.noarch
> vdsm-python-4.18.2-0.el7ev.noarch
> vdsm-jsonrpc-4.18.2-0.el7ev.noarch
> vdsm-xmlrpc-4.18.2-0.el7ev.noarch
> vdsm-hook-vmfex-dev-4.18.2-0.el7ev.noarch
> vdsm-infra-4.18.2-0.el7ev.noarch
> vdsm-cli-4.18.2-0.el7ev.noarch
> 
> 
> How reproducible:
> Seems consistent.
> 
> Steps to Reproduce:
> 1.Setup 2 hosts and 1 engine with iscsi luns
> 2.Create 4 large SDs (2 or 5TB)
> 3.Create VM from template which has 5 disks in total execute more than 3,100
> times
> 
> Actual results:
> Unable to add more disks to the system.
> LV / VG operations take 1-2s pvscan --cache #took minutes improved after
> restart
> Dashboard may show inconssistent information to Storage Domain UI changes
> (ex: size of lun after extension) 
> 
> 
> Expected results:
> Ability to add more than disks since I am able to create more lvs manually

We don't support this now.

> Additional info:
> See private comment for further details / attachment info

Comment 4 Nir Soffer 2016-09-05 11:12:21 UTC
The relevant flow in run 1:

Initializing the 1949th lease...

d300c9f0-c3b2-4380-bfbf-554258ecb519::DEBUG::2016-08-23 16:03:31,552::blockVolume::299::Storage.VolumeManifest::(newVolumeLease) Initializing volume lease volUUID=bdfa9a28-014d-4d8f-a0bd-d9f
9e77f9c94 sdUUID=c6838f59-c474-457f-8464-9771a318b752, metaId=('c6838f59-c474-457f-8464-9771a318b752', 1949)

Sanlock fails to initialize a lease which is after the end of
the lease volume:

d300c9f0-c3b2-4380-bfbf-554258ecb519::ERROR::2016-08-23 16:03:32,636::volume::843::Storage.Volume::(create) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/volume.py", line 837, in create
    cls.newVolumeLease(metaId, sdUUID, volUUID)
  File "/usr/share/vdsm/storage/volume.py", line 1156, in newVolumeLease
    return cls.manifestClass.newVolumeLease(metaId, sdUUID, volUUID)
  File "/usr/share/vdsm/storage/blockVolume.py", line 307, in newVolumeLease
    sanlock.init_resource(sdUUID, volUUID, [(leasePath, leaseOffset)])
SanlockException: (28, 'Sanlock resource init failure', 'No space left on device')

This error is expected, as explained in comment 3.

The rest of the creation attempts fail in the same way.

Bottom line:

1. We should document the maximum number of leases
2. We should check the metadata slot offset and fail the entire
   volume creation *before* when we reached the maximum.

Since we currently recommend to use less then 500 lvs, these changes can 
be schedule to next version.

Comment 6 Yaniv Kaul 2016-09-05 11:54:54 UTC
(In reply to Nir Soffer from comment #4)

> 
> Since we currently recommend to use less then 500 lvs, these changes can 
> be schedule to next version.

I'm interested in knowing what is the issue limiting us to 500 LVs.
(Separately, we should look at the limit of ~1950 leases of course).

Comment 7 Nir Soffer 2016-09-06 17:13:12 UTC
(In reply to Yaniv Kaul from comment #6)
> (In reply to Nir Soffer from comment #4)
> > Since we currently recommend to use less then 500 lvs, these changes can 
> > be schedule to next version.
> 
> I'm interested in knowing what is the issue limiting us to 500 LVs.
> (Separately, we should look at the limit of ~1950 leases of course).

Yes, but we already have an RFE for this, this bug was about failure to create
5000 disks and other issues that should move to other bugs.

To allow 5000 disks for this test, we can do:

1. Put storage domain to maintenance
2. lvextend -L 5100m vg_uuid/leases --config 'global { use_lvmetad=0 }'
3. Activate domain

Comment 8 mlehrer 2016-09-08 11:48:08 UTC
(In reply to Nir Soffer from comment #7)
> (In reply to Yaniv Kaul from comment #6)
> > (In reply to Nir Soffer from comment #4)
> > > Since we currently recommend to use less then 500 lvs, these changes can 
> > > be schedule to next version.
> > 
> > I'm interested in knowing what is the issue limiting us to 500 LVs.
> > (Separately, we should look at the limit of ~1950 leases of course).
> 
> Yes, but we already have an RFE for this, this bug was about failure to
> create
> 5000 disks and other issues that should move to other bugs.
> 
> To allow 5000 disks for this test, we can do:
> 
> 1. Put storage domain to maintenance
> 2. lvextend -L 5100m vg_uuid/leases --config 'global { use_lvmetad=0 }'
> 3. Activate domain

Confirming Nir's recommendation resolved the issue and was able to create additional LVs once the lease was extended as described above.

Comment 9 Yaniv Kaul 2016-11-21 10:50:27 UTC
This looks like a dup of bug 1386732 ?

Comment 10 Yaniv Kaul 2016-12-22 21:06:04 UTC

*** This bug has been marked as a duplicate of bug 1386732 ***


Note You need to log in before you can comment on or make changes to this bug.