Bug 1727678

Summary: Create VM from template from glance fails with 'enospc', engine hits NPE and VM remains locked
Product: [oVirt] vdsm Reporter: Germano Veit Michel <gveitmic>
Component: GeneralAssignee: Nobody <nobody>
Status: CLOSED DEFERRED QA Contact: Lukas Svaty <lsvaty>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.30.17CC: bugs, eshenitz, frolland, mkalinin
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-29 11:33:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Germano Veit Michel 2019-07-08 02:10:54 UTC
Description of problem:

Creating a thick VM from a Glance Image imported as template fails in block storage. The destination LV is too small. 
To make things worse, the engine fails to properly rollback the operation, it deletes the destination image, but then keeps doing it again, failing with NPE on rollback.

Version-Release number of selected component (if applicable):
vdsm-4.30.17-1.el7.x86_64
ovirt-engine-4.3.4.3-1.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1. Import image from Glance as Template
2. Create thick provisioned VM from Template

Actual results:
Fail

Expected results:
Succeed

Additional info:

VDSM:
=====

The destination LV is 128mb, and never extended:

2019-07-08 11:44:49,014+1000 INFO  (jsonrpc/0) [vdsm.api] START createVolume(sdUUID=u'0772f1a4-c22e-47bd-af53-915f2f2110a8', spUUID=u'7f6bbc5e-9eca-11e9-b288-52540019c1ff', imgUUID=u'5dfec071-4b81-41c2-80fd-d377ff9549a0', size=u'4294967296', volFormat=4, preallocate=2, diskType=u'DATA', volUUID=u'7be20bde-1333-43ba-b226-84c707cf7570', desc=None, srcImgUUID=u'00000000-0000-0000-0000-000000000000', srcVolUUID=u'00000000-0000-0000-0000-000000000000', initialSize=u'366048350') from=::ffff:192.168.150.253,42354, flow_id=6fcb97db-4cc0-4a95-853f-10c84923e8eb, task_id=9b1fd80c-58ae-4535-974b-a797684e8d0e (api:48)

2019-07-08 11:44:50,086+1000 INFO  (tasks/4) [storage.Volume] Request to create COW volume /rhev/data-center/mnt/blockSD/0772f1a4-c22e-47bd-af53-915f2f2110a8/images/5dfec071-4b81-41c2-80fd-d377ff9549a0/7be20bde-1333-43ba-b226-84c707cf7570 with size = 8388608 blocks (blockVolume:511)

2019-07-08 11:44:55,218+1000 INFO  (jsonrpc/7) [vdsm.api] FINISH getVolumeInfo return={'info': {'status': 'OK', 'domain': '0772f1a4-c22e-47bd-af53-915f2f2110a8', 'voltype': 'LEAF', 'description': 'None', 'parent': '00000000-0000-0000-0000-000000000000', 'format': 'COW', 'generation': 0, 'image': '5dfec071-4b81-41c2-80fd-d377ff9549a0', 'ctime': '1562550291', 'disktype': 'DATA', 'legality': 'LEGAL', 'mtime': '0', 'apparentsize': '402653184', 'children': [], 'pool': '', 'capacity': '4294967296', 'uuid': u'7be20bde-1333-43ba-b226-84c707cf7570', 'truesize': '402653184', 'type': 'SPARSE', 'lease': {'path': '/dev/0772f1a4-c22e-47bd-af53-915f2f2110a8/leases', 'owners': [], 'version': None, 'offset': 113246208}}} from=::ffff:192.168.150.253,41276, flow_id=6fcb97db-4cc0-4a95-853f-10c84923e8eb, task_id=9d035d45-b5b0-44ce-b5d0-190c676258d5 (api:54)

2019-07-08 11:44:55,370+1000 INFO  (jsonrpc/3) [vdsm.api] START sdm_copy_data(job_id=u'd4ac62a3-174c-4470-bf94-b1954ed5e4a8', source={u'img_id': u'bc2a0459-0ac4-4c1c-ab6b-2883af763d8d', u'sd_id': u'0772f1a4-c22e-47bd-af53-915f2f2110a8', u'endpoint_type': u'div', u'vol_id': u'3482add2-853f-431b-8e67-0b6be14fcb0e'}, destination={u'generation': 0, u'img_id': u'5dfec071-4b81-41c2-80fd-d377ff9549a0', u'sd_id': u'0772f1a4-c22e-47bd-af53-915f2f2110a8', u'endpoint_type': u'div', u'vol_id': u'7be20bde-1333-43ba-b226-84c707cf7570'}) from=::ffff:192.168.150.253,41276, flow_id=6fcb97db-4cc0-4a95-853f-10c84923e8eb, task_id=7642dff7-12a4-4009-bb53-3462e2208f43 (api:48)

2019-07-08 11:45:12,043+1000 ERROR (tasks/5) [root] Job u'd4ac62a3-174c-4470-bf94-b1954ed5e4a8' failed (jobs:221)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/jobs.py", line 157, in run
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdm/api/copy_data.py", line 86, in _run
    self._operation.run()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/qemuimg.py", line 343, in run
    for data in self._operation.watch():
  File "/usr/lib/python2.7/site-packages/vdsm/storage/operation.py", line 106, in watch
    self._finalize(b"", err)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/operation.py", line 179, in _finalize
    raise cmdutils.Error(self._cmd, rc, out, err)
Error: Command ['/usr/bin/qemu-img', 'convert', '-p', '-t', 'none', '-T', 'none', '-f', 'qcow2', '/rhev/data-center/mnt/blockSD/0772f1a4-c22e-47bd-af53-915f2f2110a8/images/bc2a0459-0ac4-4c1c-ab6b-2883af763d8d/3482add2-853f-431b-8e67-0b6be14fcb0e', '-O', 'qcow2', '-o', 'compat=1.1', '/rhev/data-center/mnt/blockSD/0772f1a4-c22e-47bd-af53-915f2f2110a8/images/5dfec071-4b81-41c2-80fd-d377ff9549a0/7be20bde-1333-43ba-b226-84c707cf7570'] failed with rc=1 out='' err=bytearray(b'qemu-img: error while writing sector 3742336: No space left on device\n')

Template:
  3482add2-853f-431b-8e67-0b6be14fcb0e 0772f1a4-c22e-47bd-af53-915f2f2110a8 -wi------- 384.00m                                                     IU_bc2a0459-0ac4-4c1c-ab6b-2883af763d8d,MD_1,PU_00000000-0000-0000-0000-000000000000
New VM:
  a8037db8-c9c2-4d93-b710-125ec739ce95 0772f1a4-c22e-47bd-af53-915f2f2110a8 -wi------- 128.00m                                                     IU_d5e1bf8c-44d7-4df3-a607-61d346807184,MD_7,PU_00000000-0000-0000-0000-000000000000

# dd if=/dev/0772f1a4-c22e-47bd-af53-915f2f2110a8/metadata bs=8k skip=129 count=1
CAP=4294967296
CTIME=1562302926
DESCRIPTION={"DiskAlias":"GlanceDisk-ba2dd4d","DiskDescription":"Fedora 30 Cloud Base Image v1.2 for x86_64 (ba2dd4d)"}
DISKTYPE=DATA
DOMAIN=0772f1a4-c22e-47bd-af53-915f2f2110a8
FORMAT=COW
GEN=0
IMAGE=bc2a0459-0ac4-4c1c-ab6b-2883af763d8d
LEGALITY=LEGAL
PUUID=00000000-0000-0000-0000-000000000000
TYPE=SPARSE
VOLTYPE=LEAF
EOF


# qemu-img convert /dev/0772f1a4-c22e-47bd-af53-915f2f2110a8/3482add2-853f-431b-8e67-0b6be14fcb0e -O qcow2 -o compat=1.1 /root/f30.qcow2
# qemu-img info /root/f30.qcow2 
image: /root/f30.qcow2
file format: qcow2
virtual size: 4.0G (4294967296 bytes)
disk size: 898M                            <-------
cluster_size: 65536
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

ENGINE:
=======

The engine sees the failure, deletes the disk (correct) but then hits an NPE and the VM remains locked.

1. Sees the failed copy

2019-07-08 11:45:21,188+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHostJobsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-2) [6fcb97db-4cc0-4a95-853f-10c84923e8eb] FINISH, GetHostJobsVDSCommand, return: {d4ac62a3-174c-4470-bf94-b1954ed5e4a8=HostJobInfo:{id='d4ac62a3-174c-4470-bf94-b1954ed5e4a8', type='storage', description='copy_data', status='failed', progress='42', error='VDSError:{code='GeneralException', message='General Exception: ("Command ['/usr/bin/qemu-img', 'convert', '-p', '-t', 'none', '-T', 'none', '-f', 'qcow2', '/rhev/data-center/mnt/blockSD/0772f1a4-c22e-47bd-af53-915f2f2110a8/images/bc2a0459-0ac4-4c1c-ab6b-2883af763d8d/3482add2-853f-431b-8e67-0b6be14fcb0e', '-O', 'qcow2', '-o', 'compat=1.1', '/rhev/data-center/mnt/blockSD/0772f1a4-c22e-47bd-af53-915f2f2110a8/images/5dfec071-4b81-41c2-80fd-d377ff9549a0/7be20bde-1333-43ba-b226-84c707cf7570'] failed with rc=1 out='' err=bytearray(b'qemu-img: error while writing sector 3742336: No space left on device\\n')",)'}'}}, log id: 3905ea5a

2. Deletes the image

2019-07-08 11:45:25,750+1000 INFO  (jsonrpc/0) [vdsm.api] START deleteImage(sdUUID=u'0772f1a4-c22e-47bd-af53-915f2f2110a8', spUUID=u'7f6bbc5e-9eca-11e9-b288-52540019c1ff', imgUUID=u'5dfec071-4b81-41c2-80fd-d377ff9549a0', postZero=u'false', force=u'false', discard=False) from=::ffff:192.168.150.253,42354, flow_id=1b6d96d, task_id=e5dd8a00-3e70-4394-ad79-ee5528f9ce88 (api:48)
2019-07-08 11:45:25,897+1000 INFO  (jsonrpc/4) [root] /usr/libexec/vdsm/hooks/after_get_caps/50_openstacknet: rc=0 err= (hooks:114)

2019-07-08 11:45:26,558+1000 INFO  (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call Image.delete succeeded in 0.80 seconds (__init__:312)

3. Tries to delete it again

2019-07-08 11:45:26,824+10 ERROR [org.ovirt.engine.core.bll.storage.disk.image.DestroyImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-89) [6fcb97db-4cc0-4a95-853f-10c84923e8eb] Failed to delete image 00000000-0000-0000-0000-000000000000/3482add2-853f-431b-8e67-0b6be14fcb0e: org.ovirt.engine.core.common.errors.EngineException: EngineException: java.lang.NullPointerException (Failed with error ENGINE and code 5001)
        at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:118) [bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:2075) [bll.jar:]
        at org.ovirt.engine.core.bll.storage.disk.image.DestroyImageCommand.executeCommand(DestroyImageCommand.java:56) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1157) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1315) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1964) [bll.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInRequired(TransactionSupport.java:139) [utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:105) [utils.jar:]
        at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1375) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:419) [bll.jar:]
        at org.ovirt.engine.core.bll.executor.DefaultBackendActionExecutor.execute(DefaultBackendActionExecutor.java:13) [bll.jar:]
        at org.ovirt.engine.core.bll.Backend.runAction(Backend.java:451) [bll.jar:]


2019-07-08 11:59:13,826+10 ERROR [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] (EE-ManagedThreadFactory-engineScheduled-Thread-60) [0ea7d026-fcf8-4d79-8cb3-f42fac463f54] Exception: javax.ejb.EJBTransactionRolledbackException: Error retrieving database metadata; nested exception is org.springframework.jdbc.support.MetaDataAccessException: Error while extracting DatabaseMetaData; nested exception is java.sql.SQLException: IJ031070: Transaction cannot proceed: STATUS_MARKED_ROLLBACK
        at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:203) [wildfly-ejb3-15.0.1.Final.jar:15.0.1.Final]
        at org.jboss.as.ejb3.tx.CMTTxInterceptor.supports(CMTTxInterceptor.java:420) [wildfly-ejb3-15.0.1.Final.jar:15.0.1.Final]
        at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:148) [wildfly-ejb3-15.0.1.Final.jar:15.0.1.Final]
        at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:422)
        at org.jboss.invocation.InterceptorContext$Invocation.proceed(InterceptorContext.java:509)

Comment 1 Germano Veit Michel 2019-07-08 02:22:59 UTC
>   a8037db8-c9c2-4d93-b710-125ec739ce95 0772f1a4-c22e-47bd-af53-915f2f2110a8 -wi------- 128.00m                                                     IU_d5e1bf8c-44d7-4df3-a607-61d346807184,MD_7,PU_00000000-0000-0000-0000-000000000000

This is actually the wrong LV, but I don't see it getting bigger than 128mb anyway.

There is also this in the logs:

2019-07-08 11:45:29,295+1000 ERROR (jsonrpc/3) [storage.HSM] Empty or not found image 5dfec071-4b81-41c2-80fd-d377ff9549a0 in SD 0772f1a4-c22e-47bd-af53-915f2f2110a8. {'643390a5-e705-47a7-a044-29beb4ef8008': ImgsPar(imgs=['c3605567-62e1-4ae9-89a3-44890b31d516'], parent='00000000-0000-0000-0000-000000000000'), '57f7ad5a-73f3-461a-ba74-59ccb1bb935b': ImgsPar(imgs=['c50e27ae-03d2-49af-959e-960db83b0684'], parent='00000000-0000-0000-0000-000000000000'), '1821313a-12e5-401f-a3eb-d6441cc9494f': ImgsPar(imgs=['345967c9-7e06-4bc1-880d-0269d34c8a73'], parent='00000000-0000-0000-0000-000000000000'), '3482add2-853f-431b-8e67-0b6be14fcb0e': ImgsPar(imgs=['bc2a0459-0ac4-4c1c-ab6b-2883af763d8d'], parent='00000000-0000-0000-0000-000000000000'), '9d8a3686-bca3-47bc-afa2-47250ed924e2': ImgsPar(imgs=['f62f69f6-3bc2-4f26-8ed3-77497ff72def'], parent='00000000-0000-0000-0000-000000000000'), 'a8037db8-c9c2-4d93-b710-125ec739ce95': ImgsPar(imgs=['d5e1bf8c-44d7-4df3-a607-61d346807184'], parent='00000000-0000-0000-0000-000000000000'), 'f803f664-3562-481f-8811-6c69bac96d5b': ImgsPar(imgs=['b248ca9c-4a9b-4de6-90c0-5cb23b2865e4'], parent='00000000-0000-0000-0000-000000000000')} (hsm:1501)

Let me know if you need logs (mine are ugly because 3 VMs failed at the same time). Its very easy to reproduce anyway.

Comment 2 Eyal Shenitzky 2021-08-24 12:20:23 UTC
This bug/RFE is more than 2 years old and it didn't get enough attention so far, and is now flagged as pending close. 
Please review if it is still relevant and provide additional details/justification/patches if you believe it should get more attention for the next oVirt release.

Comment 3 Michal Skrivanek 2021-09-29 11:33:06 UTC
This bug didn't get any attention in a long time, and it's not planned in foreseeable future. oVirt development team has no plans to work on it.
Please feel free to reopen if you have a plan how to contribute this feature/bug fix.