Bug 1402362

Summary: Clearing deleteImage task leave zombie dd process
Product: [oVirt] vdsm Reporter: Nir Soffer <nsoffer>
Component: CoreAssignee: Idan Shaby <ishaby>
Status: CLOSED CURRENTRELEASE QA Contact: Lilach Zitnitski <lzitnits>
Severity: low Docs Contact:
Priority: medium    
Version: 4.18.0CC: amureini, bugs, nsoffer, ratamir, tnisan, ylavi
Target Milestone: ovirt-4.2.0Flags: ylavi: ovirt-4.2+
Target Release: 4.20.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-20 11:04:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Nir Soffer 2016-12-07 11:10:51 UTC
Description of problem:

After clearing a deleteImage task, we leave a zombie dd process:

2016-12-07 13:01:44,531 INFO  (jsonrpc/0) [dispatcher] Run and protect: deleteImage(sdUUID=u'29eb00cc-4d39-44c9-90f3-aa4b88141eb5', spUUID=u'f9374c0e-ae24-4bc1-a596-f61d5f05bc5f', imgUUID=u'dda1d164-427a-48a6-80d2-8756da292814', postZero=u'true', force=u'false') (logUtils:49)
2016-12-07 13:01:44,987 INFO  (jsonrpc/0) [dispatcher] Run and protect: deleteImage, Return response: None (logUtils:52)
2016-12-07 13:01:45,445 INFO  (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call Image.delete succeeded in 0.92 seconds (__init__:515)
2016-12-07 13:01:46,686 INFO  (jsonrpc/7) [dispatcher] Run and protect: stopTask(taskID=u'55c4f367-bf64-4128-a056-9d12bcf3b0c5', spUUID=None, options=None) (logUtils:49)
2016-12-07 13:01:46,687 INFO  (jsonrpc/7) [dispatcher] Run and protect: stopTask, Return response: True (logUtils:52)
2016-12-07 13:01:46,687 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Task.stop succeeded in 0.00 seconds (__init__:515)
2016-12-07 13:01:47,343 ERROR (tmap/0) [storage.BlockSD] Zero volume 4958300c-e73b-4cf5-bd0a-9d16fd9b1c99 task 55c4f367-bf64-4128-a056-9d12bcf3b0c5 failed (blockSD:233)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/blockSD.py", line 230, in zeroVolume
    misc.ddWatchCopy("/dev/zero", path, aborting, size)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/misc.py", line 223, in ddWatchCopy
    ioclass=utils.IOCLASS.IDLE)
  File "/usr/lib/python2.7/site-packages/vdsm/commands.py", line 362, in watchCmd
    raise ActionStopped()
ActionStopped: Action was stopped: ()
2016-12-07 13:01:47,344 ERROR (tmap/1) [storage.BlockSD] Zero volume dbe2a240-4ed5-4282-8580-ebb66d489e10 task 55c4f367-bf64-4128-a056-9d12bcf3b0c5 failed (blockSD:233)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/blockSD.py", line 230, in zeroVolume
    misc.ddWatchCopy("/dev/zero", path, aborting, size)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/misc.py", line 223, in ddWatchCopy
    ioclass=utils.IOCLASS.IDLE)
  File "/usr/lib/python2.7/site-packages/vdsm/commands.py", line 362, in watchCmd
    raise ActionStopped()
ActionStopped: Action was stopped: ()
2016-12-07 13:01:47,346 ERROR (tasks/0) [storage.TaskManager.Task] (Task='55c4f367-bf64-4128-a056-9d12bcf3b0c5') Unexpected error (task:870)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 877, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 333, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/blockSD.py", line 1141, in zeroImage
    zeroImgVolumes(sdUUID, imgUUID, toZero)
  File "/usr/share/vdsm/storage/blockSD.py", line 257, in zeroImgVolumes
    raise se.VolumesZeroingError(errors)
VolumesZeroingError: Cannot zero out volume: (['Action was stopped: ()', 'Action was stopped: ()'],)

# ps auxf

vdsm      2742  2.1  2.2 4809412 83332 ?       S<sl 09:58   3:54 /usr/bin/python2 /usr/share/vdsm/vdsm
vdsm      3707  0.0  0.0 382608  1044 ?        S<l  10:05   0:01  \_ /usr/libexec/ioprocess --read-pipe-fd 131 --write-pipe-fd 130 --max-threads 10 --max-queued-requests 10
vdsm      3762  0.0  0.0 382580   888 ?        S<l  10:05   0:00  \_ /usr/libexec/ioprocess --read-pipe-fd 154 --write-pipe-fd 153 --max-threads 10 --max-queued-requests 10
vdsm      3777  0.0  0.0 382580   880 ?        S<l  10:05   0:00  \_ /usr/libexec/ioprocess --read-pipe-fd 162 --write-pipe-fd 161 --max-threads 10 --max-queued-requests 10
vdsm      3793  0.0  0.0 382608  1020 ?        S<l  10:05   0:00  \_ /usr/libexec/ioprocess --read-pipe-fd 178 --write-pipe-fd 177 --max-threads 10 --max-queued-requests 10
vdsm     20876  0.0  0.0 456340  1088 ?        S<l  12:07   0:00  \_ /usr/libexec/ioprocess --read-pipe-fd 191 --write-pipe-fd 190 --max-threads 10 --max-queued-requests 10
vdsm      2441  0.0  0.0 382608   980 ?        S<l  13:01   0:00  \_ /usr/libexec/ioprocess --read-pipe-fd 205 --write-pipe-fd 199 --max-threads 10 --max-queued-requests 10
vdsm      2596  0.0  0.0      0     0 ?        Z<   13:01   0:00  \_ [dd] <defunct>
vdsm      2597  0.0  0.0      0     0 ?        Z<   13:01   0:00  \_ [dd] <defunct>

This looks like an issue in vdsm.stoage.misc.ddWatchCopy

Version-Release number of selected component (if applicable):
master

How reproducible:
Not sure

Steps to Reproduce:
1. Start live storage migration with a running vm
2. Stop the vm
3. Put host to maintenance

Actual results:
Zombie dd processes

Expected results:
No zombies left

Comment 1 Allon Mureinik 2017-09-03 14:47:07 UTC
Is this still relevant with the new blkdiscard implementation?

Comment 2 Nir Soffer 2017-09-03 17:16:02 UTC
(In reply to Allon Mureinik from comment #1)
> Is this still relevant with the new blkdiscard implementation?

It is not related to the new implementation, but it may be fixed in master since
https://gerrit.ovirt.org/69782 removed usage of ddWatchCopy.

Should be easy to test as explained in comment 0. I suggest to move ask QE to test this.

Comment 3 Lilach Zitnitski 2017-11-01 14:57:35 UTC
--------------------------------------
Tested with the following code:
----------------------------------------
ovirt-engine-4.2.0-0.0.master.20171030210714.gitef6bb9c.el7.centos.noarch
vdsm-4.20.5-5.git3d47978.el7.centos.x86_64

Tested with the following scenario:

Steps to Reproduce:
1. Start live storage migration with a running vm
2. Stop the vm
3. Put host to maintenance

Actual results:
No zombie processes left 

Expected results:

Moving to VERIFIED!

Comment 4 Sandro Bonazzola 2017-12-20 11:04:14 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.