Bug 1408307 - Temporary file descriptor leak after running storage job
Summary: Temporary file descriptor leak after running storage job
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.19.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ovirt-4.1.0-beta
: 4.19.2
Assignee: Nir Soffer
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-22 20:45 UTC by Nir Soffer
Modified: 2017-02-01 14:34 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-02-01 14:34:28 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.1+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 69031 0 ovirt-4.1 MERGED qemuimg: Close properly qemuimg operations 2017-01-02 15:49:15 UTC

Description Nir Soffer 2016-12-22 20:45:07 UTC
Description of problem:

After running storage jobs such as SDM.copy_data and SDM.merge, the jobs are
kept for 1 hour by default. Each job reference a QemImgOperation object, 
referencing a CPopen object using 3 file descriptors and select.epoll object
using 1 file descriptor.

The 4 file descriptors are close one hour later, although the are not needed
once the job was finished.

Version-Release number of selected component (if applicable):
4.18.999

How reproducible:
Always

Steps to Reproduce:

1. Move a disk between domains, or perform delete a snapshot when a vm is
   not running.
2. Wait until the job starts on the host performing the job

   vdsm-client Host getJobs

3. Check the number of open files

   ls -1 /proc/<vdsm-pid>/fd | wc -l

4. Wait until the job finish
5. Check again the number of open files

Actual results:
Each job increase the number open files by 4

Expected results:
The number of open file remain constant after a job is done.

Fixed upstream in https://gerrit.ovirt.org/67028
need backport to 4.1.

Comment 1 Elad 2017-01-18 09:38:17 UTC
After job is done, open files amount in /proc/%VDSM-PID%/fd returns to what it was before it started. Tested for move image and for snapshot cold merge.


========================
Before job started:
========================

Every 2.0s: vdsm-client Host getJobs                                                                                                                                  Wed Jan 18 11:25:48 2017

{}
========================

Every 2.0s: ls -1 /proc/29847/fd | wc -l                                                                                                                              Wed Jan 18 11:26:08 2017

100


========================
During running job (move image):
========================

Every 2.0s: vdsm-client Host getJobs                                                                                                                                  Wed Jan 18 11:27:08 2017

{ 
    "ff0dc6cb-f9b4-4f88-9b02-763792fd52eb": {
        "status": "running",
        "progress": 2.08,
        "job_type": "storage",
        "id": "ff0dc6cb-f9b4-4f88-9b02-763792fd52eb",
        "description": "copy_data"
    }
}

========================

Every 2.0s: ls -1 /proc/29847/fd | wc -l                                                                                                                              Wed Jan 18 11:27:19 2017

112


========================
After job is done (move image):
========================

Every 2.0s: vdsm-client Host getJobs                                                                                                                                  Wed Jan 18 11:29:43 2017

{ 
    "ff0dc6cb-f9b4-4f88-9b02-763792fd52eb": {
        "status": "done",
        "progress": 100.0,
        "job_type": "storage",
        "id": "ff0dc6cb-f9b4-4f88-9b02-763792fd52eb",
        "description": "copy_data"
    }
}

========================

Every 2.0s: ls -1 /proc/29847/fd | wc -l                                                                                                                              Wed Jan 18 11:31:10 2017

100


========================


Tested using 
vdsm-4.20.0-231.git4739c81.el7.centos.x86_64
ovirt-engine-4.1.0-0.4.master.20170117182028.git4678f8c.el7.centos.noarch

Comment 2 Nir Soffer 2017-01-18 09:46:33 UTC
(In reply to Elad from comment #1)
> Tested using 
> vdsm-4.20.0-231.git4739c81.el7.centos.x86_64

You tested master build - we need to test 4.1 (4.19.x).

Thanks for this detailed verification!

Comment 3 Elad 2017-01-18 10:17:15 UTC
Thanks Nir, 

Tested again with latest 4.1 (vdsm-4.19.1-56.gitb2ac850.el7.centos.x86_64)




========================
Before job started:
========================
Every 2.0s: vdsm-client Host getJobs                                                                                                                                  Wed Jan 18 12:12:30 2017

{}
========================
Every 2.0s: ls -1 /proc/14171/fd | wc -l                                                                                                                              Wed Jan 18 12:12:56 2017

107
========================
During running job (move image):
========================
Every 2.0s: vdsm-client Host getJobs                                                                                                                                  Wed Jan 18 12:13:44 2017

{ 
    "2a704487-4234-46bb-83c9-f2601a903746": {
        "status": "running",
        "progress": 9.38,
        "job_type": "storage",
        "id": "2a704487-4234-46bb-83c9-f2601a903746",
        "description": "copy_data"
    }
}



========================

Every 2.0s: ls -1 /proc/14171/fd | wc -l                                                                                                                              Wed Jan 18 12:14:13 2017

163


========================
After job is done (move image):
========================
Every 2.0s: vdsm-client Host getJobs                                                                                                                                  Wed Jan 18 12:16:18 2017

{ 
    "2a704487-4234-46bb-83c9-f2601a903746": {
        "status": "done",
        "progress": 100.0,
        "job_type": "storage",
        "id": "2a704487-4234-46bb-83c9-f2601a903746",
        "description": "copy_data"
    }
}
========================
Every 2.0s: ls -1 /proc/14171/fd | wc -l                                                                                                                              Wed Jan 18 12:16:28 2017

107
========================

Comment 4 Elad 2017-01-18 10:21:29 UTC
BTW, why is the job being cleared only after an hour?

2017-01-18 11:29:36,039 INFO  (tasks/1) [root] Job 'ff0dc6cb-f9b4-4f88-9b02-763792fd52eb' completed (jobs:203)
2017-01-18 11:29:36,040 INFO  (tasks/1) [root] Job 'ff0dc6cb-f9b4-4f88-9b02-763792fd52eb' will be deleted in 3600 seconds (jobs:245)

Comment 5 Nir Soffer 2017-01-18 10:37:59 UTC
(In reply to Elad from comment #4)
> BTW, why is the job being cleared only after an hour?

This is a tradeoff, ensuring that engine get the result of the job in most cases
even if you shut down engine in the middle of an operation, and making it easy
to debug the system.

The jobs data is very small, keeping last jobs in memory for an hour should
not matter to anyone.


Note You need to log in before you can comment on or make changes to this bug.