Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1408307

Summary: Temporary file descriptor leak after running storage job
Product: [oVirt] vdsm Reporter: Nir Soffer <nsoffer>
Component: CoreAssignee: Nir Soffer <nsoffer>
Status: CLOSED CURRENTRELEASE QA Contact: Elad <ebenahar>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.19.0CC: bugs, nsoffer, tnisan, ybronhei
Target Milestone: ovirt-4.1.0-betaFlags: rule-engine: ovirt-4.1+
Target Release: 4.19.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-01 14:34:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Nir Soffer 2016-12-22 20:45:07 UTC
Description of problem:

After running storage jobs such as SDM.copy_data and SDM.merge, the jobs are
kept for 1 hour by default. Each job reference a QemImgOperation object, 
referencing a CPopen object using 3 file descriptors and select.epoll object
using 1 file descriptor.

The 4 file descriptors are close one hour later, although the are not needed
once the job was finished.

Version-Release number of selected component (if applicable):
4.18.999

How reproducible:
Always

Steps to Reproduce:

1. Move a disk between domains, or perform delete a snapshot when a vm is
   not running.
2. Wait until the job starts on the host performing the job

   vdsm-client Host getJobs

3. Check the number of open files

   ls -1 /proc/<vdsm-pid>/fd | wc -l

4. Wait until the job finish
5. Check again the number of open files

Actual results:
Each job increase the number open files by 4

Expected results:
The number of open file remain constant after a job is done.

Fixed upstream in https://gerrit.ovirt.org/67028
need backport to 4.1.

Comment 1 Elad 2017-01-18 09:38:17 UTC
After job is done, open files amount in /proc/%VDSM-PID%/fd returns to what it was before it started. Tested for move image and for snapshot cold merge.


========================
Before job started:
========================

Every 2.0s: vdsm-client Host getJobs                                                                                                                                  Wed Jan 18 11:25:48 2017

{}
========================

Every 2.0s: ls -1 /proc/29847/fd | wc -l                                                                                                                              Wed Jan 18 11:26:08 2017

100


========================
During running job (move image):
========================

Every 2.0s: vdsm-client Host getJobs                                                                                                                                  Wed Jan 18 11:27:08 2017

{ 
    "ff0dc6cb-f9b4-4f88-9b02-763792fd52eb": {
        "status": "running",
        "progress": 2.08,
        "job_type": "storage",
        "id": "ff0dc6cb-f9b4-4f88-9b02-763792fd52eb",
        "description": "copy_data"
    }
}

========================

Every 2.0s: ls -1 /proc/29847/fd | wc -l                                                                                                                              Wed Jan 18 11:27:19 2017

112


========================
After job is done (move image):
========================

Every 2.0s: vdsm-client Host getJobs                                                                                                                                  Wed Jan 18 11:29:43 2017

{ 
    "ff0dc6cb-f9b4-4f88-9b02-763792fd52eb": {
        "status": "done",
        "progress": 100.0,
        "job_type": "storage",
        "id": "ff0dc6cb-f9b4-4f88-9b02-763792fd52eb",
        "description": "copy_data"
    }
}

========================

Every 2.0s: ls -1 /proc/29847/fd | wc -l                                                                                                                              Wed Jan 18 11:31:10 2017

100


========================


Tested using 
vdsm-4.20.0-231.git4739c81.el7.centos.x86_64
ovirt-engine-4.1.0-0.4.master.20170117182028.git4678f8c.el7.centos.noarch

Comment 2 Nir Soffer 2017-01-18 09:46:33 UTC
(In reply to Elad from comment #1)
> Tested using 
> vdsm-4.20.0-231.git4739c81.el7.centos.x86_64

You tested master build - we need to test 4.1 (4.19.x).

Thanks for this detailed verification!

Comment 3 Elad 2017-01-18 10:17:15 UTC
Thanks Nir, 

Tested again with latest 4.1 (vdsm-4.19.1-56.gitb2ac850.el7.centos.x86_64)




========================
Before job started:
========================
Every 2.0s: vdsm-client Host getJobs                                                                                                                                  Wed Jan 18 12:12:30 2017

{}
========================
Every 2.0s: ls -1 /proc/14171/fd | wc -l                                                                                                                              Wed Jan 18 12:12:56 2017

107
========================
During running job (move image):
========================
Every 2.0s: vdsm-client Host getJobs                                                                                                                                  Wed Jan 18 12:13:44 2017

{ 
    "2a704487-4234-46bb-83c9-f2601a903746": {
        "status": "running",
        "progress": 9.38,
        "job_type": "storage",
        "id": "2a704487-4234-46bb-83c9-f2601a903746",
        "description": "copy_data"
    }
}



========================

Every 2.0s: ls -1 /proc/14171/fd | wc -l                                                                                                                              Wed Jan 18 12:14:13 2017

163


========================
After job is done (move image):
========================
Every 2.0s: vdsm-client Host getJobs                                                                                                                                  Wed Jan 18 12:16:18 2017

{ 
    "2a704487-4234-46bb-83c9-f2601a903746": {
        "status": "done",
        "progress": 100.0,
        "job_type": "storage",
        "id": "2a704487-4234-46bb-83c9-f2601a903746",
        "description": "copy_data"
    }
}
========================
Every 2.0s: ls -1 /proc/14171/fd | wc -l                                                                                                                              Wed Jan 18 12:16:28 2017

107
========================

Comment 4 Elad 2017-01-18 10:21:29 UTC
BTW, why is the job being cleared only after an hour?

2017-01-18 11:29:36,039 INFO  (tasks/1) [root] Job 'ff0dc6cb-f9b4-4f88-9b02-763792fd52eb' completed (jobs:203)
2017-01-18 11:29:36,040 INFO  (tasks/1) [root] Job 'ff0dc6cb-f9b4-4f88-9b02-763792fd52eb' will be deleted in 3600 seconds (jobs:245)

Comment 5 Nir Soffer 2017-01-18 10:37:59 UTC
(In reply to Elad from comment #4)
> BTW, why is the job being cleared only after an hour?

This is a tradeoff, ensuring that engine get the result of the job in most cases
even if you shut down engine in the middle of an operation, and making it easy
to debug the system.

The jobs data is very small, keeping last jobs in memory for an hour should
not matter to anyone.