Description of problem: After running storage jobs such as SDM.copy_data and SDM.merge, the jobs are kept for 1 hour by default. Each job reference a QemImgOperation object, referencing a CPopen object using 3 file descriptors and select.epoll object using 1 file descriptor. The 4 file descriptors are close one hour later, although the are not needed once the job was finished. Version-Release number of selected component (if applicable): 4.18.999 How reproducible: Always Steps to Reproduce: 1. Move a disk between domains, or perform delete a snapshot when a vm is not running. 2. Wait until the job starts on the host performing the job vdsm-client Host getJobs 3. Check the number of open files ls -1 /proc/<vdsm-pid>/fd | wc -l 4. Wait until the job finish 5. Check again the number of open files Actual results: Each job increase the number open files by 4 Expected results: The number of open file remain constant after a job is done. Fixed upstream in https://gerrit.ovirt.org/67028 need backport to 4.1.
After job is done, open files amount in /proc/%VDSM-PID%/fd returns to what it was before it started. Tested for move image and for snapshot cold merge. ======================== Before job started: ======================== Every 2.0s: vdsm-client Host getJobs Wed Jan 18 11:25:48 2017 {} ======================== Every 2.0s: ls -1 /proc/29847/fd | wc -l Wed Jan 18 11:26:08 2017 100 ======================== During running job (move image): ======================== Every 2.0s: vdsm-client Host getJobs Wed Jan 18 11:27:08 2017 { "ff0dc6cb-f9b4-4f88-9b02-763792fd52eb": { "status": "running", "progress": 2.08, "job_type": "storage", "id": "ff0dc6cb-f9b4-4f88-9b02-763792fd52eb", "description": "copy_data" } } ======================== Every 2.0s: ls -1 /proc/29847/fd | wc -l Wed Jan 18 11:27:19 2017 112 ======================== After job is done (move image): ======================== Every 2.0s: vdsm-client Host getJobs Wed Jan 18 11:29:43 2017 { "ff0dc6cb-f9b4-4f88-9b02-763792fd52eb": { "status": "done", "progress": 100.0, "job_type": "storage", "id": "ff0dc6cb-f9b4-4f88-9b02-763792fd52eb", "description": "copy_data" } } ======================== Every 2.0s: ls -1 /proc/29847/fd | wc -l Wed Jan 18 11:31:10 2017 100 ======================== Tested using vdsm-4.20.0-231.git4739c81.el7.centos.x86_64 ovirt-engine-4.1.0-0.4.master.20170117182028.git4678f8c.el7.centos.noarch
(In reply to Elad from comment #1) > Tested using > vdsm-4.20.0-231.git4739c81.el7.centos.x86_64 You tested master build - we need to test 4.1 (4.19.x). Thanks for this detailed verification!
Thanks Nir, Tested again with latest 4.1 (vdsm-4.19.1-56.gitb2ac850.el7.centos.x86_64) ======================== Before job started: ======================== Every 2.0s: vdsm-client Host getJobs Wed Jan 18 12:12:30 2017 {} ======================== Every 2.0s: ls -1 /proc/14171/fd | wc -l Wed Jan 18 12:12:56 2017 107 ======================== During running job (move image): ======================== Every 2.0s: vdsm-client Host getJobs Wed Jan 18 12:13:44 2017 { "2a704487-4234-46bb-83c9-f2601a903746": { "status": "running", "progress": 9.38, "job_type": "storage", "id": "2a704487-4234-46bb-83c9-f2601a903746", "description": "copy_data" } } ======================== Every 2.0s: ls -1 /proc/14171/fd | wc -l Wed Jan 18 12:14:13 2017 163 ======================== After job is done (move image): ======================== Every 2.0s: vdsm-client Host getJobs Wed Jan 18 12:16:18 2017 { "2a704487-4234-46bb-83c9-f2601a903746": { "status": "done", "progress": 100.0, "job_type": "storage", "id": "2a704487-4234-46bb-83c9-f2601a903746", "description": "copy_data" } } ======================== Every 2.0s: ls -1 /proc/14171/fd | wc -l Wed Jan 18 12:16:28 2017 107 ========================
BTW, why is the job being cleared only after an hour? 2017-01-18 11:29:36,039 INFO (tasks/1) [root] Job 'ff0dc6cb-f9b4-4f88-9b02-763792fd52eb' completed (jobs:203) 2017-01-18 11:29:36,040 INFO (tasks/1) [root] Job 'ff0dc6cb-f9b4-4f88-9b02-763792fd52eb' will be deleted in 3600 seconds (jobs:245)
(In reply to Elad from comment #4) > BTW, why is the job being cleared only after an hour? This is a tradeoff, ensuring that engine get the result of the job in most cases even if you shut down engine in the middle of an operation, and making it easy to debug the system. The jobs data is very small, keeping last jobs in memory for an hour should not matter to anyone.