Bug 1570807 - openqa-gru delete qcow2 too early before child is dispatched to use it
Summary: openqa-gru delete qcow2 too early before child is dispatched to use it
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: openqa
Version: 27
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Adam Williamson
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-23 12:31 UTC by Michel Normand
Modified: 2018-11-27 19:41 UTC (History)
2 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2018-11-27 17:06:59 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
openqa_gru_journalctl_abanc_20180423.log (941.26 KB, text/x-vhdl)
2018-04-23 12:31 UTC, Michel Normand
no flags Details
server_cockpit_default log (69.11 KB, text/plain)
2018-04-23 12:39 UTC, Michel Normand
no flags Details
server_cockpit_basic log (673 bytes, text/plain)
2018-04-23 12:40 UTC, Michel Normand
no flags Details
journalctl_all_openqa_services.log.gz (465.97 KB, application/x-gzip)
2018-04-23 16:30 UTC, Michel Normand
no flags Details

Description Michel Normand 2018-04-23 12:31:42 UTC
Created attachment 1425653 [details]
openqa_gru_journalctl_abanc_20180423.log

openqa-gru delete qcow2 too early before child is dispatched to use it

Version-Release number of selected component (if applicable):
$rpm -qa |grep openqa
openqa-common-4.5-5.20180207git3977d2f.fc27.noarch
openqa-worker-4.5-5.20180207git3977d2f.fc27.noarch
openqa-client-4.5-5.20180207git3977d2f.fc27.noarch
openqa-4.5-5.20180207git3977d2f.fc27.noarch
openqa-httpd-4.5-5.20180207git3977d2f.fc27.noarch
openqa-local-db-4.5-5.20180207git3977d2f.fc27.noarch
[root@abanc:/home/michel/openqa_fedora]
$rpm -ql openqa |grep gru
/usr/lib/systemd/system/openqa-gru.service

On a local openQA Fedora 27 instance running on a ppc64le host,
The openqa-gru.service is removing qcow2 
generated by server_cockpit_default
before server_cockpit_basic is dispatched

because the host has many pending scheduled tests for different composes.

I will attached the two autoinst.log files of the two tests, and the journalctl of openqa-gru.service.

Comment 1 Michel Normand 2018-04-23 12:39:17 UTC
Created attachment 1425654 [details]
server_cockpit_default log

Comment 2 Michel Normand 2018-04-23 12:40:05 UTC
Created attachment 1425655 [details]
server_cockpit_basic log

===
=== extract server_cockpit_default_rawhide_20180422_autoinst log
===
[2018-04-22T11:41:37.0080 EDT] [info] uploading disk_ppc64le_cockpit.qcow2
===
=== extract openqa_gru_journalctl_abanc_20180423.log
===
Apr 22 12:46:41 abanc.test.toulouse-stg.fr.ibm.com openqa[4737]: [info] GRU: removing /var/lib/openqa/share/factory/hdd/00010103-disk_ppc64le_cockpit.qcow2
===
=== extract server_cockpit_basic_rawhide_20180422_autoinst.log
===
[2018-04-22T14:59:29.0065 EDT] [info] result: setup failure: Cannot find HDD_1 asset hdd/disk_ppc64le_cockpit.qcow2!
===

Comment 3 Adam Williamson 2018-04-23 15:42:20 UTC
Are you sure this specific run of the cockpit_basic test was actually scheduled at the time the asset was removed? There are not enough logs for me to determine this, and there is actually specific code in openQA which should prevent precisely this from happening. So I'd at least want to be sure that's what is happening. At minimum I need the logs from all openQA services for the whole period surrounding this issue (like, from the time the jobs were posted to after the cockpit_basic test failed).

Comment 4 Michel Normand 2018-04-23 16:30:08 UTC
Created attachment 1425700 [details]
journalctl_all_openqa_services.log.gz

I captured the journalctl for all openqa* services and attached here as gzip file.

The two related tests had following Ids:
10103 server_cockpit_default
10108 server_cockpit_basic

a grep of above journalctl log confirm the timing:
===
grep -E '10103|10108' journalctl_all_openqa_services.log
Apr 20 11:20:57 abanc.test.toulouse-stg.fr.ibm.com worker[5194]: [info] 10103: WORKING 9251
Apr 22 11:28:57 abanc.test.toulouse-stg.fr.ibm.com worker[54523]: [info] got job 10103: 00010103-fedora-Rawhide-Server-dvd-iso-ppc64le-BuildFedora-Rawhide-20180422.n.0-server_cockpit_default@ppc64le
Apr 22 11:28:57 abanc.test.toulouse-stg.fr.ibm.com worker[54523]: [info] 66047: WORKING 10103
Apr 22 11:41:48 abanc.test.toulouse-stg.fr.ibm.com worker[54523]: [info] cleaning up 00010103-fedora-Rawhide-Server-dvd-iso-ppc64le-BuildFedora-Rawhide-20180422.n.0-server_cockpit_default@ppc64le
Apr 22 12:46:41 abanc.test.toulouse-stg.fr.ibm.com openqa[4737]: [info] GRU: removing /var/lib/openqa/share/factory/hdd/00010103-disk_ppc64le_cockpit.qcow2
Apr 22 14:59:24 abanc.test.toulouse-stg.fr.ibm.com worker[5194]: [info] got job 10108: 00010108-fedora-Rawhide-Server-dvd-iso-ppc64le-BuildFedora-Rawhide-20180422.n.0-server_cockpit_basic@ppc64le
Apr 22 14:59:24 abanc.test.toulouse-stg.fr.ibm.com openqa[4949]: [info] Got status update for job 10108 that does not belong to Worker 2
Apr 22 14:59:29 abanc.test.toulouse-stg.fr.ibm.com worker[5194]: [info] cleaning up 00010108-fedora-Rawhide-Server-dvd-iso-ppc64le-BuildFedora-Rawhide-20180422.n.0-server_cockpit_basic@ppc64le
===

The settings pages confirm the links between the tests:
===
from 10103
Parents
Chained - install_default_upload 	softfailed
Children
Chained - realmd_join_cockpit 	passed     <= points to 10109
Chained - server_cockpit_basic 	incomplete <= points to 10108
===
from 10108
Parents
Chained - server_cockpit_default 	passed <= points to 10103
===

Comment 5 Michel Normand 2018-04-23 16:37:59 UTC
(In reply to Adam Williamson from comment #3)
> Are you sure this specific run of the cockpit_basic test was actually
> scheduled at the time the asset was removed? There are not enough logs for
> me to determine this, and there is actually specific code in openQA which
> should prevent precisely this from happening. So I'd at least want to be
> sure that's what is happening. At minimum I need the logs from all openQA
> services for the whole period surrounding this issue (like, from the time
> the jobs were posted to after the cockpit_basic test failed).

The cockpit_basic test WAS NOT scheduled at time of removed asset, this was what I tried to explain in my original comment, sorry if it was not clear enough.

Comment 6 Adam Williamson 2018-04-23 17:30:37 UTC
OK, so, what are you proposing, exactly? openQA has to be able to clean up assets, or servers would need infinite disk space. And it's hard for it to know that you're *going to* schedule a job at some point in the future that will turn out to need an asset it garbage collected.

Comment 7 Michel Normand 2018-04-24 07:35:20 UTC
I do not know the details of openqa-gru algorithms for cleanup of assets, but I saw in the log that some are not deleted immediately but after some days.
 Could we have similar delayed removal for qcow2 files ?

Comment 8 Adam Williamson 2018-04-24 14:50:48 UTC
It's all the same cleanup routine for both ISOs and disk images, there isn't a difference. It *should* clean up in approximately FIFO order, *per job group* (that part is important). If it's not doing that, something may be wrong.

Comment 9 Ben Cotton 2018-11-27 16:02:56 UTC
This message is a reminder that Fedora 27 is nearing its end of life.
On 2018-Nov-30  Fedora will stop maintaining and issuing updates for
Fedora 27. It is Fedora's policy to close all bug reports from releases
that are no longer maintained. At that time this bug will be closed as
EOL if it remains open with a Fedora  'version' of '27'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 27 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 10 Michel Normand 2018-11-27 17:06:59 UTC
will close the bug as assume cause is not enough disk space reserved for the related group.

Comment 11 Adam Williamson 2018-11-27 19:41:58 UTC
Note the asset cleanup code has been rewritten in the new openQA that's currently in updates-testing. But I couldn't identify any specific bug here in the old one in any case. Do keep an eye on it after updating and see if you still have any problems...


Note You need to log in before you can comment on or make changes to this bug.