Created attachment 1425653 [details] openqa_gru_journalctl_abanc_20180423.log openqa-gru delete qcow2 too early before child is dispatched to use it Version-Release number of selected component (if applicable): $rpm -qa |grep openqa openqa-common-4.5-5.20180207git3977d2f.fc27.noarch openqa-worker-4.5-5.20180207git3977d2f.fc27.noarch openqa-client-4.5-5.20180207git3977d2f.fc27.noarch openqa-4.5-5.20180207git3977d2f.fc27.noarch openqa-httpd-4.5-5.20180207git3977d2f.fc27.noarch openqa-local-db-4.5-5.20180207git3977d2f.fc27.noarch [root@abanc:/home/michel/openqa_fedora] $rpm -ql openqa |grep gru /usr/lib/systemd/system/openqa-gru.service On a local openQA Fedora 27 instance running on a ppc64le host, The openqa-gru.service is removing qcow2 generated by server_cockpit_default before server_cockpit_basic is dispatched because the host has many pending scheduled tests for different composes. I will attached the two autoinst.log files of the two tests, and the journalctl of openqa-gru.service.
Created attachment 1425654 [details] server_cockpit_default log
Created attachment 1425655 [details] server_cockpit_basic log === === extract server_cockpit_default_rawhide_20180422_autoinst log === [2018-04-22T11:41:37.0080 EDT] [info] uploading disk_ppc64le_cockpit.qcow2 === === extract openqa_gru_journalctl_abanc_20180423.log === Apr 22 12:46:41 abanc.test.toulouse-stg.fr.ibm.com openqa[4737]: [info] GRU: removing /var/lib/openqa/share/factory/hdd/00010103-disk_ppc64le_cockpit.qcow2 === === extract server_cockpit_basic_rawhide_20180422_autoinst.log === [2018-04-22T14:59:29.0065 EDT] [info] result: setup failure: Cannot find HDD_1 asset hdd/disk_ppc64le_cockpit.qcow2! ===
Are you sure this specific run of the cockpit_basic test was actually scheduled at the time the asset was removed? There are not enough logs for me to determine this, and there is actually specific code in openQA which should prevent precisely this from happening. So I'd at least want to be sure that's what is happening. At minimum I need the logs from all openQA services for the whole period surrounding this issue (like, from the time the jobs were posted to after the cockpit_basic test failed).
Created attachment 1425700 [details] journalctl_all_openqa_services.log.gz I captured the journalctl for all openqa* services and attached here as gzip file. The two related tests had following Ids: 10103 server_cockpit_default 10108 server_cockpit_basic a grep of above journalctl log confirm the timing: === grep -E '10103|10108' journalctl_all_openqa_services.log Apr 20 11:20:57 abanc.test.toulouse-stg.fr.ibm.com worker[5194]: [info] 10103: WORKING 9251 Apr 22 11:28:57 abanc.test.toulouse-stg.fr.ibm.com worker[54523]: [info] got job 10103: 00010103-fedora-Rawhide-Server-dvd-iso-ppc64le-BuildFedora-Rawhide-20180422.n.0-server_cockpit_default@ppc64le Apr 22 11:28:57 abanc.test.toulouse-stg.fr.ibm.com worker[54523]: [info] 66047: WORKING 10103 Apr 22 11:41:48 abanc.test.toulouse-stg.fr.ibm.com worker[54523]: [info] cleaning up 00010103-fedora-Rawhide-Server-dvd-iso-ppc64le-BuildFedora-Rawhide-20180422.n.0-server_cockpit_default@ppc64le Apr 22 12:46:41 abanc.test.toulouse-stg.fr.ibm.com openqa[4737]: [info] GRU: removing /var/lib/openqa/share/factory/hdd/00010103-disk_ppc64le_cockpit.qcow2 Apr 22 14:59:24 abanc.test.toulouse-stg.fr.ibm.com worker[5194]: [info] got job 10108: 00010108-fedora-Rawhide-Server-dvd-iso-ppc64le-BuildFedora-Rawhide-20180422.n.0-server_cockpit_basic@ppc64le Apr 22 14:59:24 abanc.test.toulouse-stg.fr.ibm.com openqa[4949]: [info] Got status update for job 10108 that does not belong to Worker 2 Apr 22 14:59:29 abanc.test.toulouse-stg.fr.ibm.com worker[5194]: [info] cleaning up 00010108-fedora-Rawhide-Server-dvd-iso-ppc64le-BuildFedora-Rawhide-20180422.n.0-server_cockpit_basic@ppc64le === The settings pages confirm the links between the tests: === from 10103 Parents Chained - install_default_upload softfailed Children Chained - realmd_join_cockpit passed <= points to 10109 Chained - server_cockpit_basic incomplete <= points to 10108 === from 10108 Parents Chained - server_cockpit_default passed <= points to 10103 ===
(In reply to Adam Williamson from comment #3) > Are you sure this specific run of the cockpit_basic test was actually > scheduled at the time the asset was removed? There are not enough logs for > me to determine this, and there is actually specific code in openQA which > should prevent precisely this from happening. So I'd at least want to be > sure that's what is happening. At minimum I need the logs from all openQA > services for the whole period surrounding this issue (like, from the time > the jobs were posted to after the cockpit_basic test failed). The cockpit_basic test WAS NOT scheduled at time of removed asset, this was what I tried to explain in my original comment, sorry if it was not clear enough.
OK, so, what are you proposing, exactly? openQA has to be able to clean up assets, or servers would need infinite disk space. And it's hard for it to know that you're *going to* schedule a job at some point in the future that will turn out to need an asset it garbage collected.
I do not know the details of openqa-gru algorithms for cleanup of assets, but I saw in the log that some are not deleted immediately but after some days. Could we have similar delayed removal for qcow2 files ?
It's all the same cleanup routine for both ISOs and disk images, there isn't a difference. It *should* clean up in approximately FIFO order, *per job group* (that part is important). If it's not doing that, something may be wrong.
This message is a reminder that Fedora 27 is nearing its end of life. On 2018-Nov-30 Fedora will stop maintaining and issuing updates for Fedora 27. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '27'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 27 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
will close the bug as assume cause is not enough disk space reserved for the related group.
Note the asset cleanup code has been rewritten in the new openQA that's currently in updates-testing. But I couldn't identify any specific bug here in the old one in any case. Do keep an eye on it after updating and see if you still have any problems...