Bug 1691487
Summary: | openQA transient test failure as duplicated first character just after a snapshot | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Michel Normand <normand> |
Component: | openqa | Assignee: | Adam Williamson <awilliam> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 32 | CC: | awilliam, dan, hannsj_uhl, smooge, smooge |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | ppc64le | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-05-03 15:58:58 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1071880 |
Description
Michel Normand
2019-03-21 17:49:49 UTC
I have an openQA bypass with https://pagure.io/fedora-qa/os-autoinst-distri-fedora/pull-request/105 merged 20190416. Waiting for investigation of this bug. I mean, I dunno what I can really 'investigate', unfortunately, if it's happening because the host gets overloaded with work all we could really do is cut the number of workers :/ /proc/cpuinfo shows the system as having 10 CPUs and being a "PowerNV 8348-21C". We're currently running 8 workers on it. 8 workers on 10 CPUs doesn't seem like it should be a problem. For storage it seems to have a pair of Seagate ST1000NM0033s being used as one big LVM VG. That's a bit weedy, I guess. The 'small' x86_64 worker hosts (which run 10 workers each) have RAID-6 arrays of 8 disks each (can't tell what type off hand, the RAID controller obfuscates it), the 'big' x86_64 worker hosts (which run 40 workers each) have RAID-6 arrays of 10 disks each (again can't tell what model, but I think they're something beefy; they may even be SSDs). So...my working theory is we're running into storage contention here. That seems plausible especially since the way we schedule tests means many of them may reach the point at which they restore a snapshot at basically the same time. Perhaps we can get the storage on the worker host beefed up? Or get a beefier worker host? :) I previously said "overloading of the workers host" but I wanted to say the error is not visible when test run individually with no contention on the system. I assume the problem is between openQA and qemu that report completion of the migration while still not usable immediately after completion. I have no idea of what traces could be added between openQA and qemu to identify the cause of the problem. "I previously said "overloading of the workers host" but I wanted to say the error is not visible when test run individually with no contention on the system" well yeah that's what I mean by 'overloading' - this looks a lot like the storage kinda taps out if we have multiple tests doing disk intensive things at once. CCing smooge; are you in a position to advise on this? Can we beef up the storage on openqa-ppc64le-01.qa.fedoraproject.org at all? IIRC there was talk recently of IBM giving us some much more powerful boxes to run openQA tests on, did that go anywhere? We don't own the hardware so we can't beef up the storage on the system at all. The IBM power9 boxes are being used to build various cloud artifacts because of a bug on Power8 virt-on-virt. I will find out where the other Power9 is going.. the last I heard it was being negotiated. So, IBM owns the hardware? Then I guess this is back on Michel =) Would it be possible to give it more/faster disks? smooge, perhaps we could at least ask you exactly what the storage setup on qa09/qa14 is, since those don't seem to have this issue (while hosting 10 workers each). What are the disks in that RAID-6 array? qa09 and qa14 have 8 560 GB SAS drives which are RAID-6 together. The systems we get from IBM come through a special contract which in the past required the system to be sent back to add hardware to it. When we added drives it also caused problems because the system didn't match the contract when we returned it. I am checking with IBM on the wearabouts for the systems. This bug is still present for f31 release on stg server for two tests: server_cockpit_default: https://openqa.stg.fedoraproject.org/tests/596827#next_previous server_database_client: https://openqa.stg.fedoraproject.org/tests/596834#next_previous I found that adding a simple cmd in login session of previous _console_wait_login.pm module is sufficient to solve the problem. So initial problem related to a current_console not used before snapshot. I could add traces in my own openqa instance, but have no idea what to trace in autoinst(autotest, bmwqemu, backend/qemu) and/or qemu itself. any suggestion ? Um. Not sure either. So, let me get this straight: the scenario is we boot, log into a console, immediately take a snapshot without doing anything at the logged-in console, then resume and try to type something? Have you tried reproducing this manually? That might help narrow it down... This message is a reminder that Fedora 30 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 30 on 2020-05-26. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '30'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 30 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. This message is a reminder that Fedora 32 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '32'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 32 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The workaround is still in the tests, not sure if the bug still happens without it. I checked the history of some of the tests that needed the bypass, and did not found occurences of record_soft_failure 'brc#1691487 bypass' so I propose to revert commits that used the bypass, and close this bug. Sounds good to me. |