Bug 1691487

Summary:	openQA transient test failure as duplicated first character just after a snapshot
Product:	[Fedora] Fedora	Reporter:	Michel Normand <normand>
Component:	openqa	Assignee:	Adam Williamson <awilliam>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	32	CC:	awilliam, dan, hannsj_uhl, smooge, smooge
Target Milestone:	---
Target Release:	---
Hardware:	ppc64le
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-05-03 15:58:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1071880

Description Michel Normand 2019-03-21 17:49:49 UTC

openQA test failure as duplicated first character 'hhostnamectl' expecting 'hostnamctl'

for server_database_client test:
https://openqa.stg.fedoraproject.org/tests/500172#step/_post_network_static/1

seems to be an openQA problem that only appear when overloading of the workers host, (when a new set of tests are triggered for a new compose); because my local trial show that a manual retry of individual test is sufficient to run it successfully.

I locally hit this same problem with server_cockpit_default
my local bypass was to add a  
script_run 'echo "bypass duplicated first character"'; at top of file.


=== extract autoinst.log for above referenced url.
...
[2019-03-21T02:26:26.891 UTC] [debug] Migrating total bytes:     	4312072192
[2019-03-21T02:26:26.891 UTC] [debug] Migrating remaining bytes:   	0
[2019-03-21T02:26:26.892 UTC] [debug] Snapshot complete
[2019-03-21T02:26:26.913 UTC] [debug] EVENT {"event":"RESUME","timestamp":{"microseconds":913591,"seconds":1553135186}}
[2019-03-21T02:26:26.921 UTC] [debug] QEMU: Formatting '/var/lib/openqa/pool/5/raid/hd0-overlay1', fmt=qcow2 size=10737418240 backing_file=/var/lib/openqa/pool/5/raid/hd0-overlay0 backing_fmt=qcow2 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2019-03-21T02:26:26.921 UTC] [debug] QEMU: Formatting '/var/lib/openqa/pool/5/raid/cd0-overlay1', fmt=qcow2 size=3130112000 backing_file=/var/lib/openqa/pool/5/raid/cd0-overlay0 backing_fmt=qcow2 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2019-03-21T02:26:26.930 UTC] [debug] ||| starting _post_network_static tests/_post_network_static.pm
[2019-03-21T02:26:26.931 UTC] [debug] /var/lib/openqa/share/tests/fedora/tests/_post_network_static.pm:12 called tapnet::setup_tap_static
[2019-03-21T02:26:26.931 UTC] [debug] <<< testapi::assert_script_run(cmd='hostnamectl set-hostname dbclient.domain.local', wait=90, fail_message=undef)
[2019-03-21T02:26:26.931 UTC] [debug] /var/lib/openqa/share/tests/fedora/tests/_post_network_static.pm:12 called tapnet::setup_tap_static
[2019-03-21T02:26:26.931 UTC] [debug] <<< testapi::type_string(string='hostnamectl set-hostname dbclient.domain.local', max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2019-03-21T02:26:26.932 UTC] [debug] /var/lib/openqa/share/tests/fedora/tests/_post_network_static.pm:12 called tapnet::setup_tap_static
[2019-03-21T02:26:26.932 UTC] [debug] <<< testapi::wait_screen_change(timeout=10, similarity_level=50)
[2019-03-21T02:26:28.067 UTC] [debug] waiting for screen change: 0 32.9796599074581
[2019-03-21T02:26:28.069 UTC] [debug] >>> testapi::wait_screen_change: screen change seen at 0
[2019-03-21T02:26:28.072 UTC] [debug] /var/lib/openqa/share/tests/fedora/tests/_post_network_static.pm:12 called tapnet::setup_tap_static
[2019-03-21T02:26:28.072 UTC] [debug] <<< testapi::type_string(string=' ; echo V5qMN-$?- > /dev/hvc0
', max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2019-03-21T02:26:28.072 UTC] [debug] /var/lib/openqa/share/tests/fedora/tests/_post_network_static.pm:12 called tapnet::setup_tap_static
[2019-03-21T02:26:28.072 UTC] [debug] <<< testapi::wait_screen_change(timeout=10, similarity_level=50)
[2019-03-21T02:26:28.822 UTC] [debug] waiting for screen change: 0 35.9559735377936
[2019-03-21T02:26:28.822 UTC] [debug] >>> testapi::wait_screen_change: screen change seen at 0
[2019-03-21T02:26:28.822 UTC] [debug] /var/lib/openqa/share/tests/fedora/tests/_post_network_static.pm:12 called tapnet::setup_tap_static
[2019-03-21T02:26:28.822 UTC] [debug] <<< testapi::wait_serial(regexp=qr/V5qMN-\d+-/, timeout=90)
[2019-03-21T02:26:30.974 UTC] [debug] >>> testapi::wait_serial: (?^:V5qMN-\d+-): ok
[2019-03-21T02:26:31.096 UTC] [debug] command 'hostnamectl set-hostname dbclient.domain.local' failed at /var/lib/openqa/share/tests/fedora/lib/tapnet.pm line 39.
===

Comment 1 Michel Normand 2019-04-16 07:32:55 UTC

I have an openQA bypass with https://pagure.io/fedora-qa/os-autoinst-distri-fedora/pull-request/105 merged 20190416.
Waiting for investigation of this bug.

Comment 2 Adam Williamson 2019-04-16 15:09:39 UTC

I mean, I dunno what I can really 'investigate', unfortunately, if it's happening because the host gets overloaded with work all we could really do is cut the number of workers :/

/proc/cpuinfo shows the system as having 10 CPUs and being a "PowerNV 8348-21C". We're currently running 8 workers on it. 8 workers on 10 CPUs doesn't seem like it should be a problem.

For storage it seems to have a pair of Seagate ST1000NM0033s being used as one big LVM VG. That's a bit weedy, I guess. The 'small' x86_64 worker hosts (which run 10 workers each) have RAID-6 arrays of 8 disks each (can't tell what type off hand, the RAID controller obfuscates it), the 'big' x86_64 worker hosts (which run 40 workers each) have RAID-6 arrays of 10 disks each (again can't tell what model, but I think they're something beefy; they may even be SSDs).

So...my working theory is we're running into storage contention here. That seems plausible especially since the way we schedule tests means many of them may reach the point at which they restore a snapshot at basically the same time.

Perhaps we can get the storage on the worker host beefed up? Or get a beefier worker host? :)

Comment 3 Michel Normand 2019-04-16 16:34:09 UTC

I previously  said "overloading of the workers host"  but I wanted to say the error is not visible when test run individually with no contention on the system.
I assume the problem is between openQA and qemu that report completion of the migration while still not usable immediately after completion.
I have no idea of what traces could be added between openQA and qemu to identify the cause of the problem.

Comment 4 Adam Williamson 2019-04-16 16:38:50 UTC

"I previously  said "overloading of the workers host"  but I wanted to say the error is not visible when test run individually with no contention on the system"

well yeah that's what I mean by 'overloading' - this looks a lot like the storage kinda taps out if we have multiple tests doing disk intensive things at once.

Comment 5 Adam Williamson 2019-04-18 15:13:34 UTC

CCing smooge; are you in a position to advise on this? Can we beef up the storage on openqa-ppc64le-01.qa.fedoraproject.org at all?

IIRC there was talk recently of IBM giving us some much more powerful boxes to run openQA tests on, did that go anywhere?

Comment 6 Stephen John Smoogen 2019-04-18 15:17:59 UTC

We don't own the hardware so we can't beef up the storage on the system at all. The IBM power9 boxes are being used to build various cloud artifacts because of a bug on Power8 virt-on-virt. I will find out where the other Power9 is going.. the last I heard it was being negotiated.

Comment 7 Adam Williamson 2019-04-18 15:21:42 UTC

So, IBM owns the hardware? Then I guess this is back on Michel =) Would it be possible to give it more/faster disks?

smooge, perhaps we could at least ask you exactly what the storage setup on qa09/qa14 is, since those don't seem to have this issue (while hosting 10 workers each). What are the disks in that RAID-6 array?

Comment 8 Stephen John Smoogen 2019-04-18 15:55:38 UTC

qa09 and qa14 have 8 560 GB SAS drives which are RAID-6 together. 

The systems we get from IBM come through a special contract which in the past required the system to be sent back to add hardware to it. When we added drives it also caused problems because the system didn't match the contract when we returned it. I am checking with IBM on the wearabouts for the systems.

Comment 9 Michel Normand 2019-08-30 15:42:54 UTC

This bug is still present for f31 release on stg server for two tests:
server_cockpit_default: https://openqa.stg.fedoraproject.org/tests/596827#next_previous
server_database_client: https://openqa.stg.fedoraproject.org/tests/596834#next_previous

I found that adding a simple cmd in login session of previous _console_wait_login.pm module is sufficient to solve the problem.
So initial problem related to a current_console not used before snapshot.

I could add traces in my own openqa instance, but have no idea what to trace in autoinst(autotest, bmwqemu, backend/qemu) and/or qemu itself.
any suggestion ?

Comment 10 Adam Williamson 2019-08-30 16:31:38 UTC

Um. Not sure either. So, let me get this straight: the scenario is we boot, log into a console, immediately take a snapshot without doing anything at the logged-in console, then resume and try to type something? Have you tried reproducing this manually? That might help narrow it down...

Comment 11 Ben Cotton 2020-04-30 21:14:14 UTC

This message is a reminder that Fedora 30 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 30 on 2020-05-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '30'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 30 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 12 Fedora Program Management 2021-04-29 15:55:23 UTC

This message is a reminder that Fedora 32 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '32'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 32 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 13 Adam Williamson 2021-04-29 16:02:19 UTC

The workaround is still in the tests, not sure if the bug still happens without it.

Comment 14 Michel Normand 2021-05-03 13:16:08 UTC

I checked the history of  some of the tests that needed the bypass, and did not found occurences of record_soft_failure 'brc#1691487 bypass'
so I propose to revert commits that used the bypass, and close this bug.

Comment 15 Adam Williamson 2021-05-03 15:58:58 UTC

Sounds good to me.