Bug 1691487 - openQA transient test failure as duplicated first character just after a snapshot
Summary: openQA transient test failure as duplicated first character just after a snap...
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: openqa
Version: 30
Hardware: ppc64le
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Adam Williamson
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Keywords:
Depends On:
Blocks: PPCTracker
TreeView+ depends on / blocked
 
Reported: 2019-03-21 17:49 UTC by Michel Normand
Modified: 2019-04-18 16:41 UTC (History)
5 users (show)

(edit)
Clone Of:
(edit)
Last Closed:


Attachments (Terms of Use)

Description Michel Normand 2019-03-21 17:49:49 UTC
openQA test failure as duplicated first character 'hhostnamectl' expecting 'hostnamctl'

for server_database_client test:
https://openqa.stg.fedoraproject.org/tests/500172#step/_post_network_static/1

seems to be an openQA problem that only appear when overloading of the workers host, (when a new set of tests are triggered for a new compose); because my local trial show that a manual retry of individual test is sufficient to run it successfully.

I locally hit this same problem with server_cockpit_default
my local bypass was to add a  
script_run 'echo "bypass duplicated first character"'; at top of file.


=== extract autoinst.log for above referenced url.
...
[2019-03-21T02:26:26.891 UTC] [debug] Migrating total bytes:     	4312072192
[2019-03-21T02:26:26.891 UTC] [debug] Migrating remaining bytes:   	0
[2019-03-21T02:26:26.892 UTC] [debug] Snapshot complete
[2019-03-21T02:26:26.913 UTC] [debug] EVENT {"event":"RESUME","timestamp":{"microseconds":913591,"seconds":1553135186}}
[2019-03-21T02:26:26.921 UTC] [debug] QEMU: Formatting '/var/lib/openqa/pool/5/raid/hd0-overlay1', fmt=qcow2 size=10737418240 backing_file=/var/lib/openqa/pool/5/raid/hd0-overlay0 backing_fmt=qcow2 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2019-03-21T02:26:26.921 UTC] [debug] QEMU: Formatting '/var/lib/openqa/pool/5/raid/cd0-overlay1', fmt=qcow2 size=3130112000 backing_file=/var/lib/openqa/pool/5/raid/cd0-overlay0 backing_fmt=qcow2 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2019-03-21T02:26:26.930 UTC] [debug] ||| starting _post_network_static tests/_post_network_static.pm
[2019-03-21T02:26:26.931 UTC] [debug] /var/lib/openqa/share/tests/fedora/tests/_post_network_static.pm:12 called tapnet::setup_tap_static
[2019-03-21T02:26:26.931 UTC] [debug] <<< testapi::assert_script_run(cmd='hostnamectl set-hostname dbclient.domain.local', wait=90, fail_message=undef)
[2019-03-21T02:26:26.931 UTC] [debug] /var/lib/openqa/share/tests/fedora/tests/_post_network_static.pm:12 called tapnet::setup_tap_static
[2019-03-21T02:26:26.931 UTC] [debug] <<< testapi::type_string(string='hostnamectl set-hostname dbclient.domain.local', max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2019-03-21T02:26:26.932 UTC] [debug] /var/lib/openqa/share/tests/fedora/tests/_post_network_static.pm:12 called tapnet::setup_tap_static
[2019-03-21T02:26:26.932 UTC] [debug] <<< testapi::wait_screen_change(timeout=10, similarity_level=50)
[2019-03-21T02:26:28.067 UTC] [debug] waiting for screen change: 0 32.9796599074581
[2019-03-21T02:26:28.069 UTC] [debug] >>> testapi::wait_screen_change: screen change seen at 0
[2019-03-21T02:26:28.072 UTC] [debug] /var/lib/openqa/share/tests/fedora/tests/_post_network_static.pm:12 called tapnet::setup_tap_static
[2019-03-21T02:26:28.072 UTC] [debug] <<< testapi::type_string(string=' ; echo V5qMN-$?- > /dev/hvc0
', max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2019-03-21T02:26:28.072 UTC] [debug] /var/lib/openqa/share/tests/fedora/tests/_post_network_static.pm:12 called tapnet::setup_tap_static
[2019-03-21T02:26:28.072 UTC] [debug] <<< testapi::wait_screen_change(timeout=10, similarity_level=50)
[2019-03-21T02:26:28.822 UTC] [debug] waiting for screen change: 0 35.9559735377936
[2019-03-21T02:26:28.822 UTC] [debug] >>> testapi::wait_screen_change: screen change seen at 0
[2019-03-21T02:26:28.822 UTC] [debug] /var/lib/openqa/share/tests/fedora/tests/_post_network_static.pm:12 called tapnet::setup_tap_static
[2019-03-21T02:26:28.822 UTC] [debug] <<< testapi::wait_serial(regexp=qr/V5qMN-\d+-/, timeout=90)
[2019-03-21T02:26:30.974 UTC] [debug] >>> testapi::wait_serial: (?^:V5qMN-\d+-): ok
[2019-03-21T02:26:31.096 UTC] [debug] command 'hostnamectl set-hostname dbclient.domain.local' failed at /var/lib/openqa/share/tests/fedora/lib/tapnet.pm line 39.
===

Comment 1 Michel Normand 2019-04-16 07:32:55 UTC
I have an openQA bypass with https://pagure.io/fedora-qa/os-autoinst-distri-fedora/pull-request/105 merged 20190416.
Waiting for investigation of this bug.

Comment 2 Adam Williamson 2019-04-16 15:09:39 UTC
I mean, I dunno what I can really 'investigate', unfortunately, if it's happening because the host gets overloaded with work all we could really do is cut the number of workers :/

/proc/cpuinfo shows the system as having 10 CPUs and being a "PowerNV 8348-21C". We're currently running 8 workers on it. 8 workers on 10 CPUs doesn't seem like it should be a problem.

For storage it seems to have a pair of Seagate ST1000NM0033s being used as one big LVM VG. That's a bit weedy, I guess. The 'small' x86_64 worker hosts (which run 10 workers each) have RAID-6 arrays of 8 disks each (can't tell what type off hand, the RAID controller obfuscates it), the 'big' x86_64 worker hosts (which run 40 workers each) have RAID-6 arrays of 10 disks each (again can't tell what model, but I think they're something beefy; they may even be SSDs).

So...my working theory is we're running into storage contention here. That seems plausible especially since the way we schedule tests means many of them may reach the point at which they restore a snapshot at basically the same time.

Perhaps we can get the storage on the worker host beefed up? Or get a beefier worker host? :)

Comment 3 Michel Normand 2019-04-16 16:34:09 UTC
I previously  said "overloading of the workers host"  but I wanted to say the error is not visible when test run individually with no contention on the system.
I assume the problem is between openQA and qemu that report completion of the migration while still not usable immediately after completion.
I have no idea of what traces could be added between openQA and qemu to identify the cause of the problem.

Comment 4 Adam Williamson 2019-04-16 16:38:50 UTC
"I previously  said "overloading of the workers host"  but I wanted to say the error is not visible when test run individually with no contention on the system"

well yeah that's what I mean by 'overloading' - this looks a lot like the storage kinda taps out if we have multiple tests doing disk intensive things at once.

Comment 5 Adam Williamson 2019-04-18 15:13:34 UTC
CCing smooge; are you in a position to advise on this? Can we beef up the storage on openqa-ppc64le-01.qa.fedoraproject.org at all?

IIRC there was talk recently of IBM giving us some much more powerful boxes to run openQA tests on, did that go anywhere?

Comment 6 Stephen John Smoogen 2019-04-18 15:17:59 UTC
We don't own the hardware so we can't beef up the storage on the system at all. The IBM power9 boxes are being used to build various cloud artifacts because of a bug on Power8 virt-on-virt. I will find out where the other Power9 is going.. the last I heard it was being negotiated.

Comment 7 Adam Williamson 2019-04-18 15:21:42 UTC
So, IBM owns the hardware? Then I guess this is back on Michel =) Would it be possible to give it more/faster disks?

smooge, perhaps we could at least ask you exactly what the storage setup on qa09/qa14 is, since those don't seem to have this issue (while hosting 10 workers each). What are the disks in that RAID-6 array?

Comment 8 Stephen John Smoogen 2019-04-18 15:55:38 UTC
qa09 and qa14 have 8 560 GB SAS drives which are RAID-6 together. 

The systems we get from IBM come through a special contract which in the past required the system to be sent back to add hardware to it. When we added drives it also caused problems because the system didn't match the contract when we returned it. I am checking with IBM on the wearabouts for the systems.


Note You need to log in before you can comment on or make changes to this bug.