Bug 1730617 - Multihost: Task execution synchronization does not work in restraint.
Summary: Multihost: Task execution synchronization does not work in restraint.
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Restraint
Classification: Retired
Component: general
Version: 0.1.39
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: 0.1.40
Assignee: Carol Bouchard
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-07-17 08:20 UTC by Marek Havrila
Modified: 2019-09-10 05:57 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-08-16 14:43:58 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Beaker Project Gerrit 6625 0 None MERGED Multihost Task Sync does not work 2020-04-24 06:41:49 UTC

Description Marek Havrila 2019-07-17 08:20:53 UTC
Description of problem:

When running multihost tests, task synchronization does not work. Having server and client machines and three tasks on each machine, 2nd task should run after first task is finished at all machines. However, tasks do not wait/synchronize and if one machine get stacked, for example during installation, 2nd machine runs its tasks without waiting. This is unexpected behavior and /distribution/dummy task lose its purpose in such case. This behavior differs from behavior of beah.

Comment 1 Tomas Klohna 🔧 2019-07-17 08:31:14 UTC
Hello Marek, can you supply a failing job?

Comment 2 Marek Havrila 2019-07-17 08:41:31 UTC
Hi Tomas,

Failing job:
https://beaker.engineering.redhat.com/jobs/3673455

Comment 3 Tomas Klohna 🔧 2019-07-17 09:05:06 UTC
Marek, I'm looking at the job and I don't see an issue. 
Synchronization happens in your test case which, by examining login/ssh, is missing. Just specifying the ROLE inside XML will not have an effect on restraint unless you tweak your tests to take leverage of this. https://beaker-project.org/docs/user-guide/multihost.html

I might be missing something, but I don't see a reason why would a client/server have to wait until the installation has started for other machines. I understand that there might be use cases where you have 5 clients and you need all of them to be alive and in case one of them doesn't even start, it's pointless to continue, but from a general point of view I believe it's beneficial that other machines get ready and install or run tasks before getting to the actual multihost task. That way we don't waste time.

Do you think you can find the previous job that used beah? I would be interested.

Comment 4 Marek Havrila 2019-07-17 10:05:26 UTC
Hi Tomas,


I created simple beaker job with beah showing expected behavior: https://beaker.engineering.redhat.com/jobs/3675407
It's visible that /distribution/dummy on Client is waiting for /distribution/reservesys on Server to finish before moving to next task.

In https://beaker-project.org/docs/user-guide/multihost.html, look at the section starting with "Firstly, any multihost testing must ensure that the task execution order aligns correctly on all machines." This section explains use of /distribution/dummy for synchronization purposes. It is also nice example of use case.

Let me show another example: 
Let's have tasks:

Server                Client
/test1                /distribution/dummy
/distribution/dummy   /test1

We need test1 on server to finish and finish it's cleanup phase before test1 start at client. In this case, sync inside test1 code will not help since test1 on client and server is not supposed to run at the same time. However, we need both machines running so we can not run it as single-host job.

Comment 5 Tomas Klohna 🔧 2019-07-17 12:51:43 UTC
Thanks Marek!

Beah really did support this out of the box but it appears to not do anything more than taking leverage of the rhts-sync-block and rhts-sync-set commands.
---
I understand the example, but I fail to see how can you not take leverage of rhts-sync command?
Client -> T1 can use sync-block and wait for a specific string set by set on Server and after it receives it, it can start processing
Server -> T1 sets the specific string only after the whole Server's task is finished, only after then is Client's task executed

I imagine you have complicated tests, but I really fail to see why couldn't rhts-sync be used. You can also create an additional task before T1 for Client machines that does nothing else than checks for status using sync commands.
---
Ping me on IRC or stop on our floor, we can chat about this. (adding @pholica per your request)
I'm not really convinced at this moment to change the default behaviour of restraint, since there are already teams using it for multi-hosts recipes and this would break the workflow for them. I do wish we knew this sooner.

Comment 6 Bill Peck 2019-07-17 13:22:20 UTC
We could add a shell script plugin to the completed.d directory which would block on $SERVERS, $CLIENTS, $DRIVERS being in a done state.  

We would just need to copy it from beah/rhts:

    export TESTORDER=$(expr $TESTORDER + 1)
    rhts-sync-set -s DONE
    rhts-sync-block -s DONE $SERVERS $CLIENTS $DRIVER

and update the commands to use the restraint versions.

Comment 7 Tomas Klohna 🔧 2019-07-29 17:24:57 UTC
Running Bill's solution does work. Here's job that uses ks_appends - https://beaker.engineering.redhat.com/jobs/3698453

There's also standalone /distribution/beaker/beah/misc/sync task that does the syncing/blocking - https://bugzilla.redhat.com/show_bug.cgi?id=1426210
The issue contains some documentation about multitask behaviour as well

Please try one of these solutions out and let me know if that is enough.
As we spoke on IRC, I'm okay with the opt-in solution (ENV variable) that would enable this plugin.

Carol, I'm switching this one to myself for now.

Comment 8 Marek Havrila 2019-07-30 12:11:59 UTC
Hi Tomas,

Thank you for posting example of working solution. This should work for us. However, due to our internal tool insufficiency, we are not able to implement this solution to our workflow right now.

It would be very helpful, if you could provide opt-in solution you mentioned above. It will also be more robust and available for other teams.

Please, note that documentation of multihost (https://beaker-project.org/docs/user-guide/multihost.html) does not reflect actual behavior. It reflects behavior of beah where tasks synchronization appeared to be running by default.

Comment 9 Tomas Klohna 🔧 2019-08-02 10:26:14 UTC
I'm lowering the priority because a workaround has been provided. 
Marek, changing a way how your Job XML is generated be straight forward, at least to compared to changing anything in restraint.

The ticket is now back in Carol's hands as a clear approach has been defined.

Comment 10 Martin Styk 2019-09-10 05:57:49 UTC
Restraint 0.1.40 has been released.


Note You need to log in before you can comment on or make changes to this bug.