Created attachment 1244653 [details]
engine log, vdsm logs from both hosts, engine-backup file
Description of problem:
It seems that in some scenarios where there are several vm pools in the system and we attempt to remove them at the same time, at least some of them will fail to be removed leaving some vms in the system, sometimes detached from the pool, sometimes still attached.
We hit this in our automation and then when I tried to re produce it I was able to do so 3 times with the following scenario:
Steps to Reproduce:
1. Have a pool (auto, stateless) with 5 vms, 3 of them pre started and running.
2. Have a second pool with 3 vms, not running.
3. Invoke removal of both pools async.
4. Immediately create a new pool
At least one of the pools (in all attempts the first pool for sure) will fail to complete remove vmpool action, leaving a vm or two detached.
In one attempt it left the remove vm pool task stuck in job table in STARTED status (attaching a DB dump of the system with this task, created with engine-backup tool).
Both pools are removed successfully.
I'm not sure step 4 is a must, and this might happen if we load the system with other pool related tasks, if needed I can try to create more scenarios, but this so far worked.
Version-Release number of selected component (if applicable):
not 100% but most of the times.
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
We cannot reproduce the error on 4.1
If there is a new flow that you encounter that cause the race open a new bug with the appropriate steps to reproduce.