Red Hat Bugzilla – Bug 833795
Competing non-responsive and non-operational flows can result in guests being marked in a non-responsive state instead of down.
Last modified: 2013-08-09 01:40:53 EDT
Description of problem:
If a host becomes non-operational after previously becoming non-responsive and being fenced the
two flows (vdsNotResponding and SetNonOperationalVdsCommand) can leave one or more guests marked
in a non-responsive state instead of down.
Version-Release number of selected component (if applicable):
Unable to reproduce internally as yet.
Steps to Reproduce:
[ Work In Progress ]
1. Use a locally shared NFS mount as a SD.
2. Disable the NFS services at reboot.
3. Have a number of running guests on the host at the time.
4. Block vdsmd to force a non-responsive treatment to be started.
5. Host should be fenced and should become non-operational shortly after booting.
6. One or more guests should be marked as non-responsive instead of down.
Guests marked as non-responsive.
Guests should be marked as down.
Customer logs with an example of this will following in a private comment.
Given the need for a host to become non-operational after fencing and the fact that the guests can easily be corrected I am only assigning a medium prio to this bug.
My recommendation at this time would be that InitVdsOnUpCommand destroy any competing threads for the same host to avoid situations like this but I am not sure if this is appropriate for every use case.
looks quite complicated to try to prevent the interleaving. I feel we really need kind of framework for that kind of task.
I am sending a fix to make the migrateVm command return unsuccsessfull migration of VMs to their former status instead of NotResponding. It won't solve the interleaving but it will keep the outcome sane I guess.
Quality Engineering Management has reviewed and declined this request.
You may appeal this decision by reopening this request.
not in downstream
OK - SI25.1
Following the steps on https://bugzilla.redhat.com/show_bug.cgi?id=833795#c12, VM state is UP after migration fails.