Hide Forgot
Possibly because my hosts are abroad and have high latency, RHEVM doesn't catch "service vdsmd restart" that I run on the hosts and doesn't re-initialize the hosts after they return from restart. No messages are seen in rhevm.log. As discussed with mkublin.
Well the problem is on the RHEVM side, but there's literally nothing in the RHEVM logs.
We need logs to see what was the response to the get vdsstats during the time vdsm was down.
Created attachment 486452 [details] rhevm.log
Here's the log. The VDSMD restart was on 2011-03-20 at 11:53, but as you can see the log finishes at 11:50 because there's nothing there.
log rotate? did you check the next file? Does it happen to you on a regular basis with remote host?
No, it's not log rotate. And yes, it happens all the time with 6 different hosts.
well if rhevm cant tell there was a restart in the vdsm, there is nothing we can do automatically, maybe in high latency systems NumberVmRefreshesBeforeSave configuration should be lowered in order to have more calls to vdsm.. anyway IIUC, the host is no longer connected to the storage, therefore should move to non operational after 5 mins, and user could activate it in order to reconnect it.
(In reply to comment #8) > well if rhevm cant tell there was a restart in the vdsm, there is nothing we > can do automatically, maybe in high latency systems NumberVmRefreshesBeforeSave > configuration should be lowered in order to have more calls to vdsm.. > > anyway IIUC, the host is no longer connected to the storage, therefore should > move to non operational after 5 mins, and user could activate it in order to > reconnect it. I tried different values of NumberVmRefreshes and it didn't make the issue go away. And "service vdsmd restart" does not disconnect storage, so the host doesn't become Non Operational.
(In reply to comment #9) > > I tried different values of NumberVmRefreshes and it didn't make the issue go > away. > > And "service vdsmd restart" does not disconnect storage, so the host doesn't > become Non Operational. ok, thanks, so if the host is still connected to the storage, are there any problems caused by this? (I'm trying to understand if there is another way to discover this had happened)
The only problem me and mkublin can think of, is that if the host is SPM and I run "service vdsmd restart", the restart isn't caught and the host remains SPM instead of SPM re-election being run.
if the only issue is lost SPM, then next action should detect and run spm election code. we should resolve this, by having a state in vdsm if it got connected by rhev-m post start. probably related to policy and config check as well. for this, postponing to be looked at in a future version.
I'm refreshing the need info on livnat here, as my email reminders are ignored. However, My 0.02$ is that unless we want to invent a new status the is cleared upon first RHEV Manager access VDSM should keep the status "Recovering from crash or Initializing" until RHEV Manager has run getVdsStats at least once. VDSM without running VMs can recover very quickly thus RHEV Manager may not detect that it has gone through this state. So I would move this BZ to VDSM component and fix on the closest RHEL6 Z. (They issue is probably less relevant for RHEL5 hosts though still may happen with lower priority) VDSM already does similar things for VMs, if a VM is down it's object remains until RHEV Manager can read the Down status and the reason. There is a difference though, as in this case RHEV Manager explicitly clear the object by running sending destroy. Dan?
I think we've discussed this somewhere else several months ago. IMO it is wrong for Vdsm to wait for its (very important!) client in order to start working. Worse - I think that Engine should not care if Vdsm was restarted, as long as it is doing it job. I'd rather consolidate getSpmStatus in getVdsStats, so Engine simply sees: oops, this Vdsm lost SPMness. If Engine HAS to know that vdsmd has restarted, I suggested to report a generationID: a unique string that is effective whenever Vdsm is alive, and is changed when it is restarted.
(In reply to comment #17) > I think we've discussed this somewhere else several months ago. IMO it is wrong > for Vdsm to wait for its (very important!) client in order to start working. No need to wait with the restart, VDSM should continue to work, only the recovering status should be cleared upon first VDS status read. > > Worse - I think that Engine should not care if Vdsm was restarted, as long as > it is doing it job. I'd rather consolidate getSpmStatus in getVdsStats, so > Engine simply sees: oops, this Vdsm lost SPMness. Sorry, will not pass audit, this is an event that should be reported as any other issue in the system. > > If Engine HAS to know that vdsmd has restarted, I suggested to report a > generationID: a unique string that is effective whenever Vdsm is alive, and is > changed when it is restarted. May be the right solution, but this requires change in the API right?
(In reply to comment #18) > (In reply to comment #17) > > I think we've discussed this somewhere else several months ago. IMO it is wrong > > for Vdsm to wait for its (very important!) client in order to start working. > > No need to wait with the restart, VDSM should continue to work, only the > recovering status should be cleared upon first VDS status read. This is what I call 'waiting'. Staying in the 'recovery' state until the client uses a read-only verb (getVdsStats), and expecting this to change the vdsm state - this is an awkward semantic. > > > > Worse - I think that Engine should not care if Vdsm was restarted, as long as > > it is doing it job. I'd rather consolidate getSpmStatus in getVdsStats, so > > Engine simply sees: oops, this Vdsm lost SPMness. > > Sorry, will not pass audit, this is an event that should be reported as any > other issue in the system. Who's audit? Why do we need to report the fact that there is a new process implementing the Vdsm service? Why reporting SPMness is not enough for this bug? > > > > If Engine HAS to know that vdsmd has restarted, I suggested to report a > > generationID: a unique string that is effective whenever Vdsm is alive, and is > > changed when it is restarted. > > May be the right solution, but this requires change in the API right? Correct. You want a new functionality; either overload an existing API with new semantics, or add a new API.
Why does VDSM restart cause the host to disconnect from the storage domains / pool from the first place? Regardless of the above question - VDSM is currently dependant in the engine for the host configuration (connection to the storage domains/pools) and the engine have to identify that VDSM performed service restart. I think Dan's suggestion for adding a generationID is a nice idea. It requires a change not only to the API but also to the DB (this has to be persist to survive RHEVM restart) - so not a z-stream material. Simon's idea can be a zstream material to hold the recovery state until first stats pool - I agree with Dan not a good option as well. I suggest that for z-stream this will be fixed in VDSM. long term we can change the API to support generationID and use it.
(In reply to comment #20) > > I suggest that for z-stream this will be fixed in VDSM. long term we can change > the API to support generationID and use it. Dan?
The Red Hat process dictates that we first solve this upstream properly, then do whatever we have to for z-stream. And I would really hate it if z-stream requires introduction of a bizarre semantics for getVdsStats.
Following Dan's comment and consulting with Michael Kublin, we would like to solve this with Dan's suggestion (but this require API change).
Closing old bugs. If this issue is still relevant/important in current version, please re-open the bug.
Reopening, looks very relevant to me
I don't think it is relevant any more. If this will hit us again than we'll handle it. Moving to CLOSE NOTABUG.