Bug 1040507
Summary: | Inconsistent blockInfo values when the VM is finalizing live migration | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Federico Simoncelli <fsimonce> | |
Component: | libvirt | Assignee: | John Ferlan <jferlan> | |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 6.5 | CC: | amureini, bazulay, cpelland, djasa, dyuan, eblake, fsimonce, iheim, jferlan, jiahu, lpeer, michal.skrivanek, mjenner, mzhan, rbalakri, scohen, tlavigne, ydu, yeylon, zpeng | |
Target Milestone: | rc | Keywords: | ZStream | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | libvirt-0.10.2-30.el6 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | 1038069 | |||
: | 1065531 (view as bug list) | Environment: | ||
Last Closed: | 2014-10-14 04:18:58 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1002699, 1038069, 1059904, 1065531 |
Comment 1
Federico Simoncelli
2013-12-11 17:47:10 UTC
Hmm. I think the blockInfo code in qemuDriver always attempts a qemu monitor query, then gracefully falls back to hardcoding capacity = allocation if qemu didn't answer. This solution works around older qemu that can't query capacity, but differs from most commands that fail if the domain has quit in the meantime. So maybe we need to update our qemu_capabilities matrix to determine whether qemu is new enough to support the capacity query; then fix the code to only attempt the query when capable, rather than always, so that we can then treat qemu failure as command failure, rather than fallback path, and all without losing the fallback path when qemu is not capable For that matter, I seem to recall that we didn't like querying qemu on every blockInfo request anyways, because it could block the query command; and so in newer qemu we hooked up an event that fired any time the block info changed. In that situation, the query never goes out to qemu, but instead uses the data cached from the last time the event fired. I wonder if this is a case of not having the newer event code backported to the qemu/libvirt version combo being tested; but it may also be a case of libvirt losing the cached data when the qemu monitor goes away. This is all off the top of my head, without actually looking at the code, but just from my guess, it sounds like this is libvirt's problem to solve. Since you have an environment ready/working and I don't - is it possible to enable some libvirtd debugging? As described here: http://libvirt.org/logging.html I think setting: log_level = 1 log_filters="3:remote 4:event 3:json 3:rpc" log_outputs="1:file:/var/log/libvirt/libvirtd.log" Should do the trick. I don't need too much data just a single pass through and the libvirtd.log output when you're getting back the "bad values". What I'm looking to determine is the order of things. It's not perfectly clear from the description, but it seems as though from the libvirt side of things at least that there's a "migration" followed by a "get block info" on the domain. When that get is done, the allocation == physical which is unexpected from your point of view. Because of that "some code" is triggered to extend the LV on/for the storage of the source domain. From a scan of the libvirt code, what I see is the following qemuDomainGetBlockInfo() will set allocation == physical, then... /* ..but if guest is running & not using raw disk format and on a block device, then query highest allocated extent from QEMU */ If we get past that test, then we begin a job to query the monitor for data: qemuMonitorGetBlockExtent(... &allocation) That will (hopefully) call the JSON parsing code: qemuMonitorJSONGetBlockExtent() which does a "query-blockstats" query in order to attempt to get the "wr_highest_offset" from the response and return it. If the libvirtd.log doesn't show a path into this code when you get a return of allocation==physical, then perhaps that has to do with (as I think someone alluded to either in this case or the one blocked by this) an ordering and synchronization issue of "when" the data can/should be fetched. Based on what I read here and I had in a discussion with Dave Allan is that there's a 'thread' that continually gets the blockInfo for running guests. That thread isn't synchronized with the actual migration start/end - it's just getting data for running guests with the goal of making sure to do an LVM extend when/if allocation gets too close to physical. Suffice to say libvirt isn't doing the extend, right? Without getting too technical - when a guest is migrating from source to target it goes through a sequence of steps to copy over memory, storage, etc. from source to target. Eventually it gets to a point in time when things need to "freeze" because control from source needs to change to the target. During this time "last minute" dirty blocks and/or storage changes could be copied. Once the "final bits" are copied over, the target "accepts" the responsibility for the running guest and thus the source can/will destroy the guest. Technically it's told to, but that's a minor detail. The guest on the source still needs to be "on" or at least available to be restarted because if something goes wrong on the target, the source will reassert control. However, any query to the source in this period would have "limitations". I also think "expecting" to somehow capture (cache) the "last known state" of every data point prior to making that cutoff so that some (unknown) query afterwards gets the last set of data in that point in time could be (to say the least) a never ending task... Anyway, it seems perhaps you are indicating that within this window of time when the blockInfo thread runs it determines the guest is still running (since its not destroyed); however, when it returns the data allocation==physical which triggers some other piece of code to extend the LV. Does this seem to be a "fair assessment"? Note that the data you are requesting can also be fetched via a 'virsh domblkinfo' to the guest and storage. Whether a guest is running or not, both allocation and physical are returned: # virsh domstate rh65 --reason shut off (destroyed) # virsh domblkinfo <domain> <storage> Capacity: 10737418240 Allocation: 913641472 Physical: 913641472 So it's not like Allocation could "not be" returned or "some error" would indicate a timing window between migration complete and destroy. To a degree libvirt is returning perfectly valid data. What it seems is desired is to get some indication that a migration is taking place so that the caller can special case what to do based on that. That would require a change not only in libvirt, but the caller (unless there's some other event option I'm overlooking/missing). Even if a flag were added Optionally, the blockInfo thread either has to get some knowledge that a migration is either "in process" or at some point in time where things are frozen before the eventual domain destroy. I'm not sure if there's a libvirt mechanism in place for that (I don't know all the bells and whistles available in/for a libvirt/qemu migration). So, perhaps it may be a useful exercise in the blockInfo thread to call virDomainGetState() ensuring to also get a returned reason. Then if the state from virDomainState is VIR_DOMAIN_PAUSED, check the virDomainPausedReason flags and see if VIR_DOMAIN_PAUSED_MIGRATION is set. If that's true, then although the domain is "not stopped", it's also not really in any condition to return the "exact" blockinfo data. cut-n-paste from some email conversation that has taken place and shouldn't be lost: On 12/16/2013 08:48 AM, Federico Simoncelli wrote: > ----- Original Message ----- >> From: "John Ferlan" <jferlan> >> To: "Federico Simoncelli" <fsimonce>, "Ayal Baron" <abaron> >> Cc: "Dave Allan" <dallan>, "Eric Blake" <eblake> >> Sent: Sunday, December 15, 2013 11:43:50 PM >> Subject: Re: [Bug 1040507] New: Inconsistent blockInfo values when the VM is finalizing live migration >> >> >> >> On 12/15/2013 03:37 PM, Federico Simoncelli wrote: >>> ----- Original Message ----- >>>> From: "Ayal Baron" <abaron> >>>> To: "John Ferlan" <jferlan> >>>> Cc: "Dave Allan" <dallan>, "Eric Blake" <eblake>, >>>> "Federico Simoncelli" <fsimonce> >>>> Sent: Sunday, December 15, 2013 5:25:39 PM >>>> Subject: Re: [Bug 1040507] New: Inconsistent blockInfo values when the VM >>>> is finalizing live migration >>>> >>>> >>>> >>>> ----- Original Message ----- >>>>> Did you check the bug report for my analysis/response? >>>> >>>> Yes. >>>> Please get access from Fede directly to a machine exhibiting this. >>>> It appears it is pretty easy to reproduce. >>>> >>> >>> Yes, feel free to ping me whenever you want. >>> Anyway as far as I understood the cause is pretty clear and now it's >>> just a matter of when/how it will be fixed. >>> >> >> So is there any chance of attempting to try a change to the blockInfo >> fetching thread on your end like I suggested in the case? I need some >> feedback on what I put in the BZ to have a chance at determining a 'root >> case'. > > Ah sorry I assumed that last time me and Eric actually pinpointed the > root cause. If there's some analysis to be completed then please get in > contact with David Jasa (added to this thread) as I have never accessed > the machines. > News to me. Eric's notes in the case are correct to the degree of what happens under "normal circumstances"... Since it seems at one time the thread was getting "real data", I don't believe this is a capabilities issue (although I could be wrong as Eric certainly knows the code better than I do). My belief is the non 100% ability to reproduce is because the timing of the blockInfo thread fetch and the destroy of the guest on the source after the migration to get real data aren't "synchronized". When the blockInfo thread goes to get the data, it perhaps finds the guest "not down" and thus will inquire. Because the guest is frozen (in transit of a migration), the qemu/json calls to get the "real" blockInfo data won't occur leaving allocation == physical because 'virDomainObjIsActive()' finds the domain id == -1 on the source host. So the question becomes - how does the blockInfo thread ascertain when to attempt to fetch the blockInfo from the guest? Even if you're checking as I noted at the end of my BZ response for more detailed domain state - there's still the possibility of there being a timing window between the check and the actual change in libvirt to set the guest in a "frozen" state and/or down... In the long run, libvirt isn't extending the LV. Yes, the choice to do that is made because of a return value from libvirt; however, if my assumption regarding timing is correct it's not like the code could change to return an error. That could trigger other issues... The return is just as if the guest wasn't active which while not expected by the blockInfo thread, is to a degree correct. > VDSM already runs with a quite verbose configuration for libvirt so you > might be able to find all you need in /var/log/libvirt/libvirtd.log*. > If it's not already there, considering I don't know the VDSM code - I would think it would be easier for someone that knows/understands the code to at least check and see how the blockInfo thread goes about its business. > Eventually you can change the config and I am sure David will assist you > with the tests he was running. > I would think the changing of the libvirtd.conf is fairly simple and rerunning the test with that change to get the requested output would take someone who understands the environment a few extra cycles. John -=-=-=-=-=-=-=- On 12/16/2013 11:01 AM, Ayal Baron wrote: <...snip...> > > John, cmiiw, but it is my understanding that this behaviour (returning allocation == physical when domain id == -1) is the wrong behaviour (regardless of migration), since this is inconsistent with the behaviour of the running VM. > It is not ok to fallback to this behaviour when the vm is not running and giving the new behaviour when it is, otherwise you can have a situation where: > 1. get allocation when VM down <- returns physical > 2. run VM and get again <- returns high write > 3. stop VM and get again <- returns physical > > If you know that qemu supports high write then you should either report high write or report failure. reporting inconsistent values doesn't seem ok to me. > We'd be fine with explicitly requesting the new behaviour as long as it's consistent and is exposed in the python bindings if you're concerned about backward compatibility (although I can't see where inconsistent values would be ok for any client) > > Whether changing a "historical" default is the "right thing" to do requires a bit of thought. There's numerous checks in the existing code to "fail" if the domain isn't running. However, there are also instances where the check is made and if not true a default value is provided. For example, qemuDomainGetInfo: -> cpuTime is 0 if not running or gets a "real value" if running -> memory varies on the "memballoon" qemuDomainGetSchedulerType: -> Different/default cgroup values are returned if not running (see virsh schedinfo). Those values are 0, but 0 isn't necessarily correct, but it's returned... So if as you assert, the right thing to do is to return an error if the guest is not running, then we need to consider the instances where an error return is not expected and how do we handle those? If one goes by the theory that qemuDomainInfo() returns a non zero value for 'Used Memory' (see virsh dominfo <domain> on a non running domain), then the same theory could be applied to why the allocation == physical for a non running domain. Theoretically, the non running domain isn't using the memory, but it's displayed as if it were. If a volume exists, we can get some basic information about it. The more specific information is achievable through some qemu call "if" the guest is running. Currently, if a VM is off the allocation can be displayed, see the bz for an example: # virsh domstate rh65 --reason shut off (destroyed) # virsh domblkinfo rh65 <path-to-storage> Capacity: 10737418240 Allocation: 913641472 Physical: 913641472 So, with the "requested" change in functionality, we'd get an error here and that I would think would be unexpected. Now, I'm not against changing the default, but understanding the ramifications of that type of change are important. There has to be "some reason" that a default was decided upon. There's nothing in the code/documentation that indicates that the domain must be running in order to get values - just that defaults are supplied in the event the code cannot get the very particular case where we can pull the data from qemu via qcow/lvm query. Looking at the history of the code... The code was originally added in commit id 'db57a7be': http://libvirt.org/git/?p=libvirt.git;a=commit;h=db57a7be The code was changed to attempt to fetch the extent value as part of commit id 'ebb0c19c': http://libvirt.org/git/?p=libvirt.git;a=commit;h=ebb0c19c There's nothing in either change that indicates to me that it's expected to have an error if the domain is not running as you're getting information about a storage volume for a domain - a volume that does exist... An alternative that could be more easily done is to add a flag that indicates to return an error if the domain isn't running. At least that way we'll know that the caller is expecting the error. Of course that's akin to having the blockInfo thread check if the domain is running before asking. Changes would have to be made to libvirt, then vdsm would have to rely on those changes being applied. Then vdsm would have to change the way it calls the function to add the flag. John IMO: As much as it's a pain to do, conversation should be in the BZ not via email - that way history is better kept. > An alternative that could be more easily done is to add a flag that
> indicates to return an error if the domain isn't running. At least that
> way we'll know that the caller is expecting the error. Of course that's
> akin to having the blockInfo thread check if the domain is running
> before asking. Changes would have to be made to libvirt, then vdsm would
> have to rely on those changes being applied. Then vdsm would have to
> change the way it calls the function to add the flag.
>
This sounds like the best solution to me.
We have no problem with specifically sending such a flag from vdsm and it would both retain backward compatibility and make it explicit what behaviour we expect.
And what type of information are you looking to get? I see no question asked unless the implied question is how long would it take... For which the answer is - not sure yet. Although it would be good to have some of the data regarding libvirtd.log from comment 4, so that I can feel better about the assumption I'm working under. Digging deeper into history of the code shows a few more commits that are quite telling in this area... Changes as a result of issues during migration: http://libvirt.org/git/?p=libvirt.git;a=commit;h=054d43f5 Attempt to add the ability to fetch during a migration: http://libvirt.org/git/?p=libvirt.git;a=commit;h=18c2a592 Fixes as a result of regressions caused by previous commit: http://libvirt.org/git/?p=libvirt.git;a=commit;h=2027e184 Removal of the special case code in blockinfo fetch: http://libvirt.org/git/?p=libvirt.git;a=commit;h=fb3cada0 Another adjustment based on the state of monitor: http://libvirt.org/git/?p=libvirt.git;a=commit;h=193cd0f3 Suffice to say it's a tricky timing area and it certainly seems a varied number of algorithms to get a specific answer in a specific corner case through the years hasn't always been successful. Resolving an edge condition is never a straightforward exercise. It's really too bad your blockInfo thread wasn't a bit more stateful with respect to being able to know that a migration is taking place on the domain so it may be best to not query. Once the domain reaches its target (or not perhaps) and (re)starts again - it should be able to return the data (on whichever host it lands). I did post a possible patch upstream regarding this case. However, it met with the same concerns that I've voiced in this case regarding why can't the app (in this case the blockInfo thread) have a bit more state knowledge and not make the request on a domain that's not up. see: http://www.redhat.com/archives/libvir-list/2013-December/msg00984.html (In reply to John Ferlan from comment #4) > Since you have an environment ready/working and I don't - is it possible to > enable some libvirtd debugging? As described here: > > http://libvirt.org/logging.html > > I think setting: > > log_level = 1 > log_filters="3:remote 4:event 3:json 3:rpc" > log_outputs="1:file:/var/log/libvirt/libvirtd.log" > > Should do the trick. I don't need too much data just a single pass through > and the libvirtd.log output when you're getting back the "bad values". > I did several back-and-fort migrations between 16:17 and 16:22 (till Actual Size in RHEV portal grew by 1 GB). VM name and uuid: w7x32-33-2-dj, de193671-ae6b-42bd-883e-8705b66770d0. The fix wasn't accepted. I started looking at the debug log information before the holiday break, but haven't been able to get back to it yet. It's too bad the data has multiple domains intermixed as it makes it difficult & time consuming to pore through the data looking for the data specific to the call. So far I haven't seen anything indicates anything more than what I've already surmised regarding data returned and the state of the migrated from domain. (In reply to John Ferlan from comment #19) > The fix wasn't accepted. I started looking at the debug log information > before the holiday break, but haven't been able to get back to it yet. It's > too bad the data has multiple domains intermixed as it makes it difficult & > time consuming to pore through the data looking for the data specific to the > call. So far I haven't seen anything indicates anything more than what I've > already surmised regarding data returned and the state of the migrated from > domain. As I suggested upstream: https://www.redhat.com/archives/libvir-list/2013-December/msg01145.html it seems that part of the problem is a race condition. Libvirt treats a transient domain as non-transient when qemu dies before returning the virDomainGetBlockInfo values. Getting rid of the race would make libvirt return either a valid value for virDomainGetBlockInfo or VIR_ERR_NO_DOMAIN (Domain not found). In short, virDomainGetBlockInfo for a transient domain is valid only when the data comes from qemu. John, I think the distinction Federico was trying to make is that the behaviour for transient domains should be different than the behaviour for non-transient domains. In the transient case, it doesn't make sense to report any value when there is no domain since it is, well, transient. So you can basically keep current behaviour for persistent domains while report only info from qemu for transient domains. Correct on why allocation = capacity Not sure what you mean by it doesn't exist. There's a qemu process associated with it. I'm missing what you're driving at, so please be more explicit. An on disk configuration doesn't exist, but something had to create the domain. The XML for a transient domain doesn't get saved, but in order to create the caller still has to provide XML to define things. I think the http://wiki.libvirt.org/page/VM_lifecycle defines things pretty well. When you call virDomainGetBlockInf on a *persistent* vm which is *not* running you get allocation=physical when you call virDomainGetBlockInf on a *persistent* vm which *is* running you get high write if you happen to catch the domain when qemu stopped running but libvirt still has a reference to it then it will return allocation=physical to mimic the state where the VM is not running. So far, the behaviour is totally reasonable. Now, if you call virDomainGetBlockInf on a *transient* vm which is *not* running you will get an error that libvirt does not know this vm So it stands to reason that the behaviour you need to mimic in this case when you hit the race of the VM being migrated and right before it is removed from libvirts reference to treat it the same as when it is down = return error. Does this make sense? Makes sense for a specific corner case where you have all this knowledge built up. Makes sense for what you're trying to accomplish. I'm not disagreeing - it takes time to investigate and figure out the conditions/timing. Unfortunately I keep coming back to taking the code standing alone without any of this knowledge. All it knows is that it can fetch the high water for a running domain that's using a specific storage type. If that's not possible - return something because that's what the API dictates and has done for multiple releases. I thought the best scenario was the specific error for the specific case if the calling application desires that knowledge. That wasn't accepted initially - perhaps now with more information it can be acceptable. That conversation still needs to take place, but other things are interfering with that. Hopefully early next week that conversation will be held. The failure you mention about calling on a transient vm which is not running and receiving an error occurs much earlier in the stack. Once higher level code checks are made the process is set in motion. Adding redundant checks in lower level code is a never ending battle. Considering Comment 26 - would it be viable to only return a failure for a transient domain? For a persistent domain you'd need something like: if (virDomainIsActive()) { /* Perhaps other collections but the one we want: */ virDomainGetBlockInfo(); if (allocation == capacity) { if (!virDomainIsActive()) /* don't trust results */ } } /* Perhaps other collections */ } NOTE: The first virDomainIsActive() could also be the result of a 'virConnectListAllDomains()' to get a list of all active/transient domains. But for a transient domain, if virDomainGetBlockInfo() runs into this condition, return a failure. I'm guessing it "works" for RHEV - but the question becomes a more generic what are the downsides? Consider the code mentioned in Comment 11 and remove the concept of a new flag VIR_DOMAIN_BLOCK_CHECK_ACTIVE and replace the check: + if (activeFail && (flags & VIR_DOMAIN_BLOCK_CHECK_ACTIVE)) { with + if (activeFail && !dom->persisent) { using a common error message "domain is not running". Thus for a transient domain, if you're asking for block info of this special type device and qemu cannot provide it since the domain no longer is active, then there's an error. Similarly, if you asked right after the domain is removed, there would be an error even if some initial "virDomainGetBlockInfo()" were made on a transient domain that may have been active in a just executed virDomainIsActive(). For a persistent domain, I'm not sure what we can do other than document the code snippet and say it's up to the application that cares to make the extra check since the default is to provide allocation == capacity when the domain isn't running. I put the needsinfo on Ayal, but moving it to Federico also works. What I'm looking to make certain is there's no downside to returning an error and that doing so just for transient domains is 'good enough'. (In reply to John Ferlan from comment #28) > Considering Comment 26 - would it be viable to only return a failure for a > transient domain? Yes, I think that works. Existing behavior for persistent domains is unchanged (the race is still there for whether the domain went offline, but you already have to deal with that race); while VDSM only uses transient domains (when the domain is no longer running, we can't report stats because the stats we care about only exist when the domain is running). > But for a transient domain, if virDomainGetBlockInfo() runs into this > condition, return a failure. I'm guessing it "works" for RHEV - but the > question becomes a more generic what are the downsides? I'm not seeing any downsides from libvirt's perspective - let's go with the approach. (In reply to Eric Blake from comment #29) > (In reply to John Ferlan from comment #28) > > Considering Comment 26 - would it be viable to only return a failure for a > > transient domain? > > Yes, I think that works. Existing behavior for persistent domains is > unchanged (the race is still there for whether the domain went offline, but > you already have to deal with that race); while VDSM only uses transient > domains (when the domain is no longer running, we can't report stats because > the stats we care about only exist when the domain is running). > > > But for a transient domain, if virDomainGetBlockInfo() runs into this > > condition, return a failure. I'm guessing it "works" for RHEV - but the > > question becomes a more generic what are the downsides? > > I'm not seeing any downsides from libvirt's perspective - let's go with the > approach. perfect, thanks. This bug hard to reproduce this bug under pure libvirt environment(Accoring to the comment 9 and comment 11's description in bug 1059904), so we can verify this bug via checking the libvirt code. Check libvirt-0.10.2-32.el6.src.rpm, the code have included the patch, so mark this bug verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1374.html |