Created attachment 771012 [details] vdsm log with the problem Description of problem: connection to libvirt broken. taking vdsm down. ecode: 38 edom: 7 though the libvirt is up. Getting ECODES=libvirt.VIR_ERR_SYSTEM_ERROR EDOMAIN=libvirt.VIR_FROM_RPC, VDSM is taken down though the libvirt is up and running. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Run automation test with migration network 2. Test creates migration network on DC/Cluster/Host and try to perform migration 3. Actual results: We got connection to libvirt broken,. taking vdsm down and test fails Expected results: Vdsm is not supposed to be taken down and the test should run Additional info:
Created attachment 771013 [details] libvirt log
Genadi I must have the scripts so I can reproduce this bug.
Created attachment 774224 [details] libvirt log on the host from which you perform migration
Created attachment 774227 [details] libvirt log on the host to which you perform migration
Created attachment 774228 [details] vdsm log on the host from which you perform migration
We had examined the vdsm code related to that. If VDSM get a VIR_ERR_SYSTEM_ERROR error from libvirt it really assumes it is done and as far as we understand it should be. The problem origin is probably libvirt that throws VIR_ERR_SYSTEM_ERROR and not VDSM It is not clear how to proceed from here, IMHO this is a libvirt bug
Could be libvirt issue. We restart vdsm only if we get VIR_FROM_REMOTE or VIR_FROM_RPC next to the VIR_ERR_SYSTEM_ERROR. If libvirt throws exception with those code vdsm suicides. But maybe vdsm should restart here to establish re-communication with the new network configurations after the migration.. and the test should relies on it, and resend the request, or whatever the test does. How did it work before? libvirt always threw this exception after the migration? is it a regression?
Migration network is a new feature. It means you can pick up specific network for migration. Till now the migration always took place on management network and no such issue appeared, that's why it's not a regression. Today in my tests in order to overcome the problem I put sleep between the creation of the migration network and the migration itself of about 30 seconds and so the issue is solved, so it's a timing issue.
Interesting, so after 30 seconds libvirt doesn't reply with the exception and all gets back to normal ? We might need a delay before sending the sigterm.. we need to consult with libvirt's developers, maybe we can throw different exception from libvirt in that case. If not, we should add waiting mechanism
Genadi, Can you please describe the test flow. In case a 30 seconds sleep solves the problem in the test than the problem might be in the engine (depends on the flow)
The working flow is as follows: I create a new network with migration flag on - put it on DC, Cluster and Host with setupNetworks action. Then I wait for 30 second and only after that I start the migration on newly created network. Removing the 30 seconds sleep causes the problem described
(In reply to GenadiC from comment #11) > The working flow is as follows: > I create a new network with migration flag on - put it on DC, Cluster and > Host with setupNetworks action. > Then I wait for 30 second and only after that I start the migration on newly > created network. > Removing the 30 seconds sleep causes the problem described So IIUC the migration network is already up and running on both hosts before you actually call the migration. Did you verify that ?
yes. When setupNetwork command completes it action we have the migration network on both hosts (GUI shows the network attached to host).
1.)I approached libvirt guys on their irc (#virt @ irc.devel.redhat.com) They suggested a solution and I will submit a patch using it I'm attaching the correspondence here: moolit: Problem: In vdsm, we seem to be getting the same error (ecode: 38 edom: 7) from libvirt when: 1.libvirt service was killed and 2. someone tried to use an unknown network danpb: moolit: the error code is fundamentally not a reliable way to detect loss of connection to libvirtd it is at best a heuristic, we added an explicit API virConnectRegisterCloseCallback to detecting connection death, s/to/for/ 2.) In a minor note: When I crosscheck vdsmd and libvirt logs (vdsm is in local time and libvirt is 3 hours before - in utc) I see in the attached vdsmd.log the first ecode: 38 edom: 7 occurred on 15:47:36,247 in libvirtd.log around the same time I search for error and I see this: 2013-07-16 12:47:34.581+0000: 32744: error : virNWFilterDHCPSnoopEnd:2131 : internal error ifname "vnet0" not in key map IIUC this supports the speculation that the bridge netwotk is not yet up.
Apparently We want to avoid rewriting the error handling code in libvirtconnection.py at this point. Genady can you please attach your test scripts.
Genadi, It looks like the migration failed due to an unknown network on the destination host. Can you please run the same scenario and this time do getVdsCaps and getVdsStats on the destination host before starting the migration (and of course remove the sleep entirely).
(Side note: The attached vdsm log is of the source host - We need to check also vdsm and libvirt logs from the destination, please try to provide- one when the migration works and one when it doesn't. It'll be easier and helpful to see the errors in both sides (source and destination). I also attaching logs of reproducing the scenario - explained below) For the investigation: The address for the migration destination network is "1.1.1.2", so by following this address in the attached vdsm.log we see the following flow: starts with the command itself, dstqemu set to 1.1.1.2 that should be set up on the destination: Thread-185::DEBUG::2013-07-16 15:50:45,872::API::462::vds::(migrate) {'tunneled': 'false', 'dstqemu': '1.1.1.2', 'src': '10.35.128.10', 'dst': '10.35.128.12:54321', 'vmId': '2ad5d84e-7658-4af2-8f2e-d419a1c6d4 86', 'method': 'online'} Then, because "dest" address is valid and appear in the cluster, the migration starts: Thread-186::DEBUG::2013-07-16 15:50:46,627::vm::349::vm.Vm::(_startUnderlyingMigration) vmId=`2ad5d84e-7658-4af2-8f2e-d419a1c6d486`::starting migration to qemu+tls://10.35.128.12/system with miguri tcp://1.1.1.2 Before seeing an error, almost immediately we see the problem: Thread-186::ERROR::2013-07-16 15:50:51,797::libvirtconnection::131::vds::(wrapper) connection to libvirt broken. taking vdsm down. ecode: 38 edom: 7 This means that vdsm received an exception from libvirt and killing itself, it followed by prepareForShutdown and restart afterwards. Before the reset, we see the error: Thread-186::ERROR::2013-07-16 15:50:55,887::vm::233::vm.Vm::(_recover) vmId=`2ad5d84e-7658-4af2-8f2e-d419a1c6d486`::unable to connect to server at '1.1.1.2:49154': No route to host The migration network was not binded yet, and return No route. This may because it takes time to libvirt to bind the network (That's explain why it works after delay of 30 seconds) Thread-186::ERROR::2013-07-16 15:50:55,958::vm::313::vm.Vm::(run) vmId=`2ad5d84e-7658-4af2-8f2e-d419a1c6d486`::Failed to migrate .. From the libvirt log attached we don't see much errors (only 2013-07-16 12:50:50.119+0000: 32746: error : virNWFilterDHCPSnoopEnd:2131 : internal error ifname "vnet0" not in key map), which is odd, because when reproducing this scenario with invalid "dstqemu" we get in libvirt log the following: 2013-09-12 15:20:03.089+0000: 17955: error : virNetSocketNewConnectTCP:482 : unable to connect to server at '1.2.3.4:49156': Connection timed out 2013-09-12 15:20:03.496+0000: 17955: error : virNetClientProgramDispatchError:174 : An error occurred, but the cause is unknown To summaries: libvirt throws same error that vdsm known as broken connection error to libvirt itself, vdsm performs soft fencing (restart itself) to handle the broken communication - which is wrong. First we need to understand from libvirt developers why libvirt doesn't alert with different error codes in this case (invalid miguri), then we'll might need to wait until the network is created in the destination before starting the migration
Created attachment 797370 [details] reproducer - libvirtd.log and vdsm from src and dst hosts Running on src host: vdsClient -s 0 migrate vmId="5b3812b7-d4ef-4cad-8aba-799ed6bb01b5" method=online src=localhost dst=10.35.0.149 dstqemu=1.2.3.4
Putting aside the libvirt error stuff(which we should continue to pursue anyway), in network_cluster we have a status column. Network people say the status becomes 'attached' after it is attached to every host on the cluster. it seems we should check this status in migrateVmCommand and fail canDoAction or alternatively create a rerun behavior if status is not yet attached.
(In reply to Mooli Tayer from comment #21) > Putting aside the libvirt error stuff(which we should continue to pursue > anyway), in network_cluster we have a status column. > Network people say the status becomes 'attached' after it is attached to > every host on the cluster. it seems we should check this status in > migrateVmCommand and fail canDoAction or alternatively create a rerun > behavior if status is not yet attached. Another alternative is not failing on canDoAction() , rather , in execute() check for the network if it is in 'operational' status and use a configurable wait/retry on that status after that we can fail the command or succeed if status had changed
(In reply to Yaniv Bronhaim from comment #19) > To summaries: > libvirt throws same error that vdsm known as broken connection error to > libvirt itself, vdsm performs soft fencing (restart itself) to handle the > broken communication - which is wrong. > > First we need to understand from libvirt developers why libvirt doesn't > alert with different error codes in this case (invalid miguri), then we'll > might need to wait until the network is created in the destination before > starting the migration The error codes referred to above here can be generated by number of conditions. This applies to anything at all that is network related that libvirt can do. This case of migration to a non-existant network is just one of very many possible ways these error codes can arise. In addition this code is called from many different areas of libvirt, and does not know anything about 'migration' which is a much higher level concept that this code. It is fundamentally incorrect to assume that ERR_SYSTEM_ERROR + FROM_RPC means that libvirtd is down. It is only a heuristic. When this condition is detected, you should *not* take action immediately. You must validate that your heuristic is actually correct IOW this code in VDSM except libvirt.libvirtError as e: edom = e.get_error_domain() ecode = e.get_error_code() EDOMAINS = (libvirt.VIR_FROM_REMOTE, libvirt.VIR_FROM_RPC) ECODES = (libvirt.VIR_ERR_SYSTEM_ERROR, libvirt.VIR_ERR_INTERNAL_ERROR, libvirt.VIR_ERR_NO_CONNECT, libvirt.VIR_ERR_INVALID_CONN) if edom in EDOMAINS and ecode in ECODES: log.error('connection to libvirt broken.' ' ecode: %d edom: %d', ecode, edom) if killOnFailure: log.error('taking calling process down.') os.kill(os.getpid(), signal.SIGTERM) else: log.debug('Unknown libvirterror: ecode: %d edom: %d ' 'level: %d message: %s', ecode, edom, e.get_error_level(), e.get_error_message()) Should *not* just blindly os.kill() itself. You need to do something like run 'virConnectGetLibVersion' - or some other trivial API that acts like a "ping" message to test the state of the libvirt connection.
GenadiC, Still waiting on the need info mentioned in comment #18. In addition why do you define the migration network as optional ?
1) The getVdsCaps was already attached in the previous attachment. 2) What do you mean by optional network? - it's a regular network that has a migration flag turned on
(In reply to GenadiC from comment #26) > 1) The getVdsCaps was already attached in the previous attachment. According to the log (getVdsCaps): Looks like the networks were set correctly both on src and dst But according to the below the dst host was not reachable from src ... 2013-09-16 13:57:20,625 - MainThread - network_migrationnetwork_migration.network_migration_test - [32mINFO[0m - Start migration from silver-vdsb.qa.lab.tlv.redhat.com 2013-09-16 13:57:20,625 - MainThread - network_migrationnetwork_migration.network_migration_test - [32mINFO[0m - Migrating VM over migration network 2013-09-16 13:57:20,626 - MainThread - network_migrationnetwork_migration.network_migration_test - [32mINFO[0m - Host is silver-vdsb.qa.lab.tlv.redhat.com 2013-09-16 13:57:20,627 - JobsSet - test_utils - [32mINFO[0m - TcpDump command to be sent: ['tcpdump', '-i', 'eth1', '-c', '1000', '-nn', 'dst', '1.1.1.1', 'and', '', 'tcp', 'and', 'src', '1.1.1.2'] 2013-09-16 13:57:36,058 - JobsSet - test_utils - [32mINFO[0m - Tcpdump timed out (no packets passed the filter). 2013-09-16 13:57:36,088 - JobsSet - test_utils - [32mINFO[0m - Checking TcpDump output. RE Pattern: 1.1.1.2(.*)(.*)1.1.1.1(.*) 2013-09-16 13:57:36,089 - JobsSet - test_utils - [33mWARNING[0m - The traffic that was searched for was not found in tcpdump output 2013-09-16 13:57:51,233 - JobsSet - vms - [32mINFO[0m - vm status is 'up' 2013-09-16 13:57:51,233 - JobsSet - art.rhevm_api.tests_lib.low_level.vms - [32mINFO[0m - Getting the VM host after VM migrated. 2013-09-16 13:57:51,402 - JobsSet - art.rhevm_api.tests_lib.low_level.vms - [31mERROR[0m - VM stayed on the same host 2013-09-16 13:57:51,402 - JobsSet - art.rhevm_api.tests_lib.high_level.vms - [31mERROR[0m - Couldn't migrate VM VMTest1 I would say this is an environment issue rather than a programmatic one > 2) What do you mean by optional network? - it's a regular network that has a > migration flag turned on From comment 16: ... "'required': 'false', ... does it means optional ?
(In reply to Daniel Berrange from comment #24) > (In reply to Yaniv Bronhaim from comment #19) > It is fundamentally incorrect to assume that ERR_SYSTEM_ERROR + FROM_RPC > means that libvirtd is down. It is only a heuristic. When this condition > is detected, you should *not* take action immediately. You must validate > that your heuristic is actually correct iirc the callback to distinguish broken connection between vdsm and libvirt doesn't work without keepalive flag on or something else that prevented us from using it (maybe it was not implemented in rhel6.4?), the way we do it today was picked after consulting it with you Can you explain more deeply why do we have error codes + error domain when it always returns with the same number on network errors? can't you provide specific codes for specific cases. I prefer to handle this case without changing our "check broken connection" way as part of this bug to vdsm 3.3. But if that's the case and you can't change the error code, please supply safe and proved way to verify the broken connection. Something that works also with previous versions of libvirt
(In reply to Barak from comment #27) > (In reply to GenadiC from comment #26) > > 1) The getVdsCaps was already attached in the previous attachment. > > According to the log (getVdsCaps): > > Looks like the networks were set correctly both on src and dst > > But according to the below the dst host was not reachable from src > > ... > > 2013-09-16 13:57:20,625 - MainThread - > network_migrationnetwork_migration.network_migration_test - [32mINFO[0m - > Start migration from silver-vdsb.qa.lab.tlv.redhat.com > 2013-09-16 13:57:20,625 - MainThread - > network_migrationnetwork_migration.network_migration_test - [32mINFO[0m - > Migrating VM over migration network > 2013-09-16 13:57:20,626 - MainThread - > network_migrationnetwork_migration.network_migration_test - [32mINFO[0m - > Host is silver-vdsb.qa.lab.tlv.redhat.com > 2013-09-16 13:57:20,627 - JobsSet - test_utils - [32mINFO[0m - TcpDump > command to be sent: ['tcpdump', '-i', 'eth1', '-c', '1000', '-nn', 'dst', > '1.1.1.1', 'and', '', 'tcp', 'and', 'src', '1.1.1.2'] > 2013-09-16 13:57:36,058 - JobsSet - test_utils - [32mINFO[0m - Tcpdump > timed out (no packets passed the filter). > 2013-09-16 13:57:36,088 - JobsSet - test_utils - [32mINFO[0m - Checking > TcpDump output. RE Pattern: 1.1.1.2(.*)(.*)1.1.1.1(.*) > 2013-09-16 13:57:36,089 - JobsSet - test_utils - [33mWARNING[0m - The > traffic that was searched for was not found in tcpdump output > 2013-09-16 13:57:51,233 - JobsSet - vms - [32mINFO[0m - vm status is 'up' > 2013-09-16 13:57:51,233 - JobsSet - art.rhevm_api.tests_lib.low_level.vms - > [32mINFO[0m - Getting the VM host after VM migrated. > 2013-09-16 13:57:51,402 - JobsSet - art.rhevm_api.tests_lib.low_level.vms - > [31mERROR[0m - VM stayed on the same host > 2013-09-16 13:57:51,402 - JobsSet - art.rhevm_api.tests_lib.high_level.vms - > [31mERROR[0m - Couldn't migrate VM VMTest1 > > I would say this is an environment issue rather than a programmatic one The above is incorrect - as the migration between the 2 qemu didn't start yet. So please ignore the above comment. > > > 2) What do you mean by optional network? - it's a regular network that has a > > migration flag turned on > > From comment 16: > ... > "'required': 'false', > ... > > does it means optional ?
(In reply to Yaniv Bronhaim from comment #28) > (In reply to Daniel Berrange from comment #24) > > (In reply to Yaniv Bronhaim from comment #19) > > It is fundamentally incorrect to assume that ERR_SYSTEM_ERROR + FROM_RPC > > means that libvirtd is down. It is only a heuristic. When this condition > > is detected, you should *not* take action immediately. You must validate > > that your heuristic is actually correct > > iirc the callback to distinguish broken connection between vdsm and libvirt > doesn't work without keepalive flag on or something else that prevented us > from using it (maybe it was not implemented in rhel6.4?), the way we do it > today was picked after consulting it with you RHEL-6.4 has the connection close callbacks. > Can you explain more deeply why do we have error codes + error domain when > it always returns with the same number on network errors? can't you provide > specific codes for specific cases. The error code/domain you are seeing here indicates a failure in the RPC code. The RPC code is not solely used for client<->libvirt communication. The exact same code is used for libvirt<->libvirt communication in migration. There's nothing we can do to distinguish this - we can't issue different codes to different clients. The error code has *never* indicated that the connection has been lost - there are always been many possible causes for these errors - libvirt does nto aim to have a unique error code for every possible scenario since that would results in 100's of 1000's of error codes. > I prefer to handle this case without > changing our "check broken connection" way as part of this bug to vdsm 3.3. > But if that's the case and you can't change the error code, please supply > safe and proved way to verify the broken connection. Something that works > also with previous versions of libvirt As I already said above, you should treat this as a heuristic only, and issue a simple method call to test alive-ness after the condition occurs to validate that it really corresponds to a broken connection.
So as you say, the way we do it currently is wrong and we will change it. Thanks for your help and the fast reply! We plan to use getLibVersion for verifying the connection after the exception raises, as suggested
pinging libvirt upon error is merged on master. This resolves one issue in this bug (unnecessary restart of vdsm). The second issue is thought to be engine use of a network for migration before it is attached to all hosts (network cluster status='OPERATIONAL').
Verified in rhevm-3.3.0-0.25.beta1.el6ev.noarch
This bug is currently attached to errata RHBA-2013:15291. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag. Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information: * Cause: What actions or circumstances cause this bug to present. * Consequence: What happens when the bug presents. * Fix: What was done to fix the bug. * Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore') Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug. For further details on the Cause, Consequence, Fix, Result format please refer to: https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes Thanks in advance.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0040.html