Bug 1098285
| Summary: | VM HostedEngine is down. Exit message: internal error Failed to acquire lock: error -243. | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Nikolai Sednev <nsednev> | ||||||
| Component: | ovirt-hosted-engine-ha | Assignee: | Jiri Moskovcak <jmoskovc> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Nikolai Sednev <nsednev> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 3.4.0 | CC: | alukiano, amureini, bazulay, dfediuck, ecohen, eedri, fsimonce, iheim, jmoskovc, lpeer, lsurette, mavital, nsednev, scohen, sherold, tdosek, tnisan, wdaniel, yeylon | ||||||
| Target Milestone: | --- | Keywords: | TestOnly | ||||||
| Target Release: | 3.4.4 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | sla | ||||||||
| Fixed In Version: | av13 | Doc Type: | Bug Fix | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2014-12-03 09:55:52 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Nikolai Sednev
2014-05-15 16:09:15 UTC
> 1.Assemble HE setup containing 2 hosts and 2 storage domains, > one of them should be Imported NFS/ISO domain. Are you aware of the third storage domain that is used for the hosted engine purposes? Where is it? Can you describe the setup more clearly? > 2.Block connection to non-master storage domain (to ISO) "iptables -I INPUT -s 10.35.64.102 -j DROP" on both hosts. What is 10.35.64.102? Can you describe the setup more clearly? Hostnames and IP addresses and what runs where.. I find the report too confusing in this matter. > exitMessage = internal error Failed to acquire lock: error -243 This is reported from VDSM/libvirt. Not really related to hosted engine... > Command "hosted-engine --vm-status" issued on master, caused it to be > stuck via ssh shell and non responsive even to ctrl+c/d: This usually happens when you kill the connection to the NFS server with the hosted engine storage domain. > Expected: Both hosts should remain in active mode What do you mean by this? The engine must not run on more than one host. I do not see any real errors here except the freeze of hosted-engine tool. The VM is running properly on host master and is not running on the second host as expected. Rose: [root@rose05 ~]# vdsClient -s 0 list 68dabad2-3ee6-4830-a1c2-a2c2af019428 Status = Down Master: [root@master-vds10 ~]# vdsClient -s 0 list 68dabad2-3ee6-4830-a1c2-a2c2af019428 Status = Up > During these troubles, RHEVM actually runs uninterruptedly and accessible via UI. So hosted engine did its job properly and you only have a reporting issue?. (In reply to Martin Sivák from comment #1) > > 1.Assemble HE setup containing 2 hosts and 2 storage domains, > > one of them should be Imported NFS/ISO domain. > > Are you aware of the third storage domain that is used for the hosted engine > purposes? Where is it? Can you describe the setup more clearly? > > > 2.Block connection to non-master storage domain (to ISO) "iptables -I INPUT -s 10.35.64.102 -j DROP" on both hosts. > > What is 10.35.64.102? Can you describe the setup more clearly? Hostnames and > IP addresses and what runs where.. I find the report too confusing in this > matter. > > > exitMessage = internal error Failed to acquire lock: error -243 > > This is reported from VDSM/libvirt. Not really related to hosted engine... > > > Command "hosted-engine --vm-status" issued on master, caused it to be > > stuck via ssh shell and non responsive even to ctrl+c/d: > > This usually happens when you kill the connection to the NFS server with the > hosted engine storage domain. > > > Expected: Both hosts should remain in active mode > > What do you mean by this? The engine must not run on more than one host. > > > I do not see any real errors here except the freeze of hosted-engine tool. > The VM is running properly on host master and is not running on the second > host as expected. > > Rose: > [root@rose05 ~]# vdsClient -s 0 list > > 68dabad2-3ee6-4830-a1c2-a2c2af019428 > Status = Down > > Master: > [root@master-vds10 ~]# vdsClient -s 0 list > > 68dabad2-3ee6-4830-a1c2-a2c2af019428 > Status = Up > > > During these troubles, RHEVM actually runs uninterruptedly and accessible via UI. > > So hosted engine did its job properly and you only have a reporting issue?. 1.HE used it's own NFS3 SD, which is obviously hide out from engine, it lays at the 10.35.160.108:/RHEV/artyom_hosted_engine. 2.10.35.64.102-is the IP address of the "import NFS/ISO SD", the second SD except regular NFS3 SD used as first SD. 3.Not the issue here, as I killed only the connectivity to ISO import SD, on which no HE has anything to deal with for itself, as it runs from 10.35.160.108:/RHEV/artyom_hosted_engine NFS3 share. 4.Host with "green arrow upwards" means active host, I've got one of them as inactive, no connection to where HE is running at all. 5.Yes, HE VM continued to run properly, but engine received that it's down for some reason and one of the hosts went in to "Inactive" and remained as inactive, even after I removed iptables from both hosts. > 5.Yes, HE VM continued to run properly, but engine received that it's down
> for some reason and one of the hosts went in to "Inactive" and remained as
> inactive, even after I removed iptables from both hosts.
- that seems more like a problem between in vdsm or engine, hosted-engine doesn't report the state of the host, that's up to vdsm
- from the engine log it seems like everything is working as expected - engine is trying to periodically do RPC call GetStatsVDSCommand to master-vds10.qa.lab.tlv.redhat.com, but still getting timeouts thus keeping the host as NonOperational, seems like the host still can't access the storage domain properly: "Host master-vds10.qa.lab.tlv.red
hat.com reports about one of the Active Storage Domains as Problematic." Moving to vdsm, because it handles the storage.
Can you please take a look at this? Analyzing the logs: $ grep "return vmGetStats.*'vmId': '68dabad2-3ee6-4830-a1c2-a2c2af019428'" mastersvdsm.log | grep -v "'status': 'Up'" | wc -l 0 we can see that from 2014-05-15 18:01:03,048 to 2014-05-15 18:39:24,229 vdsm always reported (roughly every 10 seconds) 68dabad2-3ee6-4830-a1c2-a2c2af019428 up (HostedEngine VM) on the master-vds10. So I have no idea why the agent wanted to start the HostedEngine vm on rose as well: $ grep "client.*vmCreate.*HostedEngine" rosesvdsm.log | cut -c 21-114 2014-05-15 18:04:28,355::BindingXMLRPC::1067::vds::(wrapper) client [127.0.0.1]::call vmCreate 2014-05-15 18:14:57,202::BindingXMLRPC::1067::vds::(wrapper) client [127.0.0.1]::call vmCreate 2014-05-15 18:25:26,240::BindingXMLRPC::1067::vds::(wrapper) client [127.0.0.1]::call vmCreate 2014-05-15 18:35:55,714::BindingXMLRPC::1067::vds::(wrapper) client [127.0.0.1]::call vmCreate which obviously resulted in the locking failures: Thread-1349::DEBUG::2014-05-15 18:04:30,274::libvirtconnection::124::root::(wrapper) Unknown libvirterror: ecode: 1 edom: 42 level: 2 message: internal error Failed to acquire lock: error -243 Thread-1896::DEBUG::2014-05-15 18:14:59,043::libvirtconnection::124::root::(wrapper) Unknown libvirterror: ecode: 1 edom: 42 level: 2 message: internal error Failed to acquire lock: error -243 Thread-2460::DEBUG::2014-05-15 18:25:28,071::libvirtconnection::124::root::(wrapper) Unknown libvirterror: ecode: 1 edom: 42 level: 2 message: internal error Failed to acquire lock: error -243 Thread-3008::DEBUG::2014-05-15 18:35:57,547::libvirtconnection::124::root::(wrapper) Unknown libvirterror: ecode: 1 edom: 42 level: 2 message: internal error Failed to acquire lock: error -243 and that finally resulted in dead VMs picked up by engine and reported to the audit log. Doron, can someone from your group take a look please? What's the question? (In reply to Nikolai Sednev from comment #17) > What's the question? The questio is if you also think that it's a dupe of #1140824 *** Bug 1150076 has been marked as a duplicate of this bug. *** (In reply to Jiri Moskovcak from comment #18) > (In reply to Nikolai Sednev from comment #17) > > What's the question? > > The questio is if you also think that it's a dupe of #1140824 1.I think if duplicate, then not mine bug, as it was opened earlier (Reported: 2014-05-15 12:09 EDT by Nikolai Sednev) than 1140824 (Reported: 2014-09-11 14:42 EDT by wdaniel). 2.I really don't think that these two bugs are the same, as at 1140824 scenario is different, there maintenance mode to HE and then vm status retrieval causing for troubles, ISO domain have no connection as it's not being blocked by any iptables during scenario test run IMHO. (In reply to Nikolai Sednev from comment #20) > (In reply to Jiri Moskovcak from comment #18) > > (In reply to Nikolai Sednev from comment #17) > > > What's the question? > > > > The questio is if you also think that it's a dupe of #1140824 > > 1.I think if duplicate, then not mine bug, as it was opened earlier > (Reported: 2014-05-15 12:09 EDT by Nikolai Sednev) than 1140824 (Reported: > 2014-09-11 14:42 EDT by wdaniel). > 2.I really don't think that these two bugs are the same, as at 1140824 > scenario is different, there maintenance mode to HE and then vm status > retrieval causing for troubles, ISO domain have no connection as it's not > being blocked by any iptables during scenario test run IMHO. Even if the scenario is different, if the root cause is the same and the solution is the same than it should be marked as a duplicate. In this case, having no connection can be related to iptables or other scenarios, but the root cause is lack of connectivity and the resolution will be the same. (In reply to Doron Fediuck from comment #21) > (In reply to Nikolai Sednev from comment #20) > > (In reply to Jiri Moskovcak from comment #18) > > > (In reply to Nikolai Sednev from comment #17) > > > > What's the question? > > > > > > The questio is if you also think that it's a dupe of #1140824 > > > > 1.I think if duplicate, then not mine bug, as it was opened earlier > > (Reported: 2014-05-15 12:09 EDT by Nikolai Sednev) than 1140824 (Reported: > > 2014-09-11 14:42 EDT by wdaniel). > > 2.I really don't think that these two bugs are the same, as at 1140824 > > scenario is different, there maintenance mode to HE and then vm status > > retrieval causing for troubles, ISO domain have no connection as it's not > > being blocked by any iptables during scenario test run IMHO. > > Even if the scenario is different, if the root cause is the same and the > solution is the same than it should be marked as a duplicate. In this case, > having no connection can be related to iptables or other scenarios, but > the root cause is lack of connectivity and the resolution will be the same. I totally agree with you, except single difference, that my bug was opened in May this year, while referred bug was opened on October this year (which in turn already closed as dubplicate of this one). Please let me know if any additional information required. (In reply to Nikolai Sednev from comment #22) > I totally agree with you, except single difference, that my bug was opened > in May this year, while referred bug was opened on October this year (which > in turn already closed as dubplicate of this one). > > Please let me know if any additional information required. In this case please try to reproduce since it we suspect it's already resolved. (In reply to Doron Fediuck from comment #23) > (In reply to Nikolai Sednev from comment #22) > > > I totally agree with you, except single difference, that my bug was opened > > in May this year, while referred bug was opened on October this year (which > > in turn already closed as dubplicate of this one). > > > > Please let me know if any additional information required. > > In this case please try to reproduce since it we suspect it's already > resolved. Sure, but the problem that I see it fixed/targeted to release in 3.4.4, while latest is 3.4.3. rhevm-3.4.3-1.2.el6ev.noarch http://bob.eng.lab.tlv.redhat.com/builds/latest_av/ I tried to reproduce it in latest and it works for me with several exceptions: Original error was not reproduced and one of the 3 hosts remains active with HE's VM running on top of it and acts as SPM, but 2 additional hosts changing their states to not accessible or down continuously. Screenshot is attached. I ran the setup over 3 RHEL6.6 hosts and with upgraded HE VM from RHEL6.5 to RHEL6.6. Created attachment 951438 [details]
screenshot
The question here is current system behaviour looks OK? We're loosing HA for HE in case of 2 other hosts jumping around with their states, we may be have to define them as inactive for regular VMs, but still good for the HE HA. I rebooted the host that was shown as active and that hosted HE, I saw that all 3 hosts became down and not accessible, HE's VM changed its status to unknown, yet HE running, although not shown properly via WEBUI. More over, if I'm running "hosted-engine --vm-status" on both runnning hosts, while third previously rebooted still powering up, the response to command not being received and both hosts stuck. The problem with agent getting stuck when ANY of the storage domain is not accessible is not fixed in 3.4.x, but the problem you're describing is something else, frozen agent doesn't cause the engine to think that the host is down or non-responsive I think that your testcase when you're killing the connection to the storage domain somehow messes the connection between the host and the engine. (In reply to Jiri Moskovcak from comment #27) > The problem with agent getting stuck when ANY of the storage domain is not > accessible is not fixed in 3.4.x, but the problem you're describing is > something else, frozen agent doesn't cause the engine to think that the host > is down or non-responsive I think that your testcase when you're killing the > connection to the storage domain somehow messes the connection between the > host and the engine. Anyway for the original bug we should close this one as verified. The additional troubles should get their own bug and fixed based on decisions and inputs received from Doron. |