Created attachment 922462 [details] Engine log Description of problem: Got into state where tow hosts were up but none of them is the spm. Version-Release number of selected component (if applicable): oVirt Engine Version: 3.5.0-0.0.master.20140722232056.git8e1babc.fc19 How reproducible: Unknown Steps to Reproduce: 1. Setup ISCSI data center with 30 storage domains and 2 hosts running Fedora 19. 2. Remove the two hosts and add them back, one host using jsonrpc and the other using xmlrpc. 3. Activate both hosts 4. Switch spm role between hosts several times On one switch, the host that should be the new spm became the spm and immidiately lost spm and the other host became the spm. 5. Put hosts to maintenance and wait until all 30 domain monitors are stopped 6. Remove both hosts, add them back using jsonrpc Actual results: One host came up but did not become the spm, although all domains were up. The other host remained in "Initiailizing" state. After few minutes, the host became non-operational. Activating the hosts made it "Unasigned". After few minutes the host became non-operational again and I moved it into maintenance. After activating the host, it was finally up, but still no host became the spm. Expected results: Both host in up state, one of them the spm. Workaround: Restart ovirt-engine.
Created attachment 922463 [details] vdsm log from host that was up
Created attachment 922464 [details] vdsm log from host that had trouble becoming up
Additional info: After putting both hosts to maintenance, all storage domain in the data center remained up(!) - without any active host in the data center.
Nir, did you try the same scenario with XMLRPC?
(In reply to Allon Mureinik from comment #4) > Nir, did you try the same scenario with XMLRPC? In step 1 there was one host using xmlrpc. After the problem state was reached, I switched both hosts to xmlrpc and it did not change engine state. I did not try to repeat the whole test using xmlrpc on both hosts.
Nir, it'll be great if on the next time you could add only the relevant timeframe of the log or specify which part is relevant. reproduce the issue with the minimal steps required, it's hard to track the provided log as it's spread on long time and contains many different operations. Let's handle the issues separately here- this bug describes multiple issues in one bug. 1. Host being activated and moves to non operational: The host moves to non operational because the connect to storage pool operation takes more than 3 minutes on VDSM, which causes it to be considered as timed out in the engine. ----------------------------------------------------- Thread-21::INFO::2014-07-30 00:53:32,519::logUtils::44::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID=u'2440ff3d-275f-42e6-b204-7d055b26b17 4', hostID=2, msdUUID=u'983111eb-5fea-4899-833b-305e6fb91b47', masterVersion=740, domainsMap=None, options=None) Thread-21::INFO::2014-07-30 00:57:54,293::logUtils::47::dispatcher::(wrapper) Run and protect: connectStoragePool, Return response: True ----------------------------------------------------- 2. Host doesn't become the spm - from what i've seen in the log (and if that's not the correct timeframe please le me know) voodoo3 does become the SPM, the issue might be a UI refresh issue. ----------------------------------------------------- 2014-07-30 00:46:27,614 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-30) [6a4da444] FINISH, SpmStartVDS Command, return: org.ovirt.engine.core.common.businessentities.SpmStatusResult@7a013213, log id: 415fe407 2014-07-30 00:46:27,622 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-30) [6a4da444] Initialize Irs proxy from vds: voodoo3.tlv.redhat.com 2014-07-30 00:46:27,638 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-30) [6a4da444] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Storage Pool Manager runs on Host voodoo3 (Address: voodoo3.tlv.redhat.com). -----------------------------------------------------
(In reply to Liron Aravot from comment #6) > Let's handle the issues separately here- this bug describes multiple issues > in one bug. I mentioned two issues (no spm, storage domain up when all hosts are down), but they are probably related, both cause by same issue. If not, you can open other bug for the separate issue. > > 1. > Host being activated and moves to non operational: > The host moves to non operational because the connect to storage pool > operation takes more than 3 minutes on VDSM, which causes it to be > considered as timed out in the engine. > > ----------------------------------------------------- > Thread-21::INFO::2014-07-30 > 00:53:32,519::logUtils::44::dispatcher::(wrapper) Run and protect: > connectStoragePool(spUUID=u'2440ff3d-275f-42e6-b204-7d055b26b17 > 4', hostID=2, msdUUID=u'983111eb-5fea-4899-833b-305e6fb91b47', > masterVersion=740, domainsMap=None, options=None) > > Thread-21::INFO::2014-07-30 > 00:57:54,293::logUtils::47::dispatcher::(wrapper) Run and protect: > connectStoragePool, Return response: True > ----------------------------------------------------- Sure, but why the connect to storage pool failed? > > 2. > Host doesn't become the spm - > from what i've seen in the log (and if that's not the correct timeframe > please le me know) voodoo3 does become the SPM, the issue might be a UI > refresh issue. > > ----------------------------------------------------- > 2014-07-30 00:46:27,614 INFO > [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] > (DefaultQuartzScheduler_Worker-30) [6a4da444] FINISH, SpmStartVDS > Command, return: > org.ovirt.engine.core.common.businessentities.SpmStatusResult@7a013213, log > id: 415fe407 > 2014-07-30 00:46:27,622 INFO > [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] > (DefaultQuartzScheduler_Worker-30) [6a4da444] Initialize Irs proxy from > vds: voodoo3.tlv.redhat.com > 2014-07-30 00:46:27,638 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-30) [6a4da444] Correlation > ID: null, Call Stack: null, Custom Event ID: -1, Message: Storage Pool > Manager runs on Host voodoo3 (Address: voodoo3.tlv.redhat.com). > > ----------------------------------------------------- Hopefully this is the case - we can also see in vdsm logs if one of the host became the spm. The relevant engine log start at Jul 29 about 19:00. The vdsm logs start some time before this issue was seen. I don't have more information, all the info is in the logs.
(In reply to Nir Soffer from comment #7) > Sure, but why the connect to storage pool failed? > We should inspect why the connect takes that long on that scenario so long, but that's not an engine issue. --------------------------------------------- > > Hopefully this is the case - we can also see in vdsm logs if one of the > host became the spm. > > The relevant engine log start at Jul 29 about 19:00. > > The vdsm logs start some time before this issue was seen. > > I don't have more information, all the info is in the logs. from what i've seen in the log (and if that's not the correct timeframe please le me know) voodoo3 does become the SPM, the issue might be a UI refresh issue.
Ori, this seems like the UI refresh issue that you encountered last week, please close this as a duplicate of the opened bug on that issue. thanks.
I haven't opened that one,someone beat me to it
(In reply to Ori from comment #10) > I haven't opened that one,someone beat me to it Ori, do we have a BZ number for it?
Allon, by someone i meant Nir :)
So, do we have a bug on that already? Nir?
Talked offline with Liron. This is the bug. Removing the needinfo. Eli - can you check if this reproduces? Play a bit with the hosts and see if it is being refreshed? I'm afraid this one will be hard to reproduce, but lets try.
(In reply to Oved Ourfali from comment #14) > Talked offline with Liron. This is the bug. Removing the needinfo. > > Eli - can you check if this reproduces? > Play a bit with the hosts and see if it is being refreshed? > I'm afraid this one will be hard to reproduce, but lets try. Indeed, I see no issue with the UI auto refresh, but still I miss info Nir, if you can reproduce the following info is mandatory 1) You wrote in the BZ description that the workaround is engine restart, if BZ claims that "Hosts tab isn't being refreshed automatically" , we can refresh the information on Hosts tab manually , does this work ??? 2) If 1) does not work, please issue a REST GET on api/hosts do you get the correct information ???