Bug 1282187
Summary: | Host under maintenance still have sanlock lockspaces which prevents the upgrade of the sanlock package | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-hosted-engine-ha | Reporter: | Artyom <alukiano> | ||||||||||
Component: | Agent | Assignee: | Martin Sivák <msivak> | ||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Artyom <alukiano> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | urgent | ||||||||||||
Version: | 1.2.8 | CC: | alukiano, amureini, bugs, dfediuck, gklein, mavital, msivak, nsoffer, sbonazzo, stirabos, tnisan, ylavi | ||||||||||
Target Milestone: | ovirt-3.6.1 | Keywords: | TestBlocker, Triaged | ||||||||||
Target Release: | 1.3.3.2 | Flags: | rule-engine:
ovirt-3.6.z+
rule-engine: blocker+ ylavi: planning_ack+ dfediuck: devel_ack+ mavital: testing_ack+ |
||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | sla | ||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | |||||||||||||
: | 1298461 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2015-12-16 12:19:16 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 1284954, 1285700, 1294353, 1298461 | ||||||||||||
Attachments: |
|
Description
Artyom
2015-11-15 11:37:50 UTC
Created attachment 1094386 [details]
output of script
Martin, can you explain what are the lockspaces we see in attachment 1094386 [details]?
Does stopping hosted engine agent remove these lockspaces?
Artyom, what is this mount? 10.35.64.11\:_vol_RHEV_virt_alukiano__HE__upgrade/ If the host is in maintenance, I don't expect to find this mount point active, and sanlock should not use it. Artyom, the logs and the output of the script are conflicting. The output of the script suggests that on Sun Nov 15 10:04:06 IST 2015 the host was in maintenance. But according to the logs, the host was in maintenance in from 2015-11-15 11:05:27 to 2015-11-15 11:11:13 Thread-20::INFO::2015-11-15 11:05:27,886::logUtils::47::dispatcher::(wrapper) Run and protect: disconnectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': 'a04857b8-7375-4c5d-a834-d0d20ef00946'}]} Thread-18::INFO::2015-11-15 11:11:13,612::logUtils::44::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=1, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'protocol_version': 3, 'connection': '10.35.64.11:/vol/RHEV/virt/alukiano_HE_upgrade', 'user': 'kvm', 'id': 'a04857b8-7375-4c5d-a834-d0d20ef00946'}], options=None) On 10:04:06, the host was connected to storage: Thread-83::DEBUG::2015-11-15 10:04:05,851::fileSD::262::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n500 bytes (500 B) copied, 0.000449901 s, 1.1 MB/s\n'; <rc> = 0 Detector thread::DEBUG::2015-11-15 10:04:06,056::protocoldetector::187::vds.MultiProtocolAcceptor::(_add_connection) Adding connection from 127.0.0.1:56009 Detector thread::DEBUG::2015-11-15 10:04:06,061::protocoldetector::201::vds.MultiProtocolAcceptor::(_remove_connection) Connection removed from 127.0.0.1:56009 Detector thread::DEBUG::2015-11-15 10:04:06,061::protocoldetector::247::vds.MultiProtocolAcceptor::(_handle_connection_read) Detected protocol xml from 127.0.0.1:56009 Detector thread::DEBUG::2015-11-15 10:04:06,061::BindingXMLRPC::1173::XmlDetector::(handleSocket) xml over http detected from ('127.0.0.1', 56009) Thread-1457::DEBUG::2015-11-15 10:04:06,062::BindingXMLRPC::1133::vds::(wrapper) client [127.0.0.1]::call vmGetStats with ('8408e3bd-22bc-497e-b98f-cbd50ca754b3',) {} Thread-1457::DEBUG::2015-11-15 10:04:06,063::BindingXMLRPC::1140::vds::(wrapper) return vmGetStats with {'status': {'message': 'Virtual machine does not exist', 'code': 1}} Detector thread::DEBUG::2015-11-15 10:04:06,081::protocoldetector::187::vds.MultiProtocolAcceptor::(_add_connection) Adding connection from 127.0.0.1:56010 Detector thread::DEBUG::2015-11-15 10:04:06,085::protocoldetector::201::vds.MultiProtocolAcceptor::(_remove_connection) Connection removed from 127.0.0.1:56010 Detector thread::DEBUG::2015-11-15 10:04:06,085::protocoldetector::247::vds.MultiProtocolAcceptor::(_handle_connection_read) Detected protocol xml from 127.0.0.1:56010 Detector thread::DEBUG::2015-11-15 10:04:06,085::BindingXMLRPC::1173::XmlDetector::(handleSocket) xml over http detected from ('127.0.0.1', 56010) Thread-1458::DEBUG::2015-11-15 10:04:06,087::task::595::Storage.TaskManager.Task::(_updateState) Task=`5e933a40-bf10-4c23-bc83-cc41699813f1`::moving from state init -> state preparing Thread-1458::INFO::2015-11-15 10:04:06,087::logUtils::44::dispatcher::(wrapper) Run and protect: repoStats(options=None) Thread-1458::INFO::2015-11-15 10:04:06,087::logUtils::47::dispatcher::(wrapper) Run and protect: repoStats, Return response: {'5a6cc8b0-07f7-45c3-8bb5-7ba630170130': {'code': 0, 'actual': True, 'version': 3, 'acquired': True, 'delay': '0.000449901', 'lastCheck': '0.2', 'valid': True}} Thread-1458::DEBUG::2015-11-15 10:04:06,087::task::1191::Storage.TaskManager.Task::(prepare) Task=`5e933a40-bf10-4c23-bc83-cc41699813f1`::finished: {'5a6cc8b0-07f7-45c3-8bb5-7ba630170130': {'code': 0, 'actual': True, 'version': 3, 'acquired': True, 'delay': '0.000449901', 'lastCheck': '0.2', 'valid': True}} Thread-1458::DEBUG::2015-11-15 10:04:06,087::task::595::Storage.TaskManager.Task::(_updateState) Task=`5e933a40-bf10-4c23-bc83-cc41699813f1`::moving from state preparing -> state finished Thread-1458::DEBUG::2015-11-15 10:04:06,087::resourceManager::940::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} Thread-1458::DEBUG::2015-11-15 10:04:06,087::resourceManager::977::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} Thread-1458::DEBUG::2015-11-15 10:04:06,088::task::993::Storage.TaskManager.Task::(_decref) Task=`5e933a40-bf10-4c23-bc83-cc41699813f1`::ref 0 aborting False Detector thread::DEBUG::2015-11-15 10:04:07,054::protocoldetector::187::vds.MultiProtocolAcceptor::(_add_connection) Adding connection from 127.0.0.1:56011 Maybe we are getting different timestamps from from the script and from the logs? To make sure we are testing the right thing, lets repeat this test with some more logs: 1. Have a running hosted engine setup on this environment 2. Add message to vdsm log: echo "==== `date` maintenance - start ====" >> /var/log/vdsm/vdsm.log 3. Move host to maintenance 4. When host is in maintenance mode in engine, add the next log to vdsm log: echo "==== `date` maintenance - finish ====" >> /var/log/vdsm/vdsm.log 5. Add log before stopping hosted engine agent: echo "==== `date` stopping he agent - start ====" >> /var/log/vdsm/vdsm.log 6. Stop hosted engine agent 7. Add log after stopping hosted engine agent: echo "==== `date` stopping he agent - finish ====" >> /var/log/vdsm/vdsm.log 8. Run the suggested script again so we can see sanlock state. Artyom, the attached vdsm.log looks like wrong combination of different log files: MainThread::INFO::2015-11-15 11:39:22,892::vdsm::145::vds::(run) <WorkerThread(Thread-8, started daemon 140012301575936)> VM Channels Listener::INFO::2015-11-15 11:39:23,801::vmchannels::183::vds::(run) VM channels listener thread has ended. MainThread::INFO::2015-11-15 09:40:09,395::vdsm::132::vds::(run) (PID: 24404) I am the actual vdsm 4.16.29-1.el7ev alma07.qa.lab.tlv.redhat.com (3.10.0-229.24.2.el7.x86_64) MainThread::DEBUG::2015-11-15 09:40:09,396::resourceManager::421::Storage.ResourceManager::(registerNamespace) Registering namespace 'Storage' Please avoid these modifications of the logs. If two log files are needed, attach unmodified logs. Created attachment 1095437 [details]
logs and script outputs
1) It storage domain that I use when deploy hosted engine
2) Attached new logs and script outputs
3) I did edit vdsm log, for me it looks like vdsm log have different date, before added to engine and after(not sure about it)
I hope it will help you)
3) I did not edit vdsm log Nir, is this a blocker indeed? This bug is not marked for z-stream, yet the milestone is for a z-stream version, therefore the milestone has been reset. Please set the correct milestone or add the z-stream flag. Martin, thanks for passing the bug to me, but I don't know what you are doing in the hosted engine setup, so I will need your answers. See comment 2. Uh what? Tal moved the bug to you. And with regards to lockspaces: hosted-engine:1:/var/run/vdsm/storage/5a6cc8b0-07f7-45c3-8bb5-7ba630170130/a66d5806-f79c-4195-bd9b-37f954282ed5/fca4a3b2-717e-400e-aea0-62c9972eed2e:0 hosted-engine:1:/var/run/vdsm/storage/5a6cc8b0-07f7-45c3-8bb5-7ba630170130/a66d5806-f79c-4195-bd9b-37f954282ed5/fca4a3b2-717e-400e-aea0-62c9972eed2e:0 Those two will go away when agent is stopped. I have no idea about the rest, none of those belong to hosted engine agent. They seem to be related to VMs or SPM operation. Hosted engine agent's lockspace always ends with ha_agent.lockspace on NFS. And they seem to start with hosted-engine on iSCSI. This is an upgrade issue so pushing out Tal, please avoid setting needinfo unless you need info from me. Thanks Artyom for the detailed logs. Looking at the new logs: 1. Connecting to storage pool Thread-126566::INFO::2015-11-17 13:07:12,867::logUtils::44::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID=u'00000002-0002-0002-0002-0000000001c6', hostID=1, msdUUID=u'34da7c89-06e3-49a3-9a44-912c0ff0b6e7', masterVersion=1, domainsMap={u'34da7c89-06e3-49a3-9a44-912c0ff0b6e7': u'active'}, options=None) 2. connection finished Thread-126566::INFO::2015-11-17 13:07:13,094::logUtils::47::dispatcher::(wrapper) Run and protect: connectStoragePool, Return response: True 3. Disconnecting from storage pool ==== Tue Nov 17 13:08:27 IST 2015 maintenance - start ==== Thread-126682::INFO::2015-11-17 13:08:34,627::logUtils::44::dispatcher::(wrapper) Run and protect: disconnectStoragePool(spUUID=u'00000002-0002-0002-0002-0000000001c6', hostID=1, remove=False, options=None) Thread-126682::INFO::2015-11-17 13:08:39,070::logUtils::47::dispatcher::(wrapper) Run and protect: disconnectStoragePool, Return response: True ==== Tue Nov 17 13:08:53 IST 2015 maintenance - finish ==== 4. Stopping hosted engine agent ==== Tue Nov 17 13:11:18 IST 2015 stopping he agent - start ==== # agent.log MainThread::INFO::2015-11-17 13:11:26,854::agent::121::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down ==== Tue Nov 17 13:11:53 IST 2015 stopping he agent - finish ==== 5. According to script output, we still have 2 lockspaces, from 13:12:06 until 13:12:56 when the script ends. s hosted-engine:1:/var/run/vdsm/storage/5a6cc8b0-07f7-45c3-8bb5-7ba630170130/a66d5806-f79c-4195-bd9b-37f954282ed5/fca4a3b2-717e-400e-aea0-62c9972eed2e:0 s 5a6cc8b0-07f7-45c3-8bb5-7ba630170130:1:/rhev/data-center/mnt/10.35.64.11\:_vol_RHEV_virt_alukiano__HE__upgrade/5a6cc8b0-07f7-45c3-8bb5-7ba630170130/dom_md/ids:0 6. Hosted engine agent starts up again # agent.log MainThread::INFO::2015-11-17 13:14:24,984::agent::57::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 1.2.8 started According to comment 11 this lockspace should be removed when stopping the agent: hosted-engine:1:/var/run/vdsm/storage/5a6cc8b0-07f7-45c3-8bb5-7ba630170130/a66d5806-f79c-4195-bd9b-37f954282ed5/fca4a3b2-717e-400e-aea0-62c9972eed2e:0 Moving back to Martin to investigate the hosted engine side. In the meantime, I will check what is the second lockspace. I think it is also related to hosted engine, but I need to investigate this further. Regarding the second lockspace: s 5a6cc8b0-07f7-45c3-8bb5-7ba630170130:1:/rhev/data-center/mnt/10.35.64.11\:_vol_RHEV_virt_alukiano__HE__upgrade/5a6cc8b0-07f7-45c3-8bb5-7ba630170130/dom_md/ids:0 This is a lockspace on the domain 5a6cc8b0-07f7-45c3-8bb5-7ba630170130 This domain is not part of the pool - when we connect to the stoarage pool, we can see that vdsm is managing only one domain: Thread-126566::INFO::2015-11-17 13:07:12,867::logUtils::44::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID=u'00000002-0002-0002-0002-0000000001c6', hostID=1, msdUUID=u'34da7c89-06e3-49a3-9a44-912c0ff0b6e7', masterVersion=1, domainsMap={u'34da7c89-06e3-49a3-9a44-912c0ff0b6e7': u'active'}, options=None) Domain 34da7c89-06e3-49a3-9a44-912c0ff0b6e7 is owned by vdsm. When we disconnect from the storage pool, vdsm stop monitoring this domain: Thread-126682::INFO::2015-11-17 13:08:37,068::domainMonitor::115::Storage.DomainMonitor::(stopMonitoring) Stop monitoring 34da7c89-06e3-49a3-9a44-912c0ff0b6e7 Thread-126570::DEBUG::2015-11-17 13:08:37,068::domainMonitor::192::Storage.DomainMonitorThread::(_monitorLoop) Stopping domain monitor for 34da7c89-06e3-49a3-9a44-912c0ff0b6e7 Thread-126570::INFO::2015-11-17 13:08:37,069::clusterlock::242::Storage.SANLock::(releaseHostId) Releasing host id for domain 34da7c89-06e3-49a3-9a44-912c0ff0b6e7 (id: 1) I suspect that this domain is owned by hosted agent agent. Vdsm supports monitoring external storage domains. External programs can start monitoring these domains using the startMonitoringDomain verb. When they are done, they must invoke stopMonitoringDomain verb to stop monitoring the domain. See https://github.com/oVirt/vdsm/blob/master/lib/api/vdsmapi-schema.json#L2242 We need to get older logs files showing the when vdsm started to monitor this domain. Artyom, can you find and attach the log with this line: Run and protect: startMonitoringDomain Created attachment 1097510 [details]
vdsm.log.83
I found in one of vdsm log line connect to message:
grep "Run and protect: startMonitoringDomain" vdsm.log*
vdsm.log.83:Thread-77::INFO::2015-11-15 11:17:12,286::logUtils::44::dispatcher::(wrapper) Run and protect: startMonitoringDomain(sdUUID='5a6cc8b0-07f7-45c3-8bb5-7ba630170130', hostID=1, options=None)
vdsm.log.83:Thread-77::INFO::2015-11-15 11:17:12,291::logUtils::47::dispatcher::(wrapper) Run and protect: startMonitoringDomain, Return response: None
vdsm.log.83:Thread-82::INFO::2015-11-15 09:41:24,323::logUtils::44::dispatcher::(wrapper) Run and protect: startMonitoringDomain(sdUUID='5a6cc8b0-07f7-45c3-8bb5-7ba630170130', hostID=1, options=None)
vdsm.log.83:Thread-82::INFO::2015-11-15 09:41:24,327::logUtils::47::dispatcher::(wrapper) Run and protect: startMonitoringDomain, Return response: None
(In reply to Artyom from comment #16) > Created attachment 1097510 [details] > vdsm.log.83 > vdsm.log.83:Thread-82::INFO::2015-11-15 > 09:41:24,323::logUtils::44::dispatcher::(wrapper) Run and protect: > startMonitoringDomain(sdUUID='5a6cc8b0-07f7-45c3-8bb5-7ba630170130', > hostID=1, options=None) > vdsm.log.83:Thread-82::INFO::2015-11-15 > 09:41:24,327::logUtils::47::dispatcher::(wrapper) Run and protect: > startMonitoringDomain, Return response: None Artyom, is this the time when hosted engine was installed? We must find the program that started monitoring - it owns this domain and must stop monitoring it when it is not needed. I believe yes, because it the oldest one. If you need some additional information, ping me and I will provide you all you need or will give you access to my host. Simone, is it possible that hosted engine setup is starting to monitor the hosted engine vm, import the vm into engine, and do not stop monitoring the domain after import? On upgrade you don't re-run hosted-engine-setup at all. The agent is calling startMonitoringDomain it seems. But it always needs that unless it is being killed.. what happens to domain monitoring when vdsm dies first? Is it restored on VDSM restart? (In reply to Martin Sivák from comment #21) > The agent is calling startMonitoringDomain it seems. But it always needs > that unless it is being killed.. what happens to domain monitoring when vdsm > dies first? Is it restored on VDSM restart? External domains are owned by the caller, vdsm does not manage them. If you start monitoring you should stop it, and if vdsm restart you should start monitoring again. This verb is also deprecated, seems like a temporary to solution added for hosted engine. Once the hosted engine is imported into the dc, hosted engine can stop monitoring this domain. Note that stopping monitoring will release the host id on that domain, which may cause failures in the imported domain, until it will acquire the host id again. I think we were using it as there was no other way to make sure vdsm already knows about the storage. Hosted engine storage was supposed to be mounted by vdsm (prepareImage).. but it was probably a bit complicated without a SD. If all the host in the data-center goes down at the same time due to a power-outage, we still have to properly handle the distributed process of starting the engine. The attached patch does not fix this issue, but it allows the hosted engine agent to stop monitoring the hosted engine storage domain when entering local maintenance in a race free way. Martin, what do you think? QE reports this is a test blocker for them. moving back to 3.6.1. Upgrade should work. Nir, doesn't you last series of patches fix this issue too? (In reply to Allon Mureinik from comment #27) > Nir, doesn't you last series of patches fix this issue too? No, but no hosted engine agent can safely stop monitoring the hosted engine storage domain, which will release the host id, and allow an upgrade. What is the status of this bug? Hi Martin I see that hosts under maintenance still have one sanlock, if it desired behavior? before maintenance: # sanlock client status daemon 162c27d5-da7c-4cfc-a6e2-0e49eaf4da2c.master-vds p -1 helper p -1 listener p -1 status s hosted-engine:2:/var/run/vdsm/storage/7c5171bf-bf2b-46cd-b75a-41bbc3a969ba/a5e71225-71cd-43a8-8f6b-8cc68c9537a3/01b08932-0c39-4db6-a190-6a4577a26853:0 s 7c5171bf-bf2b-46cd-b75a-41bbc3a969ba:2:/rhev/data-center/mnt/10.35.64.11\:_vol_RHEV_Virt_alukiano__HE__upgrade/7c5171bf-bf2b-46cd-b75a-41bbc3a969ba/dom_md/ids:0 ADD s cec30bfe-9640-4da6-94e9-3118f13d0af0:2:/rhev/data-center/mnt/10.35.64.11\:_vol_RHEV_Virt_alukiano__HE__upgrade__nfs__0/cec30bfe-9640-4da6-94e9-3118f13d0af0/dom_md/ids:0 ADD after maintenance: # sanlock client status daemon 162c27d5-da7c-4cfc-a6e2-0e49eaf4da2c.master-vds p -1 helper p -1 listener p -1 status s hosted-engine:2:/var/run/vdsm/storage/7c5171bf-bf2b-46cd-b75a-41bbc3a969ba/a5e71225-71cd-43a8-8f6b-8cc68c9537a3/01b08932-0c39-4db6-a190-6a4577a26853:0 Yes, this is the expected behaviour. The hosted engine agent has to be stopped to release its lock. It needs it to publish statuses to other nodes and it does that even in maintenance mode. Ok thanks Martin Verified on ovirt-hosted-engine-ha-1.3.3.3-1.el7ev.noarch After maintenance: # sanlock client status daemon 162c27d5-da7c-4cfc-a6e2-0e49eaf4da2c.master-vds p -1 helper p -1 listener p -1 status s hosted-engine:2:/var/run/vdsm/storage/7c5171bf-bf2b-46cd-b75a-41bbc3a969ba/a5e71225-71cd-43a8-8f6b-8cc68c9537a3/01b08932-0c39-4db6-a190-6a4577a26853:0 After stop ovirt-ha-agent: # sanlock client status daemon 162c27d5-da7c-4cfc-a6e2-0e49eaf4da2c.master-vds p -1 helper p -1 listener p -1 status According to verification status and target milestone this issue should be fixed in oVirt 3.6.1. Closing current release. |