Bug 881941
| Summary: | [RFE] VDSM connection management | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-engine | Reporter: | Jason Brooks <jbrooks> | ||||||
| Component: | RFEs | Assignee: | Liron Aravot <laravot> | ||||||
| Status: | CLOSED EOL | QA Contact: | Aharon Canan <acanan> | ||||||
| Severity: | unspecified | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | --- | CC: | acanan, adrian.gibanel, bazulay, bsettle, bugs, iheim, jbrooks, laravot, mgoldboi, rbalakri, sbonazzo | ||||||
| Target Milestone: | --- | Keywords: | FutureFeature | ||||||
| Target Release: | --- | Flags: | ylavi:
ovirt-future?
ylavi: planning_ack? ylavi: devel_ack? ylavi: testing_ack? |
||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | storage | ||||||||
| Fixed In Version: | Doc Type: | Enhancement | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2015-10-02 11:02:53 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | 910434 | ||||||||
| Bug Blocks: | |||||||||
| Attachments: |
|
||||||||
|
Description
Jason Brooks
2012-11-29 20:09:13 UTC
After more tests, it looks like that systemd workaround I mentioned above doesn't work, after all. I confirm the bug as it has been described. I also confirm that the workaround doesn't work. My system just in case it helps: Fedora 17 x86_64 Kernel 3.6.8-2 ovirt-engine-config-3.1.0-4.fc17.noarch ovirt-image-uploader-3.1.0-0.git9c42c8.fc17.noarch ovirt-engine-genericapi-3.1.0-4.fc17.noarch ovirt-engine-setup-plugin-allinone-3.1.0-4.fc17.noarch ovirt-engine-dbscripts-3.1.0-4.fc17.noarch ovirt-engine-backend-3.1.0-4.fc17.noarch ovirt-log-collector-3.1.0-0.git10d719.fc17.noarch ovirt-engine-setup-3.1.0-4.fc17.noarch ovirt-engine-tools-common-3.1.0-4.fc17.noarch ovirt-engine-3.1.0-4.fc17.noarch ovirt-engine-userportal-3.1.0-4.fc17.noarch ovirt-engine-webadmin-portal-3.1.0-4.fc17.noarch ovirt-engine-notification-service-3.1.0-4.fc17.noarch ovirt-iso-uploader-3.1.0-0.git1841d9.fc17.noarch ovirt-engine-sdk-3.2.0.2-1.fc17.noarch ovirt-engine-restapi-3.1.0-4.fc17.noarch ovirt-release-fedora-4-2.noarch Created attachment 658935 [details]
vdsm.log that shows how iso domain is being deactivated
Log starts after having rebooted the all-in-one host machine.
At: 2012-12-06 18:46:47,111 you can see:
Storage.StoragePool::(deactivateSD)
deactivating missing domain 34bbbf29-d93a-44a4-9cb0-85e75ae1dc26
If you need more log files from the same moments don't hesitate to ask for them.
Looking at the logs I only see the connectStoragePool command: Thread-21::INFO::2012-12-06 18:41:52,707::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='10b9170f-333d-4462-adf0-2d92b00973da', hostID=1, scsiKey='10b9170f-333d-4462-adf0-2d92b00973da', msdUUID='acb6dd9e-904e-4e3c-9640-1c107cf1f600', masterVersion=1, options=None) I'd expect to find a connectStorageServer for the nfs domain. Are the uploaded logs incomplete? If not, this could be a setup/engine side issue. Yes. The logs were completed. I remember having removed lines from before the reboot. (About 10 or 15 minutes before). Maybe jbrooks can reproduce the bug in its virtual scenario and provide an additional log file just in case? I'm not going to reboot the server in the short term and I don't have the full logs for so long ago. I've tested this several times with F18 and the oVirt 3.2 beta, and have found that the storage domains do eventually sort themselves out. Following a reboot of the engine/host machine, the engine incorrectly reports the iso domain as green. After 5-7 minutes, the engine reports the iso domain as red, and after 5-7 more minutes, the iso domain comes up on its own, and the engine reports this. The takeaway should be -- rebooting the engine will lead to a state that either requires manual intervention or ~15 minutes of patience to get things up and running again. To summarize. Steps to Reproduce (Jason correct me if I'm wrong): 1. connect (at least) one host (spm) to the nfs iso domain 2. reboot *at the same time* the engine and the host(s) 3. after the boot the engine takes up to 15 minutes to recover the situation (nfs iso up and running) This scenario is more evident (easier to hit) on an all-in-one setup since the engine and vdsm live on the same machine and a single machine reboot *always* affect them both. Moving to ovirt-engine-core as it gets stuck in a connectStoragePool disconnectStoragePool Moving to ovirt-engine-core as it gets stuck in a loop of: connectStoragePool disconnectStoragePool reconstructMaster connectStoragePool without detecting that the storage servers aren't connected anymore. (In reply to comment #9) > Moving to ovirt-engine-core as it gets stuck in a loop of: > > connectStoragePool > disconnectStoragePool > reconstructMaster > connectStoragePool > > without detecting that the storage servers aren't connected anymore. Liron, please review this. Jason ..can you attach also the engine.log? Liron, Is there anything we can do here without the engine log? Liron, please update the bug. Ayal, it indeed may take some time to recover in that scenario. We should decide whether we want to solve it by add a way in the engine to determine a vdsm restart while the engine is down (vdsm generation id) or if the situation would be improved/solved by other solutions (like the manage connections) - how do you want to proceed with it? Created attachment 761470 [details] requested engine log Sorry for the long time in getting this log to you. I recreated the situation: new all in one install, data and iso domains up, one iso image in the iso domain, reboot the server, the server comes up showing green iso and data domains before registering that the iso domain isn't actually up and then properly activating it. For this run, that whole process of coming back up took about five minutes -- much better than 15. Maybe the time this takes is variable? I tested w/ F18 & all updates applied and oVirt 3.2.2. I did have to manually start the nfs-server service after install, due, apparently, to bz 974633. Let me know if you'd like me to re-run w/ nightly or something. I tested in a VM (nested KVM FTW), and the test instance is available for re-runs. After discussion it was decided that this would be solved by using the host connection management feature - which means that after vdsm restart, vdsm will connect automatically to the storage server connection that it connected before the restart. This is an automated message. This Bugzilla report has been opened on a version which is not maintained anymore. Please check if this bug is still relevant in oVirt 3.5.4. If it's not relevant anymore, please close it (you may use EOL or CURRENT RELEASE resolution) If it's an RFE please update the version to 4.0 if still relevant. This is an automated message. This Bugzilla report has been opened on a version which is not maintained anymore. Please check if this bug is still relevant in oVirt 3.5.4 and reopen if still an issue. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |