Description of problem: Following a reboot of an all-in-one oVirt 3.1 + F17 installation, NFS-based domains appear green in the storage view, but their associated NFS shares are not mounted. The engine eventually polls these domains, finds that they're inaccessible, and they turn from green to red. The same behavior occurs with an engine+vdsm install that uses a NFS data domain (rather than the default AIO local domain) but the NFS domains start out "red" because the master NFS data domain is never mounted. Putting the engine's own host into maintenance mode and activating it again causes the NFS shares to be mounted correctly, as does manually activating the domains through the web ui. I tested this as well with an ovirt node and a separate minimal fedora+vdsm host, and found that the issue only occurs on engine+vdsm hosts. I was able to work around the issue by adding "After=vdsmd.service" to the systemd service file for "proc-fs-nfsd.mount". Don't know if this is a reasonable tweak for the AIO installer? This is an annoying issue that's likely to frustrate new users trying out oVirt. Version-Release number of selected component (if applicable): oVirt 3.1.0-4 Fedora 17 x86_64 How reproducible: Steps to Reproduce: 1. Install oVirt 3.1 on F17 via AIO. 2. Reboot 3. Actual results: See your nfs domains appear green, but not be mounted. Expected results: NFS domains ought to be mounted, and if your nfs domain appears green, it ought actually to be up. Additional info:
After more tests, it looks like that systemd workaround I mentioned above doesn't work, after all.
I confirm the bug as it has been described. I also confirm that the workaround doesn't work. My system just in case it helps: Fedora 17 x86_64 Kernel 3.6.8-2 ovirt-engine-config-3.1.0-4.fc17.noarch ovirt-image-uploader-3.1.0-0.git9c42c8.fc17.noarch ovirt-engine-genericapi-3.1.0-4.fc17.noarch ovirt-engine-setup-plugin-allinone-3.1.0-4.fc17.noarch ovirt-engine-dbscripts-3.1.0-4.fc17.noarch ovirt-engine-backend-3.1.0-4.fc17.noarch ovirt-log-collector-3.1.0-0.git10d719.fc17.noarch ovirt-engine-setup-3.1.0-4.fc17.noarch ovirt-engine-tools-common-3.1.0-4.fc17.noarch ovirt-engine-3.1.0-4.fc17.noarch ovirt-engine-userportal-3.1.0-4.fc17.noarch ovirt-engine-webadmin-portal-3.1.0-4.fc17.noarch ovirt-engine-notification-service-3.1.0-4.fc17.noarch ovirt-iso-uploader-3.1.0-0.git1841d9.fc17.noarch ovirt-engine-sdk-3.2.0.2-1.fc17.noarch ovirt-engine-restapi-3.1.0-4.fc17.noarch ovirt-release-fedora-4-2.noarch
Created attachment 658935 [details] vdsm.log that shows how iso domain is being deactivated Log starts after having rebooted the all-in-one host machine. At: 2012-12-06 18:46:47,111 you can see: Storage.StoragePool::(deactivateSD) deactivating missing domain 34bbbf29-d93a-44a4-9cb0-85e75ae1dc26 If you need more log files from the same moments don't hesitate to ask for them.
Looking at the logs I only see the connectStoragePool command: Thread-21::INFO::2012-12-06 18:41:52,707::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='10b9170f-333d-4462-adf0-2d92b00973da', hostID=1, scsiKey='10b9170f-333d-4462-adf0-2d92b00973da', msdUUID='acb6dd9e-904e-4e3c-9640-1c107cf1f600', masterVersion=1, options=None) I'd expect to find a connectStorageServer for the nfs domain. Are the uploaded logs incomplete? If not, this could be a setup/engine side issue.
Yes. The logs were completed. I remember having removed lines from before the reboot. (About 10 or 15 minutes before). Maybe jbrooks can reproduce the bug in its virtual scenario and provide an additional log file just in case? I'm not going to reboot the server in the short term and I don't have the full logs for so long ago.
I've tested this several times with F18 and the oVirt 3.2 beta, and have found that the storage domains do eventually sort themselves out. Following a reboot of the engine/host machine, the engine incorrectly reports the iso domain as green. After 5-7 minutes, the engine reports the iso domain as red, and after 5-7 more minutes, the iso domain comes up on its own, and the engine reports this. The takeaway should be -- rebooting the engine will lead to a state that either requires manual intervention or ~15 minutes of patience to get things up and running again.
To summarize. Steps to Reproduce (Jason correct me if I'm wrong): 1. connect (at least) one host (spm) to the nfs iso domain 2. reboot *at the same time* the engine and the host(s) 3. after the boot the engine takes up to 15 minutes to recover the situation (nfs iso up and running) This scenario is more evident (easier to hit) on an all-in-one setup since the engine and vdsm live on the same machine and a single machine reboot *always* affect them both.
Moving to ovirt-engine-core as it gets stuck in a connectStoragePool disconnectStoragePool
Moving to ovirt-engine-core as it gets stuck in a loop of: connectStoragePool disconnectStoragePool reconstructMaster connectStoragePool without detecting that the storage servers aren't connected anymore.
(In reply to comment #9) > Moving to ovirt-engine-core as it gets stuck in a loop of: > > connectStoragePool > disconnectStoragePool > reconstructMaster > connectStoragePool > > without detecting that the storage servers aren't connected anymore. Liron, please review this.
Jason ..can you attach also the engine.log?
Liron, Is there anything we can do here without the engine log?
Liron, please update the bug.
Ayal, it indeed may take some time to recover in that scenario. We should decide whether we want to solve it by add a way in the engine to determine a vdsm restart while the engine is down (vdsm generation id) or if the situation would be improved/solved by other solutions (like the manage connections) - how do you want to proceed with it?
Created attachment 761470 [details] requested engine log Sorry for the long time in getting this log to you. I recreated the situation: new all in one install, data and iso domains up, one iso image in the iso domain, reboot the server, the server comes up showing green iso and data domains before registering that the iso domain isn't actually up and then properly activating it. For this run, that whole process of coming back up took about five minutes -- much better than 15. Maybe the time this takes is variable? I tested w/ F18 & all updates applied and oVirt 3.2.2. I did have to manually start the nfs-server service after install, due, apparently, to bz 974633. Let me know if you'd like me to re-run w/ nightly or something. I tested in a VM (nested KVM FTW), and the test instance is available for re-runs.
After discussion it was decided that this would be solved by using the host connection management feature - which means that after vdsm restart, vdsm will connect automatically to the storage server connection that it connected before the restart.
This is an automated message. This Bugzilla report has been opened on a version which is not maintained anymore. Please check if this bug is still relevant in oVirt 3.5.4. If it's not relevant anymore, please close it (you may use EOL or CURRENT RELEASE resolution) If it's an RFE please update the version to 4.0 if still relevant.
This is an automated message. This Bugzilla report has been opened on a version which is not maintained anymore. Please check if this bug is still relevant in oVirt 3.5.4 and reopen if still an issue.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days