881941 – [RFE] VDSM connection management

Bug 881941 - [RFE] VDSM connection management

Summary: [RFE] VDSM connection management

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	RFEs
Sub Component:
Version:	---
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Liron Aravot
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:	storage
Depends On:	910434
Blocks:
TreeView+	depends on / blocked

Reported:	2012-11-29 20:09 UTC by Jason Brooks
Modified:	2023-09-14 01:39 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-10-02 11:02:53 UTC
oVirt Team:	---
Embargoed:
Dependent Products:
Flags:	ylavi: ovirt-future? ylavi: planning_ack? ylavi: devel_ack? ylavi: testing_ack?

Attachments	(Terms of Use)
vdsm.log that shows how iso domain is being deactivated (28.66 KB, application/x-gzip) 2012-12-06 18:05 UTC, Adrian Gibanel Btactic	no flags	Details
requested engine log (6.75 KB, application/gzip) 2013-06-15 01:16 UTC, Jason Brooks	no flags	Details
View All

Description Jason Brooks 2012-11-29 20:09:13 UTC

Description of problem:

Following a reboot of an all-in-one oVirt 3.1 + F17 installation, NFS-based domains appear green in the storage view, but their associated NFS shares are not mounted. The engine eventually polls these domains, finds that they're inaccessible, and they turn from green to red.

The same behavior occurs with an engine+vdsm install that uses a NFS data domain (rather than the default AIO local domain) but the NFS domains start out "red" because the master NFS data domain is never mounted.

Putting the engine's own host into maintenance mode and activating it again causes the NFS shares to be mounted correctly, as does manually activating the domains through the web ui.

I tested this as well with an ovirt node and a separate minimal fedora+vdsm host, and found that the issue only occurs on engine+vdsm hosts.

I was able to work around the issue by adding "After=vdsmd.service" to the systemd service file for "proc-fs-nfsd.mount". Don't know if this is a reasonable tweak for the AIO installer?

This is an annoying issue that's likely to frustrate new users trying out oVirt.

Version-Release number of selected component (if applicable):

oVirt 3.1.0-4
Fedora 17 x86_64

How reproducible:

Steps to Reproduce:
1. Install oVirt 3.1 on F17 via AIO.
2. Reboot
3.

Actual results:

See your nfs domains appear green, but not be mounted.

Expected results:

NFS domains ought to be mounted, and if your nfs domain appears green, it ought actually to be up.

Additional info:

Comment 1 Jason Brooks 2012-12-02 18:32:22 UTC

After more tests, it looks like that systemd workaround I mentioned above doesn't work, after all.

Comment 2 Adrian Gibanel Btactic 2012-12-02 18:58:17 UTC

I confirm the bug as it has been described.
I also confirm that the workaround doesn't work.

My system just in case it helps:
Fedora 17 x86_64 Kernel 3.6.8-2

ovirt-engine-config-3.1.0-4.fc17.noarch
ovirt-image-uploader-3.1.0-0.git9c42c8.fc17.noarch
ovirt-engine-genericapi-3.1.0-4.fc17.noarch
ovirt-engine-setup-plugin-allinone-3.1.0-4.fc17.noarch
ovirt-engine-dbscripts-3.1.0-4.fc17.noarch
ovirt-engine-backend-3.1.0-4.fc17.noarch
ovirt-log-collector-3.1.0-0.git10d719.fc17.noarch
ovirt-engine-setup-3.1.0-4.fc17.noarch
ovirt-engine-tools-common-3.1.0-4.fc17.noarch
ovirt-engine-3.1.0-4.fc17.noarch
ovirt-engine-userportal-3.1.0-4.fc17.noarch
ovirt-engine-webadmin-portal-3.1.0-4.fc17.noarch
ovirt-engine-notification-service-3.1.0-4.fc17.noarch
ovirt-iso-uploader-3.1.0-0.git1841d9.fc17.noarch
ovirt-engine-sdk-3.2.0.2-1.fc17.noarch
ovirt-engine-restapi-3.1.0-4.fc17.noarch
ovirt-release-fedora-4-2.noarch

Comment 3 Adrian Gibanel Btactic 2012-12-06 18:05:27 UTC

Created attachment 658935 [details]
vdsm.log that shows how iso domain is being deactivated

Log starts after having rebooted the all-in-one host machine.

At: 2012-12-06 18:46:47,111 you can see:

Storage.StoragePool::(deactivateSD)
deactivating missing domain 34bbbf29-d93a-44a4-9cb0-85e75ae1dc26

If you need more log files from the same moments don't hesitate to ask for them.

Comment 4 Federico Simoncelli 2013-01-02 16:21:13 UTC

Looking at the logs I only see the connectStoragePool command:

Thread-21::INFO::2012-12-06 18:41:52,707::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='10b9170f-333d-4462-adf0-2d92b00973da', hostID=1, scsiKey='10b9170f-333d-4462-adf0-2d92b00973da', msdUUID='acb6dd9e-904e-4e3c-9640-1c107cf1f600', masterVersion=1, options=None)

I'd expect to find a connectStorageServer for the nfs domain.
Are the uploaded logs incomplete?
If not, this could be a setup/engine side issue.

Comment 5 Adrian Gibanel Btactic 2013-01-02 17:03:50 UTC

Yes. The logs were completed. I remember having removed lines from before the reboot. (About 10 or 15 minutes before).

Maybe jbrooks can reproduce the bug in its virtual scenario and provide an additional log file just in case?

I'm not going to reboot the server in the short term and I don't have the full logs for so long ago.

Comment 6 Jason Brooks 2013-02-12 18:55:57 UTC

I've tested this several times with F18 and the oVirt 3.2 beta, and have found that the storage domains do eventually sort themselves out. 

Following a reboot of the engine/host machine, the engine incorrectly reports the iso domain as green. After 5-7 minutes, the engine reports the iso domain as red, and after 5-7 more minutes, the iso domain comes up on its own, and the engine reports this. 

The takeaway should be -- rebooting the engine will lead to a state that either requires manual intervention or ~15 minutes of patience to get things up and running again.

Comment 7 Federico Simoncelli 2013-02-12 23:17:26 UTC

To summarize.

Steps to Reproduce (Jason correct me if I'm wrong):
1. connect (at least) one host (spm) to the nfs iso domain
2. reboot *at the same time* the engine and the host(s)
3. after the boot the engine takes up to 15 minutes to recover the situation (nfs iso up and running)

This scenario is more evident (easier to hit) on an all-in-one setup since the engine and vdsm live on the same machine and a single machine reboot *always* affect them both.

Comment 8 Federico Simoncelli 2013-02-21 11:18:04 UTC

Moving to ovirt-engine-core as it gets stuck in a 

connectStoragePool
disconnectStoragePool

Comment 9 Federico Simoncelli 2013-02-21 11:19:54 UTC

Moving to ovirt-engine-core as it gets stuck in a loop of:

connectStoragePool
disconnectStoragePool
reconstructMaster
connectStoragePool

without detecting that the storage servers aren't connected anymore.

Comment 10 Ayal Baron 2013-02-24 09:27:58 UTC

(In reply to comment #9)
> Moving to ovirt-engine-core as it gets stuck in a loop of:
> 
> connectStoragePool
> disconnectStoragePool
> reconstructMaster
> connectStoragePool
> 
> without detecting that the storage servers aren't connected anymore.

Liron, please review this.

Comment 11 Liron Aravot 2013-02-24 09:40:15 UTC

Jason ..can you attach also the engine.log?

Comment 12 Ayal Baron 2013-03-05 21:50:22 UTC

Liron,

Is there anything we can do here without the engine log?

Comment 13 Ayal Baron 2013-05-08 06:42:52 UTC

Liron, please update the bug.

Comment 14 Liron Aravot 2013-06-10 15:26:09 UTC

Ayal, it indeed may take some time to recover in that scenario.
We should decide whether we want to solve it by add a way in the engine to determine a vdsm restart while the engine is down (vdsm generation id) or if the situation would be improved/solved by other solutions (like the manage connections) - how do you want to proceed with it?

Comment 15 Jason Brooks 2013-06-15 01:16:33 UTC

Created attachment 761470 [details]
requested engine log

Sorry for the long time in getting this log to you. 

I recreated the situation: new all in one install, data and iso domains up, one iso image in the iso domain, reboot the server, the server comes up showing green iso and data domains before registering that the iso domain isn't actually up and then properly activating it. For this run, that whole process of coming back up took about five minutes -- much better than 15. Maybe the time this takes is variable?

I tested w/ F18 & all updates applied and oVirt 3.2.2. I did have to manually start the nfs-server service after install, due, apparently, to bz 974633.

Let me know if you'd like me to re-run w/ nightly or something. I tested in a VM (nested KVM FTW), and the test instance is available for re-runs.

Comment 16 Liron Aravot 2013-08-12 05:17:57 UTC

After discussion it was decided that this would be solved by using the host connection management feature - which means that after vdsm restart, vdsm will connect automatically to the storage server connection that it connected before the restart.

Comment 18 Sandro Bonazzola 2015-09-04 08:59:29 UTC

This is an automated message.
This Bugzilla report has been opened on a version which is not maintained anymore.
Please check if this bug is still relevant in oVirt 3.5.4.
If it's not relevant anymore, please close it (you may use EOL or CURRENT RELEASE resolution)
If it's an RFE please update the version to 4.0 if still relevant.

Comment 19 Sandro Bonazzola 2015-10-02 11:02:53 UTC

This is an automated message.
This Bugzilla report has been opened on a version which is not maintained
anymore.
Please check if this bug is still relevant in oVirt 3.5.4 and reopen if still
an issue.

Comment 20 Red Hat Bugzilla 2023-09-14 01:39:11 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.