Bug 881941 - [RFE] VDSM connection management
Summary: [RFE] VDSM connection management
Keywords:
Status: CLOSED EOL
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: RFEs
Version: ---
Hardware: Unspecified
OS: Unspecified
medium
unspecified
Target Milestone: ---
: ---
Assignee: Liron Aravot
QA Contact: Aharon Canan
URL:
Whiteboard: storage
Depends On: 910434
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-11-29 20:09 UTC by Jason Brooks
Modified: 2023-09-14 01:39 UTC (History)
11 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2015-10-02 11:02:53 UTC
oVirt Team: ---
Embargoed:
ylavi: ovirt-future?
ylavi: planning_ack?
ylavi: devel_ack?
ylavi: testing_ack?


Attachments (Terms of Use)
vdsm.log that shows how iso domain is being deactivated (28.66 KB, application/x-gzip)
2012-12-06 18:05 UTC, Adrian Gibanel Btactic
no flags Details
requested engine log (6.75 KB, application/gzip)
2013-06-15 01:16 UTC, Jason Brooks
no flags Details

Description Jason Brooks 2012-11-29 20:09:13 UTC
Description of problem:

Following a reboot of an all-in-one oVirt 3.1 + F17 installation, NFS-based domains appear green in the storage view, but their associated NFS shares are not mounted. The engine eventually polls these domains, finds that they're inaccessible, and they turn from green to red.

The same behavior occurs with an engine+vdsm install that uses a NFS data domain (rather than the default AIO local domain) but the NFS domains start out "red" because the master NFS data domain is never mounted.

Putting the engine's own host into maintenance mode and activating it again causes the NFS shares to be mounted correctly, as does manually activating the domains through the web ui.

I tested this as well with an ovirt node and a separate minimal fedora+vdsm host, and found that the issue only occurs on engine+vdsm hosts.

I was able to work around the issue by adding "After=vdsmd.service" to the systemd service file for "proc-fs-nfsd.mount". Don't know if this is a reasonable tweak for the AIO installer?

This is an annoying issue that's likely to frustrate new users trying out oVirt.

Version-Release number of selected component (if applicable):

oVirt 3.1.0-4
Fedora 17 x86_64

How reproducible:


Steps to Reproduce:
1. Install oVirt 3.1 on F17 via AIO.
2. Reboot
3. 
  
Actual results:

See your nfs domains appear green, but not be mounted.

Expected results:

NFS domains ought to be mounted, and if your nfs domain appears green, it ought actually to be up.

Additional info:

Comment 1 Jason Brooks 2012-12-02 18:32:22 UTC
After more tests, it looks like that systemd workaround I mentioned above doesn't work, after all.

Comment 2 Adrian Gibanel Btactic 2012-12-02 18:58:17 UTC
I confirm the bug as it has been described.
I also confirm that the workaround doesn't work.

My system just in case it helps:
Fedora 17 x86_64 Kernel 3.6.8-2

ovirt-engine-config-3.1.0-4.fc17.noarch
ovirt-image-uploader-3.1.0-0.git9c42c8.fc17.noarch
ovirt-engine-genericapi-3.1.0-4.fc17.noarch
ovirt-engine-setup-plugin-allinone-3.1.0-4.fc17.noarch
ovirt-engine-dbscripts-3.1.0-4.fc17.noarch
ovirt-engine-backend-3.1.0-4.fc17.noarch
ovirt-log-collector-3.1.0-0.git10d719.fc17.noarch
ovirt-engine-setup-3.1.0-4.fc17.noarch
ovirt-engine-tools-common-3.1.0-4.fc17.noarch
ovirt-engine-3.1.0-4.fc17.noarch
ovirt-engine-userportal-3.1.0-4.fc17.noarch
ovirt-engine-webadmin-portal-3.1.0-4.fc17.noarch
ovirt-engine-notification-service-3.1.0-4.fc17.noarch
ovirt-iso-uploader-3.1.0-0.git1841d9.fc17.noarch
ovirt-engine-sdk-3.2.0.2-1.fc17.noarch
ovirt-engine-restapi-3.1.0-4.fc17.noarch
ovirt-release-fedora-4-2.noarch

Comment 3 Adrian Gibanel Btactic 2012-12-06 18:05:27 UTC
Created attachment 658935 [details]
vdsm.log that shows how iso domain is being deactivated

Log starts after having rebooted the all-in-one host machine.

At: 2012-12-06 18:46:47,111 you can see:

Storage.StoragePool::(deactivateSD)
deactivating missing domain 34bbbf29-d93a-44a4-9cb0-85e75ae1dc26

If you need more log files from the same moments don't hesitate to ask for them.

Comment 4 Federico Simoncelli 2013-01-02 16:21:13 UTC
Looking at the logs I only see the connectStoragePool command:

Thread-21::INFO::2012-12-06 18:41:52,707::logUtils::37::dispatcher::(wrapper) Run and protect: connectStoragePool(spUUID='10b9170f-333d-4462-adf0-2d92b00973da', hostID=1, scsiKey='10b9170f-333d-4462-adf0-2d92b00973da', msdUUID='acb6dd9e-904e-4e3c-9640-1c107cf1f600', masterVersion=1, options=None)

I'd expect to find a connectStorageServer for the nfs domain.
Are the uploaded logs incomplete?
If not, this could be a setup/engine side issue.

Comment 5 Adrian Gibanel Btactic 2013-01-02 17:03:50 UTC
Yes. The logs were completed. I remember having removed lines from before the reboot. (About 10 or 15 minutes before).

Maybe jbrooks can reproduce the bug in its virtual scenario and provide an additional log file just in case?

I'm not going to reboot the server in the short term and I don't have the full logs for so long ago.

Comment 6 Jason Brooks 2013-02-12 18:55:57 UTC
I've tested this several times with F18 and the oVirt 3.2 beta, and have found that the storage domains do eventually sort themselves out. 

Following a reboot of the engine/host machine, the engine incorrectly reports the iso domain as green. After 5-7 minutes, the engine reports the iso domain as red, and after 5-7 more minutes, the iso domain comes up on its own, and the engine reports this. 

The takeaway should be -- rebooting the engine will lead to a state that either requires manual intervention or ~15 minutes of patience to get things up and running again.

Comment 7 Federico Simoncelli 2013-02-12 23:17:26 UTC
To summarize.

Steps to Reproduce (Jason correct me if I'm wrong):
1. connect (at least) one host (spm) to the nfs iso domain
2. reboot *at the same time* the engine and the host(s)
3. after the boot the engine takes up to 15 minutes to recover the situation (nfs iso up and running)

This scenario is more evident (easier to hit) on an all-in-one setup since the engine and vdsm live on the same machine and a single machine reboot *always* affect them both.

Comment 8 Federico Simoncelli 2013-02-21 11:18:04 UTC
Moving to ovirt-engine-core as it gets stuck in a 

connectStoragePool
disconnectStoragePool

Comment 9 Federico Simoncelli 2013-02-21 11:19:54 UTC
Moving to ovirt-engine-core as it gets stuck in a loop of:

connectStoragePool
disconnectStoragePool
reconstructMaster
connectStoragePool

without detecting that the storage servers aren't connected anymore.

Comment 10 Ayal Baron 2013-02-24 09:27:58 UTC
(In reply to comment #9)
> Moving to ovirt-engine-core as it gets stuck in a loop of:
> 
> connectStoragePool
> disconnectStoragePool
> reconstructMaster
> connectStoragePool
> 
> without detecting that the storage servers aren't connected anymore.

Liron, please review this.

Comment 11 Liron Aravot 2013-02-24 09:40:15 UTC
Jason ..can you attach also the engine.log?

Comment 12 Ayal Baron 2013-03-05 21:50:22 UTC
Liron,

Is there anything we can do here without the engine log?

Comment 13 Ayal Baron 2013-05-08 06:42:52 UTC
Liron, please update the bug.

Comment 14 Liron Aravot 2013-06-10 15:26:09 UTC
Ayal, it indeed may take some time to recover in that scenario.
We should decide whether we want to solve it by add a way in the engine to determine a vdsm restart while the engine is down (vdsm generation id) or if the situation would be improved/solved by other solutions (like the manage connections) - how do you want to proceed with it?

Comment 15 Jason Brooks 2013-06-15 01:16:33 UTC
Created attachment 761470 [details]
requested engine log

Sorry for the long time in getting this log to you. 

I recreated the situation: new all in one install, data and iso domains up, one iso image in the iso domain, reboot the server, the server comes up showing green iso and data domains before registering that the iso domain isn't actually up and then properly activating it. For this run, that whole process of coming back up took about five minutes -- much better than 15. Maybe the time this takes is variable?

I tested w/ F18 & all updates applied and oVirt 3.2.2. I did have to manually start the nfs-server service after install, due, apparently, to bz 974633.

Let me know if you'd like me to re-run w/ nightly or something. I tested in a VM (nested KVM FTW), and the test instance is available for re-runs.

Comment 16 Liron Aravot 2013-08-12 05:17:57 UTC
After discussion it was decided that this would be solved by using the host connection management feature - which means that after vdsm restart, vdsm will connect automatically to the storage server connection that it connected before the restart.

Comment 18 Sandro Bonazzola 2015-09-04 08:59:29 UTC
This is an automated message.
This Bugzilla report has been opened on a version which is not maintained anymore.
Please check if this bug is still relevant in oVirt 3.5.4.
If it's not relevant anymore, please close it (you may use EOL or CURRENT RELEASE resolution)
If it's an RFE please update the version to 4.0 if still relevant.

Comment 19 Sandro Bonazzola 2015-10-02 11:02:53 UTC
This is an automated message.
This Bugzilla report has been opened on a version which is not maintained
anymore.
Please check if this bug is still relevant in oVirt 3.5.4 and reopen if still
an issue.

Comment 20 Red Hat Bugzilla 2023-09-14 01:39:11 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.