Bug 1159314
Summary: | Error accessing ha metadata while deploying second host | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] oVirt | Reporter: | Sandro Bonazzola <sbonazzo> | ||||
Component: | vdsm | Assignee: | Doron Fediuck <dfediuck> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | Nikolai Sednev <nsednev> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 3.5 | CC: | amureini, bazulay, bugs, dfediuck, didi, ecohen, gklein, istein, lsurette, lveyde, mgoldboi, nsoffer, rbalakri, sbonazzo, stefano.stagnaro, stirabos, yeylon | ||||
Target Milestone: | --- | Keywords: | Regression, TestCaseNeeded | ||||
Target Release: | 3.6.0 | Flags: | amureini:
needinfo+
istein: needinfo+ |
||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | sla | ||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2015-09-07 12:00:23 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1215967, 1226670, 1271272 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Sandro Bonazzola
2014-10-31 13:35:08 UTC
it happens on master, not verified on 3.5 branch yet. The patch causing this is not in 3.5 branch (In reply to Jiri Moskovcak from comment #2) > The patch causing this is not in 3.5 branch I was wrong about the patch I thought was causing this. So the reason stays unclear and we need a better reproducer. Sandro, please update if reproducible. If not, please close this issue. I'll try to reproduce tomorrow I can reproduce it on CentOS 6.6: [ ERROR ] Failed to execute stage 'Setup validation': [Errno 2] No such file or directory: '/rhev/data-center/mnt/192.168.1.102:_NotBackedUp_hosted/ac5b8cd3-95ed-4aac-b60d-f57a46351fdb/ha_agent/hosted-engine.metadata' On the NFS server (F20): # ls -lZd /NotBackedUp/hosted/ac5b8cd3-95ed-4aac-b60d-f57a46351fdb/ha_agent drwxrwx---. vdsm kvm system_u:object_r:default_t:s0 /NotBackedUp/hosted/ac5b8cd3-95ed-4aac-b60d-f57a46351fdb/ha_agent # getenforce Permissive # lsattr -d /NotBackedUp/hosted/ac5b8cd3-95ed-4aac-b60d-f57a46351fdb/ha_agent -------------e-- /NotBackedUp/hosted/ac5b8cd3-95ed-4aac-b60d-f57a46351fdb/ha_agent ll /NotBackedUp/hosted/ac5b8cd3-95ed-4aac-b60d-f57a46351fdb/ha_agent totale 8 lrwxrwxrwx. 1 vdsm kvm 132 28 nov 13.48 hosted-engine.lockspace -> /var/run/vdsm/storage/ac5b8cd3-95ed-4aac-b60d-f57a46351fdb/0ee73836-2f78-462b-aa59-659157f975fc/5b6b0cf1-8349-4750-ac03-84d16295abfe lrwxrwxrwx. 1 vdsm kvm 132 28 nov 13.48 hosted-engine.metadata -> /var/run/vdsm/storage/ac5b8cd3-95ed-4aac-b60d-f57a46351fdb/cb98ad78-3712-4065-ac75-848d57f3d392/c9b5ec82-90ae-42f4-9b7a-f8726ac0467d On the host (CentOS 6.6): # ll /rhev/data-center/mnt/192.168.1.102\:_NotBackedUp_hosted/ac5b8cd3-95ed-4aac-b60d-f57a46351fdb/ha_agent/ ls: cannot open directory /rhev/data-center/mnt/192.168.1.102:_NotBackedUp_hosted/ac5b8cd3-95ed-4aac-b60d-f57a46351fdb/ha_agent/: Permission denied # ls -ldZ /rhev/data-center/mnt/192.168.1.102\:_NotBackedUp_hosted/ac5b8cd3-95ed-4aac-b60d-f57a46351fdb/ha_agent/ drwxrwx---. vdsm kvm system_u:object_r:nfs_t:s0 /rhev/data-center/mnt/192.168.1.102:_NotBackedUp_hosted/ac5b8cd3-95ed-4aac-b60d-f57a46351fdb/ha_agent/ # getenforce Permissive # ls -l /var/run/vdsm/ total 28 -rw-r--r--. 1 vdsm kvm 0 Nov 28 15:05 client.log drwxr-xr-x. 2 vdsm kvm 4096 Nov 28 15:05 lvm -rw-r--r--. 1 root root 0 Nov 28 15:05 nets_restored -rw-r--r--. 1 vdsm kvm 6 Nov 28 15:05 respawn.pid drwxr-xr-x. 2 vdsm kvm 4096 Nov 27 16:42 sourceRoutes -rw-r--r--. 1 root root 6 Nov 28 15:05 supervdsm_respawn.pid -rw-r--r--. 1 root root 6 Nov 28 15:05 supervdsmd.pid srwxr-xr-x. 1 vdsm kvm 0 Nov 28 15:05 svdsm.sock drwxr-xr-x. 2 vdsm kvm 4096 Nov 27 16:42 trackedInterfaces -rw-rw-r--. 1 vdsm kvm 6 Nov 28 15:05 vdsmd.pid Looks like storage directory is missing. Moving to vdsm. # rpm -qa |egrep "(vdsm|ovirt)"|sort ovirt-engine-sdk-python-3.6.0.0-0.6.20141111.git2006d3f.el6.noarch ovirt-host-deploy-1.3.1-0.0.master.20141127190154.gitad54173.el6.noarch ovirt-hosted-engine-ha-1.3.0-0.0.master.20141126094507.20141126094505.git374c42a.el6.noarch ovirt-hosted-engine-setup-1.3.0-0.0.master.20141126152735.gitdbbeab0.el6.noarch ovirt-release-master-001-0.2.master.noarch vdsm-4.17.0-51.gitf943da9.el6.x86_64 vdsm-cli-4.17.0-51.gitf943da9.el6.noarch vdsm-infra-4.17.0-51.gitf943da9.el6.noarch vdsm-jsonrpc-4.17.0-51.gitf943da9.el6.noarch vdsm-python-4.17.0-51.gitf943da9.el6.noarch vdsm-xmlrpc-4.17.0-51.gitf943da9.el6.noarch vdsm-yajsonrpc-4.17.0-51.gitf943da9.el6.noarch Created attachment 962528 [details]
sosreport -o vdsm
I wonder if this error from vdsm.log might be somehow related to it: Thread-36::DEBUG::2014-11-28 15:17:20,798::resourceManager::198::Storage.ResourceManager.Request::(__init__) ResName=`Storage.b2d35fe4-2d95-4045-8030-c2a4eecd7ebf`ReqID=`76fb0590-9e9f-45b6-8b63-ba15737687b7`::Request was made in '/usr/sha Thread-36::DEBUG::2014-11-28 15:17:20,798::resourceManager::542::Storage.ResourceManager::(registerResource) Trying to register resource 'Storage.b2d35fe4-2d95-4045-8030-c2a4eecd7ebf' for lock type 'exclusive' Thread-36::DEBUG::2014-11-28 15:17:20,798::resourceManager::601::Storage.ResourceManager::(registerResource) Resource 'Storage.b2d35fe4-2d95-4045-8030-c2a4eecd7ebf' is free. Now locking as 'exclusive' (1 active user) Thread-36::DEBUG::2014-11-28 15:17:20,798::resourceManager::238::Storage.ResourceManager.Request::(grant) ResName=`Storage.b2d35fe4-2d95-4045-8030-c2a4eecd7ebf`ReqID=`76fb0590-9e9f-45b6-8b63-ba15737687b7`::Granted request Thread-36::DEBUG::2014-11-28 15:17:20,798::task::824::Storage.TaskManager.Task::(resourceAcquired) Task=`350a34f3-5607-4f70-b983-22d111a5c4fc`::_resourcesAcquired: Storage.b2d35fe4-2d95-4045-8030-c2a4eecd7ebf (exclusive) Thread-36::DEBUG::2014-11-28 15:17:20,798::task::990::Storage.TaskManager.Task::(_decref) Task=`350a34f3-5607-4f70-b983-22d111a5c4fc`::ref 1 aborting False Thread-36::ERROR::2014-11-28 15:17:20,798::task::863::Storage.TaskManager.Task::(_setError) Task=`350a34f3-5607-4f70-b983-22d111a5c4fc`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 870, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 49, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 601, in spmStop pool.stopSpm() File "/usr/share/vdsm/storage/securable.py", line 75, in wrapper raise SecureError("Secured object is not in safe state") SecureError: Secured object is not in safe state The same problem arose with a fresh oVirt 3.5 deployment. I was in touch with Simone Tiraboschi who suggested me to perform a new deployment with the NFS server configured with option "subtree_check" (since "no_subtree_check" is the default). The new HE deployment ended up successfully, also when the second node was added. (In reply to Stefano Stagnaro from comment #10) > The new HE deployment ended up successfully, also when the second node was > added. That's weird, our docs say specifically: ( ... rw,sync,no_subtree_check,all_squash, ...) cna anyone from storage clear this out and change the wiki if it's wrong? (In reply to Jiri Moskovcak from comment #11) > (In reply to Stefano Stagnaro from comment #10) > > The new HE deployment ended up successfully, also when the second node was > > added. > > That's weird, our docs say specifically: > > ( ... rw,sync,no_subtree_check,all_squash, ...) cna anyone from storage > clear this out and change the wiki if it's wrong? subtree_check is a security enhancement. Simone/Jiri, can you clarify why this was suggested? Jiri can you also point to our documentation? (In reply to Sandro Bonazzola from comment #13) > Jiri can you also point to our documentation? oops, sorry, here it is: http://www.ovirt.org/Troubleshooting_NFS_Storage_Issues Ok, so this may be the issue, since my NFS server was configured with: /NotBackedUp/hosted 0.0.0.0/0.0.0.0(rw) Allon, Jiri, if the configuration suggested in http://www.ovirt.org/Troubleshooting_NFS_Storage_Issues is mandatory for vdsm to work correctly I guess we need to review the verification done by HE setup and by ovirt-engine too at least for all-in-one and ovirt-live. Can you ensure that the doc page is updated and reflects vdsm requirements? I suggested to explicitly add no_subtree_check while Stefano is reporting that he need to add subtree_check to have it working. (In reply to Sandro Bonazzola from comment #15) > Ok, so this may be the issue, since my NFS server was configured with: > > /NotBackedUp/hosted 0.0.0.0/0.0.0.0(rw) > > > Allon, Jiri, if the configuration suggested in > http://www.ovirt.org/Troubleshooting_NFS_Storage_Issues is mandatory for > vdsm to work correctly I guess we need to review the verification done by HE > setup and by ovirt-engine too at least for all-in-one and ovirt-live. > > Can you ensure that the doc page is updated and reflects vdsm requirements? Nir, can you please review? Sandro, can you explain how to reproduce this issue? Stefano, can you explain what fails when you configure using no_subtree_check as described in the wiki? According to exports(5), no_subtree_check is a little less secure but more reliable, and is the default configuration because "subtree_checking tends to cause more problems than it is worth". I don't see any connection between this option and the trouble described here. Probably the problem is misconfiguration of the NFS server, and not using or not using subtree_check. Sandro, can you check the problem exists with the suggested configuration? # cat /etc/exports /storage *(rw,sync,no_subtree_check,all_squash,anonuid=36,anongid=36) Nir, I'll try to reproduce with suggested configuration. (In reply to Nir Soffer from comment #19) > Stefano, can you explain what fails when you configure using > no_subtree_check as described in the wiki? Nir, I've just succeed to deploy an oVirt 3.5.1 with Hosted Engine using the suggested configuration: # cat /etc/exports /engine *(rw,sync,no_subtree_check,all_squash,anonuid=36,anongid=36) Previously, the deployment of the second node was failing with error: Failed to execute stage 'Setup validation': [Errno 2] No such file or directory: '/rhev/data-center/mnt/henfs.ovirt:_engine/629de3b1-e259-448b-a9d9-c6c5bb0d49e2/ha_agent/hosted-engine.metadata' Nir, Stefano, I just reproduced the issue on CentOS 6.6 with rpms from master: [ ERROR ] Failed to execute stage 'Setup validation': [Errno 2] No such file or directory: '/rhev/data-center/mnt/192.168.1.107:_storage/4e9b5ea2-0762-478b-b9fb-503680724799/ha_agent/hosted-engine.metadata' ll /rhev/data-center/mnt/192.168.1.107:_storage/4e9b5ea2-0762-478b-b9fb-503680724799/ha_agent/hosted-engine.metadata lrwxrwxrwx. 1 vdsm kvm 132 3 feb 14:58 /rhev/data-center/mnt/192.168.1.107:_storage/4e9b5ea2-0762-478b-b9fb-503680724799/ha_agent/hosted-engine.metadata -> /var/run/vdsm/storage/4e9b5ea2-0762-478b-b9fb-503680724799/2da18efc-f017-4630-bf62-670b580aa3bc/3ba06862-628d-4d00-a350-ce79f01e9cd7 # ls -l /var/run/vdsm/ totale 28 -rw-r--r--. 1 vdsm kvm 0 3 feb 15:42 client.log drwxr-xr-x. 2 vdsm kvm 4096 3 feb 15:39 lvm -rw-r--r--. 1 root root 0 3 feb 15:39 nets_restored -rw-r--r--. 1 vdsm kvm 6 3 feb 15:39 respawn.pid drwxr-xr-x. 2 vdsm kvm 4096 2 feb 20:20 sourceRoutes -rw-r--r--. 1 root root 6 3 feb 15:39 supervdsmd.pid -rw-r--r--. 1 root root 6 3 feb 15:38 supervdsm_respawn.pid srwxr-xr-x. 1 vdsm kvm 0 3 feb 15:39 svdsm.sock drwxr-xr-x. 2 vdsm kvm 4096 2 feb 20:20 trackedInterfaces -rw-rw-r--. 1 vdsm kvm 6 3 feb 15:39 vdsmd.pid # rpm -qa |egrep "(otopi|ovirt|vdsm)"|sort otopi-1.3.1-1.20150116.git995a46d.el6.noarch ovirt-engine-sdk-python-3.6.0.0-0.7.20150126.git21170a7.el6.noarch ovirt-host-deploy-1.3.2-0.0.master.20150131085622.git94089f1.el6.noarch ovirt-hosted-engine-ha-1.3.0-0.0.master.20150126112239.20150126112233.gita3c842d.el6.noarch ovirt-hosted-engine-setup-1.3.0-0.0.master.20150203115525.git9d41848.el6.noarch ovirt-release-master-001-0.3.master.noarch vdsm-4.17.0-357.git11ad42a.el6.x86_64 vdsm-cli-4.17.0-357.git11ad42a.el6.noarch vdsm-gluster-4.17.0-357.git11ad42a.el6.noarch vdsm-infra-4.17.0-357.git11ad42a.el6.noarch vdsm-jsonrpc-4.17.0-357.git11ad42a.el6.noarch vdsm-python-4.17.0-357.git11ad42a.el6.noarch vdsm-xmlrpc-4.17.0-357.git11ad42a.el6.noarch vdsm-yajsonrpc-4.17.0-357.git11ad42a.el6.noarch # cat /etc/exports /storage *(rw,sync,no_subtree_check,all_squash,anonuid=36,anongid=36) (In reply to Sandro Bonazzola from comment #23) > Nir, Stefano, I just reproduced the issue on CentOS 6.6 with rpms from > master: Can you add this host to engine successfully? You should check why hosted-engine setup fails - it may be incorrect validation in hosted-engine setup, or some change in vdsm that broke the existing validation code. Since you moved this bug to vdsm, please explain how to reproduce this *without* hosted-engine setup. What are the vdsm verbs I have to call, for example using vdsClient, that reproduce this? Ok, I'll try to reproduce without using hosted-engine. This will take a while. Can't reproduce with vdsm-4.17.0-458.git05dfa2a.el7.x86_64 vdsm-cli-4.17.0-458.git05dfa2a.el7.noarch vdsm-gluster-4.17.0-458.git05dfa2a.el7.noarch vdsm-infra-4.17.0-458.git05dfa2a.el7.noarch vdsm-jsonrpc-4.17.0-458.git05dfa2a.el7.noarch vdsm-python-4.17.0-458.git05dfa2a.el7.noarch vdsm-xmlrpc-4.17.0-458.git05dfa2a.el7.noarch vdsm-yajsonrpc-4.17.0-458.git05dfa2a.el7.noarch Moving to QA for further verification. Gil, please see comment 26. Can you please take over? Sandro, Would you please close this bug as duplicate on bug 1215967? Thanks, Ilanit. (In reply to Ilanit Stein from comment #28) > Sandro, > > Would you please close this bug as duplicate on bug 1215967? Done > > Thanks, > Ilanit. *** This bug has been marked as a duplicate of bug 1215967 *** |