Bug 949248
Summary: | vdsm: we fail to activate a host from non-operational state because 1 domain is missing from the tree only (tree broken and not rebuild) | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Dafna Ron <dron> | ||||
Component: | vdsm | Assignee: | Tal Nisan <tnisan> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | Aharon Canan <acanan> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 3.2.0 | CC: | abaron, amureini, bazulay, iheim, jkt, lpeer, scohen, sgotliv, yeylon | ||||
Target Milestone: | --- | Keywords: | Reopened, Triaged | ||||
Target Release: | 3.4.0 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | storage | ||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2014-02-13 13:56:03 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
1. "admin" manually deleted files from host 2. the host doesn't see 1 of the domains while another host sees it, the defined behaviour is for this host to be non-operational 3. moving host to maintenance and activating fixes the problem The system cannot deal with all the user inflicted issues that can happen (e.g. instead of deleting the links, you could have deleted /usr/share/vdsm/*) Closing. (In reply to comment #1) > 1. "admin" manually deleted files from host I did not manually delete the files - they were deleted during upgrade by vdsm. I simply gave a way to reproduce this. > 2. the host doesn't see 1 of the domains while another host sees it, the > defined behaviour is for this host to be non-operational yes, but only if the domain is actually unavailble and in this case, the domain is functional and can be seen by the host only it does not exist in the tree (you can see that connectStorageServer and connectStoragePool succeed and we are failing because the domain is not in the tree only. > 3. moving host to maintenance and activating fixes the problem which measn that you need to shut down the vm's on the host to do so. > > The system cannot deal with all the user inflicted issues that can happen > (e.g. instead of deleting the links, you could have deleted > /usr/share/vdsm/*) > > Closing. following my reply and the impact on user I think this should be discussed with PM and QE managers before cloing. still targeted for 3.4? yes. Sergey, this sounds like a dup of the missing links issue? It looks like a dup of BZ#1026697. The only thing which really bother me about this bug is why host is not activated??? BZ#1026697 deals with the host that activated without restoring these links. We should check that. *** This bug has been marked as a duplicate of bug 1026697 *** |
Created attachment 732310 [details] logs Description of problem: I had a problem while upgrading domains and the tree was broken. After we fixed the issue in one of the hosts, the second host which was non-operational failed to be activated. engine is sending connectStorageServer and connectStoragePool who succeeds and than sends getVdsStats and gets only one domain from the statistic and fails to activate the host. if I run vgs and getStorageDomainsList I can see the domain and there is no issue connecting to it. also, once I moved the host to maintenance state and activated we rebuild the tree and host started correctly. Version-Release number of selected component (if applicable): vdsm-4.10.2-13.0.el6ev.x86_64 sf12 How reproducible: Steps to Reproduce: 1. in iscsi storage, with two domains, block connectivity to non master storage domain from one hosts 2. when host becomes non-operational remove one domain from the tree on this host only. # rm -rf /rhev/data-center/<spUUID>/<deactivatedSdUUID> # rm -rf /rhev/data-center/mnt/blockSD/<deactivatedSdUUID> 3. remove block of domain from the non-operational host. 4. try to activate the host Actual results: getVdsStats is checking the tree and since one domain is missing we do not activate the host although we can see all the domains. if we put the host in maintenance and than activate it host becomes operational. I am putting this BZ as high severity because if we have running vm's on the host we need to migrate them (which we cannot do if they are paused on EIO) or shut off the vms. Expected results: we should be able to rebuild the tree when host is non-operational. Additional info: logs [root@cougar01 ~]# vdsClient -s 0 |grep getVdsStat getVdsStats [root@cougar01 ~]# vdsClient -s 0 getVdsStats anonHugePages = 56 cpuIdle = 99.67 cpuLoad = 0.01 cpuSys = 0.12 cpuSysVdsmd = 0.00 cpuUser = 0.21 cpuUserVdsmd = 0.12 dateTime = 2013-04-07T08:40:22 GMT diskStats = {'/var/log': {'free': '428785'}, '/var/log/core': {'free': '428785'}, '/tmp': {'free': '428785'}, '/var/run/vdsm/': {'free': '428785'}} elapsedTime = 171598 generationID = 08825355-61c7-46fe-b503-4a5ddb162192 ksmCpu = 0 ksmPages = 100 ksmState = True memAvailable = 15116 memCommitted = 0 memShared = 0 memUsed = 4 netConfigDirty = False network = {'bond4': {'macAddr': '', 'name': 'bond4', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'down', 'speed': '1000', 'rxDropped': '0'}, 'bond0': {'macAddr': '', 'name': 'bond0', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'down', 'speed': '1000', 'rxDropped': '0'}, 'bond1': {'macAddr': '', 'name': 'bond1', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'down', 'speed': '1000', 'rxDropped': '0'}, 'bond2': {'macAddr': '', 'name': 'bond2', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'down', 'speed': '1000', 'rxDropped': '0'}, 'bond3': {'macAddr': '', 'name': 'bond3', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'down', 'speed': '1000', 'rxDropped': '0'}, 'eth3': {'macAddr': '', 'name': 'eth3', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'down', 'speed': '1000', 'rxDropped': '0'}, 'eth2': {'macAddr': '', 'name': 'eth2', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'down', 'speed': '1000', 'rxDropped': '0'}, 'eth1': {'macAddr': '', 'name': 'eth1', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'down', 'speed': '1000', 'rxDropped': '0'}, 'eth0': {'macAddr': '', 'name': 'eth0', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'up', 'speed': '10000', 'rxDropped': '0'}} rxDropped = 0 rxRate = 0.00 statsAge = 1.88 storageDomains = {'6bbbe226-7456-46da-8fc4-4c4d59472436': {'delay': '0.00945687294006', 'lastCheck': '1.9', 'code': 0, 'valid': True}} swapFree = 16119 swapTotal = 16119 thpState = always txDropped = 0 txRate = 0.00 vmActive = 0 vmCount = 0 vmMigrating = 0 [root@cougar01 ~]# vdsClient -s 0 getStorageDomainsList 6bbbe226-7456-46da-8fc4-4c4d59472436 3533d774-aff3-4b73-926d-053e2a3dc8a1 cf28adb9-28e7-49f9-88d8-0e6d12336bd9 [root@cougar01 ~]# vgs /dev/mapper/1Dafna-31-021363867: read failed after 0 of 4096 at 107374116864: Input/output error /dev/mapper/1Dafna-31-021363867: read failed after 0 of 4096 at 107374174208: Input/output error /dev/mapper/1Dafna-31-021363867: read failed after 0 of 4096 at 0: Input/output error /dev/mapper/1Dafna-31-021363867: read failed after 0 of 4096 at 4096: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/metadata: read failed after 0 of 4096 at 536805376: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/metadata: read failed after 0 of 4096 at 536862720: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/metadata: read failed after 0 of 4096 at 0: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/metadata: read failed after 0 of 4096 at 4096: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/ids: read failed after 0 of 4096 at 134152192: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/ids: read failed after 0 of 4096 at 134209536: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/ids: read failed after 0 of 4096 at 0: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/ids: read failed after 0 of 4096 at 4096: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/leases: read failed after 0 of 4096 at 2147418112: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/leases: read failed after 0 of 4096 at 2147475456: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/leases: read failed after 0 of 4096 at 0: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/leases: read failed after 0 of 4096 at 4096: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/inbox: read failed after 0 of 4096 at 134152192: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/inbox: read failed after 0 of 4096 at 134209536: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/inbox: read failed after 0 of 4096 at 0: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/inbox: read failed after 0 of 4096 at 4096: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/outbox: read failed after 0 of 4096 at 134152192: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/outbox: read failed after 0 of 4096 at 134209536: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/outbox: read failed after 0 of 4096 at 0: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/outbox: read failed after 0 of 4096 at 4096: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/master: read failed after 0 of 4096 at 1073676288: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/master: read failed after 0 of 4096 at 1073733632: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/master: read failed after 0 of 4096 at 0: Input/output error /dev/a3282596-8f78-4930-bb76-bebeb657babf/master: read failed after 0 of 4096 at 4096: Input/output error VG #PV #LV #SN Attr VSize VFree 3533d774-aff3-4b73-926d-053e2a3dc8a1 1 6 0 wz--n- 99.62g 95.75g 6bbbe226-7456-46da-8fc4-4c4d59472436 1 7 0 wz--n- 99.62g 94.75g cf28adb9-28e7-49f9-88d8-0e6d12336bd9 1 6 0 wz--n- 99.62g 95.75g vg0 1 2 0 wz--n- 465.56g 0 [root@cougar01 ~]# /rhev/data-center/ 9a9db723-c63b-469d-9545-d8cc9407822b/ hsm-tasks/ mnt/ [root@cougar01 ~]# /rhev/data-center/9a9db723-c63b-469d-9545-d8cc9407822b/ 6bbbe226-7456-46da-8fc4-4c4d59472436/ mastersd/ [root@cougar01 ~]# /rhev/data-center/9a9db723-c63b-469d-9545-d8cc9407822b/