Bug 949248

Summary: vdsm: we fail to activate a host from non-operational state because 1 domain is missing from the tree only (tree broken and not rebuild)
Product: Red Hat Enterprise Virtualization Manager Reporter: Dafna Ron <dron>
Component: vdsmAssignee: Tal Nisan <tnisan>
Status: CLOSED DUPLICATE QA Contact: Aharon Canan <acanan>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: abaron, amureini, bazulay, iheim, jkt, lpeer, scohen, sgotliv, yeylon
Target Milestone: ---Keywords: Reopened, Triaged
Target Release: 3.4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-02-13 13:56:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs none

Description Dafna Ron 2013-04-07 09:03:18 UTC
Created attachment 732310 [details]
logs

Description of problem:

I had a problem while upgrading domains and the tree was broken. 
After we fixed the issue in one of the hosts, the second host which was non-operational failed to be activated. 
engine is sending connectStorageServer and connectStoragePool who succeeds and than sends getVdsStats and gets only one domain from the statistic and fails to activate the host.
if I run vgs and getStorageDomainsList I can see the domain and there is no issue connecting to it. 
also, once I moved the host to maintenance state and activated we rebuild the tree and host started correctly. 

Version-Release number of selected component (if applicable):

vdsm-4.10.2-13.0.el6ev.x86_64
sf12

How reproducible:


Steps to Reproduce:
1. in iscsi storage, with two domains, block connectivity to non master storage domain from one hosts
2. when host becomes non-operational remove one domain from the tree on this host only. 
# rm -rf /rhev/data-center/<spUUID>/<deactivatedSdUUID>
# rm -rf /rhev/data-center/mnt/blockSD/<deactivatedSdUUID>
3. remove block of domain from the non-operational host. 
4. try to activate the host
  
Actual results:

getVdsStats is checking the tree and since one domain is missing we do not activate the host although we can see all the domains. 
if we put the host in maintenance and than activate it host becomes operational. 
I am putting this BZ as high severity because if we have running vm's on the host we need to migrate them (which we cannot do if they are paused on EIO) or shut off the vms. 

Expected results:

we should be able to rebuild the tree when host is non-operational. 

Additional info: logs

[root@cougar01 ~]# vdsClient -s 0 |grep getVdsStat
getVdsStats
[root@cougar01 ~]# vdsClient -s 0 getVdsStats
	anonHugePages = 56
	cpuIdle = 99.67
	cpuLoad = 0.01
	cpuSys = 0.12
	cpuSysVdsmd = 0.00
	cpuUser = 0.21
	cpuUserVdsmd = 0.12
	dateTime = 2013-04-07T08:40:22 GMT
	diskStats = {'/var/log': {'free': '428785'}, '/var/log/core': {'free': '428785'}, '/tmp': {'free': '428785'}, '/var/run/vdsm/': {'free': '428785'}}
	elapsedTime = 171598
	generationID = 08825355-61c7-46fe-b503-4a5ddb162192
	ksmCpu = 0
	ksmPages = 100
	ksmState = True
	memAvailable = 15116
	memCommitted = 0
	memShared = 0
	memUsed = 4
	netConfigDirty = False
	network = {'bond4': {'macAddr': '', 'name': 'bond4', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'down', 'speed': '1000', 'rxDropped': '0'}, 'bond0': {'macAddr': '', 'name': 'bond0', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'down', 'speed': '1000', 'rxDropped': '0'}, 'bond1': {'macAddr': '', 'name': 'bond1', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'down', 'speed': '1000', 'rxDropped': '0'}, 'bond2': {'macAddr': '', 'name': 'bond2', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'down', 'speed': '1000', 'rxDropped': '0'}, 'bond3': {'macAddr': '', 'name': 'bond3', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'down', 'speed': '1000', 'rxDropped': '0'}, 'eth3': {'macAddr': '', 'name': 'eth3', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'down', 'speed': '1000', 'rxDropped': '0'}, 'eth2': {'macAddr': '', 'name': 'eth2', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'down', 'speed': '1000', 'rxDropped': '0'}, 'eth1': {'macAddr': '', 'name': 'eth1', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'down', 'speed': '1000', 'rxDropped': '0'}, 'eth0': {'macAddr': '', 'name': 'eth0', 'txDropped': '0', 'rxErrors': '0', 'txRate': '0.0', 'rxRate': '0.0', 'txErrors': '0', 'state': 'up', 'speed': '10000', 'rxDropped': '0'}}
	rxDropped = 0
	rxRate = 0.00
	statsAge = 1.88
	storageDomains = {'6bbbe226-7456-46da-8fc4-4c4d59472436': {'delay': '0.00945687294006', 'lastCheck': '1.9', 'code': 0, 'valid': True}}
	swapFree = 16119
	swapTotal = 16119
	thpState = always
	txDropped = 0
	txRate = 0.00
	vmActive = 0
	vmCount = 0
	vmMigrating = 0
[root@cougar01 ~]# vdsClient -s 0 getStorageDomainsList
6bbbe226-7456-46da-8fc4-4c4d59472436
3533d774-aff3-4b73-926d-053e2a3dc8a1
cf28adb9-28e7-49f9-88d8-0e6d12336bd9

[root@cougar01 ~]# vgs
  /dev/mapper/1Dafna-31-021363867: read failed after 0 of 4096 at 107374116864: Input/output error
  /dev/mapper/1Dafna-31-021363867: read failed after 0 of 4096 at 107374174208: Input/output error
  /dev/mapper/1Dafna-31-021363867: read failed after 0 of 4096 at 0: Input/output error
  /dev/mapper/1Dafna-31-021363867: read failed after 0 of 4096 at 4096: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/metadata: read failed after 0 of 4096 at 536805376: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/metadata: read failed after 0 of 4096 at 536862720: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/metadata: read failed after 0 of 4096 at 0: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/metadata: read failed after 0 of 4096 at 4096: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/ids: read failed after 0 of 4096 at 134152192: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/ids: read failed after 0 of 4096 at 134209536: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/ids: read failed after 0 of 4096 at 0: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/ids: read failed after 0 of 4096 at 4096: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/leases: read failed after 0 of 4096 at 2147418112: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/leases: read failed after 0 of 4096 at 2147475456: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/leases: read failed after 0 of 4096 at 0: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/leases: read failed after 0 of 4096 at 4096: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/inbox: read failed after 0 of 4096 at 134152192: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/inbox: read failed after 0 of 4096 at 134209536: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/inbox: read failed after 0 of 4096 at 0: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/inbox: read failed after 0 of 4096 at 4096: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/outbox: read failed after 0 of 4096 at 134152192: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/outbox: read failed after 0 of 4096 at 134209536: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/outbox: read failed after 0 of 4096 at 0: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/outbox: read failed after 0 of 4096 at 4096: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/master: read failed after 0 of 4096 at 1073676288: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/master: read failed after 0 of 4096 at 1073733632: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/master: read failed after 0 of 4096 at 0: Input/output error
  /dev/a3282596-8f78-4930-bb76-bebeb657babf/master: read failed after 0 of 4096 at 4096: Input/output error
  VG                                   #PV #LV #SN Attr   VSize   VFree 
  3533d774-aff3-4b73-926d-053e2a3dc8a1   1   6   0 wz--n-  99.62g 95.75g
  6bbbe226-7456-46da-8fc4-4c4d59472436   1   7   0 wz--n-  99.62g 94.75g
  cf28adb9-28e7-49f9-88d8-0e6d12336bd9   1   6   0 wz--n-  99.62g 95.75g
  vg0                                    1   2   0 wz--n- 465.56g     0 
[root@cougar01 ~]# /rhev/data-center/
9a9db723-c63b-469d-9545-d8cc9407822b/ hsm-tasks/                            mnt/                                  
[root@cougar01 ~]# /rhev/data-center/9a9db723-c63b-469d-9545-d8cc9407822b/
6bbbe226-7456-46da-8fc4-4c4d59472436/ mastersd/                             
[root@cougar01 ~]# /rhev/data-center/9a9db723-c63b-469d-9545-d8cc9407822b/

Comment 1 Ayal Baron 2013-04-07 10:01:35 UTC
1. "admin" manually deleted files from host
2. the host doesn't see 1 of the domains while another host sees it, the defined behaviour is for this host to be non-operational
3. moving host to maintenance and activating fixes the problem

The system cannot deal with all the user inflicted issues that can happen (e.g. instead of deleting the links, you could have deleted /usr/share/vdsm/*)

Closing.

Comment 2 Dafna Ron 2013-04-07 10:08:18 UTC
(In reply to comment #1)
> 1. "admin" manually deleted files from host

I did not manually delete the files - they were deleted during upgrade by vdsm. 
I simply gave a way to reproduce this. 

> 2. the host doesn't see 1 of the domains while another host sees it, the
> defined behaviour is for this host to be non-operational

yes, but only if the domain is actually unavailble and in this case, the domain is functional and can be seen by the host only it does not exist in the tree (you can see that connectStorageServer and connectStoragePool succeed and we are failing because the domain is not in the tree only. 

> 3. moving host to maintenance and activating fixes the problem

which measn that you need to shut down the vm's on the host to do so. 

> 
> The system cannot deal with all the user inflicted issues that can happen
> (e.g. instead of deleting the links, you could have deleted
> /usr/share/vdsm/*)
> 
> Closing.

following my reply and the impact on user I think this should be discussed with PM and QE managers before cloing.

Comment 3 Itamar Heim 2013-12-01 19:57:46 UTC
still targeted for 3.4?

Comment 4 Allon Mureinik 2013-12-02 13:33:20 UTC
yes.

Comment 5 Ayal Baron 2013-12-18 10:01:30 UTC
Sergey, this sounds like a dup of the missing links issue?

Comment 6 Sergey Gotliv 2013-12-18 10:38:07 UTC
It looks like a dup of BZ#1026697. The only thing which really bother me about this bug is why host is not activated???

BZ#1026697 deals with the host that activated without restoring these links.
We should check that.

Comment 9 Tal Nisan 2014-02-13 13:56:03 UTC

*** This bug has been marked as a duplicate of bug 1026697 ***