1167085 – [rhev-upgrade] > Host can't connect/reconnect to iscsi storage domain after setting the host to maintenance and vdsm upgrade from 3.4 > 3.5

Bug 1167085 - [rhev-upgrade] > Host can't connect/reconnect to iscsi storage domain after setting the host to maintenance and vdsm upgrade from 3.4 > 3.5

Summary: [rhev-upgrade] > Host can't connect/reconnect to iscsi storage domain after s...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.5.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Nir Soffer
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:	rhev35rcblocker rhev35gablocker
TreeView+	depends on / blocked

Reported:	2014-11-23 13:24 UTC by Michael Burman
Modified:	2016-02-10 19:59 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1168217 (view as bug list)
Environment:
Last Closed:	2014-12-14 09:38:03 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
vdsm and super vdsm logs (789.35 KB, application/x-gzip) 2014-11-23 13:24 UTC, Michael Burman	no flags	Details
engine logs (998.39 KB, application/x-gzip) 2014-11-25 07:00 UTC, Michael Burman	no flags	Details
host and engine logs (1.96 MB, application/x-gzip) 2014-12-02 07:42 UTC, Michael Burman	no flags	Details
more info and logs (673.37 KB, application/x-gzip) 2014-12-15 07:40 UTC, Michael Burman	no flags	Details
View All

Description Michael Burman 2014-11-23 13:24:54 UTC

Created attachment 960431 [details]
vdsm and super vdsm logs

Description of problem:
When setting host to maintenance, then upgrade vdsm from 3.4 > 3.5 , then reinstalling host again in the setup, after installation can't activate host. Host can't find and connect to iscsi storage domain.

Version-Release number of selected component (if applicable):
3.4.4-2.2.el6ev 
vdsm-4.14.18-3.el6ev > vdsm-4.16.7.4-1.el6ev

How reproducible:
always

Steps to Reproduce:
1. 3.4 engine with 3.4 host
2. put host on maintenance and upgrade host to 3.5 
3. after installation successfully finished try activate host 

Actual results:
can't connect or find iscsi storage domain 'Storage domain does not exist'.

Expected results:
host should be able to connect to iscsi storage domain after activating.

Comment 1 Allon Mureinik 2014-11-24 17:15:53 UTC

Please attach the engine logs too.

Comment 2 Michael Burman 2014-11-25 07:00:57 UTC

Created attachment 961028 [details]
engine logs

Comment 3 Michael Burman 2014-11-25 07:02:19 UTC

Sure.

In the engine logs look at 23.11.14, 12:45-12:55
Relevant hosts- orange-vdsc and orange-vdsd

Comment 4 Allon Mureinik 2014-11-26 12:52:31 UTC

Nir, does this make any sense to you?

Thread-13::ERROR::2014-11-23 12:12:01,276::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain e04c81c8-8d7e-4dab-b909-2d8443ff8863
Thread-13::ERROR::2014-11-23 12:12:01,276::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain e04c81c8-8d7e-4dab-b909-2d8443ff8863
Thread-13::DEBUG::2014-11-23 12:12:01,276::lvm::365::Storage.OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex
Thread-13::DEBUG::2014-11-23 12:12:01,279::lvm::288::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /sbin/lvm vgs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ '\''r|.*|'\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name e04c81c8-8d7e-4dab-b909-2d8443ff8863 (cwd None)
Thread-13::DEBUG::2014-11-23 12:12:01,783::lvm::288::Storage.Misc.excCmd::(cmd) FAILED: <err> = '  Volume group "e04c81c8-8d7e-4dab-b909-2d8443ff8863" not found\n  Skipping volume group e04c81c8-8d7e-4dab-b909-2d8443ff8863\n'; <rc> = 5
Thread-13::WARNING::2014-11-23 12:12:01,785::lvm::370::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] ['  Volume group "e04c81c8-8d7e-4dab-b909-2d8443ff8863" not found', '  Skipping volume group e04c81c8-8d7e-4dab-b909-2d8443ff8863']

Comment 5 Nir Soffer 2014-11-26 13:00:03 UTC

(In reply to Allon Mureinik from comment #4)
> Nir, does this make any sense to you?
> 12:12:01,783::lvm::288::Storage.Misc.excCmd::(cmd) FAILED: <err> = '  Volume
> group "e04c81c8-8d7e-4dab-b909-2d8443ff8863" not found\n  Skipping volume
> group e04c81c8-8d7e-4dab-b909-2d8443ff8863\n'; <rc> = 5
> Thread-13::WARNING::2014-11-23
> 12:12:01,785::lvm::370::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] [' 
> Volume group "e04c81c8-8d7e-4dab-b909-2d8443ff8863" not found', '  Skipping
> volume group e04c81c8-8d7e-4dab-b909-2d8443ff8863']

Not having a vg does not look like a regression. Need to investigate the logs
to understand why it happened.

I upgraded vdsm from 3.4 to 3.5/master many times and never had such issue.

Michael, can you reproduce this or this happened only once?

Comment 6 Nir Soffer 2014-11-26 13:03:15 UTC

Michael, also describe exactly how did you upgrade to 3.5.

For example, when I upgrade my testing machines, I always do:
1. yum remove vdsm\*
2. yum install vdsm
3. vdsm-tool configure --force

I'm not saying this is the recommended procedure, but it works.

Comment 7 Nikolai Sednev 2014-11-26 13:46:48 UTC

(In reply to Nir Soffer from comment #6)
> Michael, also describe exactly how did you upgrade to 3.5.
> 
> For example, when I upgrade my testing machines, I always do:
> 1. yum remove vdsm\*
> 2. yum install vdsm
> 3. vdsm-tool configure --force
> 
> I'm not saying this is the recommended procedure, but it works.

Hi Nir,
Please be informed that customers not always perform upgrades in expected or logical ways, your way is absolutely logical though, but we performed on most of our hosts by simply "yum update all -y" and that's it.

Comment 8 Michael Burman 2014-11-26 14:15:50 UTC

Hi Nir,

From what i investigate and understand, this happen because when setting host on maintenance, iscsi targets was not removed, the sessions doesn't closed properly.  Me, Nikolai and Ori from storage saw that when host in maintenance, the session to iscsi left opened, or at least some of them. 

Any way, i reproduced this twice, with two of my hosts. All logs from hosts are attached.

- This was my steps:

1. Put 3.4 host to maintenance
2. Run 'yum update vdsm', with right repo's. vdsm updated with right version. 
3. One host i reinstalled(actually this is not required) and then tried to activate, second host tried to activated. On both cases wasn't able to connect or find iscsi storage domain 'Storage domain does not exist'.  

- Nir, a customer shouldn't run upgrade procedure the way you described, 'yum update vdsm' or 'yum updae' with right repo's should be enough.

Best regards

Michael B

Comment 9 Nir Soffer 2014-11-27 17:18:24 UTC

dMichael, please specify which storage domains are defined on the 3.4 setup when you put the host to maintenance. This bug talks about iscsi storage domain, but I see a *gluster* domain disconnected in the vdsm log:

Thread-15::INFO::2014-11-23 12:02:47,818::logUtils::44::dispatcher::(wrapper) Run and protect: disconnectStorageServer(domType=7, spUUID='ba5d5f70-b014-4b33-bc81-de7df2f88574', conList=[{'port': '', 'connection': '10.35.160.202:/ogofen1', 'iqn': '', 'user': '', 'tpgt': '1', 'vfs_type': 'glusterfs', 'password': '******', 'id': 'ef9e98e6-fe20-4599-955e-2d288ba14de2'}], options=None)
Thread-15::DEBUG::2014-11-23 12:02:47,819::mount::227::Storage.Misc.excCmd::(_runcmd) /usr/bin/sudo -n /bin/umount -f -l /rhev/data-center/mnt/glusterSD/10.35.160.202:_ogofen1 (cwd None)

And also specify when did you put the host in maintenance. The best way to report such bugs is to add messages to vdsm log or to /var/log/messages before you start the test:

    echo "---- `date` putting host to maintenance -----" >> /var/log/vdsm/vdsm.log

Comment 10 Nir Soffer 2014-11-27 21:33:03 UTC

Michael, other important details - *before* the uprade:

- On which vdsm version the iscsi domain was created?
- dump of the storage connections table from engine database
- output of:
  tree /var/lib/iscsi
- output of:
  for f in /var/lib/iscsi/nodes/*/*/*; do echo $f; cat $f; echo; done

Comment 11 Michael Burman 2014-11-30 07:33:07 UTC

Nir, 

The iscsi domain was created on vdsm-4.13.2-0.18.el6ev.x86_64 
- Can't provide any output before upgrade, only after. The upgrade is already done. It's mixed environment.   

The hosts was connected both to iscsi storage domain and NFS storage domain.

Comment 12 Michael Burman 2014-12-02 07:41:11 UTC

Ok, so maybe this issue happening again with another host in our environment,
i'm attaching all relevant logs, seems to me like the same issue. 

Pls for more details ssh to host- alma03.qa.lab.tlv.redhat.com
upgrade engine- 10.35.161.37

Comment 13 Michael Burman 2014-12-02 07:42:11 UTC

Created attachment 963575 [details]
host and engine logs

Comment 14 Allon Mureinik 2014-12-09 12:38:13 UTC

(In reply to Michael Burman from comment #12)
> Ok, so maybe this issue happening again with another host in our environment,
> i'm attaching all relevant logs, seems to me like the same issue. 
> 
> Pls for more details ssh to host- alma03.qa.lab.tlv.redhat.com
> upgrade engine- 10.35.161.37
And you can't grab the required info (comment 10) from there?

Comment 15 Nir Soffer 2014-12-10 07:45:57 UTC

Michael, please provide the information requested in comment 10.

If needed, install a fresh machine so we can see the state of the engine database and host iscsi sessions *before* the upgrade.

Comment 16 Michael Burman 2014-12-10 08:09:44 UTC

Nir, Allon

The setup was in this state for a week for your investigation, while hosts were disconnected on daily basis.
This setup is already upgraded, the state of the engine already changed.

If we will take this setup from snapshot, only then i will be able to provide info for your requests before the upgrade.

Comment 17 Nir Soffer 2014-12-10 09:40:52 UTC

Adding back the needinfo to make it clear that we are waiting for info.

Comment 18 Michael Burman 2014-12-14 09:36:41 UTC

Closing the need info until i will be able to provide the info or be able to reproduce.

Comment 19 Allon Mureinik 2014-12-14 09:38:03 UTC

Closing the bug until we will have a proper reproduction with all the info.

Comment 20 Michael Burman 2014-12-15 07:38:44 UTC

Didn't managed to reproduce this issue, but:
When host activated back from maintenance mode, there is a lot of errors in the vdsm.log about-

'StorageDomainDoesNotExist: Storage domain does not exist'

'Error while collecting domain de096187-36e2-44e5-b8db-a9b67c3960de monitoring information'

- 
Thread-184::ERROR::2014-12-15 09:30:33,376::sdc::143::Storage.StorageDomainCache::(_findDomain) domain de096187-36e2-44e5-b8db-a9b67c3960de not found
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 171, in _findUnfetchedDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: (u'de096187-36e2-44e5-b8db-a9b67c3960de',)
Thread-184::ERROR::2014-12-15 09:30:33,376::domainMonitor::239::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain de096187-36e2-44e5-b8db-a9b67c3960de monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 204, in _monitorDomain
    self.nextStatus.clear()
  File "/usr/share/vdsm/storage/sdc.py", line 98, in produce
    domain.getRealDomain()
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain
    dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 171, in _findUnfetchedDomain
    raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: (u'de096187-36e2-44e5-b8db-a9b67c3960de',)

Attaching relevant logs and info.

Comment 21 Michael Burman 2014-12-15 07:40:10 UTC

Created attachment 968817 [details]
more info and logs

Note You need to log in before you can comment on or make changes to this bug.