Created attachment 960431 [details] vdsm and super vdsm logs Description of problem: When setting host to maintenance, then upgrade vdsm from 3.4 > 3.5 , then reinstalling host again in the setup, after installation can't activate host. Host can't find and connect to iscsi storage domain. Version-Release number of selected component (if applicable): 3.4.4-2.2.el6ev vdsm-4.14.18-3.el6ev > vdsm-4.16.7.4-1.el6ev How reproducible: always Steps to Reproduce: 1. 3.4 engine with 3.4 host 2. put host on maintenance and upgrade host to 3.5 3. after installation successfully finished try activate host Actual results: can't connect or find iscsi storage domain 'Storage domain does not exist'. Expected results: host should be able to connect to iscsi storage domain after activating.
Please attach the engine logs too.
Created attachment 961028 [details] engine logs
Sure. In the engine logs look at 23.11.14, 12:45-12:55 Relevant hosts- orange-vdsc and orange-vdsd
Nir, does this make any sense to you? Thread-13::ERROR::2014-11-23 12:12:01,276::sdc::137::Storage.StorageDomainCache::(_findDomain) looking for unfetched domain e04c81c8-8d7e-4dab-b909-2d8443ff8863 Thread-13::ERROR::2014-11-23 12:12:01,276::sdc::154::Storage.StorageDomainCache::(_findUnfetchedDomain) looking for domain e04c81c8-8d7e-4dab-b909-2d8443ff8863 Thread-13::DEBUG::2014-11-23 12:12:01,276::lvm::365::Storage.OperationMutex::(_reloadvgs) Operation 'lvm reload operation' got the operation mutex Thread-13::DEBUG::2014-11-23 12:12:01,279::lvm::288::Storage.Misc.excCmd::(cmd) /usr/bin/sudo -n /sbin/lvm vgs --config ' devices { preferred_names = ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 obtain_device_list_from_udev=0 filter = [ '\''r|.*|'\'' ] } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min = 50 retain_days = 0 } ' --noheadings --units b --nosuffix --separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name e04c81c8-8d7e-4dab-b909-2d8443ff8863 (cwd None) Thread-13::DEBUG::2014-11-23 12:12:01,783::lvm::288::Storage.Misc.excCmd::(cmd) FAILED: <err> = ' Volume group "e04c81c8-8d7e-4dab-b909-2d8443ff8863" not found\n Skipping volume group e04c81c8-8d7e-4dab-b909-2d8443ff8863\n'; <rc> = 5 Thread-13::WARNING::2014-11-23 12:12:01,785::lvm::370::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] [' Volume group "e04c81c8-8d7e-4dab-b909-2d8443ff8863" not found', ' Skipping volume group e04c81c8-8d7e-4dab-b909-2d8443ff8863']
(In reply to Allon Mureinik from comment #4) > Nir, does this make any sense to you? > 12:12:01,783::lvm::288::Storage.Misc.excCmd::(cmd) FAILED: <err> = ' Volume > group "e04c81c8-8d7e-4dab-b909-2d8443ff8863" not found\n Skipping volume > group e04c81c8-8d7e-4dab-b909-2d8443ff8863\n'; <rc> = 5 > Thread-13::WARNING::2014-11-23 > 12:12:01,785::lvm::370::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] [' > Volume group "e04c81c8-8d7e-4dab-b909-2d8443ff8863" not found', ' Skipping > volume group e04c81c8-8d7e-4dab-b909-2d8443ff8863'] Not having a vg does not look like a regression. Need to investigate the logs to understand why it happened. I upgraded vdsm from 3.4 to 3.5/master many times and never had such issue. Michael, can you reproduce this or this happened only once?
Michael, also describe exactly how did you upgrade to 3.5. For example, when I upgrade my testing machines, I always do: 1. yum remove vdsm\* 2. yum install vdsm 3. vdsm-tool configure --force I'm not saying this is the recommended procedure, but it works.
(In reply to Nir Soffer from comment #6) > Michael, also describe exactly how did you upgrade to 3.5. > > For example, when I upgrade my testing machines, I always do: > 1. yum remove vdsm\* > 2. yum install vdsm > 3. vdsm-tool configure --force > > I'm not saying this is the recommended procedure, but it works. Hi Nir, Please be informed that customers not always perform upgrades in expected or logical ways, your way is absolutely logical though, but we performed on most of our hosts by simply "yum update all -y" and that's it.
Hi Nir, From what i investigate and understand, this happen because when setting host on maintenance, iscsi targets was not removed, the sessions doesn't closed properly. Me, Nikolai and Ori from storage saw that when host in maintenance, the session to iscsi left opened, or at least some of them. Any way, i reproduced this twice, with two of my hosts. All logs from hosts are attached. - This was my steps: 1. Put 3.4 host to maintenance 2. Run 'yum update vdsm', with right repo's. vdsm updated with right version. 3. One host i reinstalled(actually this is not required) and then tried to activate, second host tried to activated. On both cases wasn't able to connect or find iscsi storage domain 'Storage domain does not exist'. - Nir, a customer shouldn't run upgrade procedure the way you described, 'yum update vdsm' or 'yum updae' with right repo's should be enough. Best regards Michael B
dMichael, please specify which storage domains are defined on the 3.4 setup when you put the host to maintenance. This bug talks about iscsi storage domain, but I see a *gluster* domain disconnected in the vdsm log: Thread-15::INFO::2014-11-23 12:02:47,818::logUtils::44::dispatcher::(wrapper) Run and protect: disconnectStorageServer(domType=7, spUUID='ba5d5f70-b014-4b33-bc81-de7df2f88574', conList=[{'port': '', 'connection': '10.35.160.202:/ogofen1', 'iqn': '', 'user': '', 'tpgt': '1', 'vfs_type': 'glusterfs', 'password': '******', 'id': 'ef9e98e6-fe20-4599-955e-2d288ba14de2'}], options=None) Thread-15::DEBUG::2014-11-23 12:02:47,819::mount::227::Storage.Misc.excCmd::(_runcmd) /usr/bin/sudo -n /bin/umount -f -l /rhev/data-center/mnt/glusterSD/10.35.160.202:_ogofen1 (cwd None) And also specify when did you put the host in maintenance. The best way to report such bugs is to add messages to vdsm log or to /var/log/messages before you start the test: echo "---- `date` putting host to maintenance -----" >> /var/log/vdsm/vdsm.log
Michael, other important details - *before* the uprade: - On which vdsm version the iscsi domain was created? - dump of the storage connections table from engine database - output of: tree /var/lib/iscsi - output of: for f in /var/lib/iscsi/nodes/*/*/*; do echo $f; cat $f; echo; done
Nir, The iscsi domain was created on vdsm-4.13.2-0.18.el6ev.x86_64 - Can't provide any output before upgrade, only after. The upgrade is already done. It's mixed environment. The hosts was connected both to iscsi storage domain and NFS storage domain.
Ok, so maybe this issue happening again with another host in our environment, i'm attaching all relevant logs, seems to me like the same issue. Pls for more details ssh to host- alma03.qa.lab.tlv.redhat.com upgrade engine- 10.35.161.37
Created attachment 963575 [details] host and engine logs
(In reply to Michael Burman from comment #12) > Ok, so maybe this issue happening again with another host in our environment, > i'm attaching all relevant logs, seems to me like the same issue. > > Pls for more details ssh to host- alma03.qa.lab.tlv.redhat.com > upgrade engine- 10.35.161.37 And you can't grab the required info (comment 10) from there?
Michael, please provide the information requested in comment 10. If needed, install a fresh machine so we can see the state of the engine database and host iscsi sessions *before* the upgrade.
Nir, Allon The setup was in this state for a week for your investigation, while hosts were disconnected on daily basis. This setup is already upgraded, the state of the engine already changed. If we will take this setup from snapshot, only then i will be able to provide info for your requests before the upgrade.
Adding back the needinfo to make it clear that we are waiting for info.
Closing the need info until i will be able to provide the info or be able to reproduce.
Closing the bug until we will have a proper reproduction with all the info.
Didn't managed to reproduce this issue, but: When host activated back from maintenance mode, there is a lot of errors in the vdsm.log about- 'StorageDomainDoesNotExist: Storage domain does not exist' 'Error while collecting domain de096187-36e2-44e5-b8db-a9b67c3960de monitoring information' - Thread-184::ERROR::2014-12-15 09:30:33,376::sdc::143::Storage.StorageDomainCache::(_findDomain) domain de096187-36e2-44e5-b8db-a9b67c3960de not found Traceback (most recent call last): File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 171, in _findUnfetchedDomain raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: (u'de096187-36e2-44e5-b8db-a9b67c3960de',) Thread-184::ERROR::2014-12-15 09:30:33,376::domainMonitor::239::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain de096187-36e2-44e5-b8db-a9b67c3960de monitoring information Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 204, in _monitorDomain self.nextStatus.clear() File "/usr/share/vdsm/storage/sdc.py", line 98, in produce domain.getRealDomain() File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain return self._cache._realProduce(self._sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce domain = self._findDomain(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain dom = findMethod(sdUUID) File "/usr/share/vdsm/storage/sdc.py", line 171, in _findUnfetchedDomain raise se.StorageDomainDoesNotExist(sdUUID) StorageDomainDoesNotExist: Storage domain does not exist: (u'de096187-36e2-44e5-b8db-a9b67c3960de',) Attaching relevant logs and info.
Created attachment 968817 [details] more info and logs