Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1205575

Summary: Invalid Data centers status on 3.6 engine. Data center status changing from UP to NON-Responsive every few minutes
Product: [Retired] oVirt Reporter: Michael Burman <mburman>
Component: ovirt-engine-webadminAssignee: Liron Aravot <laravot>
Status: CLOSED WORKSFORME QA Contact: Aharon Canan <acanan>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.6CC: amureini, bugs, ecohen, gklein, lsurette, mburman, mgoldboi, mshira, rbalakri, tnisan, yeylon, ylavi
Target Milestone: m1   
Target Release: 3.6.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-07-22 13:48:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
screenshots
none
engine log
none
more logs none

Description Michael Burman 2015-03-25 08:51:14 UTC
Created attachment 1006172 [details]
screenshots

Description of problem:
Invalid Data centers status on 3.6 engine. Data center status changing from UP to NON-Responsive every few minutes.
This is something new from the yesterday's update from master.
I'm running updated from master both for ovirt-engine and vdsm every morning for the last months, since yesterday my event log and engine.log are getting the next error every 2 minutes:
 Invalid status on Data Center mburman2. Setting Data Center status to Non Responsive (On host red-vds2.qa.lab.tlv.redhat.com, Error: Network error during communication with the Host.).
Invalid status on Data Center mburman1. Setting Data Center status to Non Responsive (On host orchid-vds1.qa.lab.tlv.redhat.com, Error: Network error during communication with the Host.).

I have hosts running in 3.5 and 3.6 clusters in 3.6 engine.
All my servers are up all this time, they have no connectivity issues at all.
My data center changing there state from UP to NON-Operational and back to UP every few minutes.

Version-Release number of selected component (if applicable):


How reproducible:
100, since yesterday(24.3.2015)

Steps to Reproduce:
1. working 3.6 setup > data center in 3.6 engine, 3.5 and 3.6 clusters, servers
2.
3.

Actual results:
event log and engine log filled with errors and warnings, DC's changing state every few minutes.

Expected results:
Should not see such errors or warnings, DC's shouldn't change their state every few minutes.

- I think something has changed during yesterday's ovirt-engine update, but i'm not sure exactly what is going on.

Comment 1 Michael Burman 2015-03-25 08:56:24 UTC
Created attachment 1006192 [details]
engine log

Comment 2 Michael Burman 2015-03-25 08:57:52 UTC
I wasn't sure on which white board i should put this BZ. So i'm sorry in advance.

Comment 3 Michael Burman 2015-03-25 09:21:17 UTC
All this situation blocking from create vm's , add disks and so on.

i'm afraid now that this BZ 1205559, that i have created earlier related to this issue. I'm not sure, but this is not good.

Comment 4 Michael Burman 2015-03-25 14:35:56 UTC
And now this errors started to appear every hour-
Failed to update OVF disks 1009cf4a-9abc-4f01-b99f-fcc81aa006a4, OVF data isn't updated on those OVF stores (Data Center mburman1, Storage Domain mbstrg).

Failed to update OVF disks cd5534da-b4b6-4dda-9ce5-b4783db76bc2, OVF data isn't updated on those OVF stores (Data Center mburman2, Storage Domain mbstrg5).

and this- Failed to Reconstruct Master Domain for Data Center mburman1.
Data Center is being initialized, please wait for initialization to complete.

My setup filled with errors, crashes, warnings and  failures.
All of this begun after the update i did to vdsm on servers and on ovirt-engine from nightly snapshot.

Comment 5 Michael Burman 2015-03-30 07:22:27 UTC
The issue/us are still there, DCs and storage domains changing state every few minutes.

And now it's not possible to add new storage domain to my setup.
logs are filled with errors:
this from vdsm.log when trying to add new storage domain, by the operation is stacked for ever.

Thread-23::ERROR::2015-03-30 10:19:58,498::monitor::250::Storage.Monitor::(_monitorDomain) Error monitoring domain 0ff09f73-8df6-4619-b47c-55802867213c
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 246, in _monitorDomain
    self._performDomainSelftest()
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 726, in wrapper
    value = meth(self, *a, **kw)
  File "/usr/share/vdsm/storage/monitor.py", line 313, in _performDomainSelftest
    self.domain.selftest()
  File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 120, in _realProduce
    self.refreshStorage()
  File "/usr/share/vdsm/storage/misc.py", line 752, in helper
  File "/usr/share/vdsm/storage/misc.py", line 737, in __call__
  File "/usr/share/vdsm/storage/sdc.py", line 83, in refreshStorage
    multipath.rescan()
  File "/usr/share/vdsm/storage/multipath.py", line 63, in rescan
  File "/usr/share/vdsm/storage/misc.py", line 752, in helper
  File "/usr/share/vdsm/storage/misc.py", line 737, in __call__
  File "/usr/share/vdsm/storage/iscsi.py", line 435, in rescan
  File "/usr/share/vdsm/storage/iscsiadm.py", line 318, in session_rescan_async
  File "/usr/share/vdsm/storage/iscsiadm.py", line 97, in _runCmd
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 626, in execCmd
    deathSignal=deathSignal, childUmask=childUmask)
  File "/usr/lib64/python2.7/site-packages/cpopen/__init__.py", line 51, in __init__
  File "/usr/lib64/python2.7/subprocess.py", line 703, in __init__
  File "/usr/lib64/python2.7/subprocess.py", line 1100, in _get_handles
  File "/usr/lib64/python2.7/subprocess.py", line 1153, in pipe_cloexec
OSError: [Errno 24] Too many open files
Dummy-43::DEBUG::2015-03-30 10:19:58,851::storage_mailbox::732::Storage.Misc.excCmd::(_checkForMail) dd if=/rhev/data-center/26bf9396-8f3d-45bf-8f61-0e5d671bbd9a/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None)
Dummy-43::ERROR::2015-03-30 10:19:58,852::storage_mailbox::790::Storage.MailBox.SpmMailMonitor::(run) Error checking for mail
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 788, in run
  File "/usr/share/vdsm/storage/storage_mailbox.py", line 732, in _checkForMail
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 626, in execCmd
    deathSignal=deathSignal, childUmask=childUmask)
  File "/usr/lib64/python2.7/site-packages/cpopen/__init__.py", line 51, in __init__
  File "/usr/lib64/python2.7/subprocess.py", line 703, in __init__
  File "/usr/lib64/python2.7/subprocess.py", line 1100, in _get_handles
  File "/usr/lib64/python2.7/subprocess.py", line 1153, in pipe_cloexec
OSError: [Errno 24] Too many open files

Attaching some more logs. I afraid this is all related and part of one issue, guess storage, may be related to the SPM changes the are made.
But from small investigation, this issues are not local only at my setup.

Comment 6 Michael Burman 2015-03-30 07:23:05 UTC
Created attachment 1008244 [details]
more logs

Comment 7 Shira Maximov 2015-04-06 08:46:32 UTC
I also experienced this bug, my setup have 2 hosts that are up and only one DC that is down.
the errors i have from the event log are: 
	
2015-Apr-06, 07:46
	
Invalid status on Data Center DC. Setting Data Center status to Non Responsive (On host silver-vdsc, Error: General Exception).
	
2015-Apr-06, 07:36
	
Storage Pool Manager runs on Host silver-vdsc (Address: 10.35.108.12).
	
2015-Apr-06, 07:35
	
Data Center is being initialized, please wait for initialization to complete.

Comment 8 Yaniv Lavi 2015-07-21 23:51:13 UTC
Is this bug still relevant?

Comment 9 Michael Burman 2015-07-22 13:36:04 UTC
I haven't faced this bug on the latest builds, so i guess it's not relevant.

Comment 10 Allon Mureinik 2015-07-22 13:48:04 UTC
(In reply to Michael Burman from comment #9)
> I haven't faced this bug on the latest builds, so i guess it's not relevant.

Closing based on this statement.
If you encounter it again feel free to re-open.