Bug 1205575
| Summary: | Invalid Data centers status on 3.6 engine. Data center status changing from UP to NON-Responsive every few minutes | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Retired] oVirt | Reporter: | Michael Burman <mburman> | ||||||||
| Component: | ovirt-engine-webadmin | Assignee: | Liron Aravot <laravot> | ||||||||
| Status: | CLOSED WORKSFORME | QA Contact: | Aharon Canan <acanan> | ||||||||
| Severity: | urgent | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 3.6 | CC: | amureini, bugs, ecohen, gklein, lsurette, mburman, mgoldboi, mshira, rbalakri, tnisan, yeylon, ylavi | ||||||||
| Target Milestone: | m1 | ||||||||||
| Target Release: | 3.6.0 | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | storage | ||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2015-07-22 13:48:04 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
Created attachment 1006192 [details]
engine log
I wasn't sure on which white board i should put this BZ. So i'm sorry in advance. All this situation blocking from create vm's , add disks and so on. i'm afraid now that this BZ 1205559, that i have created earlier related to this issue. I'm not sure, but this is not good. And now this errors started to appear every hour- Failed to update OVF disks 1009cf4a-9abc-4f01-b99f-fcc81aa006a4, OVF data isn't updated on those OVF stores (Data Center mburman1, Storage Domain mbstrg). Failed to update OVF disks cd5534da-b4b6-4dda-9ce5-b4783db76bc2, OVF data isn't updated on those OVF stores (Data Center mburman2, Storage Domain mbstrg5). and this- Failed to Reconstruct Master Domain for Data Center mburman1. Data Center is being initialized, please wait for initialization to complete. My setup filled with errors, crashes, warnings and failures. All of this begun after the update i did to vdsm on servers and on ovirt-engine from nightly snapshot. The issue/us are still there, DCs and storage domains changing state every few minutes.
And now it's not possible to add new storage domain to my setup.
logs are filled with errors:
this from vdsm.log when trying to add new storage domain, by the operation is stacked for ever.
Thread-23::ERROR::2015-03-30 10:19:58,498::monitor::250::Storage.Monitor::(_monitorDomain) Error monitoring domain 0ff09f73-8df6-4619-b47c-55802867213c
Traceback (most recent call last):
File "/usr/share/vdsm/storage/monitor.py", line 246, in _monitorDomain
self._performDomainSelftest()
File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 726, in wrapper
value = meth(self, *a, **kw)
File "/usr/share/vdsm/storage/monitor.py", line 313, in _performDomainSelftest
self.domain.selftest()
File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__
return getattr(self.getRealDomain(), attrName)
File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
return self._cache._realProduce(self._sdUUID)
File "/usr/share/vdsm/storage/sdc.py", line 120, in _realProduce
self.refreshStorage()
File "/usr/share/vdsm/storage/misc.py", line 752, in helper
File "/usr/share/vdsm/storage/misc.py", line 737, in __call__
File "/usr/share/vdsm/storage/sdc.py", line 83, in refreshStorage
multipath.rescan()
File "/usr/share/vdsm/storage/multipath.py", line 63, in rescan
File "/usr/share/vdsm/storage/misc.py", line 752, in helper
File "/usr/share/vdsm/storage/misc.py", line 737, in __call__
File "/usr/share/vdsm/storage/iscsi.py", line 435, in rescan
File "/usr/share/vdsm/storage/iscsiadm.py", line 318, in session_rescan_async
File "/usr/share/vdsm/storage/iscsiadm.py", line 97, in _runCmd
File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 626, in execCmd
deathSignal=deathSignal, childUmask=childUmask)
File "/usr/lib64/python2.7/site-packages/cpopen/__init__.py", line 51, in __init__
File "/usr/lib64/python2.7/subprocess.py", line 703, in __init__
File "/usr/lib64/python2.7/subprocess.py", line 1100, in _get_handles
File "/usr/lib64/python2.7/subprocess.py", line 1153, in pipe_cloexec
OSError: [Errno 24] Too many open files
Dummy-43::DEBUG::2015-03-30 10:19:58,851::storage_mailbox::732::Storage.Misc.excCmd::(_checkForMail) dd if=/rhev/data-center/26bf9396-8f3d-45bf-8f61-0e5d671bbd9a/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None)
Dummy-43::ERROR::2015-03-30 10:19:58,852::storage_mailbox::790::Storage.MailBox.SpmMailMonitor::(run) Error checking for mail
Traceback (most recent call last):
File "/usr/share/vdsm/storage/storage_mailbox.py", line 788, in run
File "/usr/share/vdsm/storage/storage_mailbox.py", line 732, in _checkForMail
File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 626, in execCmd
deathSignal=deathSignal, childUmask=childUmask)
File "/usr/lib64/python2.7/site-packages/cpopen/__init__.py", line 51, in __init__
File "/usr/lib64/python2.7/subprocess.py", line 703, in __init__
File "/usr/lib64/python2.7/subprocess.py", line 1100, in _get_handles
File "/usr/lib64/python2.7/subprocess.py", line 1153, in pipe_cloexec
OSError: [Errno 24] Too many open files
Attaching some more logs. I afraid this is all related and part of one issue, guess storage, may be related to the SPM changes the are made.
But from small investigation, this issues are not local only at my setup.
Created attachment 1008244 [details]
more logs
I also experienced this bug, my setup have 2 hosts that are up and only one DC that is down. the errors i have from the event log are: 2015-Apr-06, 07:46 Invalid status on Data Center DC. Setting Data Center status to Non Responsive (On host silver-vdsc, Error: General Exception). 2015-Apr-06, 07:36 Storage Pool Manager runs on Host silver-vdsc (Address: 10.35.108.12). 2015-Apr-06, 07:35 Data Center is being initialized, please wait for initialization to complete. Is this bug still relevant? I haven't faced this bug on the latest builds, so i guess it's not relevant. (In reply to Michael Burman from comment #9) > I haven't faced this bug on the latest builds, so i guess it's not relevant. Closing based on this statement. If you encounter it again feel free to re-open. |
Created attachment 1006172 [details] screenshots Description of problem: Invalid Data centers status on 3.6 engine. Data center status changing from UP to NON-Responsive every few minutes. This is something new from the yesterday's update from master. I'm running updated from master both for ovirt-engine and vdsm every morning for the last months, since yesterday my event log and engine.log are getting the next error every 2 minutes: Invalid status on Data Center mburman2. Setting Data Center status to Non Responsive (On host red-vds2.qa.lab.tlv.redhat.com, Error: Network error during communication with the Host.). Invalid status on Data Center mburman1. Setting Data Center status to Non Responsive (On host orchid-vds1.qa.lab.tlv.redhat.com, Error: Network error during communication with the Host.). I have hosts running in 3.5 and 3.6 clusters in 3.6 engine. All my servers are up all this time, they have no connectivity issues at all. My data center changing there state from UP to NON-Operational and back to UP every few minutes. Version-Release number of selected component (if applicable): How reproducible: 100, since yesterday(24.3.2015) Steps to Reproduce: 1. working 3.6 setup > data center in 3.6 engine, 3.5 and 3.6 clusters, servers 2. 3. Actual results: event log and engine log filled with errors and warnings, DC's changing state every few minutes. Expected results: Should not see such errors or warnings, DC's shouldn't change their state every few minutes. - I think something has changed during yesterday's ovirt-engine update, but i'm not sure exactly what is going on.