Bug 983641
Summary: | if /var/log runs out of space vdsm.service won't start | ||
---|---|---|---|
Product: | [Retired] oVirt | Reporter: | Lukas Grimm <grimm> |
Component: | vdsm | Assignee: | Yaniv Bronhaim <ybronhei> |
Status: | CLOSED NOTABUG | QA Contact: | yeylon <yeylon> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 3.2 | CC: | aberezin, acathrow, bazulay, danken, emesika, grimm, iheim, jboggs, jkt, mgoldboi, ovirt-bugs, ovirt-maint, srevivo, ybronhei |
Target Milestone: | --- | ||
Target Release: | 3.4.1 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | infra | ||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-04-13 14:43:45 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Lukas Grimm
2013-07-11 15:23:29 UTC
I expect that this can only be fixed by freeing up space in /var/log. From ovirt-node's perspective, there isn't anything we can do to affect vdsm starting/stopping correctly other than trying to prevent /var/log from filling. It's filling due to vdsm's logrotate settings, I believe, so it's an issue in vdsm anyway. thanks for your answer, I tried to clean up /var/log but somehow no succes: [root@ovirtnode2 admin]# du -h /var/log 4.0K /var/log/vdsm/backup 8.0K /var/log/vdsm 4.0K /var/log/core 16K /var/log [root@ovirtnode2 admin]# df -h /var/log Filesystem Size Used Avail Use% Mounted on /dev/mapper/HostVG-Logging 2.0G 2.0G 0 100% /var/log so vdsm still won't start... after several reboots of this node, vdsm started again on its own and /var/log was nearly empty [when I checked with df] (took like 3 or 4 reboots until it was empty) There was a bug with the vdsm that is included in this version of ovirt-node that it doesn't clean up it's log files completely. It keeps a file handle open to a deleted file and continues to write to it. Here /var/log vg got full. you have only 2.0G set for it, so it is possible and reasonable. do we want that vdsm will clear up rotated logs after reaching to certain size\age, or become non-responsive if disk is full (which also can be preferable by uesrs)? about the cleaning, odd that after restart vdsm was still non-operational, it doesn't sound relate to vdsm specifically vdsm does not start at all when the space for /var/log/vdsm full: ... vdsm: Running load_needed_modules vdsm: Running tune_system vdsm: Running test_space vdsm: low log space vdsm: stopped during execute test_space task (task returned with error code 1). vdsm start [FAILED] 10000 and, after clean it does start properly. the command it runs is - df -P /var/log/vdsm | awk '{print $4}' | tail -1 and this passes only if return 10000 or greater was that the case? since we updated our Hypervisors some time ago I can't reproduce the problem. Unfortunatly I don't have any spare Hardware around, to replicate the problem again. We use "fat Hypervisors" on Centos now... But I think I remember the log looked like the output above. Hope this helps. we do have here behavior that we need to think if worth a fix or not. vdsm checks /var/log/vdsm spare space before start, but supervdsm doesn't. so might be that even if you clean space, supervdsm service keeps writing to it because nothing avoid it from starting. but i'm not sure it requires another check there, the only thing we want to prevent is not to be operational when such space issue occurs. and we got that by vdsm's pre-start task check . in your case your limited space is for all under /var/log, so even if you cleaned vdsm folder once, might be that other services wrote under /var/log, filled it up and vdsm could not start properly ([1] < 10000). due to this verification I don't find it a bug, and would close it if there are no rejections [1] df -P /var/log/vdsm | awk '{print $4}' | tail -1 in my opinion it's worth fixing (build a work-around) in the thin Hypervisor Image, since this is staged with the 2GB /var/log, maybe this is already done in a newer Version. For vdsm I think it works as designed. You may close this report. we can add a limit or age to the rotated vdsm.log files for both vdsm and supervdsm, that will help to decrease the size of vdsm logs. but still you have bunch of services that write to /var/log and can cause this result so im not sure if it will help . the way to work with that host is to expand /var/log vg size The only real option to solve the issue is to enable vdsm/svdsm to run with no logs (even if we clean all vdsm logs/var/log can still be full), I don't think this is an option, I also think we report that the /var/log is filled up to the engine and there might be a warning about it (Yaniv please verify). Other than that I think this bug should be closed. Arthur ? Vdsm should not start running without a log. Local storage can be set properly for /var/log. there is no report in engine side, only host becomes not operational, by looking on its syslog we'll see the "vdsm: low log space". vdsm does not start at all, so engine can't know the reason properly This is an automated message. Re-targeting all non-blocker bugs still open on 3.4.0 to 3.4.1. This could happen on runtime as well, When VDSM fails due to full /var/log/ partition, it should report back to Engine with the exact reason it failed, so a user could understand the reason why the host is not able to go back to operational state straight fro from the Engine UI. vdsm can't do it . we use python's logging module all around vdsm (http://docs.python.org/2/library/logging.html). when we log new print, the logging flush method can raise exceptions to the caller. this exception can be outOfMemory, as that's what we get when the log partition is full. but still, we can't cover each logging print call with try clause , so the only way for catching such exception is to intervene in the logging module itself, which we can't with specific vdsm code such as reporting something to engine. afaiu the solution should be to constantly report engine about the host's partitions status. with that, when host becomes non-responsive at least the user can check for this info and might correlate between that and the actual reason.. Dan, did vdsm use to report about its local disks status? Vdsm still reports it on getVdsStats: diskStats = {'/tmp': {'free': '92573'}, '/var/log': {'free': '92573'}, '/var/log/core': {'free': '92573'}, '/var/run/vdsm/': {'free': '92573'}} and BTW, it is possible (though quite a big feature, with multiple caveats) to replace local file loggers with a centralized rsyslog-based network logger. ohh barak was right of course :) searched for it in the vds_statistics table, but engine saves those only in cache in VDS object. engine alerts it in the auditlog when the space passes the VdsLocalDisksCriticallyLowFreeSpace and VdsLocalDisksLowFreeSpace in each refreshVdsRunTimeInfo interval (in alertIfLowDiskSpaceOnHost). on my testings i checked only after vdsm was down already so i missed that warning. it appears before vdsm goes down (if vdsm was and states were collected alright). is that enough? maybe auditlog is not enough and should appear in UI as well? (In reply to Yaniv Bronhaim from comment #18) > ohh barak was right of course :) searched for it in the vds_statistics > table, but engine saves those only in cache in VDS object. engine alerts it > in the auditlog when the space passes the > VdsLocalDisksCriticallyLowFreeSpace and VdsLocalDisksLowFreeSpace in each > refreshVdsRunTimeInfo interval (in alertIfLowDiskSpaceOnHost). > on my testings i checked only after vdsm was down already so i missed that > warning. it appears before vdsm goes down (if vdsm was and states were if vdsm was UP and responsive > collected alright). is that enough? maybe auditlog is not enough and should > appear in UI as well? means more related to the host, like signing the host's line in the main Host tab with a critical warning label. (In reply to Yaniv Bronhaim from comment #19) > (In reply to Yaniv Bronhaim from comment #18) > > ohh barak was right of course :) searched for it in the vds_statistics > > table, but engine saves those only in cache in VDS object. engine alerts it > > in the auditlog when the space passes the > > VdsLocalDisksCriticallyLowFreeSpace and VdsLocalDisksLowFreeSpace in each > > refreshVdsRunTimeInfo interval (in alertIfLowDiskSpaceOnHost). > > on my testings i checked only after vdsm was down already so i missed that > > warning. it appears before vdsm goes down (if vdsm was and states were > > if vdsm was UP and responsive > > > collected alright). is that enough? maybe auditlog is not enough and should > > appear in UI as well? > > means more related to the host, like signing the host's line in the main > Host tab with a critical warning label. Yes, we should sign the host with warning in the hosts main tab, add VdsLocalDisksCriticallyLowFreeSpace and VdsLocalDisksLowFreeSpace to "Alerts" in UI. VDSM should also report to Engine with same alerts when it fails to start due to full logs partition. (In reply to Arthur Berezin from comment #20) ... > VDSM should also report to Engine with same alerts when it fails to start > due to full logs partition. Note: When Vdsm fails to start it cannot report any alert... It's dead. can't we let vdsmd start up in lite mode with simple xml binding which will allow to report the error in getCaps response? Very often does Vdsm code assume that it can write to the logs. Failure to do so generates an exception (bz 615733). Ignoring such exceptions seems dangerous to me. I do not like it, but it is doable. This is going nowhere so before closing this bug as NOTABUG here is the summary. - vdsm reports diskStats (comment #17) - engine logs a warning in the event log (comment #18) From the VDSM side there is no real option nor desire to run VDSM with no logs. In case vdsm reaches non-responsive due to no disk space the even can still be viewed in the event log. |