Bug 983641

Summary: if /var/log runs out of space vdsm.service won't start
Product: [Retired] oVirt Reporter: Lukas Grimm <grimm>
Component: vdsmAssignee: Yaniv Bronhaim <ybronhei>
Status: CLOSED NOTABUG QA Contact: yeylon <yeylon>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.2CC: aberezin, acathrow, bazulay, danken, emesika, grimm, iheim, jboggs, jkt, mgoldboi, ovirt-bugs, ovirt-maint, srevivo, ybronhei
Target Milestone: ---   
Target Release: 3.4.1   
Hardware: x86_64   
OS: Linux   
Whiteboard: infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-04-13 14:43:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Lukas Grimm 2013-07-11 15:23:29 UTC
Description of problem:
Today one Node went into non-resposive mode, after manually rebooting it vdsmd dind't start. 

systemctl start vdsmd.service failed, after deleting all log-files vdsmd started again

Version-Release number of selected component (if applicable):
oVirt Node 2.6.1-1 

How reproducible:
as soon there's no space left, and vdsm is stopped

Steps to Reproduce:
1. fill up /var/log
2. stop vdsm
3. try to restart it

Actual results:
vdsm won't start

Expected results:
vdsm should start

Additional info:

Errors:

[root@ovirtnode2 admin]# systemctl start vdsmd.service
Job for vdsmd.service failed. See 'systemctl status vdsmd.service' and 'journalctl -xn' for details.
[root@ovirtnode2 admin]# systemctl status vdsmd.service
vdsmd.service - Virtual Desktop Server Manager
	  Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled)
	  Active: failed (Result: exit-code) since Thu 2013-07-11 15:16:19 UTC; 2s ago
	 Process: 20454 ExecStart=/lib/systemd/systemd-vdsmd start (code=exited, status=1/FAILURE)
	Main PID: 4237 (code=killed, signal=TERM)
	  CGroup: name=systemd:/system/vdsmd.service

Jul 11 15:16:17 ovirtnode2.ep4.puzzle.ch python[20594]: DIGEST-MD5 client step 2
Jul 11 15:16:17 ovirtnode2.ep4.puzzle.ch python[20594]: DIGEST-MD5 client step 2
Jul 11 15:16:17 ovirtnode2.ep4.puzzle.ch python[20594]: DIGEST-MD5 client step 3
Jul 11 15:16:19 ovirtnode2.ep4.puzzle.ch python[20657]: DIGEST-MD5 client step 2
Jul 11 15:16:19 ovirtnode2.ep4.puzzle.ch python[20657]: DIGEST-MD5 client step 2
Jul 11 15:16:19 ovirtnode2.ep4.puzzle.ch python[20657]: DIGEST-MD5 client step 3
Jul 11 15:16:19 ovirtnode2.ep4.puzzle.ch systemd-vdsmd[20454]: vdsm: low log space[FAILED]
Jul 11 15:16:19 ovirtnode2.ep4.puzzle.ch systemd[1]: vdsmd.service: control process exited, code=exited status=1
Jul 11 15:16:19 ovirtnode2.ep4.puzzle.ch systemd[1]: Failed to start Virtual Desktop Server Manager.
Jul 11 15:16:19 ovirtnode2.ep4.puzzle.ch systemd[1]: Unit vdsmd.service entered failed state
[root@ovirtnode2 admin]# journalctl -xn
-- Logs begin at Thu 2013-07-11 09:35:01 UTC, end at Thu 2013-07-11 15:16:19 UTC. --
Jul 11 15:16:17 ovirtnode2.ep4.puzzle.ch collectd[1891]: Filter subsystem: Built-in target `write': Dispatching value to all write plugins failed with status 2 (ENOENT). Most likely this means you didn't load any write plugins.
Jul 11 15:16:17 ovirtnode2.ep4.puzzle.ch collectd[1891]: Filter subsystem: Built-in target `write': Dispatching value to all write plugins failed with status 2 (ENOENT). Most likely this means you didn't load any write plugins.
Jul 11 15:16:17 ovirtnode2.ep4.puzzle.ch systemd-journal[772]: Forwarding to syslog missed 10 messages.
-- Subject: One or more messages could not be forwarded to syslog
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- Documentation: http://www.freedesktop.org/wiki/Software/systemd/catalog/0027229ca0644181a76c4e92458afa2e
-- 
-- One or more messages could not be forwarded to the syslog service
-- running side-by-side with journald. This usually indicates that the
-- syslog implementation has not been able to keep up with the speed of
-- messages queued.
Jul 11 15:16:19 ovirtnode2.ep4.puzzle.ch python[20657]: DIGEST-MD5 client step 2
Jul 11 15:16:19 ovirtnode2.ep4.puzzle.ch python[20657]: DIGEST-MD5 client step 2
Jul 11 15:16:19 ovirtnode2.ep4.puzzle.ch python[20657]: DIGEST-MD5 client step 3
Jul 11 15:16:19 ovirtnode2.ep4.puzzle.ch systemd-vdsmd[20454]: vdsm: low log space[FAILED]
Jul 11 15:16:19 ovirtnode2.ep4.puzzle.ch systemd[1]: vdsmd.service: control process exited, code=exited status=1
Jul 11 15:16:19 ovirtnode2.ep4.puzzle.ch systemd[1]: Failed to start Virtual Desktop Server Manager.
-- Subject: Unit vdsmd.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- Documentation: http://www.freedesktop.org/wiki/Software/systemd/catalog/be02cf6855d2428ba40df7e9d022f03d
-- 
-- Unit vdsmd.service has failed.
-- 
-- The result is failed.
Jul 11 15:16:19 ovirtnode2.ep4.puzzle.ch systemd[1]: Unit vdsmd.service entered failed state

Comment 1 Mike Burns 2013-07-11 15:55:00 UTC
I expect that this can only be fixed by freeing up space in /var/log.  From ovirt-node's perspective, there isn't anything we can do to affect vdsm starting/stopping correctly other than trying to prevent /var/log from filling.  It's filling due to vdsm's logrotate settings, I believe, so it's an issue in vdsm anyway.

Comment 2 Lukas Grimm 2013-07-12 07:17:07 UTC
thanks for your answer, I tried to clean up /var/log but somehow no succes:

[root@ovirtnode2 admin]# du -h /var/log
4.0K	/var/log/vdsm/backup
8.0K	/var/log/vdsm
4.0K	/var/log/core
16K	/var/log

[root@ovirtnode2 admin]# df -h /var/log
Filesystem                  Size  Used Avail Use% Mounted on
/dev/mapper/HostVG-Logging  2.0G  2.0G     0 100% /var/log

so vdsm still won't start...

Comment 3 Lukas Grimm 2013-07-12 11:55:43 UTC
after several reboots of this node, vdsm started again on its own and /var/log was nearly empty [when I checked with df] (took like 3 or 4 reboots until it was empty)

Comment 4 Mike Burns 2013-07-12 12:16:06 UTC
There was a bug with the vdsm that is included in this version of ovirt-node that it doesn't clean up it's log files completely.  It keeps a file handle open to a deleted file and continues to write to it.

Comment 6 Yaniv Bronhaim 2014-03-02 12:32:47 UTC
Here /var/log vg got full. you have only 2.0G set for it, so it is possible and reasonable. do we want that vdsm will clear up rotated logs after reaching to certain size\age, or become non-responsive if disk is full (which also can be preferable by uesrs)?

about the cleaning, odd that after restart vdsm was still non-operational, it doesn't sound relate to vdsm specifically

Comment 7 Yaniv Bronhaim 2014-03-02 17:46:57 UTC
vdsm does not start at all when the space for /var/log/vdsm full:
...
vdsm: Running load_needed_modules
vdsm: Running tune_system
vdsm: Running test_space
vdsm: low log space
vdsm: stopped during execute test_space task (task returned with error code 1).
vdsm start                                                 [FAILED]
10000

and, after clean it does start properly.
the command it runs is - df -P /var/log/vdsm | awk '{print $4}' | tail -1

and this passes only if return 10000 or greater 

was that the case?

Comment 8 Lukas Grimm 2014-03-03 08:45:21 UTC
since we updated our Hypervisors some time ago I can't reproduce the problem. Unfortunatly I don't have any spare Hardware around, to replicate the problem again. We use "fat Hypervisors" on Centos now...

But I think I remember the log looked like the output above.

Hope this helps.

Comment 9 Yaniv Bronhaim 2014-03-03 08:57:33 UTC
we do have here behavior that we need to think if worth a fix or not. vdsm checks /var/log/vdsm spare space before start, but supervdsm doesn't. so might be that even if you clean space, supervdsm service keeps writing to it because nothing avoid it from starting. but i'm not sure it requires another check there, the only thing we want to prevent is not to be operational when such space issue occurs. and we got that by vdsm's pre-start task check . 

in your case your limited space is for all under /var/log, so even if you cleaned vdsm folder once, might be that other services wrote under /var/log, filled it up and vdsm could not start properly ([1] < 10000).


due to this verification I don't find it a bug, and would close it if there are no rejections 


[1] df -P /var/log/vdsm | awk '{print $4}' | tail -1

Comment 10 Lukas Grimm 2014-03-03 09:14:14 UTC
in my opinion it's worth fixing (build a work-around) in the thin Hypervisor Image, since this is staged with the 2GB /var/log, maybe this is already done in a newer Version. For vdsm I think it works as designed. You may close this report.

Comment 11 Yaniv Bronhaim 2014-03-03 09:30:18 UTC
we can add a limit or age to the rotated vdsm.log files for both vdsm and supervdsm, that will help to decrease the size of vdsm logs. but still you have bunch of services that write to /var/log and can cause this result so im not sure if it will help . the way to work with that host is to expand /var/log vg size

Comment 12 Barak 2014-03-03 11:11:13 UTC
The only real option to solve the issue is to enable vdsm/svdsm to run with no logs (even if we clean all vdsm logs/var/log can still be full),
I don't think this is an option, I also think we report that the /var/log is filled up to the engine and there might be a warning about it (Yaniv please verify).

Other than that I think this bug should be closed.

Arthur ?

Comment 13 Yaniv Bronhaim 2014-03-03 12:55:42 UTC
Vdsm should not start running without a log. Local storage can be set properly for /var/log. there is no report in engine side, only host becomes not operational, by looking on its syslog we'll see the "vdsm: low log space". vdsm does not start at all, so engine can't know the reason properly

Comment 14 Sandro Bonazzola 2014-03-04 09:26:21 UTC
This is an automated message.
Re-targeting all non-blocker bugs still open on 3.4.0 to 3.4.1.

Comment 15 Arthur Berezin 2014-03-17 09:35:38 UTC
This could happen on runtime as well, When VDSM fails due to full /var/log/ partition, it should report back to Engine with the exact reason it failed, so a user could understand the reason why the host is not able to go back to operational state straight fro from the Engine UI.

Comment 16 Yaniv Bronhaim 2014-03-17 12:48:26 UTC
vdsm can't do it . we use python's logging module all around vdsm (http://docs.python.org/2/library/logging.html). when we log new print, the logging flush method can raise exceptions to the caller. this exception can be outOfMemory, as that's what we get when the log partition is full.

but still, we can't cover each logging print call with try clause , so the only way for catching such exception is to intervene in the logging module itself, which we can't with specific vdsm code such as reporting something to engine.

afaiu the solution should be to constantly report engine about the host's partitions status. with that, when host becomes non-responsive at least the user can check for this info and might correlate between that and the actual reason..

Dan, did vdsm use to report about its local disks status?

Comment 17 Dan Kenigsberg 2014-03-17 23:53:31 UTC
Vdsm still reports it on getVdsStats:

        diskStats = {'/tmp': {'free': '92573'},
                     '/var/log': {'free': '92573'},
                     '/var/log/core': {'free': '92573'},
                     '/var/run/vdsm/': {'free': '92573'}}

and BTW, it is possible (though quite a big feature, with multiple caveats) to replace local file loggers with a centralized rsyslog-based network logger.

Comment 18 Yaniv Bronhaim 2014-03-18 08:44:33 UTC
ohh barak was right of course :) searched for it in the vds_statistics table, but engine saves those only in cache in VDS object. engine alerts it in the auditlog when the space passes the VdsLocalDisksCriticallyLowFreeSpace and VdsLocalDisksLowFreeSpace in each refreshVdsRunTimeInfo interval (in alertIfLowDiskSpaceOnHost).
on my testings i checked only after vdsm was down already so i missed that warning. it appears before vdsm goes down (if vdsm was and states were collected alright). is that enough? maybe auditlog is not enough and should appear in UI as well?

Comment 19 Yaniv Bronhaim 2014-03-18 08:50:49 UTC
(In reply to Yaniv Bronhaim from comment #18)
> ohh barak was right of course :) searched for it in the vds_statistics
> table, but engine saves those only in cache in VDS object. engine alerts it
> in the auditlog when the space passes the
> VdsLocalDisksCriticallyLowFreeSpace and VdsLocalDisksLowFreeSpace in each
> refreshVdsRunTimeInfo interval (in alertIfLowDiskSpaceOnHost).
> on my testings i checked only after vdsm was down already so i missed that
> warning. it appears before vdsm goes down (if vdsm was and states were

if vdsm was UP and responsive 

> collected alright). is that enough? maybe auditlog is not enough and should
> appear in UI as well?

means more related to the host, like signing the host's line in the main Host tab with a critical warning label.

Comment 20 Arthur Berezin 2014-03-20 16:32:09 UTC
(In reply to Yaniv Bronhaim from comment #19)
> (In reply to Yaniv Bronhaim from comment #18)
> > ohh barak was right of course :) searched for it in the vds_statistics
> > table, but engine saves those only in cache in VDS object. engine alerts it
> > in the auditlog when the space passes the
> > VdsLocalDisksCriticallyLowFreeSpace and VdsLocalDisksLowFreeSpace in each
> > refreshVdsRunTimeInfo interval (in alertIfLowDiskSpaceOnHost).
> > on my testings i checked only after vdsm was down already so i missed that
> > warning. it appears before vdsm goes down (if vdsm was and states were
> 
> if vdsm was UP and responsive 
> 
> > collected alright). is that enough? maybe auditlog is not enough and should
> > appear in UI as well?
> 
> means more related to the host, like signing the host's line in the main
> Host tab with a critical warning label.

Yes, we should sign the host with warning in the hosts main tab, add VdsLocalDisksCriticallyLowFreeSpace and VdsLocalDisksLowFreeSpace to "Alerts" in UI.
VDSM should also report to Engine with same alerts when it fails to start due to full logs partition.

Comment 21 Dan Kenigsberg 2014-03-20 16:41:17 UTC
(In reply to Arthur Berezin from comment #20)
...
> VDSM should also report to Engine with same alerts when it fails to start
> due to full logs partition.

Note: When Vdsm fails to start it cannot report any alert... It's dead.

Comment 22 Yaniv Bronhaim 2014-03-23 11:54:21 UTC
can't we let vdsmd start up in lite mode with simple xml binding which will allow to report the error in getCaps response?

Comment 23 Dan Kenigsberg 2014-04-07 09:39:19 UTC
Very often does Vdsm code assume that it can write to the logs. Failure to do so generates an exception (bz 615733). Ignoring such exceptions seems dangerous to me. I do not like it, but it is doable.

Comment 24 Barak 2014-04-13 14:43:45 UTC
This is going nowhere so before closing this bug as NOTABUG here is the summary.

- vdsm reports diskStats (comment #17)
- engine logs a warning in the event log (comment #18)

From the VDSM side there is no real option nor desire to run VDSM with no logs.
In case vdsm reaches non-responsive due to no disk space the even can still be viewed in the event log.