Bug 1296930 - ovirt-engine fails with "too many open files"
ovirt-engine fails with "too many open files"
Status: CLOSED CURRENTRELEASE
Product: ovirt-engine
Classification: oVirt
Component: General (Show other bugs)
3.6.0.3
x86_64 Linux
unspecified Severity high (vote)
: ovirt-3.6.3
: 3.6.3.2
Assigned To: Piotr Kliczewski
Ondra Machacek
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-01-08 08:22 EST by Paul Wightman
Modified: 2016-03-11 02:24 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-03-11 02:24:53 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑3.6.z+
rule-engine: exception+
mgoldboi: planning_ack+
oourfali: devel_ack+
pstehlik: testing_ack+


Attachments (Terms of Use)
engine.log (293.22 KB, text/plain)
2016-01-08 08:22 EST, Paul Wightman
no flags Details
Additional log files per needinfo request. (13.89 MB, application/zip)
2016-01-11 11:52 EST, Paul Wightman
no flags Details
Additional Log File (3.46 MB, application/x-gzip)
2016-01-20 10:57 EST, Paul Wightman
no flags Details
Log files immediatly before/after application of test jar file. (87.47 KB, application/zip)
2016-01-27 08:18 EST, Paul Wightman
no flags Details
New log file to accompany possible postgres leak (259.25 KB, application/zip)
2016-02-01 09:34 EST, Paul Wightman
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 52635 master MERGED connect: close channel when NoRouteToHostException 2016-02-02 03:28 EST
oVirt gerrit 52992 ovirt-3.6 MERGED connect: close channel when NoRouteToHostException 2016-02-02 10:52 EST
oVirt gerrit 53234 ovirt-3.6 MERGED Release to 1.1.8 2016-02-09 03:35 EST
oVirt gerrit 53271 ovirt-engine-3.6 MERGED jsonrpc: version bump 2016-02-11 07:10 EST
oVirt gerrit 53274 master MERGED jsonrpc: version bump 2016-02-11 01:59 EST
oVirt gerrit 53411 ovirt-engine-3.6.3 MERGED jsonrpc: version bump 2016-02-11 09:42 EST

  None (edit)
Description Paul Wightman 2016-01-08 08:22:51 EST
Created attachment 1112870 [details]
engine.log

Description of problem:

After running for a few days, ovirt-engine stops responding to RHEV API calls.  The oVirt portal is also non-responsive.  This started occurring shortly after the 3.6.0 release (about the time the async updates started getting released for ovirt-hosted-engine-ha - see Bug 1286850), but we also experienced a hardware fault around that time so it may be coincidental.


Version-Release number of selected component (if applicable): 
ovirt-engine-3.6.0.3-1.el6.noarch
CentOS 6.7 (KVM manager - ovirt-engine host)
CentOS 6.7 & 7.1 KVM hosts


How reproducible:
The system is simply left running with normal, day-to-day user interfaces.  ovirt-engine seems to fail every 3-4 days.  My environment has 8 hosts (7 active, 1 hard-down due to issues but not in maintenance mode for some reason) spread across 3 data centers (all local) hosting a total of 24 active VMs.  VM state is stable - we haven't been creating/destroying or even rebooting VMs much lately.  Nagios is used to monitor system health through the API.

Steps to Reproduce:
1. Boot system and VMs.  Leave one host offline but do not place in maintenance mode.
2. Let run under normal operations.

Actual results:
After a few days, ovirt-engine stops responding to RHEV API calls from Nagios monitor.  oVirt Web Portal is unresponsive.  'lsof' returns over 81,000 open file descriptors, the majority of which are of the form:

java       2164     ovirt *230u     sock                0,6      0t0     464370 can't identify protocol
java       2164     ovirt *231u     sock                0,6      0t0     464377 can't identify protocol
java       2164     ovirt *232u     sock                0,6      0t0     464416 can't identify protocol
java       2164     ovirt *233u     sock                0,6      0t0     464421 can't identify protocol
java       2164     ovirt *234u     sock                0,6      0t0     464448 can't identify protocol
java       2164     ovirt *235u     sock                0,6      0t0     464457 can't identify protocol
java       2164     ovirt *236u     sock                0,6      0t0     464480 can't identify protocol
java       2164     ovirt *237u     sock                0,6      0t0     464489 can't identify protocol

engine.log indicates repeated SocketExceptions for "Too many open files" as below...

2016-01-08 03:39:36,177 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] (DefaultQuartzScheduler_Worker-47) [46614a66] Command 'GetCapabilitiesVDSCommand(HostName = tms-kvmsec01.****, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='727986e5-c7bc-4921-89ff-218007bed1f3', vds='Host[tms-kvmsec01.****,727986e5-c7bc-4921-89ff-218007bed1f3]'})' execution failed: java.net.SocketException: Too many open files
2016-01-08 03:39:36,177 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-47) [46614a66] Failure to refresh Vds runtime info: java.net.SocketException: Too many open files
2016-01-08 03:39:36,178 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-47) [46614a66] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.SocketException: Too many open files
	at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.createNetworkException(VdsBrokerCommand.java:157) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:120) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) [vdsbroker.jar:]
	at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:]
	at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.VdsManager.refreshCapabilities(VdsManager.java:634) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:119) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:226) [vdsbroker.jar:]
	at sun.reflect.GeneratedMethodAccessor123.invoke(Unknown Source) [:1.7.0_91]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_91]
	at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_91]
	at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:]
	at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:]
	at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]
Caused by: java.net.SocketException: Too many open files
	at sun.nio.ch.Net.socket0(Native Method) [rt.jar:1.7.0_91]
	at sun.nio.ch.Net.socket(Net.java:441) [rt.jar:1.7.0_91]
	at sun.nio.ch.Net.socket(Net.java:434) [rt.jar:1.7.0_91]
	at sun.nio.ch.SocketChannelImpl.<init>(SocketChannelImpl.java:105) [rt.jar:1.7.0_91]
	at sun.nio.ch.SelectorProviderImpl.openSocketChannel(SelectorProviderImpl.java:60) [rt.jar:1.7.0_91]
	at java.nio.channels.SocketChannel.open(SocketChannel.java:142) [rt.jar:1.7.0_91]
	at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient$1.call(ReactorClient.java:108) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient$1.call(ReactorClient.java:100) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable.call(Retryable.java:27) [vdsm-jsonrpc-java-client.jar:]
	at java.util.concurrent.FutureTask.run(FutureTask.java:262) [rt.jar:1.7.0_91]
	at org.ovirt.vdsm.jsonrpc.client.utils.ReactorScheduler.performPendingOperations(ReactorScheduler.java:28) [vdsm-jsonrpc-java-client.jar:]
	at org.ovirt.vdsm.jsonrpc.client.reactors.Reactor.run(Reactor.java:61) [vdsm-jsonrpc-java-client.jar:]

2016-01-08 03:39:36,305 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to /****.34
2016-01-08 03:39:36,305 WARN  [org.ovirt.vdsm.jsonrpc.client.utils.retry.Retryable] (SSL Stomp Reactor) [] Retry failed
2016-01-08 03:39:36,305 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (DefaultQuartzScheduler_Worker-42) [5a40a785] Exception during connection
2016-01-08 03:39:36,309 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand] (DefaultQuartzScheduler_Worker-42) [5a40a785] Command 'ListVDSCommand(HostName = tms-kvmsec01.****, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='727986e5-c7bc-4921-89ff-218007bed1f3', vds='Host[tms-kvmsec01.****,727986e5-c7bc-4921-89ff-218007bed1f3]'})' execution failed: java.net.SocketException: Too many open files
2016-01-08 03:39:36,309 INFO  [org.ovirt.engine.core.vdsbroker.PollVmStatsRefresher] (DefaultQuartzScheduler_Worker-42) [5a40a785] Failed to fetch vms info for host 'tms-kvmsec01.****' - skipping VMs monitoring.



Expected results:
ovirt-engine is stable and continues to respond to API and web portal calls.


Additional info:
I have placed the down host into maintenance mode (as it should have been all along) and will monitor to see if any more errors are shown.  However, it seems clear that ovirt-engine (or some other ovirt related process) is opening things and not closing them properly. My suspicion is that the down/unresponsive host is key to reproducing the bug.

I've attached a portion of my engine.log (full log is 28MB).  The x.x.x.34 address does belong to the unresponsive host.
Comment 1 Piotr Kliczewski 2016-01-11 08:26:26 EST
Can you please provide complete logs for the period that the engine was running?
You mentioned 3-4 days.
Comment 2 Paul Wightman 2016-01-11 11:52 EST
Created attachment 1113633 [details]
Additional log files per needinfo request.

I believe this covers an entire cycle.
Comment 3 Piotr Kliczewski 2016-01-12 06:40:17 EST
I checked the logs and I can see first occurrence of the issue at 2015-12-27 03:06:16,423 which means that it is one of the first entries in the oldest log provided. Can you look for time when the issue started to occur and provide logs for a day or two before it occurred.

I tried to reproduce the issue using single host and it looks like fds are closed properly (checked by using lsof and jprofiler).

I am not sure whether it is related to the issue but I can see:

2015-12-27 03:06:07,818 ERROR [org.ovirt.engine.extensions.aaa.builtin.kerberosldap.GSSAPIDirContextAuthenticationStrategy] (default task-79) [] Kerberos error: Cannot locate KDC
2015-12-27 03:06:07,819 ERROR [org.ovirt.engine.extensions.aaa.builtin.kerberosldap.DirectorySearcher] (default task-79) [] Failed ldap search server ldap://level.{domain}:389 using user nagios@{domain} due to Kerberos error. Please check log for further details.. We should not try the next server
2015-12-27 03:06:07,819 ERROR [org.ovirt.engine.extensions.aaa.builtin.kerberosldap.LdapAuthenticateUserCommand] (default task-79) [] Failed authenticating user: nagios to domain {domain}. Ldap Query Type is getUserByName
2015-12-27 03:06:07,819 ERROR [org.ovirt.engine.extensions.aaa.builtin.kerberosldap.LdapAuthenticateUserCommand] (default task-79) [] Kerberos error. Please check log for further details.
2015-12-27 03:06:07,819 ERROR [org.ovirt.engine.extensions.aaa.builtin.kerberosldap.LdapBrokerCommandBase] (default task-79) [] Failed to run command LdapAuthenticateUserCommand. Domain is {domain}. User is nagios.
2015-12-27 03:06:07,819 ERROR [org.ovirt.engine.core.aaa.filters.BasicAuthenticationFilter] (default task-79) [] User nagios authentication failed. profile is {domain}. Invocation Result code is 0. Authn result code is GENERAL_ERROR
Comment 4 Martin Perina 2016-01-12 07:42:56 EST
Hi Paul,

I looked at engine.log and there are lot of kerberos error which may cause login errors using API. So please check your kerberos environment. If everything is OK (from the log provided I'd guess that you have some issue with either kerberos config on the host or DNS setup, because there's and issue to locate KDC). More details can be found if you enable debug logging on AAA module by execution of:

    /usr/share/jbossas/bin/jboss-cli.sh \
        --connect \
        --timeout=30000 \
        --controller=localhost:8706 \
        --user=admin@internal \
        --commands="
            if (outcome != success) of /subsystem=logging/logger=org.ovirt.engine.extensions.aaa:read-attribute(name=level),
              /subsystem=logging/logger=org.ovirt.engine.extensions.aaa:add,
            end-if,
            /subsystem=logging/logger=org.ovirt.engine.extensions.aaa:write-attribute(name=level,value=ALL)
        "

This command uses "admin@internal" user so it's not affected by kerbldap issue.
When done successfully please try to execute your API call and attach new logs.

Also please be aware that kerbldap extension (configured by engine-manage-domains) has been deprecated in RHEV 3.5 and you really should migrate to aaa-ldap extension. For that please install ovirt-engine-extension-aaa-ldap* packages and take a look at provided README files.
Comment 5 Martin Perina 2016-01-12 08:13:16 EST
(In reply to Martin Perina from comment #4)
> Hi Paul,
> 
> I looked at engine.log and there are lot of kerberos error which may cause
> login errors using API. So please check your kerberos environment. If
> everything is OK (from the log provided I'd guess that you have some issue
> with either kerberos config on the host or DNS setup, because there's and
> issue to locate KDC). More details can be found if you enable debug logging
> on AAA module by execution of:
> 
>     /usr/share/jbossas/bin/jboss-cli.sh \
>         --connect \
>         --timeout=30000 \
>         --controller=localhost:8706 \
>         --user=admin@internal \
>         --commands="
>             if (outcome != success) of
> /subsystem=logging/logger=org.ovirt.engine.extensions.aaa:read-
> attribute(name=level),
>               /subsystem=logging/logger=org.ovirt.engine.extensions.aaa:add,
>             end-if,
>            
> /subsystem=logging/logger=org.ovirt.engine.extensions.aaa:write-
> attribute(name=level,value=ALL)
>         "

Sorry, I was confused by "RHEV API call" and I didn't noticed that you are using oVirt, this is correct command for oVirt 3.6:

    /usr/share/ovirt-engine-wildfly/bin/jboss-cli.sh \
        --connect \
        --timeout=30000 \
        --controller=localhost:8706 \
        --user=admin@internal \
        --commands="
            if (outcome != success) of /subsystem=logging/logger=org.ovirt.engine.extensions.aaa:read-attribute(name=level),
              /subsystem=logging/logger=org.ovirt.engine.extensions.aaa:add,
            end-if,
            /subsystem=logging/logger=org.ovirt.engine.extensions.aaa:write-attribute(name=level,value=ALL)
        "
Comment 6 Paul Wightman 2016-01-20 10:57 EST
Created attachment 1116683 [details]
Additional Log File

Sorry for the delay... I've been out of town.  I thought I had pulled enough files to show a complete cycle, but I guess I fell short by one.  Attached is the next log file in the series, and it does show the error.

Interestingly, the KDC errors seemed to stop around 6 Jan.  The system has been running for 16 days and only shows ~13,500 open file descriptors.

I will look into changing the LDAP backend to aaa-ldap, thought that may take a while for me to get done.
Comment 7 Martin Perina 2016-01-21 08:36:12 EST
(In reply to Paul Wightman from comment #6)
> Created attachment 1116683 [details]
> Additional Log File

Hmm, the log ends on Jan 03, so there are no debug logs as I sent you how to enable debug log on Jan 12. If the issue with login still persists, please attach up-to-date log with debug information. Also please be aware that the above method how to turn on debug logs can be used only for currently running ovirt-engine service, if you restart the service, log config is restored to default.

> 
> Sorry for the delay... I've been out of town.  I thought I had pulled enough
> files to show a complete cycle, but I guess I fell short by one.  Attached
> is the next log file in the series, and it does show the error.
> 
> Interestingly, the KDC errors seemed to stop around 6 Jan.  The system has
> been running for 16 days and only shows ~13,500 open file descriptors.
> 
> I will look into changing the LDAP backend to aaa-ldap, thought that may
> take a while for me to get done.

You can try ovirt-engine-kerbldap-migration tool [1] to easily transfer you existing kerbldap config into the new aaa-ldap config. For details please take a look at [2] and also feel free to ask if something doesn't work.


[1] https://github.com/machacekondra/ovirt-engine-kerbldap-migration
[2] https://github.com/machacekondra/ovirt-engine-kerbldap-migration/blob/master/README.md
Comment 8 Piotr Kliczewski 2016-01-25 04:07:18 EST
I pushed a patch. Can you please help me verify whether it fixes the issue?
I could send you a jar a tell how to replace.
Comment 9 Paul Wightman 2016-01-26 15:37:53 EST
Piotr,

I would be happy to test a patch.  Please send instructions and the jar file.
Comment 10 Piotr Kliczewski 2016-01-27 05:07:33 EST
Paul,

I sent the file and steps to your email. Please let me know about the results.
Comment 11 Paul Wightman 2016-01-27 08:18 EST
Created attachment 1118785 [details]
Log files immediatly before/after application of test jar file.

Piotr,

Patch was installed this morning and initial indications are good.  It has been running for about 45 minutes now and ovirt only owns ~1432 files (it's been steady at about that number for 25 minutes).  In that same amount of time yesterday, ovirt was up to just over 2000 open files.

I've attached the engine.log from today as well as a log showing open file descriptors about every 30 seconds starting at 10:10am local time yesterday.  Prior to leaving for the evening, I placed the non-responsive host in maintenance mode.  The extra log shows a corresponding leveling off of the number of ovirt-owned files (~5880) starting at about the same time the host was put in maintenance mode.  The number of ovirt-owned files started increasing rapidly again this morning when I took the host out of maintenance mode just prior to installing the test jar file.

I will continue to monitor and let you know how it goes.
Comment 12 Paul Wightman 2016-01-28 08:20:28 EST
Piotr,

After 24 hours, total open file descriptors is relatively steady at ~16,400.  Ovirt-owned file descriptors have remained steady between 1,431 and 1,435.

engine.log is throwing the below exception every 3 minutes:

2016-01-28 08:09:44,150 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-61) [7b61476d] Failure to refresh Vds runtime info: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
2016-01-28 08:09:44,150 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-61) [7b61476d] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues
        at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand.executeVdsBrokerCommand(GetCapabilitiesVDSCommand.java:16) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) [vdsbroker.jar:]
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.refreshCapabilities(VdsManager.java:634) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:119) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:226) [vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor233.invoke(Unknown Source) [:1.7.0_91]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_91]
        at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_91]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]

Not uploading a new log file as this can be seen towards the end of the log file uploaded yesterday.
Comment 13 Piotr Kliczewski 2016-01-28 08:35:20 EST
I am happy that we do not loose fds.

The issue that you mentioned occurs when there is no response from vdsm within 3 minutes. This is high level mechanism making sure that we handle vdsm side issues correctly.

From the stack trace I can see that getCapabilieties call was affected. Please provide vdsm log for the host that it is failing.
Comment 14 Paul Wightman 2016-01-28 08:42:56 EST
VDSM logs are local to the host, right?  The failing/failed host is physically powered off but not in maintenance mode so there are no logs to be gathered from there.  Incorrect cluster management, I know, as it should be in maintenance mode, but that simple oversight was the genesis of this issue.
Comment 15 Piotr Kliczewski 2016-01-28 09:00:31 EST
Based on you description here is what is happening. The engine attempts to trigger monitoring cycle for the host since it is not in maintenance mode. Even though the connection fails the call is tracked and after high level timeout we can see exception as in comment #18.

What is the state of the host that you see on the UI?
Comment 16 Paul Wightman 2016-01-28 09:08:10 EST
In the UI, host status is "Non Responsive."  The Alerts tab is showing "VDSM tms-kvmsec01.{domain} command failed: Message timeout which can be caused by communication issues" (exception text) every 3 minutes, as well.

Understand that the exception is due to a failure of the host to respond (because it's powered off but not in maintenance mode).  Not sure if this is really of major concern, but I did want to point it out because there was a change in behavior (namely, a new exception/log message).
Comment 17 Piotr Kliczewski 2016-01-28 09:29:45 EST
This exception is expected when a host is not responding or down. I need to explore why monitoring cycle is triggered for host in "Non Responsive" but this issue is not related to file descriptor leak.

Thank you for verifying the fix.
Comment 18 Paul Wightman 2016-02-01 09:34 EST
Created attachment 1120124 [details]
New log file to accompany possible postgres leak

Piotr,

I left the system running over the weekend in the same downed-host state and came in to a very sluggish manager host this morning.  

top reported a LOT of postgres-owned "postmaster" commands were running.  I checked open file descriptors twice and found:

TIME: 07:32:46 lsof Total: 21038 ovirt-owned: 1455 postgres-owned: 13236
TIME: 08:59:27 lsof Total: 30548 ovirt-owned: 1526 postgres-owned: 21491

Upon restarting ovirt-engine, the both ovirt- and postgres-owned file descriptors fell to around 800 and the system became much more responsive.  My descriptor monitor log showed:

TIME: 09:09:29 lsof Total: 10423 ovirt-owned: 1205 postgres-owned: 2941
TIME: 09:10:01 lsof Total: 10456 ovirt-owned: 1205 postgres-owned: 3031
TIME: 09:10:32 lsof Total: 10717 ovirt-owned: 1229 postgres-owned: 3211
TIME: 09:11:04 lsof Total: 10759 ovirt-owned: 1250 postgres-owned: 3229
TIME: 09:11:35 lsof Total: 10690 ovirt-owned: 1254 postgres-owned: 3156
TIME: 09:12:07 lsof Total: 10693 ovirt-owned: 1273 postgres-owned: 3142
TIME: 09:12:38 lsof Total: 10717 ovirt-owned: 1273 postgres-owned: 3168
TIME: 09:13:10 lsof Total: 10735 ovirt-owned: 1290 postgres-owned: 3170
TIME: 09:13:41 lsof Total: 10905 ovirt-owned: 1290 postgres-owned: 3337
TIME: 09:14:13 lsof Total: 10892 ovirt-owned: 1299 postgres-owned: 3315
TIME: 09:14:44 lsof Total: 10931 ovirt-owned: 1299 postgres-owned: 3371
restarted ovirt-engine
TIME: 09:15:16 lsof Total: 7823 ovirt-owned: 1126 postgres-owned: 450
TIME: 09:15:48 lsof Total: 10808 ovirt-owned: 1159 postgres-owned: 3353
TIME: 09:16:19 lsof Total: 10858 ovirt-owned: 1184 postgres-owned: 3414
TIME: 09:16:51 lsof Total: 10733 ovirt-owned: 1183 postgres-owned: 3289
TIME: 09:17:22 lsof Total: 10912 ovirt-owned: 1207 postgres-owned: 3453
TIME: 09:17:54 lsof Total: 10924 ovirt-owned: 1208 postgres-owned: 3405
TIME: 09:18:26 lsof Total: 11117 ovirt-owned: 1232 postgres-owned: 3472
TIME: 09:18:58 lsof Total: 14579 ovirt-owned: 1252 postgres-owned: 6987
TIME: 09:19:30 lsof Total: 15194 ovirt-owned: 1277 postgres-owned: 7509
TIME: 09:20:02 lsof Total: 15141 ovirt-owned: 1289 postgres-owned: 7444
TIME: 09:20:34 lsof Total: 15332 ovirt-owned: 1292 postgres-owned: 7631
TIME: 09:21:07 lsof Total: 15304 ovirt-owned: 1305 postgres-owned: 7594
TIME: 09:21:39 lsof Total: 15302 ovirt-owned: 1304 postgres-owned: 7586
TIME: 09:22:11 lsof Total: 15268 ovirt-owned: 1314 postgres-owned: 7552
TIME: 09:22:43 lsof Total: 15352 ovirt-owned: 1315 postgres-owned: 7625
TIME: 09:23:15 lsof Total: 7951 ovirt-owned: 190 postgres-owned: 306
restarted ovirt-engine
TIME: 09:23:46 lsof Total: 8090 ovirt-owned: 1111 postgres-owned: 624
TIME: 09:24:18 lsof Total: 10446 ovirt-owned: 1177 postgres-owned: 2850
TIME: 09:24:50 lsof Total: 10405 ovirt-owned: 1176 postgres-owned: 2839

The UI showed alerts of "There is no full backup available, please run engine-backup to prevent data loss in case of corruption." about every 6 hours.

It seems possible that a leak may have moved from ovirt/java to something in the ovirt/postgres interface...

Today's engine.log is attached.
Comment 19 Piotr Kliczewski 2016-02-02 03:06:33 EST
It is good that we fixed one and it looks like the other one could be there since the beginning.

Eli can you please take a look?
Comment 20 Eli Mesika 2016-02-02 04:40:29 EST
(In reply to Piotr Kliczewski from comment #19)
> It is good that we fixed one and it looks like the other one could be there
> since the beginning.
> 
> Eli can you please take a look?

Can you please provide the exact PG version used ?
I see some reports regarding leaks on 8.4.x
Comment 21 Paul Wightman 2016-02-02 07:27:26 EST
The following packages are installed on the manager:

postgresql-libs-8.4.20-4.el6_7.x86_64
postgresql-server-8.4.20-4.el6_7.x86_64
postgresql-8.4.20-4.el6_7.x86_64

This does not appear to be a steady leak like the earlier issue.  It seems it jumped up rather sharply yesterday then has increased in a much more gradual trend with lots of dips and rises.  I have yet to zero in on a series of events that causes the rate to increase like it did yesterday.  It would seem possible that this is UI related as it jumped up most sharply yesterday while I was checking status of several running VMs and was having trouble getting the console to open (ultimately found to be due to the manager being so bogged down with all the extra postgres stuff).  The numbers remained mostly steady overnight when no one was using the system.  Now that I've opened the UI and launched some consoles, the numbers have risen some.

FYI... The client is Chrome v48.0.2564.97 m.  I tend to keep the UI on the Virtual Machines tab for the whole system (showing me all VMs across 3 datacenters in one view).  All consoles being opened are using SPICE.
Comment 22 Piotr Kliczewski 2016-02-08 11:08:18 EST
I am going to release the change which fixed fd leak in the engine. Paul if you want to track postgresql related leak please open new BZ.
Comment 23 Ondra Machacek 2016-02-22 10:16:48 EST
based on #comment 21 moving to verified.

Note You need to log in before you can comment on or make changes to this bug.