Bug 1296930
Summary: | ovirt-engine fails with "too many open files" | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Paul Wightman <pwightman> | ||||||||||||
Component: | General | Assignee: | Piotr Kliczewski <pkliczew> | ||||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Ondra Machacek <omachace> | ||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | unspecified | ||||||||||||||
Version: | 3.6.0.3 | CC: | bugs, emesika, mgoldboi, mperina, oourfali, pstehlik, pwightman | ||||||||||||
Target Milestone: | ovirt-3.6.3 | Flags: | rule-engine:
ovirt-3.6.z+
rule-engine: exception+ mgoldboi: planning_ack+ oourfali: devel_ack+ pstehlik: testing_ack+ |
||||||||||||
Target Release: | 3.6.3.2 | ||||||||||||||
Hardware: | x86_64 | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2016-03-11 07:24:53 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Attachments: |
|
Description
Paul Wightman
2016-01-08 13:22:51 UTC
Can you please provide complete logs for the period that the engine was running? You mentioned 3-4 days. Created attachment 1113633 [details]
Additional log files per needinfo request.
I believe this covers an entire cycle.
I checked the logs and I can see first occurrence of the issue at 2015-12-27 03:06:16,423 which means that it is one of the first entries in the oldest log provided. Can you look for time when the issue started to occur and provide logs for a day or two before it occurred. I tried to reproduce the issue using single host and it looks like fds are closed properly (checked by using lsof and jprofiler). I am not sure whether it is related to the issue but I can see: 2015-12-27 03:06:07,818 ERROR [org.ovirt.engine.extensions.aaa.builtin.kerberosldap.GSSAPIDirContextAuthenticationStrategy] (default task-79) [] Kerberos error: Cannot locate KDC 2015-12-27 03:06:07,819 ERROR [org.ovirt.engine.extensions.aaa.builtin.kerberosldap.DirectorySearcher] (default task-79) [] Failed ldap search server ldap://level.{domain}:389 using user nagios@{domain} due to Kerberos error. Please check log for further details.. We should not try the next server 2015-12-27 03:06:07,819 ERROR [org.ovirt.engine.extensions.aaa.builtin.kerberosldap.LdapAuthenticateUserCommand] (default task-79) [] Failed authenticating user: nagios to domain {domain}. Ldap Query Type is getUserByName 2015-12-27 03:06:07,819 ERROR [org.ovirt.engine.extensions.aaa.builtin.kerberosldap.LdapAuthenticateUserCommand] (default task-79) [] Kerberos error. Please check log for further details. 2015-12-27 03:06:07,819 ERROR [org.ovirt.engine.extensions.aaa.builtin.kerberosldap.LdapBrokerCommandBase] (default task-79) [] Failed to run command LdapAuthenticateUserCommand. Domain is {domain}. User is nagios. 2015-12-27 03:06:07,819 ERROR [org.ovirt.engine.core.aaa.filters.BasicAuthenticationFilter] (default task-79) [] User nagios authentication failed. profile is {domain}. Invocation Result code is 0. Authn result code is GENERAL_ERROR Hi Paul, I looked at engine.log and there are lot of kerberos error which may cause login errors using API. So please check your kerberos environment. If everything is OK (from the log provided I'd guess that you have some issue with either kerberos config on the host or DNS setup, because there's and issue to locate KDC). More details can be found if you enable debug logging on AAA module by execution of: /usr/share/jbossas/bin/jboss-cli.sh \ --connect \ --timeout=30000 \ --controller=localhost:8706 \ --user=admin@internal \ --commands=" if (outcome != success) of /subsystem=logging/logger=org.ovirt.engine.extensions.aaa:read-attribute(name=level), /subsystem=logging/logger=org.ovirt.engine.extensions.aaa:add, end-if, /subsystem=logging/logger=org.ovirt.engine.extensions.aaa:write-attribute(name=level,value=ALL) " This command uses "admin@internal" user so it's not affected by kerbldap issue. When done successfully please try to execute your API call and attach new logs. Also please be aware that kerbldap extension (configured by engine-manage-domains) has been deprecated in RHEV 3.5 and you really should migrate to aaa-ldap extension. For that please install ovirt-engine-extension-aaa-ldap* packages and take a look at provided README files. (In reply to Martin Perina from comment #4) > Hi Paul, > > I looked at engine.log and there are lot of kerberos error which may cause > login errors using API. So please check your kerberos environment. If > everything is OK (from the log provided I'd guess that you have some issue > with either kerberos config on the host or DNS setup, because there's and > issue to locate KDC). More details can be found if you enable debug logging > on AAA module by execution of: > > /usr/share/jbossas/bin/jboss-cli.sh \ > --connect \ > --timeout=30000 \ > --controller=localhost:8706 \ > --user=admin@internal \ > --commands=" > if (outcome != success) of > /subsystem=logging/logger=org.ovirt.engine.extensions.aaa:read- > attribute(name=level), > /subsystem=logging/logger=org.ovirt.engine.extensions.aaa:add, > end-if, > > /subsystem=logging/logger=org.ovirt.engine.extensions.aaa:write- > attribute(name=level,value=ALL) > " Sorry, I was confused by "RHEV API call" and I didn't noticed that you are using oVirt, this is correct command for oVirt 3.6: /usr/share/ovirt-engine-wildfly/bin/jboss-cli.sh \ --connect \ --timeout=30000 \ --controller=localhost:8706 \ --user=admin@internal \ --commands=" if (outcome != success) of /subsystem=logging/logger=org.ovirt.engine.extensions.aaa:read-attribute(name=level), /subsystem=logging/logger=org.ovirt.engine.extensions.aaa:add, end-if, /subsystem=logging/logger=org.ovirt.engine.extensions.aaa:write-attribute(name=level,value=ALL) " Created attachment 1116683 [details]
Additional Log File
Sorry for the delay... I've been out of town. I thought I had pulled enough files to show a complete cycle, but I guess I fell short by one. Attached is the next log file in the series, and it does show the error.
Interestingly, the KDC errors seemed to stop around 6 Jan. The system has been running for 16 days and only shows ~13,500 open file descriptors.
I will look into changing the LDAP backend to aaa-ldap, thought that may take a while for me to get done.
(In reply to Paul Wightman from comment #6) > Created attachment 1116683 [details] > Additional Log File Hmm, the log ends on Jan 03, so there are no debug logs as I sent you how to enable debug log on Jan 12. If the issue with login still persists, please attach up-to-date log with debug information. Also please be aware that the above method how to turn on debug logs can be used only for currently running ovirt-engine service, if you restart the service, log config is restored to default. > > Sorry for the delay... I've been out of town. I thought I had pulled enough > files to show a complete cycle, but I guess I fell short by one. Attached > is the next log file in the series, and it does show the error. > > Interestingly, the KDC errors seemed to stop around 6 Jan. The system has > been running for 16 days and only shows ~13,500 open file descriptors. > > I will look into changing the LDAP backend to aaa-ldap, thought that may > take a while for me to get done. You can try ovirt-engine-kerbldap-migration tool [1] to easily transfer you existing kerbldap config into the new aaa-ldap config. For details please take a look at [2] and also feel free to ask if something doesn't work. [1] https://github.com/machacekondra/ovirt-engine-kerbldap-migration [2] https://github.com/machacekondra/ovirt-engine-kerbldap-migration/blob/master/README.md I pushed a patch. Can you please help me verify whether it fixes the issue? I could send you a jar a tell how to replace. Piotr, I would be happy to test a patch. Please send instructions and the jar file. Paul, I sent the file and steps to your email. Please let me know about the results. Created attachment 1118785 [details]
Log files immediatly before/after application of test jar file.
Piotr,
Patch was installed this morning and initial indications are good. It has been running for about 45 minutes now and ovirt only owns ~1432 files (it's been steady at about that number for 25 minutes). In that same amount of time yesterday, ovirt was up to just over 2000 open files.
I've attached the engine.log from today as well as a log showing open file descriptors about every 30 seconds starting at 10:10am local time yesterday. Prior to leaving for the evening, I placed the non-responsive host in maintenance mode. The extra log shows a corresponding leveling off of the number of ovirt-owned files (~5880) starting at about the same time the host was put in maintenance mode. The number of ovirt-owned files started increasing rapidly again this morning when I took the host out of maintenance mode just prior to installing the test jar file.
I will continue to monitor and let you know how it goes.
Piotr, After 24 hours, total open file descriptors is relatively steady at ~16,400. Ovirt-owned file descriptors have remained steady between 1,431 and 1,435. engine.log is throwing the below exception every 3 minutes: 2016-01-28 08:09:44,150 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-61) [7b61476d] Failure to refresh Vds runtime info: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 2016-01-28 08:09:44,150 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-61) [7b61476d] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand.executeVdsBrokerCommand(GetCapabilitiesVDSCommand.java:16) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) [vdsbroker.jar:] at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.refreshCapabilities(VdsManager.java:634) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:119) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:226) [vdsbroker.jar:] at sun.reflect.GeneratedMethodAccessor233.invoke(Unknown Source) [:1.7.0_91] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_91] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_91] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:] Not uploading a new log file as this can be seen towards the end of the log file uploaded yesterday. I am happy that we do not loose fds. The issue that you mentioned occurs when there is no response from vdsm within 3 minutes. This is high level mechanism making sure that we handle vdsm side issues correctly. From the stack trace I can see that getCapabilieties call was affected. Please provide vdsm log for the host that it is failing. VDSM logs are local to the host, right? The failing/failed host is physically powered off but not in maintenance mode so there are no logs to be gathered from there. Incorrect cluster management, I know, as it should be in maintenance mode, but that simple oversight was the genesis of this issue. Based on you description here is what is happening. The engine attempts to trigger monitoring cycle for the host since it is not in maintenance mode. Even though the connection fails the call is tracked and after high level timeout we can see exception as in comment #18. What is the state of the host that you see on the UI? In the UI, host status is "Non Responsive." The Alerts tab is showing "VDSM tms-kvmsec01.{domain} command failed: Message timeout which can be caused by communication issues" (exception text) every 3 minutes, as well. Understand that the exception is due to a failure of the host to respond (because it's powered off but not in maintenance mode). Not sure if this is really of major concern, but I did want to point it out because there was a change in behavior (namely, a new exception/log message). This exception is expected when a host is not responding or down. I need to explore why monitoring cycle is triggered for host in "Non Responsive" but this issue is not related to file descriptor leak. Thank you for verifying the fix. Created attachment 1120124 [details]
New log file to accompany possible postgres leak
Piotr,
I left the system running over the weekend in the same downed-host state and came in to a very sluggish manager host this morning.
top reported a LOT of postgres-owned "postmaster" commands were running. I checked open file descriptors twice and found:
TIME: 07:32:46 lsof Total: 21038 ovirt-owned: 1455 postgres-owned: 13236
TIME: 08:59:27 lsof Total: 30548 ovirt-owned: 1526 postgres-owned: 21491
Upon restarting ovirt-engine, the both ovirt- and postgres-owned file descriptors fell to around 800 and the system became much more responsive. My descriptor monitor log showed:
TIME: 09:09:29 lsof Total: 10423 ovirt-owned: 1205 postgres-owned: 2941
TIME: 09:10:01 lsof Total: 10456 ovirt-owned: 1205 postgres-owned: 3031
TIME: 09:10:32 lsof Total: 10717 ovirt-owned: 1229 postgres-owned: 3211
TIME: 09:11:04 lsof Total: 10759 ovirt-owned: 1250 postgres-owned: 3229
TIME: 09:11:35 lsof Total: 10690 ovirt-owned: 1254 postgres-owned: 3156
TIME: 09:12:07 lsof Total: 10693 ovirt-owned: 1273 postgres-owned: 3142
TIME: 09:12:38 lsof Total: 10717 ovirt-owned: 1273 postgres-owned: 3168
TIME: 09:13:10 lsof Total: 10735 ovirt-owned: 1290 postgres-owned: 3170
TIME: 09:13:41 lsof Total: 10905 ovirt-owned: 1290 postgres-owned: 3337
TIME: 09:14:13 lsof Total: 10892 ovirt-owned: 1299 postgres-owned: 3315
TIME: 09:14:44 lsof Total: 10931 ovirt-owned: 1299 postgres-owned: 3371
restarted ovirt-engine
TIME: 09:15:16 lsof Total: 7823 ovirt-owned: 1126 postgres-owned: 450
TIME: 09:15:48 lsof Total: 10808 ovirt-owned: 1159 postgres-owned: 3353
TIME: 09:16:19 lsof Total: 10858 ovirt-owned: 1184 postgres-owned: 3414
TIME: 09:16:51 lsof Total: 10733 ovirt-owned: 1183 postgres-owned: 3289
TIME: 09:17:22 lsof Total: 10912 ovirt-owned: 1207 postgres-owned: 3453
TIME: 09:17:54 lsof Total: 10924 ovirt-owned: 1208 postgres-owned: 3405
TIME: 09:18:26 lsof Total: 11117 ovirt-owned: 1232 postgres-owned: 3472
TIME: 09:18:58 lsof Total: 14579 ovirt-owned: 1252 postgres-owned: 6987
TIME: 09:19:30 lsof Total: 15194 ovirt-owned: 1277 postgres-owned: 7509
TIME: 09:20:02 lsof Total: 15141 ovirt-owned: 1289 postgres-owned: 7444
TIME: 09:20:34 lsof Total: 15332 ovirt-owned: 1292 postgres-owned: 7631
TIME: 09:21:07 lsof Total: 15304 ovirt-owned: 1305 postgres-owned: 7594
TIME: 09:21:39 lsof Total: 15302 ovirt-owned: 1304 postgres-owned: 7586
TIME: 09:22:11 lsof Total: 15268 ovirt-owned: 1314 postgres-owned: 7552
TIME: 09:22:43 lsof Total: 15352 ovirt-owned: 1315 postgres-owned: 7625
TIME: 09:23:15 lsof Total: 7951 ovirt-owned: 190 postgres-owned: 306
restarted ovirt-engine
TIME: 09:23:46 lsof Total: 8090 ovirt-owned: 1111 postgres-owned: 624
TIME: 09:24:18 lsof Total: 10446 ovirt-owned: 1177 postgres-owned: 2850
TIME: 09:24:50 lsof Total: 10405 ovirt-owned: 1176 postgres-owned: 2839
The UI showed alerts of "There is no full backup available, please run engine-backup to prevent data loss in case of corruption." about every 6 hours.
It seems possible that a leak may have moved from ovirt/java to something in the ovirt/postgres interface...
Today's engine.log is attached.
It is good that we fixed one and it looks like the other one could be there since the beginning. Eli can you please take a look? (In reply to Piotr Kliczewski from comment #19) > It is good that we fixed one and it looks like the other one could be there > since the beginning. > > Eli can you please take a look? Can you please provide the exact PG version used ? I see some reports regarding leaks on 8.4.x The following packages are installed on the manager: postgresql-libs-8.4.20-4.el6_7.x86_64 postgresql-server-8.4.20-4.el6_7.x86_64 postgresql-8.4.20-4.el6_7.x86_64 This does not appear to be a steady leak like the earlier issue. It seems it jumped up rather sharply yesterday then has increased in a much more gradual trend with lots of dips and rises. I have yet to zero in on a series of events that causes the rate to increase like it did yesterday. It would seem possible that this is UI related as it jumped up most sharply yesterday while I was checking status of several running VMs and was having trouble getting the console to open (ultimately found to be due to the manager being so bogged down with all the extra postgres stuff). The numbers remained mostly steady overnight when no one was using the system. Now that I've opened the UI and launched some consoles, the numbers have risen some. FYI... The client is Chrome v48.0.2564.97 m. I tend to keep the UI on the Virtual Machines tab for the whole system (showing me all VMs across 3 datacenters in one view). All consoles being opened are using SPICE. I am going to release the change which fixed fd leak in the engine. Paul if you want to track postgresql related leak please open new BZ. based on #comment 21 moving to verified. |