Bug 1118337
| Summary: | ETL service sampling error after kill -9 | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Retired] oVirt | Reporter: | Petr Matyáš <pmatyas> | ||||||
| Component: | ovirt-engine-dwh | Assignee: | Shirly Radco <sradco> | ||||||
| Status: | CLOSED WONTFIX | QA Contact: | Petr Matyáš <pmatyas> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 3.4 | CC: | bazulay, bugs, didi, gklein, nbarcet, pmatyas, rbalakri, sbonazzo, s.kieske, sradco, yeylon, ylavi | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | 3.6.0 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | integration | ||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2015-07-05 11:33:10 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
Petr, Please provide setup environment details (Engine,DWH and report rpm versions). Thanks, Shirly Ovirt engine seems to be down causing this warning. Can you check the engine status? Yaniv rhevm-3.4.1-0.23.el6ev.noarch rhevm-dwh-setup-3.4.1-1.el6ev.noarch rhevm-reports-setup-3.4.1-1.el6ev.noarch The engine is running, I saw this message in webadmin first. Any reason that heartbeat will not work in 3.4.1? Yaniv the log is indicating that two dwh processes are running at the same time. Please check. Yaniv (In reply to Yaniv Dary from comment #4) > Any reason that heartbeat will not work in 3.4.1? I looked into db and DWH heartbeat was recorded properly in dwh_history_timekeeping table. There actually were 2 running dwhd processes, after I killed older one it started to work. But I don't think it should be possible to start second process. It should be. Alon, any idea why kill didn't kill the process? Yaniv I guess the correct handling of this would be the following: when you start dwh, it checks if there is already a dwh process running if it is running, then output a message to the user and don't start a second dwh process. You could easily tack .pid files inside /var/run to achieve this. There are already many working implementations in the wild which do this. This should apply to any critical ovirt component which can not be run concurrently. (In reply to Sven Kieske from comment #9) > I guess the correct handling of this would be the following: > when you start dwh, it checks if there is already a dwh process running > if it is running, then output a message to the user and don't start > a second dwh process. > > You could easily tack .pid files inside /var/run to achieve this. > There are already many working implementations in the wild which do this. > > This should apply to any critical ovirt component which can not be run > concurrently. This is how it should happen. Alon might help with understanding why it didn't happen. cron content is quite simple:
--
#!/bin/sh
PIDFILE="/var/run/ovirt-engine-dwhd.pid"
if [ -e "${PIDFILE}" ] && ! service ovirt-engine-dwhd status > /dev/null 2>&1; then
service ovirt-engine-dwhd start > /dev/null 2>&1
fi
---
please try to run the cron script and see in what situation it cause to run two processes, we already use the pid locking, so it should not happen:
---
daemon --user "${PKG_USER}" --pidfile="${PIDFILE}" \
"${PKG_DATA_DIR}/services/${NAME}/${NAME}.py" \
--pidfile="${PIDFILE}" \
--background \
--redirect-output \
${EXTRA_ARGS} \
start
---
what can happen is that during stop the service is stuck, and pid is removed.
please reproduce and understand what happens.
thanks!
Created attachment 917783 [details]
Ovirt Engine Dwhd log update
After something like 38 minutes it started to fail again. In webadmin is the same error, but log is different.
Forget the last comment, I didn't have enough free space on engine, it's being filled by something, have to find out with what it is. Sorry about that. Could there be different warnings? This is confusing. (In reply to Petr Matyáš from comment #13) > Forget the last comment, I didn't have enough free space on engine, it's > being filled by something, have to find out with what it is. > Sorry about that. Could there be different warnings? This is confusing. This situation should never happen, so we need to find the cause, not handle the errors. If you are unable to recreate, please close the bug. I just recreated it on my setup (clean RHEL6.5 with RHEV-M 3.4), I belive I don't have to post the log, it's same as before. There are 2 processes for dwhd service. If I kill the older one, it starts to work. Alon, please see if you can figure out why this is happening. We see that with ps aux | grep dwh we get that the two processes are running at the same time. (In reply to Shirly Radco from comment #17) > Alon, please see if you can figure out why this is happening. > We see that with ps aux | grep dwh we get that the two processes are running > at the same time. please assign to someone from integration... this should not happen. The same bug probably affects all our python-wraps-java daemons, including engine, notifier, reportsd. The only difference is that for dwhd we supply a (downstream-only, afaik) cron job to start it if it's down. (In reply to Yedidyah Bar David from comment #19) > The same bug probably affects all our python-wraps-java daemons, including > engine, notifier, reportsd. The only difference is that for dwhd we supply a > (downstream-only, afaik) cron job to start it if it's down. it's not downstream-only, since we use the same from centos. (In reply to Yaniv Dary from comment #20) > (In reply to Yedidyah Bar David from comment #19) > > The same bug probably affects all our python-wraps-java daemons, including > > engine, notifier, reportsd. The only difference is that for dwhd we supply a > > (downstream-only, afaik) cron job to start it if it's down. > > it's not downstream-only, since we use the same from centos. Actually also on fedora - we set in its systemd conf: Restart=on-failure RestartSec=3600 All of the above daemons do basically this: - A python wrapper runs and writes its pid to the pidfile - This wrapper runs the real daemon (written in Java) - When the real daemon stops, the wrapper also exits - 'service $SERVICE stop' just sends a SIGTERM to the pid in the file (the python wrapper) - When the wrapper receives a SIGTERM it kills the real daemon and exits - When the wrapper receives a SIGKILL (9) it can't do anything and just dies The child is not notified in any way and remains running. 'service $SERVICE start' then sees that the "daemon" (python wrapper) is not running and starts another copy of the real daemon, which happily runs. How to solve it (prevent two copies of the real daemon from running)? I can think of a few possible solutions: 1. Make the python wrapper not wait for its child - instead of its own pid, write the child's pid to the file and exit. This will require making the real daemon do the processing currently done by the wrapper (remove the pid file etc). 2. Write another pidfile of the real daemon, and check also it when relevant. Specifically, check it at 'start' to make sure no duplicates - if duplicates are found, we should probably alert somewhere, kill the existing real daemon, wait a bit, and start afresh. 3. Do nothing and close wontfix. Not sure this is such a bad choice - the python wrapper uses very few resources, I do not see a reason for it to die unexpectedly (e.g. by the oom killer), and if a user kills it with -9 he should definitely know what he is doing. Sandro - what do you think? Setting lower severity for now. I'm tempted to go for option 3, services shouldn't be stopped by randomly killing them with kill -9. On the other hand, if the parent process receive signal 9, it should not remove the pid file and when you try to start the service it should fail the start on "subsystem locked". User should then cleanly stop the service by running service $service stop and that action should take care of the pid file removal and kill somehow the child process. I'm ok with the lowered severity, I don't think there's any urgency to handle this. (In reply to Sandro Bonazzola from comment #22) > I'm tempted to go for option 3, services shouldn't be stopped by randomly > killing them with kill -9. > On the other hand, if the parent process receive signal 9, it should not > remove the pid file and when you try to start the service it should fail the > start on "subsystem locked". User should then cleanly stop the service by > running service $service stop and that action should take care of the pid > file removal and kill somehow the child process. On systemd we do not pass --pidfile and it already behaves as expected (?) - it kills also the child (which is the real daemon). On sysv we ship a sysv init script that does not check the pid itself - it calls the daemon function, which ignores the pid file if there is no process with that pid. Our cron script to restart it calls 'status' which behaves similarly. We do not use the "lock-file" option to 'status' and do not touch anything under /var/lock (which seems to be done manually by other scripts, as opposed to using an option to daemon/start/etc). > > I'm ok with the lowered severity, I don't think there's any urgency to > handle this. We need to match the systemd behaviour we can't have diff. I would go with sandro's suggestion. Please move forward on fixing this. (In reply to Yaniv Dary from comment #24) > We need to match the systemd behaviour we can't have diff. > I would go with sandro's suggestion. Please move forward on fixing this. Not sure what exactly was "Sandro's suggestion". I already replied to his comment explaining why it's not applicable directly. We want to add a lock file? Go with my option 2? something else? (In reply to Yedidyah Bar David from comment #25) > (In reply to Yaniv Dary from comment #24) > > We need to match the systemd behaviour we can't have diff. > > I would go with sandro's suggestion. Please move forward on fixing this. > > Not sure what exactly was "Sandro's suggestion". I already replied to his > comment explaining why it's not applicable directly. We want to add a lock > file? Go with my option 2? something else? I think option #2 here would be the best, the service should not allow to run the daemon twice, and should handle orphan daemons if the service is SIGKILLed for any reason. Didi says that Yaniv said that following a discussion with Alon, the python process should not be left running, but only the java one. kill -9 does not give the process chance for cleanup, it should not be used unless process is not responding. using it is per sysadmin responsibility to cleanup. kill -TERM should be used to terminate processes. this bug should be closed as notabug. Shirly please check whether the second process can actually run after the fix for Bug 1118350 was commited. I seriously doubt it is possible, even if it is possible, than a small fix for that logic can handle this situation. The only reason I would like to handle it is the remote possibility of history data correctness (having 2 active DWH processes together might lead to a race and data correctness issues. Didi, Please check why the python process is running in the background and not use exec. (In reply to Shirly Radco from comment #30) > Didi, Please check why the python process is running in the background and > not use exec. to be able to monitor the java and handle signals and cleanups correctly. PLEASE close this bug kill -9 behaves as it should as far as services are involved. On Thursday we (Shirly, Yaniv and I) discussed this bug. We agreed to continue and apply something like option 2 of comment 21. For the record - the reason is that current code can cause corruption of the database if two dwh processes run concurrently. (In reply to Yedidyah Bar David from comment #32) > On Thursday we (Shirly, Yaniv and I) discussed this bug. We agreed to > continue and apply something like option 2 of comment 21. > > For the record - the reason is that current code can cause corruption of the > database if two dwh processes run concurrently. this violates the way we execute daemons. the wrapper must kept living. any other solution should be based on resource of java, for example, bind to port 12312 and fail in 2nd process. (In reply to Alon Bar-Lev from comment #33) > (In reply to Yedidyah Bar David from comment #32) > > On Thursday we (Shirly, Yaniv and I) discussed this bug. We agreed to > > continue and apply something like option 2 of comment 21. > > > > For the record - the reason is that current code can cause corruption of the > > database if two dwh processes run concurrently. > > this violates the way we execute daemons. the wrapper must kept living. "this" - killing with '-9'? I obviously agree, but we decided we still would like to prevent corruption. We might give up if it turns out to be too costly (unlikely). > > any other solution should be based on resource of java, for example, bind to > port 12312 and fail in 2nd process. Just bind as a lock? Why not directly lock a file in /var/lock (as Sandro mentioned)? Or use a pidfile? Why is it important that it's done by the java code and not python? (In reply to Yedidyah Bar David from comment #34) > Just bind as a lock? Why not directly lock a file in /var/lock (as Sandro > mentioned)? Or use a pidfile? Why is it important that it's done by the java > code and not python? because java do not know how to manage signals correctly nor cleanup. just to be clear: the wrapper is not to be touched, nor eliminated before and after, it is working perfectly and as designed in all cases that required to integrate with system service manager. using /var/lock is unusable in kill -9 as well. blocking concurrent process should be done within the java domain, if java cannot figure out if another instance is running, for example: examining last update column of database in startup, as it can corrupt also if running on different machine, or at lock machine by updating files, socket or any other semi singleton. (In reply to Alon Bar-Lev from comment #35) > (In reply to Yedidyah Bar David from comment #34) > > Just bind as a lock? Why not directly lock a file in /var/lock (as Sandro > > mentioned)? Or use a pidfile? Why is it important that it's done by the java > > code and not python? > > because java do not know how to manage signals correctly nor cleanup. I didn't mean to change the Java code so not sure if relevant. I see that when the wrapper receives a signal (TERM or INT) it calls p.terminate on the child (java), which is sending it TERM too. So does it work or not? Or you mean it can't handle other signals? Not that important, though. > > just to be clear: the wrapper is not to be touched, nor eliminated before > and after, it is working perfectly and as designed in all cases that > required to integrate with system service manager. I did not intend to change the way it manages itself. The suggestion in option 2 was to change the wrapper to use another pidfile for the pid of the child: 1. Check if it exists - if yes, and the pid seems to belong to dwhd, alert and exit. Otherwise ignore (perhaps after alerting). 2. Write it there on startup 3. Remove on cleanup > > using /var/lock is unusable in kill -9 as well. > > blocking concurrent process should be done within the java domain, if java > cannot figure out if another instance is running, for example: examining > last update column of database in startup, as it can corrupt also if running > on different machine, or at lock machine by updating files, socket or any > other semi singleton. We actually discussed this too, and I can't remember why it was rejected. There is already a row in engine db's dwh_history_timekeeping table called 'DwhCurrentlyRunning'. I asked Shirly about that yesterday in private and she said "We decided it is not enough because we want to use the data from the dwh to see if it is running and not count on the engine db for it.". Shirly - why? IIRC the only reason was that a user doing 'kill -9' on python will then need to connect to engine db in order to clean up after himself, which is a bit more complex than just removing a file. I do not think this is a good enough reason. If they are smart enough to kill -9, they should be able to connect to the engine db (with some kb/wiki helping them). As discussed on length, we don't want the user to have to change this value in the engine database. kill -9 can be caused theoretically by the system itself and not only by the user. We want this managed on dwh server side. (In reply to Shirly Radco from comment #37) > As discussed on length, we don't want the user to have to change this value > in the engine database. > kill -9 can be caused theoretically by the system itself and not only by the > user. > We want this managed on dwh server side. the user should not change anything. if it is important enough and can corrupt data this should be embedded within application and not assume local or remote execution. when the new process of dwh is up it should have a mean to determine if existing process of dwh is up. there should be mechanism to do so, for example having a field that contains last timestamp of update, while the new process waits for a period and see if it changed, or use database exclusive lock of some object during runtime. using another pid file will solve the problem partially, as these pid files belongs to the old sysvinit and should be avoided for newer, and once you allows remote dwh installation, the corruption is possible remotely, so using any local solution is not correct, "guessing" that pid belongs to a specific application is always heuristics. once again, any mechanism that is to be implemented should be part of the application not the wrapper as the application is responsible of corruption protection, even if executed manually. (In reply to Alon Bar-Lev from comment #38) > (In reply to Shirly Radco from comment #37) > > As discussed on length, we don't want the user to have to change this value > > in the engine database. > > kill -9 can be caused theoretically by the system itself and not only by the > > user. > > We want this managed on dwh server side. > > the user should not change anything. > if it is important enough and can corrupt data this should be embedded > within application and not assume local or remote execution. > when the new process of dwh is up it should have a mean to determine if > existing process of dwh is up. > there should be mechanism to do so, for example having a field that contains > last timestamp of update, while the new process waits for a period and see > if it changed, or use database exclusive lock of some object during runtime. We already have this mechanism for separate hosts to make sure dwh is not installed in two locations. Using database values and doing a test as suggested will make the service start very slow or will cause situations that service start reports OK and java fails a few moments after. > > using another pid file will solve the problem partially, as these pid files > belongs to the old sysvinit and should be avoided for newer, and once you > allows remote dwh installation, the corruption is possible remotely, so > using any local solution is not correct, "guessing" that pid belongs to a > specific application is always heuristics. systemd doesn't have this issue, it's only a sysv problem. pid checking will resolve local issue using check that will almost eliminate the possibility. > > once again, any mechanism that is to be implemented should be part of the > application not the wrapper as the application is responsible of corruption > protection, even if executed manually. I don't consider the wrapper external to the application. The purpose of this code is to monitor and execute this java application and it should not if there is an issue. If a user runs java twice and without the wrapper it's really out of scope in my opinion, since this is why the wrapper exists. do whatever you wish, it looks like you are the expert. I am done. Pushing this to 3.6 since this request was not made by a user and we want to try and test db locking as a solution. How the db locking solution tests gone? (In reply to Sandro Bonazzola from comment #42) > How the db locking solution tests gone? No change on this one. Since no agreement on resolution yet. We will not fix this unless a customer complains. well I don't know why you don't use systemd features to fix this, maybe I'm missing some internal knowledge over the involved components here, but a systemctl restart $service should clean up the whole environment, including all old possibly running/killed processes in the cgroup which is assigned to this service. you can also replace cron with systemd. I really don't know why a customer needs to complain to achieve such basic features (sounds like: "we will not include breaks in our cars until a customer complains"). But again, maybe I'm missing some details here why my proposal won't work. if this is the case, feel free to ignore this message. regards Sven (In reply to Sven Kieske from comment #45) > well I don't know why you don't use systemd features to fix this, maybe I'm > missing some internal knowledge over the involved components here, but a > systemctl restart $service should clean up the whole environment, including > all old possibly running/killed processes in the cgroup which is assigned to > this service. > > you can also replace cron with systemd. We did on el7\fc and the bug doesn't recreate with that. The issue is only on sysv and this should be resolved once we stop supporting el6. > > I really don't know why a customer needs to complain to achieve such basic > features (sounds like: "we will not include breaks in our cars until a > customer complains"). > > But again, maybe I'm missing some details here why my proposal won't work. > if this is the case, feel free to ignore this message. > > regards > > Sven |
Created attachment 917069 [details] Ovirt Engine DWH log Description of problem: ETL service sampling is not working after I used 'kill -9 ovirt-engine-dwhd' and let it start again with cron. Version-Release number of selected component (if applicable): 3.4 How reproducible: Always on my setup Steps to Reproduce: 1. Use 'kill -9 ovirt-engine-dwhd' on engine 2. Wait 1 hour for cron to start the service 3. Actual results: Not working sampling Expected results: Working sampling Additional info: