Bug 1118337 - ETL service sampling error after kill -9
Summary: ETL service sampling error after kill -9
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: oVirt
Classification: Retired
Component: ovirt-engine-dwh
Version: 3.4
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.6.0
Assignee: Shirly Radco
QA Contact: Petr Matyáš
URL:
Whiteboard: integration
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-07-10 13:08 UTC by Petr Matyáš
Modified: 2015-11-05 07:45 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-07-05 11:33:10 UTC
oVirt Team: ---


Attachments (Terms of Use)
Ovirt Engine DWH log (535.19 KB, text/x-log)
2014-07-10 13:08 UTC, Petr Matyáš
no flags Details
Ovirt Engine Dwhd log update (1.28 MB, text/x-log)
2014-07-14 12:38 UTC, Petr Matyáš
no flags Details


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 35000 master ABANDONED etl: added check to pervent two dwh instances Never

Description Petr Matyáš 2014-07-10 13:08:19 UTC
Created attachment 917069 [details]
Ovirt Engine DWH log

Description of problem:
ETL service sampling is not working after I used 'kill -9 ovirt-engine-dwhd' and let it start again with cron.

Version-Release number of selected component (if applicable):
3.4

How reproducible:
Always on my setup

Steps to Reproduce:
1. Use 'kill -9 ovirt-engine-dwhd' on engine
2. Wait 1 hour for cron to start the service
3.

Actual results:
Not working sampling

Expected results:
Working sampling

Additional info:

Comment 1 Shirly Radco 2014-07-13 06:25:06 UTC
Petr,

Please provide setup environment details (Engine,DWH and report rpm versions).

Thanks,
Shirly

Comment 2 Yaniv Lavi 2014-07-13 11:18:31 UTC
Ovirt engine seems to be down causing this warning.
Can you check the engine status?



Yaniv

Comment 3 Petr Matyáš 2014-07-14 08:48:36 UTC
rhevm-3.4.1-0.23.el6ev.noarch
rhevm-dwh-setup-3.4.1-1.el6ev.noarch
rhevm-reports-setup-3.4.1-1.el6ev.noarch

The engine is running, I saw this message in webadmin first.

Comment 4 Yaniv Lavi 2014-07-14 10:55:42 UTC
Any reason that heartbeat will not work in 3.4.1?



Yaniv

Comment 5 Yaniv Lavi 2014-07-14 11:36:42 UTC
the log is indicating that two dwh processes are running at the same time.
Please check.



Yaniv

Comment 6 Martin Perina 2014-07-14 11:53:05 UTC
(In reply to Yaniv Dary from comment #4)
> Any reason that heartbeat will not work in 3.4.1?

I looked into db and DWH heartbeat was recorded properly in dwh_history_timekeeping table.

Comment 7 Petr Matyáš 2014-07-14 11:59:43 UTC
There actually were 2 running dwhd processes, after I killed older one it started to work. But I don't think it should be possible to start second process.

Comment 8 Yaniv Lavi 2014-07-14 12:16:29 UTC
It should be. Alon, any idea why kill didn't kill the process?



Yaniv

Comment 9 Sven Kieske 2014-07-14 12:25:09 UTC
I guess the correct handling of this would be the following:
when you start dwh, it checks if there is already a dwh process running
if it is running, then output a message to the user and don't start
a second dwh process.

You could easily tack .pid files inside /var/run to achieve this.
There are already many working implementations in the wild which do this.

This should apply to any critical ovirt component which can not be run
concurrently.

Comment 10 Yaniv Lavi 2014-07-14 12:30:07 UTC
(In reply to Sven Kieske from comment #9)
> I guess the correct handling of this would be the following:
> when you start dwh, it checks if there is already a dwh process running
> if it is running, then output a message to the user and don't start
> a second dwh process.
> 
> You could easily tack .pid files inside /var/run to achieve this.
> There are already many working implementations in the wild which do this.
> 
> This should apply to any critical ovirt component which can not be run
> concurrently.

This is how it should happen. Alon might help with understanding why it didn't happen.

Comment 11 Alon Bar-Lev 2014-07-14 12:35:32 UTC
cron content is quite simple:

--
#!/bin/sh

PIDFILE="/var/run/ovirt-engine-dwhd.pid"

if [ -e "${PIDFILE}" ] && ! service ovirt-engine-dwhd status > /dev/null 2>&1; then
        service ovirt-engine-dwhd start > /dev/null 2>&1
fi
---

please try to run the cron script and see in what situation it cause to run two processes, we already use the pid locking, so it should not happen:

---
                daemon --user "${PKG_USER}" --pidfile="${PIDFILE}" \
                        "${PKG_DATA_DIR}/services/${NAME}/${NAME}.py" \
                                --pidfile="${PIDFILE}" \
                                --background \
                                --redirect-output \
                                ${EXTRA_ARGS} \
                                start
---

what can happen is that during stop the service is stuck, and pid is removed.

please reproduce and understand what happens.

thanks!

Comment 12 Petr Matyáš 2014-07-14 12:38:50 UTC
Created attachment 917783 [details]
Ovirt Engine Dwhd log update

After something like 38 minutes it started to fail again. In webadmin is the same error, but log is different.

Comment 13 Petr Matyáš 2014-07-14 13:42:00 UTC
Forget the last comment, I didn't have enough free space on engine, it's being filled by something, have to find out with what it is.
Sorry about that. Could there be different warnings? This is confusing.

Comment 14 Yaniv Lavi 2014-07-14 13:46:00 UTC
(In reply to Petr Matyáš from comment #13)
> Forget the last comment, I didn't have enough free space on engine, it's
> being filled by something, have to find out with what it is.
> Sorry about that. Could there be different warnings? This is confusing.

This situation should never happen, so we need to find the cause, not handle the errors.

Comment 15 Shirly Radco 2014-07-15 08:45:20 UTC
If you are unable to recreate, please close the bug.

Comment 16 Petr Matyáš 2014-07-31 12:39:42 UTC
I just recreated it on my setup (clean RHEL6.5 with RHEV-M 3.4), I belive I don't have to post the log, it's same as before. There are 2 processes for dwhd service. If I kill the older one, it starts to work.

Comment 17 Shirly Radco 2014-07-31 13:39:12 UTC
Alon, please see if you can figure out why this is happening.
We see that with ps aux | grep dwh we get that the two processes are running at the same time.

Comment 18 Alon Bar-Lev 2014-07-31 14:02:24 UTC
(In reply to Shirly Radco from comment #17)
> Alon, please see if you can figure out why this is happening.
> We see that with ps aux | grep dwh we get that the two processes are running
> at the same time.

please assign to someone from integration... this should not happen.

Comment 19 Yedidyah Bar David 2014-08-03 13:05:58 UTC
The same bug probably affects all our python-wraps-java daemons, including engine, notifier, reportsd. The only difference is that for dwhd we supply a (downstream-only, afaik) cron job to start it if it's down.

Comment 20 Yaniv Lavi 2014-08-03 13:21:27 UTC
(In reply to Yedidyah Bar David from comment #19)
> The same bug probably affects all our python-wraps-java daemons, including
> engine, notifier, reportsd. The only difference is that for dwhd we supply a
> (downstream-only, afaik) cron job to start it if it's down.

it's not downstream-only, since we use the same from centos.

Comment 21 Yedidyah Bar David 2014-08-04 11:22:50 UTC
(In reply to Yaniv Dary from comment #20)
> (In reply to Yedidyah Bar David from comment #19)
> > The same bug probably affects all our python-wraps-java daemons, including
> > engine, notifier, reportsd. The only difference is that for dwhd we supply a
> > (downstream-only, afaik) cron job to start it if it's down.
> 
> it's not downstream-only, since we use the same from centos.

Actually also on fedora - we set in its systemd conf:
Restart=on-failure
RestartSec=3600

All of the above daemons do basically this:
- A python wrapper runs and writes its pid to the pidfile
- This wrapper runs the real daemon (written in Java)
- When the real daemon stops, the wrapper also exits
- 'service $SERVICE stop' just sends a SIGTERM to the pid in the file (the python wrapper)
- When the wrapper receives a SIGTERM it kills the real daemon and exits
- When the wrapper receives a SIGKILL (9) it can't do anything and just dies
The child is not notified in any way and remains running.
'service $SERVICE start' then sees that the "daemon" (python wrapper) is not running and starts another copy of the real daemon, which happily runs.

How to solve it (prevent two copies of the real daemon from running)?
I can think of a few possible solutions:
1. Make the python wrapper not wait for its child - instead of its own pid, write the child's pid to the file and exit. This will require making the real daemon do the processing currently done by the wrapper (remove the pid file etc).
2. Write another pidfile of the real daemon, and check also it when relevant. Specifically, check it at 'start' to make sure no duplicates - if duplicates are found, we should probably alert somewhere, kill the existing real daemon, wait a bit, and start afresh.
3. Do nothing and close wontfix. Not sure this is such a bad choice - the python wrapper uses very few resources, I do not see a reason for it to die unexpectedly (e.g. by the oom killer), and if a user kills it with -9 he should definitely know what he is doing.

Sandro - what do you think?

Setting lower severity for now.

Comment 22 Sandro Bonazzola 2014-08-05 07:25:56 UTC
I'm tempted to go for option 3, services shouldn't be stopped by randomly killing them with kill -9.
On the other hand, if the parent process receive signal 9, it should not remove the pid file and when you try to start the service it should fail the start on "subsystem locked". User should then cleanly stop the service by running service $service stop and that action should take care of the pid file removal and kill somehow the child process.

I'm ok with the lowered severity, I don't think there's any urgency to handle this.

Comment 23 Yedidyah Bar David 2014-08-05 08:39:27 UTC
(In reply to Sandro Bonazzola from comment #22)
> I'm tempted to go for option 3, services shouldn't be stopped by randomly
> killing them with kill -9.
> On the other hand, if the parent process receive signal 9, it should not
> remove the pid file and when you try to start the service it should fail the
> start on "subsystem locked". User should then cleanly stop the service by
> running service $service stop and that action should take care of the pid
> file removal and kill somehow the child process.

On systemd we do not pass --pidfile and it already behaves as expected (?) - it kills also the child (which is the real daemon).

On sysv we ship a sysv init script that does not check the pid itself - it calls the daemon function, which ignores the pid file if there is no process with that pid. Our cron script to restart it calls 'status' which behaves similarly.

We do not use the "lock-file" option to 'status' and do not touch anything under /var/lock (which seems to be done manually by other scripts, as opposed to using an option to daemon/start/etc).

> 
> I'm ok with the lowered severity, I don't think there's any urgency to
> handle this.

Comment 24 Yaniv Lavi 2014-08-07 11:06:46 UTC
We need to match the systemd behaviour we can't have diff.
I would go with sandro's suggestion. Please move forward on fixing this.

Comment 25 Yedidyah Bar David 2014-08-07 11:58:56 UTC
(In reply to Yaniv Dary from comment #24)
> We need to match the systemd behaviour we can't have diff.
> I would go with sandro's suggestion. Please move forward on fixing this.

Not sure what exactly was "Sandro's suggestion". I already replied to his comment explaining why it's not applicable directly. We want to add a lock file? Go with my option 2? something else?

Comment 26 Arthur Berezin 2014-08-18 17:44:51 UTC
(In reply to Yedidyah Bar David from comment #25)
> (In reply to Yaniv Dary from comment #24)
> > We need to match the systemd behaviour we can't have diff.
> > I would go with sandro's suggestion. Please move forward on fixing this.
> 
> Not sure what exactly was "Sandro's suggestion". I already replied to his
> comment explaining why it's not applicable directly. We want to add a lock
> file? Go with my option 2? something else?

I think option #2 here would be the best, the service should not allow to run the daemon twice, and should handle orphan daemons if the service is SIGKILLed for any reason.

Comment 27 Shirly Radco 2014-10-02 09:28:29 UTC
Didi says that Yaniv said that following a discussion with Alon, the python process should not be left running, but only the java one.

Comment 28 Alon Bar-Lev 2014-10-02 09:32:25 UTC
kill -9 does not give the process chance for cleanup, it should not be used unless process is not responding. using it is per sysadmin responsibility to cleanup.

kill -TERM should be used to terminate processes.

this bug should be closed as notabug.

Comment 29 Barak 2014-10-23 09:22:33 UTC
Shirly please check whether the second process can actually run after the fix for Bug 1118350 was commited.

I seriously doubt it is possible, even if it is possible, than a small fix for that logic can handle this situation.

The only reason I would like to handle it is the remote possibility of history data correctness (having 2 active DWH processes together might lead to a race and data correctness issues.

Comment 30 Shirly Radco 2014-10-30 11:36:39 UTC
Didi, Please check why the python process is running in the background and not use exec.

Comment 31 Alon Bar-Lev 2014-10-30 11:40:34 UTC
(In reply to Shirly Radco from comment #30)
> Didi, Please check why the python process is running in the background and
> not use exec.

to be able to monitor the java and handle signals and cleanups correctly.

PLEASE close this bug kill -9 behaves as it should as far as services are involved.

Comment 32 Yedidyah Bar David 2014-11-02 12:22:36 UTC
On Thursday we (Shirly, Yaniv and I) discussed this bug. We agreed to continue and apply something like option 2 of comment 21.

For the record - the reason is that current code can cause corruption of the database if two dwh processes run concurrently.

Comment 33 Alon Bar-Lev 2014-11-02 12:32:22 UTC
(In reply to Yedidyah Bar David from comment #32)
> On Thursday we (Shirly, Yaniv and I) discussed this bug. We agreed to
> continue and apply something like option 2 of comment 21.
> 
> For the record - the reason is that current code can cause corruption of the
> database if two dwh processes run concurrently.

this violates the way we execute daemons. the wrapper must kept living.

any other solution should be based on resource of java, for example, bind to port 12312 and fail in 2nd process.

Comment 34 Yedidyah Bar David 2014-11-02 14:13:44 UTC
(In reply to Alon Bar-Lev from comment #33)
> (In reply to Yedidyah Bar David from comment #32)
> > On Thursday we (Shirly, Yaniv and I) discussed this bug. We agreed to
> > continue and apply something like option 2 of comment 21.
> > 
> > For the record - the reason is that current code can cause corruption of the
> > database if two dwh processes run concurrently.
> 
> this violates the way we execute daemons. the wrapper must kept living.

"this" - killing with '-9'? I obviously agree, but we decided we still would like to prevent corruption. We might give up if it turns out to be too costly (unlikely).

> 
> any other solution should be based on resource of java, for example, bind to
> port 12312 and fail in 2nd process.

Just bind as a lock? Why not directly lock a file in /var/lock (as Sandro mentioned)? Or use a pidfile? Why is it important that it's done by the java code and not python?

Comment 35 Alon Bar-Lev 2014-11-02 16:00:39 UTC
(In reply to Yedidyah Bar David from comment #34)
> Just bind as a lock? Why not directly lock a file in /var/lock (as Sandro
> mentioned)? Or use a pidfile? Why is it important that it's done by the java
> code and not python?

because java do not know how to manage signals correctly nor cleanup.

just to be clear: the wrapper is not to be touched, nor eliminated before and after, it is working perfectly and as designed in all cases that required to integrate with system service manager.

using /var/lock is unusable in kill -9 as well.

blocking concurrent process should be done within the java domain, if java cannot figure out if another instance is running, for example: examining last update column of database in startup, as it can corrupt also if running on different machine, or at lock machine by updating files, socket or any other semi singleton.

Comment 36 Yedidyah Bar David 2014-11-03 07:03:19 UTC
(In reply to Alon Bar-Lev from comment #35)
> (In reply to Yedidyah Bar David from comment #34)
> > Just bind as a lock? Why not directly lock a file in /var/lock (as Sandro
> > mentioned)? Or use a pidfile? Why is it important that it's done by the java
> > code and not python?
> 
> because java do not know how to manage signals correctly nor cleanup.

I didn't mean to change the Java code so not sure if relevant.

I see that when the wrapper receives a signal (TERM or INT) it calls p.terminate on the child (java), which is sending it TERM too. So does it work or not? Or you mean it can't handle other signals? Not that important, though.

> 
> just to be clear: the wrapper is not to be touched, nor eliminated before
> and after, it is working perfectly and as designed in all cases that
> required to integrate with system service manager.

I did not intend to change the way it manages itself.

The suggestion in option 2 was to change the wrapper to use another pidfile for the pid of the child:
1. Check if it exists - if yes, and the pid seems to belong to dwhd, alert and exit. Otherwise ignore (perhaps after alerting).
2. Write it there on startup
3. Remove on cleanup

> 
> using /var/lock is unusable in kill -9 as well.
> 
> blocking concurrent process should be done within the java domain, if java
> cannot figure out if another instance is running, for example: examining
> last update column of database in startup, as it can corrupt also if running
> on different machine, or at lock machine by updating files, socket or any
> other semi singleton.

We actually discussed this too, and I can't remember why it was rejected. There is already a row in engine db's dwh_history_timekeeping table called 'DwhCurrentlyRunning'. I asked Shirly about that yesterday in private and she said "We decided it is not enough because we want to use the data from the dwh to see if it is running and not count on the engine db for it.". Shirly - why?

IIRC the only reason was that a user doing 'kill -9' on python will then need to connect to engine db in order to clean up after himself, which is a bit more complex than just removing a file. I do not think this is a good enough reason. If they are smart enough to kill -9, they should be able to connect to the engine db (with some kb/wiki helping them).

Comment 37 Shirly Radco 2014-11-03 07:15:06 UTC
As discussed on length, we don't want the user to have to change this value in the engine database.
kill -9 can be caused theoretically by the system itself and not only by the user.
We want this managed on dwh server side.

Comment 38 Alon Bar-Lev 2014-11-03 07:47:46 UTC
(In reply to Shirly Radco from comment #37)
> As discussed on length, we don't want the user to have to change this value
> in the engine database.
> kill -9 can be caused theoretically by the system itself and not only by the
> user.
> We want this managed on dwh server side.

the user should not change anything.
if it is important enough and can corrupt data this should be embedded within application and not assume local or remote execution.
when the new process of dwh is up it should have a mean to determine if existing process of dwh is up.
there should be mechanism to do so, for example having a field that contains last timestamp of update, while the new process waits for a period and see if it changed, or use database exclusive lock of some object during runtime.

using another pid file will solve the problem partially, as these pid files belongs to the old sysvinit and should be avoided for newer, and once you allows remote dwh installation, the corruption is possible remotely, so using any local solution is not correct, "guessing" that pid belongs to a specific application is always heuristics.

once again, any mechanism that is to be implemented should be part of the application not the wrapper as the application is responsible of corruption protection, even if executed manually.

Comment 39 Yaniv Lavi 2014-11-03 09:18:36 UTC
(In reply to Alon Bar-Lev from comment #38)
> (In reply to Shirly Radco from comment #37)
> > As discussed on length, we don't want the user to have to change this value
> > in the engine database.
> > kill -9 can be caused theoretically by the system itself and not only by the
> > user.
> > We want this managed on dwh server side.
> 
> the user should not change anything.
> if it is important enough and can corrupt data this should be embedded
> within application and not assume local or remote execution.
> when the new process of dwh is up it should have a mean to determine if
> existing process of dwh is up.
> there should be mechanism to do so, for example having a field that contains
> last timestamp of update, while the new process waits for a period and see
> if it changed, or use database exclusive lock of some object during runtime.

We already have this mechanism for separate hosts to make sure dwh is not installed in two locations. Using database values and doing a test as suggested will make the service start very slow or will cause situations that service start reports OK and java fails a few moments after. 


> 
> using another pid file will solve the problem partially, as these pid files
> belongs to the old sysvinit and should be avoided for newer, and once you
> allows remote dwh installation, the corruption is possible remotely, so
> using any local solution is not correct, "guessing" that pid belongs to a
> specific application is always heuristics.

systemd doesn't have this issue, it's only a sysv problem.
pid checking will resolve local issue using check that will almost eliminate the possibility.

> 
> once again, any mechanism that is to be implemented should be part of the
> application not the wrapper as the application is responsible of corruption
> protection, even if executed manually.

I don't consider the wrapper external to the application. The purpose of this code is to monitor and execute this java application and it should not if there is an issue. If a user runs java twice and without the wrapper it's really out of scope in my opinion, since this is why the wrapper exists.

Comment 40 Alon Bar-Lev 2014-11-03 09:26:57 UTC
do whatever you wish, it looks like you are the expert.
I am done.

Comment 41 Shirly Radco 2014-11-13 12:26:54 UTC
Pushing this to 3.6 since this request was not made by a user and we want to try and test db locking as a solution.

Comment 42 Sandro Bonazzola 2015-06-04 06:38:41 UTC
How the db locking solution tests gone?

Comment 43 Yaniv Lavi 2015-06-21 15:26:49 UTC
(In reply to Sandro Bonazzola from comment #42)
> How the db locking solution tests gone?

No change on this one.
Since no agreement on resolution yet.

Comment 44 Yaniv Lavi 2015-07-05 11:33:10 UTC
We will not fix this unless a customer complains.

Comment 45 Sven Kieske 2015-07-07 08:30:55 UTC
well I don't know why you don't use systemd features to fix this, maybe I'm missing some internal knowledge over the involved components here, but a systemctl restart $service should clean up the whole environment, including all old possibly running/killed processes in the cgroup which is assigned to this service.

you can also replace cron with systemd.

I really don't know why a customer needs to complain to achieve such basic features (sounds like: "we will not include breaks in our cars until a customer complains").

But again, maybe I'm missing some details here why my proposal won't work.
if this is the case, feel free to ignore this message.

regards

Sven

Comment 46 Yaniv Lavi 2015-07-07 08:42:41 UTC
(In reply to Sven Kieske from comment #45)
> well I don't know why you don't use systemd features to fix this, maybe I'm
> missing some internal knowledge over the involved components here, but a
> systemctl restart $service should clean up the whole environment, including
> all old possibly running/killed processes in the cgroup which is assigned to
> this service.
> 
> you can also replace cron with systemd.

We did on el7\fc and the bug doesn't recreate with that.
The issue is only on sysv and this should be resolved once we stop supporting el6.

> 
> I really don't know why a customer needs to complain to achieve such basic
> features (sounds like: "we will not include breaks in our cars until a
> customer complains").
> 
> But again, maybe I'm missing some details here why my proposal won't work.
> if this is the case, feel free to ignore this message.
> 
> regards
> 
> Sven


Note You need to log in before you can comment on or make changes to this bug.