Bug 838527

Summary: [rhevm] unable to start ovirt-engine if service crash and pid is left
Product: Red Hat Enterprise Virtualization Manager Reporter: Haim <hateya>
Component: ovirt-engine-setupAssignee: Alon Bar-Lev <alonbl>
Status: CLOSED ERRATA QA Contact: Pavel Stehlik <pstehlik>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.1.0CC: acathrow, alonbl, bazulay, iheim, jkt, juan.hernandez, knesenko, mgoldboi, oramraz, Rhev-m-bugs, yeylon
Target Milestone: ---Keywords: Reopened
Target Release: 3.3.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: integration
Fixed In Version: is1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-21 17:28:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 952297    
Bug Blocks:    

Description Haim 2012-07-09 11:34:44 UTC
Description of problem:

problem:

[root@hateya-rhevm ~]# kill -9 `pgrep java`
[root@hateya-rhevm ~]# /etc/init.d/ovirt-engine start 
The engine PID file "/var/run/ovirt-engine.pid" already exists.

mitigation:


[root@hateya-rhevm ~]# rm -rf /var/run/ovirt-engine.pid
[root@hateya-rhevm ~]# /etc/init.d/ovirt-engine start 
Started engine process 11798.

expected results: behave like any other app and allow user to start the service.

Comment 1 Yaniv Kaul 2012-07-09 11:37:04 UTC
Actually, this is a sign of going down uncleanly ('dirty bit'). We may need to run consistency check on the DB or whatever before we delete the PID file and run the service.

Comment 9 Juan Hernández 2012-08-14 13:56:45 UTC
The change suggested for alternative 1 is available here:

http://gerrit.ovirt.org/7175

It changes the service script so that it will send the following message to syslog (/var/log/messages):

Aug 14 15:49:46 f17vm engine-service[18877]: The engine PID file "/var/run/ovirt-engine.pid" contains 18713 but that process doesn't exist. This means that the engine crashed or was killed. You will need to stop and start it again.

Comment 10 Simon Grinberg 2012-08-14 14:09:12 UTC
If you are absolutely sure that Comment #1 is none issue then 1 may be an option however

1. Is it also presented to the command line when running restart?
2. What happens if the server has crashed? This means that power cycle fencing will never be able to recover the RHEV Manager, right? this may be unacceptable on some customers (unless /var/run/*.pid is cleaned on boot)

Comment 11 Juan Hernández 2012-08-14 15:30:30 UTC
I am not absolutely sure, there can be other issues, but I am not aware of them, that is why I prefer to not start the service automatically but warn the user instead.

The message goes to syslog, not to the terminal. In the terminal the user will see only this:

# service ovirt-engine start
Starting engine-service:                                    [FAILED]
# echo $?
1

The /var/run directory is cleaned during boot, so a power cycle will most probably recover the service.

I don't think this is very problematic, as the typical routine of any system administrator will be something like this:

# service ovirt-engine start
Starting engine-service:                                    [FAILED]

# service ovirt-engine status
The engine process 1080 is not running.

# tail /var/log/messages
Aug 14 15:49:46 f17vm engine-service[18877]: The engine PID file "/var/run/ovirt-engine.pid" contains 1080 but that process doesn't exist. This means that the engine crashed or was killed. You will need to stop and start it again.

# service ovirt-engine stop
Stopping engine-service:                                    [  OK  ]

# service ovirt-engine start
Starting engine-service:                                    [  OK  ]

# service ovirt-engine status
The engine process 1082 is running.

Comment 16 Juan Hernández 2012-08-17 12:43:19 UTC
The proposed change has been merged upstream.

Comment 18 Oded Ramraz 2012-08-29 08:35:22 UTC
[root@aqua-rhel ovirt-engine]# kill -9 `pgrep java`
[root@aqua-rhel ovirt-engine]# service ovirt-engine start
Starting engine-service:                                    [FAILED]

## /var/log/messages 

ug 29 11:33:34 aqua-rhel engine-service[23375]: The engine PID file "/var/run/ovirt-engine.pid" contains 23196 but that process doesn't exist. This means that the engine crashed or was killed. You need to explicitly run 'service ovirt-engine stop' and then 'service ovirt-engine start' to enable it again.

[root@aqua-rhel ovirt-engine]# service ovirt-engine restart
Stopping engine-service:                                    [  OK  ]
Starting engine-service:                                    [  OK  ]

Verified si15.1

Comment 19 Alon Bar-Lev 2013-04-01 09:51:08 UTC
Just a follow up from the future...

There is no reason to prevent user of starting a daemon because there is an old pid left, as the process surly is not running.

Telling the user to perform start and stop is void math statement just like:
  (-1 + 1 = 0)

I suggest removing this none standard behavior of our daemon, per[1]

[1] http://gerrit.ovirt.org/#/c/13415/

Comment 20 Alon Bar-Lev 2013-04-01 10:11:23 UTC
Per Juan suggestion I am reopening this bug to allow farther discussion.

As I wrote in comment#19, the decision to force user to stop inactive service is not something that is expected per the right comment#0, which was the reason of opening this bug.

Comment 21 Juan Hernández 2013-04-01 10:12:59 UTC
Alon, as you wrote the patch, please assign the bug to yourself.

Comment 22 Alon Bar-Lev 2013-04-15 15:53:29 UTC
Modified per future rebase.

Comment 23 David Botzer 2013-07-04 04:57:17 UTC
Fixed, 3.3/is4
1. kill -9 `pgrep java`
2. service ovirt-engine start
       Starting oVirt Engine:     [  OK  ]
Fixed, 3.3/is4

Comment 24 Charlie 2013-11-28 00:24:33 UTC
This bug is currently attached to errata RHEA-2013:15231. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 28 errata-xmlrpc 2014-01-21 17:28:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2014-0038.html