Bug 1249862 - [RFE] Provide more information on current running items in katello-service status
[RFE] Provide more information on current running items in katello-service st...
Product: Red Hat Satellite 6
Classification: Red Hat
Component: Infrastructure (Show other bugs)
Unspecified Unspecified
medium Severity medium (vote)
: Unspecified
: --
Assigned To: satellite6-bugs
Katello QA List
: FutureFeature
Depends On:
Blocks: 260381
  Show dependency treegraph
Reported: 2015-08-03 21:55 EDT by jnikolak
Modified: 2015-12-03 10:52 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2015-09-17 10:50:15 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description jnikolak 2015-08-03 21:55:54 EDT
1. Proposed title of this feature request  
Provide information on currently running items in katello-service status

2. Who is the customer behind the request?  
Account Name	Cisco Systems	
Contact Name	Landon Noll
Account Number	5554001

TAM customer: no  
SRM customer: no  
Strategic: no
3. What is the nature and description of the request?  

For Servers that have got multi-core capabilities.

For example, this customer has 1000 cores!! on his Virtual machine.
The server boots very quickly and therefore the server boots up without katello-service status.  So customer then runs katello-service however as services are still coming up its not clear if the service is operational..

Please see customer comment for more information.

The issue is particularly a problem on fast machines / machines with a lot of cores.  One is able to boot faster, login, run " katello-service status" and see the Satellite in what seems like a failed state.  

Here is just one example:

Tomcat6 (pid 1791) is running...                           [  OK  ]
mongod (pid  2642) is running...
listening on
connection test successful
qpidd (pid  1993) is running...
elasticsearch (pid  1626) is running...
celery init v10.0.
Using config script: /etc/default/pulp_resource_manager
node resource_manager is stopped...
celery init v10.0.
Using config script: /etc/default/pulp_workers
node reserved_resource_worker-0 is stopped...
node reserved_resource_worker-1 is stopped...
node reserved_resource_worker-2 is stopped...
node reserved_resource_worker-3 is stopped...
celery init v10.0.
Using configuration: /etc/default/pulp_workers, /etc/default/pulp_celerybeat
pulp_celerybeat is stopped.
httpd (pid  1942) is running...
dynflow_executor is running.
dynflow_executor_monitor is running.

The " katello-service status"  then exits with a non-zero error status.

There appears to be no way for the admin to distinguish between the Satellite server being in a failed / bad state AND the Satellite server has not fully started.  Moreover this state can last for a while, leading an administrator to think that they need to restart the satellite.

What is the Satellite service startup script were to first acquire a set of file lock(s).  As various Satellite components eventually complete their initialization, those components would clear those file lock(s).  Then " katello-service status" would check the status of those file lock(s).  If it found the file was locked, then it would print something like:

Using config script: /etc/default/pulp_workers
node reserved_resource_worker-0 is still in the initialization state
Using configuration: /etc/default/pulp_workers, /etc/default/pulp_celerybeat
pulp_celerybeat is still in the initialization state

Think of these locks as "startup locks".  The locks are set (locked) when the Satellite startup script is launched / started.
I mention file lock(s) because I suspect each sub-component would want to mange its own lock.  Then when the
service was fully started, it would clear its particular "startup lock".

There are multiple strategies with file locks.  Some locking methods are such that initializing process dies or is killed, the lock is freed.  Only IF the Satellite component were to successfully initialize would it clear the lock.

The " katello-service status"  would,  as part of its work, need to query those "startup locks".  A "startup lock" that was still locked would mean that the particular Satellite component was still starting.  The  "katello-service status" command would report this fact instead of a simple "it stopped", which is highly misleading.

Only if "katello-service status" found a "startup lock" cleared and not service was running would it print a "stopped" or "failed" status.

4. Why does the customer need this? (List the business requirements here)  
Please see his comments below.
"Any updates on this case in relation to the points you made on Jun 25 2015  at  10:13 PM -07:00?

For example: "An action item from me is that I could discover if we could have an RFE to display that the server is still starting katello services."  Having katello-service status clearly indicate when the startup of services is underway AND that any "missing services" may be the result that the startup process is not yet complete.  Such an enhancement would be particularly useful to fast multi-core systems where it is easy to login soon after booting and be mislead into thinking that katello has failed to properly start.

chongo (Landon Curt Noll) /\oo/\"

5. How would the customer like to achieve this? (List the functional requirements here)  
6. For each functional requirement listed, specify how Red Hat and the customer can test to confirm the requirement is successfully implemented.  
You would need a multi-core server, at least more than 500 cores.

7. Is there already an existing RFE upstream or in Red Hat Bugzilla?  

8. Does the customer have any specific timeline dependencies and which release would they like to target (i.e. RHEL5, RHEL6)?  
9. Is the sales team involved in this request and do they have any additional input?  

10. List any affected packages or components.  

11. Would the customer be able to assist in testing this functionality if implemented?  

Comment 1 Sachin Ghai 2015-08-20 03:34:02 EDT
Is this issue appears on rhel6.X. ?

If so, we already have a open issue where when user checks the `katello-service status` the status ends with exit code 0 and throws a message at end : some service failed 


mongod (pid  6207) is running...
listening on
connection test successful
qdrouterd (pid 6330) is running...

foreman-proxy (pid  4915) is running...
celery init v10.0.
Using config script: /etc/default/pulp_workers
node reserved_resource_worker-0 (pid 7157) is running...
node reserved_resource_worker-1 (pid 7180) is running...
node reserved_resource_worker-2 (pid 7219) is running...
node reserved_resource_worker-3 (pid 7252) is running...
elasticsearch (pid  6757) is running...
tomcat6 (pid 10011) is running...[  OK  ]
celery init v10.0.
Using config script: /etc/default/pulp_resource_manager
node resource_manager (pid 6935) is running...
celery init v10.0.
Using configuration: /etc/default/pulp_workers, /etc/default/pulp_celerybeat
pulp_celerybeat (pid 7003) is running.
dynflow_executor is running.
dynflow_executor_monitor is running.
httpd (pid  7316) is running...
Some services failed: qpidd

here is bz related to it: https://bugzilla.redhat.com/show_bug.cgi?id=1246152
Comment 2 jnikolak 2015-09-03 22:03:10 EDT
Hello Sachin, I think this is a different issue because, 

on customer side, the katello-service status hasn't finished completed.
(i.e its still running)

on the bug specified the status will always show failed,
(regardless of time)
Comment 3 Mike McCune 2015-09-17 10:50:15 EDT
There is no simple way to determine if services are "starting" or if they failed to start.

Users should rely on the return value of katello-service to determine if all services are running or not but we can't offer a method to determine if the services are in the process of starting or stopping.

The best way to determine if the services failed to stop is based on a time value of expected start up time. If after 2-5 minutes all the services failed to start then we know there is most likely an error condition and further investigation should take place.

Going to close this as WONTFIX as this isn't something we can support within the katello-service script.

Note You need to log in before you can comment on or make changes to this bug.