Bug 968192
Summary: | Sometimes after finishing openshift.sh and restarting services, oo-diagnostics reports No request sent, we did not discover any nodes. | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jan Pazdziora <jpazdziora> |
Component: | Node | Assignee: | Jason DeTiberus <jdetiber> |
Status: | CLOSED NOTABUG | QA Contact: | libra bugs <libra-bugs> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 1.2.0 | CC: | bleanhar, jdetiber, jpazdziora, libra-onpremise-devel, mmasters |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2013-05-30 17:44:42 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jan Pazdziora
2013-05-29 08:04:50 UTC
Jan, Could you try adding a 30 second sleep between the services restart and the oo-diagnostics run? This should allow sufficient time for mcollective to re-establish a connection with activemq following the service restarts. I believe the problem you are experiencing is when the activemq service on the broker is being restarted after the mcollective service has been restarted on the node. Adding synchronization between the broker and node so that the broker services are restarted before the node services should be sufficient, yes. The service script for ActiveMQ can return before the daemon is ready to accept connections. Initialisation of the daemon can take a couple minutes, so there could still be problems. It would be helpful to have /var/log/activemq/activemq.log to see whether that's involved in the problem reported here. (In reply to Jason DeTiberus from comment #5) > Adding synchronization between the broker and node so that the broker > services are restarted before the node services should be sufficient, yes. Thanks. I will do that and see how it works. I assume this bugzilla can be closed as NOTABUG? (In reply to Miciah Dashiel Butler Masters from comment #6) > The service script for ActiveMQ can return before the daemon is ready to > accept connections. Initialisation of the daemon can take a couple minutes, > so there could still be problems. It would be helpful to have > /var/log/activemq/activemq.log to see whether that's involved in the problem > reported here. I think it would be very helpful to have a script supported by OpenShift developers to start and restart the services, and that could add the necessary waits and guarantee that when it finishes, all services are in a ready state. For example, in Spacewalk, we have https://git.fedorahosted.org/cgit/spacewalk.git/tree/spacewalk/admin/spacewalk-service which calls https://git.fedorahosted.org/cgit/spacewalk.git/tree/spacewalk/admin/spacewalk-startup-helper to for example tomcat is up and accepting connections (we test with lsof) before starting httpd. This ensures that users cannot hit Apache and see 503 from tomcat -- once the spacewalk-service start finishes (unless it failed badly), daemons of all components are ready to accept connections and serve. I probably can create an initial version of such a script if it would be viewed as useful. (In reply to Jan Pazdziora from comment #9) > (In reply to Miciah Dashiel Butler Masters from comment #6) > > The service script for ActiveMQ can return before the daemon is ready to > > accept connections. Initialisation of the daemon can take a couple minutes, > > so there could still be problems. It would be helpful to have > > /var/log/activemq/activemq.log to see whether that's involved in the problem > > reported here. > > I think it would be very helpful to have a script supported by OpenShift > developers to start and restart the services, and that could add the > necessary waits and guarantee that when it finishes, all services are in a > ready state. > > For example, in Spacewalk, we have > > > https://git.fedorahosted.org/cgit/spacewalk.git/tree/spacewalk/admin/ > spacewalk-service > > which calls > > > https://git.fedorahosted.org/cgit/spacewalk.git/tree/spacewalk/admin/ > spacewalk-startup-helper > > to for example tomcat is up and accepting connections (we test with lsof) > before starting httpd. > > This ensures that users cannot hit Apache and see 503 from tomcat -- once > the spacewalk-service start finishes (unless it failed badly), daemons of > all components are ready to accept connections and serve. > > I probably can create an initial version of such a script if it would be > viewed as useful. I think a script like that could be useful in an all-in-one type deployment, but once services are broken out into multiple hosts (as we recommend for production), then I think the script gets complicated. Potentially we could have the broker script verify that the services it depends on are up before starting (activemq, mongo, etc), however I don't think we necessarily want those to fail if the services aren't available either (since they will reconnect to those services when they become available). |