Description of problem: Archiving gears sometimes fail due to a timeout on large gears (such as JBoss Fuse). Version-Release number of selected component (if applicable): OpenShift Enterprise 2.2.4 How reproducible: Sometimes Steps to Reproduce: 1. install the JBoss Fuse cartridge 2. enable archiving gears in /etc/openshift/node.conf (see ARCHIVE_DESTROYED_GEARS ) 3. sabotage the fuse gear install process by adding e.g. "sleep 720" to the end of /var/lib/openshift/.cartridge_repository/redhat-fuse/0.0.*/bin/install 4. attempt to create the fuse application 5. attempt to unpack the /cartout/fuse-*.tar.bz2 Actual results: the archive is sometimes not complete, missing important files, such as karaf logs. also note the error messages in the node logs about gear teardown failure due to a timeout Expected results: the archive in /cartout should be complete Additional info: the problem seems to be an unwise choice of a compression algorithm, as bz2 seems to be terribly slow. workaround is to edit /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.34.1.1/lib/openshift-origin-node/model/v2_cart_model.rb and remove the "--bzip2" option from the tar command (don't forget to remove the /opt/rh/ruby193/root/usr/share/gems/cache as well)
(In reply to Marek Schmidt from comment #0) > Description of problem: > > Archiving gears sometimes fail due to a timeout on large gears (such as > JBoss Fuse). > > Version-Release number of selected component (if applicable): > OpenShift Enterprise 2.2.4 > > How reproducible: > Sometimes > > Steps to Reproduce: > 1. install the JBoss Fuse cartridge > 2. enable archiving gears in /etc/openshift/node.conf (see > ARCHIVE_DESTROYED_GEARS ) > 3. sabotage the fuse gear install process by adding e.g. "sleep 720" to the > end of /var/lib/openshift/.cartridge_repository/redhat-fuse/0.0.*/bin/install > 4. attempt to create the fuse application > 5. attempt to unpack the /cartout/fuse-*.tar.bz2 > > Actual results: > the archive is sometimes not complete, missing important files, such as > karaf logs. Is it possible that the gear install step is failing at different points, such as before the log files are generated? It would be helpful if you could attach a platform.log and platform-trace.log from a system where this has happened "naturally" (e.g. without intentionally sabotaging bin/install) - so far I haven't been able to reproduce this without sabotage. > also note the error messages in the node logs about gear teardown failure > due to a timeout > I haven't seen gear teardown fail due to timeout yet, only gear deployment. I will continue to try to reproduce a failure of this type, but any additional information would be appreciated.
I have hit the issue "naturally" when testing a Fuse patch, which causes the installation to sometimes timeout on its own. https://issues.jboss.org/browse/ENTESB-2755 This bz slightly complicates debugging of such issues.
While working on this bug, I've discovered some notable details: * The archive process (tar) continues to run even after the teardown has timed out * The install process continues to run after it has timed out, which means the archive process can begin before cartridge deployment has completed. This causes tar to return an exit code of 1, which then causes the node to delete the generated archive, presuming it's unusable. This is awful and I was a silly person when I wrote that bit of logic. * The truncated archives can be caused by: * trying to read the archive before it is completely written to disk (see 1st point) * the archive being deleted while it is being read (see 2nd point) * the tar process being killed because it exceeded the mcollective agent timeout This last condition is what you correctly surmised in the initial bug report. There are two ways to cope with this: the first is the workaround you identified - disabling bzip2 in favor of quicker archive options - and the second is to extend the OpenShift MCollective agent timeout. Bzip2 is pretty slow, so I'm submitting a pull request that adds a node.conf option - ARCHIVE_DESTROYED_GEARS_COMPRESSION - so users can specify bzip2 (default), gzip, or no compression. This pull request also changes the gear archive logic so that no matter what tar's exit code is, the generated archive is never deleted. bzip2 was originally chosen with the assumption that drive space would be a primary constraint. I've opted to leave bzip2 as the default compression option so as to avoid surprising anyone who depends on this feature. In retrospect, however, gzip is probably the correct choice for almost any circumstance. The second solution is a bit more extreme, but as far as I can tell is the only way to guarantee - given some trial and error - that the archiving process isn't killed prematurely due to agent timeout. You have to modify the openshift.ddl file on your node(s) and broker(s): /opt/rh/ruby193/root/usr/libexec/mcollective/mcollective/agent/openshift.ddl Find the OpenShift Origin Management section, which should look like this: metadata :name => "OpenShift Origin Management", :description => "Agent to manage OpenShift Origin services", :author => "Mike McGrath", :license => "ASL 2.0", :version => "0.1", :url => "http://www.openshift.com", :timeout => 360 Change the "timeout" parameter to something longer - 720 is a good start. The actual timeout for agent actions will be (roughly) this number times 0.65 - so for 360, the timeout will be around 234, and for 720 it will be 468. In practice, this seems to get rounded up to e.g. 240 and 480. After updating the timeout on all your nodes and brokers, restart ruby193-mcollective on the nodes and openshift-broker on the brokers. If you still see the teardown being killed prematurely, confirm in the logs that it's at least running for longer before timing out, and then adjust the timeout accordingly. Of course, once you're done debugging your cart, you'll want to set the timeout back to 360. With luck, setting ARCHIVE_DESTROYED_GEARS_COMPRESSION to gzip or none should do the trick, and you won't have to mess with openshift.ddl at all.
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/2cef72fff70602110da3bfebeaa85c22580cece9 Don't delete archives if tar returns !0 `tar` will return 1 if a "recoverable" error has happened, e.g. if a file changed while it was being archived. We were erroneously deleting the archive if this happened. Also, a "broken" archive may still be useful, so now we only report if archiving failed, we don't delete the archive. This change also adds the `ARCHIVE_DESTROYED_GEARS_COMPRESSION` option, which can be 'bzip2', 'gzip' or 'none'. Enterprise bug: 1200096 https://bugzilla.redhat.com/show_bug.cgi?id=1200096
Verified and pass on puddle-2-2-2015-03-18 1) ARCHIVE_DESTROYED_GEARS_COMPRESSION works, bzip2, gzip and none backup can be created as below: fdefault-550baa534add71cc4000001b.tar.bz2 fgzip2-550babbd4add71cc40000043.tar.gz fnone-550bacf04add71cc40000056.tar 2) The diff result as below diff -ruNa /cartout/fdefault/var/lib/openshift/550baa534add71cc4000001b/ /var/lib/550ba98a4add71cc40000002 >diffbz2 diff: /cartout/fdefault/var/lib/openshift/550baa534add71cc4000001b/fuse/container/dependencies.txt: No such file or directory diff: /cartout/fdefault/var/lib/openshift/550baa534add71cc4000001b/fuse/container/extras: No such file or directory diff: /cartout/fdefault/var/lib/openshift/550baa534add71cc4000001b/fuse/container/licenses: No such file or directory diff: /cartout/fdefault/var/lib/openshift/550baa534add71cc4000001b/fuse/container/license.txt: No such file or directory diff: /cartout/fdefault/var/lib/openshift/550baa534add71cc4000001b/fuse/container/notices.txt: No such file or directory diff: /cartout/fdefault/var/lib/openshift/550baa534add71cc4000001b/fuse/container/readme.txt: No such file or directory diff: /cartout/fdefault/var/lib/openshift/550baa534add71cc4000001b/fuse/container/system: No such file or directory diff: /cartout/fdefault/var/lib/openshift/550baa534add71cc4000001b/.vimrc: No such file or directory 3) The karaf logs can be found. [root@node1 ~]# find /cartout/ -name *.log /cartout/fnone/var/lib/openshift/550bacf04add71cc40000056/fuse/log/karaf-fnone.log /cartout/fdefault/var/lib/openshift/550baa534add71cc4000001b/fuse/log/karaf-fdefault.log /cartout/fgzip2/var/lib/openshift/550babbd4add71cc40000043/fuse/log/karaf-fgzip2.log
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0779.html