Bug 1200096 - ARCHIVE_DESTROYED_GEARS timeouts if gear too large
Summary: ARCHIVE_DESTROYED_GEARS timeouts if gear too large
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers
Version: 2.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: John W. Lamb
QA Contact: libra bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-03-09 17:01 UTC by Marek Schmidt
Modified: 2015-04-06 17:06 UTC (History)
9 users (show)

Fixed In Version: rubygem-openshift-origin-node-1.35.4.2-1
Doc Type: Bug Fix
Doc Text:
When gear archiving was enabled on nodes using the ARCHIVE_DESTROYED_GEARS parameter, previously archives could be erroneously deleted during the archiving process if it encountered a "recoverable" error. For example, this could happen if a file changed while it was being archived. Additionally, gear archiving could fail if the default compression algorithm, bz2, exceeded the MCollective agent timeout value. This bug fix updates the archiving logic to not delete broken archives, and adds the ARCHIVE_DESTROYED_GEARS_COMPRESSION parameter to the /etc/openshift/node.conf file on nodes. This new parameter allows administrators to set their preferred compression algorithm; valid options are "bzip2", "gzip", or "none". As a result, broken archives are still available for debugging purposes, and timeouts while archiving are less likely to occur.
Clone Of:
Environment:
RHEL 6.6 OSE 2.2.4
Last Closed: 2015-04-06 17:06:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:0779 0 normal SHIPPED_LIVE Red Hat OpenShift Enterprise 2.2.5 bug fix and enhancement update 2015-04-06 21:05:45 UTC

Description Marek Schmidt 2015-03-09 17:01:29 UTC
Description of problem:

Archiving gears sometimes fail due to a timeout on large gears (such as JBoss Fuse).

Version-Release number of selected component (if applicable):
OpenShift Enterprise 2.2.4

How reproducible:
Sometimes

Steps to Reproduce:
1. install the JBoss Fuse cartridge
2. enable archiving gears in /etc/openshift/node.conf (see ARCHIVE_DESTROYED_GEARS )
3. sabotage the fuse gear install process by adding e.g. "sleep 720" to the end of /var/lib/openshift/.cartridge_repository/redhat-fuse/0.0.*/bin/install
4. attempt to create the fuse application
5. attempt to unpack the /cartout/fuse-*.tar.bz2

Actual results:
the archive is sometimes not complete, missing important files, such as karaf logs.
also note the error messages in the node logs about gear teardown failure due to a timeout

Expected results:
the archive in /cartout should be complete

Additional info:
the problem seems to be an unwise choice of a compression algorithm, as bz2 seems to be terribly slow.

workaround is to edit /opt/rh/ruby193/root/usr/share/gems/gems/openshift-origin-node-1.34.1.1/lib/openshift-origin-node/model/v2_cart_model.rb
and remove the "--bzip2" option from the tar command  (don't forget to remove the /opt/rh/ruby193/root/usr/share/gems/cache as well)

Comment 2 John W. Lamb 2015-03-16 21:52:27 UTC
(In reply to Marek Schmidt from comment #0)
> Description of problem:
> 
> Archiving gears sometimes fail due to a timeout on large gears (such as
> JBoss Fuse).
> 
> Version-Release number of selected component (if applicable):
> OpenShift Enterprise 2.2.4
> 
> How reproducible:
> Sometimes
> 
> Steps to Reproduce:
> 1. install the JBoss Fuse cartridge
> 2. enable archiving gears in /etc/openshift/node.conf (see
> ARCHIVE_DESTROYED_GEARS )
> 3. sabotage the fuse gear install process by adding e.g. "sleep 720" to the
> end of /var/lib/openshift/.cartridge_repository/redhat-fuse/0.0.*/bin/install
> 4. attempt to create the fuse application
> 5. attempt to unpack the /cartout/fuse-*.tar.bz2
> 
> Actual results:
> the archive is sometimes not complete, missing important files, such as
> karaf logs.

Is it possible that the gear install step is failing at different points, such as before the log files are generated? It would be helpful if you could attach a platform.log and platform-trace.log from a system where this has happened "naturally" (e.g. without intentionally sabotaging bin/install) - so far I haven't been able to reproduce this without sabotage.

> also note the error messages in the node logs about gear teardown failure
> due to a timeout
> 

I haven't seen gear teardown fail due to timeout yet, only gear deployment. I will continue to try to reproduce a failure of this type, but any additional information would be appreciated.

Comment 3 Marek Schmidt 2015-03-17 08:48:45 UTC
I have hit the issue "naturally" when testing a Fuse patch, which causes the installation to sometimes timeout on its own.

https://issues.jboss.org/browse/ENTESB-2755

This bz slightly complicates debugging of such issues.

Comment 4 John W. Lamb 2015-03-18 17:41:19 UTC
While working on this bug, I've discovered some notable details:

* The archive process (tar) continues to run even after the teardown has timed out
 
* The install process continues to run after it has timed out, which means the archive process can begin before cartridge deployment has completed. This causes tar to return an exit code of 1, which then causes the node to delete the generated archive, presuming it's unusable. This is awful and I was a silly person when I wrote that bit of logic.

* The truncated archives can be caused by:
  * trying to read the archive before it is completely written to disk (see 1st point)
  * the archive being deleted while it is being read (see 2nd point)
  * the tar process being killed because it exceeded the mcollective agent timeout

This last condition is what you correctly surmised in the initial bug report. There are two ways to cope with this: the first is the workaround you identified - disabling bzip2 in favor of quicker archive options - and the second is to extend the OpenShift MCollective agent timeout.

Bzip2 is pretty slow, so I'm submitting a pull request that adds a node.conf option - ARCHIVE_DESTROYED_GEARS_COMPRESSION - so users can specify bzip2 (default), gzip, or no compression. This pull request also changes the gear archive logic so that no matter what tar's exit code is, the generated archive is never deleted.

bzip2 was originally chosen with the assumption that drive space would be a primary constraint. I've opted to leave bzip2 as the default compression option so as to avoid surprising anyone who depends on this feature. In retrospect, however, gzip is probably the correct choice for almost any circumstance.

The second solution is a bit more extreme, but as far as I can tell is the only way to guarantee - given some trial and error - that the archiving process isn't killed prematurely due to agent timeout. You have to modify the openshift.ddl file on your node(s) and broker(s):
    /opt/rh/ruby193/root/usr/libexec/mcollective/mcollective/agent/openshift.ddl

Find the OpenShift Origin Management section, which should look like this:

    metadata    :name        => "OpenShift Origin Management",
                :description => "Agent to manage OpenShift Origin services",
                :author      => "Mike McGrath",
                :license     => "ASL 2.0",
                :version     => "0.1",
                :url         => "http://www.openshift.com",
                :timeout     => 360

Change the "timeout" parameter to something longer - 720 is a good start. The actual timeout for agent actions will be (roughly) this number times 0.65 - so for 360, the timeout will be around 234, and for 720 it will be 468. In practice, this seems to get rounded up to e.g. 240 and 480.

After updating the timeout on all your nodes and brokers, restart ruby193-mcollective on the nodes and openshift-broker on the brokers. If you still see the teardown being killed prematurely, confirm in the logs that it's at least running for longer before timing out, and then adjust the timeout accordingly.

Of course, once you're done debugging your cart, you'll want to set the timeout back to 360.

With luck, setting ARCHIVE_DESTROYED_GEARS_COMPRESSION to gzip or none should do the trick, and you won't have to mess with openshift.ddl at all.

Comment 5 openshift-github-bot 2015-03-18 19:05:57 UTC
Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/2cef72fff70602110da3bfebeaa85c22580cece9
Don't delete archives if tar returns !0

`tar` will return 1 if a "recoverable" error has happened, e.g. if a
file changed while it was being archived. We were erroneously deleting
the archive if this happened. Also, a "broken" archive may still be
useful, so now we only report if archiving failed, we don't delete the
archive.

This change also adds the `ARCHIVE_DESTROYED_GEARS_COMPRESSION` option,
which can be 'bzip2', 'gzip' or 'none'.

Enterprise bug: 1200096
https://bugzilla.redhat.com/show_bug.cgi?id=1200096

Comment 8 Anping Li 2015-03-20 05:51:22 UTC
Verified and pass on puddle-2-2-2015-03-18
1) ARCHIVE_DESTROYED_GEARS_COMPRESSION works, bzip2, gzip and none backup can be created as below:
fdefault-550baa534add71cc4000001b.tar.bz2
fgzip2-550babbd4add71cc40000043.tar.gz
fnone-550bacf04add71cc40000056.tar
2)   The diff result as below
diff -ruNa /cartout/fdefault/var/lib/openshift/550baa534add71cc4000001b/  /var/lib/550ba98a4add71cc40000002 >diffbz2
diff: /cartout/fdefault/var/lib/openshift/550baa534add71cc4000001b/fuse/container/dependencies.txt: No such file or directory
diff: /cartout/fdefault/var/lib/openshift/550baa534add71cc4000001b/fuse/container/extras: No such file or directory
diff: /cartout/fdefault/var/lib/openshift/550baa534add71cc4000001b/fuse/container/licenses: No such file or directory
diff: /cartout/fdefault/var/lib/openshift/550baa534add71cc4000001b/fuse/container/license.txt: No such file or directory
diff: /cartout/fdefault/var/lib/openshift/550baa534add71cc4000001b/fuse/container/notices.txt: No such file or directory
diff: /cartout/fdefault/var/lib/openshift/550baa534add71cc4000001b/fuse/container/readme.txt: No such file or directory
diff: /cartout/fdefault/var/lib/openshift/550baa534add71cc4000001b/fuse/container/system: No such file or directory
diff: /cartout/fdefault/var/lib/openshift/550baa534add71cc4000001b/.vimrc: No such file or directory

3) The karaf logs can be found.
[root@node1 ~]# find /cartout/ -name *.log
/cartout/fnone/var/lib/openshift/550bacf04add71cc40000056/fuse/log/karaf-fnone.log
/cartout/fdefault/var/lib/openshift/550baa534add71cc4000001b/fuse/log/karaf-fdefault.log
/cartout/fgzip2/var/lib/openshift/550babbd4add71cc40000043/fuse/log/karaf-fgzip2.log

Comment 10 errata-xmlrpc 2015-04-06 17:06:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0779.html


Note You need to log in before you can comment on or make changes to this bug.