995599 – [oo-accept-node] no manifest in the cart repo matches

Bug 995599 - [oo-accept-node] no manifest in the cart repo matches

Summary: [oo-accept-node] no manifest in the cart repo matches

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	1.x
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Jhon Honce
QA Contact:	libra bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-08-09 20:00 UTC by Kenny Woodson
Modified:	2013-08-29 12:50 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-08-29 12:50:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Kenny Woodson 2013-08-09 20:00:57 UTC

Description of problem:

Today we noticed there were a few issues with the oo-accept-node that causes it to fail on many nodes in production.

The error message we are seeing is the following:
FAIL: no manifest in the cart repo matches /usr/libexec/openshift/cartridges//zend/metadata/manifest.yml
FAIL: no manifest in the cart repo matches /usr/libexec/openshift/cartridges//postgresql/metadata/manifest.yml

Upon further investigation of this code it appears that under this path (/var/lib/openshift/.cartridge_repository/redhat-zend/) there are 3 directories. 0.0.1, 0.0.2, 0.0.3. For 1,2 there exist a metadata directory which includes the manifest.yml file that is being used in this check. In the 0.0.3 directory there is no metadata directory. The contents look like this:

ls -l /var/lib/openshift/.cartridge_repository/redhat-zend/0.0.3/
total 20
drwxr-xr-x. 2 root root 4096 Aug 7 12:57 env
drwxr-xr-x. 2 root root 4096 Aug 7 12:57 hooks
-rw-r--r--. 1 root root 523 Aug 1 15:24 LICENSE
-rw-r--r--. 1 root root 315 Aug 1 15:24 README.md
drwxr-xr-x. 3 root root 4096 Aug 1 15:24 versions

When comparing these values with the ones under /usr/libexec/openshift/cartridges/zend/metadata/manifest.yml the values fail.

/usr/libexec/openshift/cartridges/zend/metadata/manifest.yml
Name: zend
Cartridge-Version: 0.0.3

Version-Release number of selected component (if applicable):
rhc-node-1.12.7-1.el6oso.x86_64
openshift-origin-node-util-1.12.6-1.el6oso.noarch

How reproducible:
This is currently happening on several of our nodes in production and int environments.

Steps to Reproduce:
1. Run oo-accept-node on a host with a missing /var/lib/openshift/.cartridge_repository/redhat-zend/0.0.3/metadata directory
2. Observe the failure.
3.

Actual results:
This is currently failing for oo-accept-node.

Expected results:
This should pass.

Additional info:

Spoke with pmorie and we believe there is corruption happening to the cartridge repos. We are unsure how or why this is occurring but we need to fix this ASAP.

Comment 1 Jhon Honce 2013-08-09 20:31:19 UTC

Could mcollectived been restarted while starting?

Comment 2 Thomas Wiest 2013-08-09 20:54:57 UTC

Not sure how this happened, but a simple stop / start of mcollective and it fixed this issue.

Comment 3 Sten Turpin 2013-08-09 22:07:05 UTC

platform-trace.log shows copy operations failing: 

August 09 16:37:10 INFO oo_spawn buffer(10/) /bin/cp: cannot create directory `/var/lib/openshift/.cartridge_repository/redhat-python/0.0.3/versions/2.7/template/libs': No such file or directory
/bin/cp: preserving times for `/var/lib/openshift/.cartridge_repository/redhat-python/0.0.3/versions/2.7/template': No such file or directory
/bin/cp: preserving times for `/var/lib/openshift/.cartridge_repository/redhat-python/0.0.3/versions/2.7': No such file or directory
/bin/cp: preserving times for `/var/lib/openshift/.cartridge_repository/redhat-python/0.0.3/versions': No such file or directory

August 07 14:19:23 INFO oo_spawn buffer(10/) /bin/cp: cannot create directory `/var/lib/openshift/.cartridge_repository/redhat-phpmyadmin/0.0.3/env': File exists

August 07 14:19:23 INFO oo_spawn buffer(10/) /bin/cp: cannot create regular file `/var/lib/openshift/.cartridge_repository/redhat-phpmyadmin/0.0.3/versions/shared/conf.d/php.conf': File exists

August 07 14:19:23 INFO oo_spawn buffer(10/) /bin/cp: cannot create regular file `/var/lib/openshift/.cartridge_repository/redhat-phpmyadmin/0.0.3/versions/shared/conf.d/openshift.conf.erb': File exists

August 07 14:19:23 INFO oo_spawn buffer(10/) /bin/cp: cannot create regular file `/var/lib/openshift/.cartridge_repository/redhat-phpmyadmin/0.0.3/versions/shared/conf.d/phpMyAdmin.conf': File exists

Comment 5 Jhon Honce 2013-08-16 22:50:44 UTC

Cartridge Repository became corrupted. Added lock to openshift mcollective agent to prevent two agents from rebuilding cartridge repository at the same time.

Comment 6 openshift-github-bot 2013-08-17 03:01:19 UTC

Commit pushed to master at https://github.com/openshift/origin-server

https://github.com/openshift/origin-server/commit/55c3e065a6a3a8aeea57068f68cc6d579a419a33
Bug 995599 - Add lock when building cartridge repository

* When building cartridge repository protect from an additional
  openshift mcollective being started

Comment 7 Meng Bo 2013-08-19 09:42:59 UTC

Not sure if the following method is correct or not.

1. watch the process with grep key word mco
# watch --interval 0.2 "ps -ef |grep mco|grep -v grep |grep -v update_yaml|grep -v ruby193"
2. do a parallel cartridge install 
# cd /usr/libexec/openshift/cartridges/
# for i in `ls`; do  oo-admin-cartridge -a install -s ./$i/ --mco & done
3. chekc if there are more than one mco process generated.

There still a lot of cartridge install processes exist at the same time.


root     21733 21376  2 05:41 pts/1    00:00:00 ruby /usr/sbin/mco rpc -q openshift cartridge_repository action=install path=/usr/libexec/openshift/cartridges/jbossew
root     21734 21365  2 05:41 pts/1    00:00:00 ruby /usr/sbin/mco rpc -q openshift cartridge_repository action=install path=/usr/libexec/openshift/cartridges/jbossas
root     21737 21374  2 05:41 pts/1    00:00:00 ruby /usr/sbin/mco rpc -q openshift cartridge_repository action=install path=/usr/libexec/openshift/cartridges/10gen-m
root     21738 21373  2 05:41 pts/1    00:00:00 ruby /usr/sbin/mco rpc -q openshift cartridge_repository action=install path=/usr/libexec/openshift/cartridges/diy
root     21746 21386  2 05:41 pts/1    00:00:00 ruby /usr/sbin/mco rpc -q openshift cartridge_repository action=install path=/usr/libexec/openshift/cartridges/mock

Comment 8 Jhon Honce 2013-08-19 15:48:16 UTC

The method used is not verifying the issue. The issues is multiple mcollectived running multiple openshift agents. Not multiple mco clients.

The only method I've seen to reproduce the problem and it's not reliable is to restart mcollectived repeatedly. 

/jwh

Comment 9 Meng Bo 2013-08-20 10:30:50 UTC

Checked again on devenv_3678, after about 100 times repeatedly restart the mcollective service, there is only one mcollectived running.

Move bug to verified.

Note You need to log in before you can comment on or make changes to this bug.