Description of problem: Today we noticed there were a few issues with the oo-accept-node that causes it to fail on many nodes in production. The error message we are seeing is the following: FAIL: no manifest in the cart repo matches /usr/libexec/openshift/cartridges//zend/metadata/manifest.yml FAIL: no manifest in the cart repo matches /usr/libexec/openshift/cartridges//postgresql/metadata/manifest.yml Upon further investigation of this code it appears that under this path (/var/lib/openshift/.cartridge_repository/redhat-zend/) there are 3 directories. 0.0.1, 0.0.2, 0.0.3. For 1,2 there exist a metadata directory which includes the manifest.yml file that is being used in this check. In the 0.0.3 directory there is no metadata directory. The contents look like this: ls -l /var/lib/openshift/.cartridge_repository/redhat-zend/0.0.3/ total 20 drwxr-xr-x. 2 root root 4096 Aug 7 12:57 env drwxr-xr-x. 2 root root 4096 Aug 7 12:57 hooks -rw-r--r--. 1 root root 523 Aug 1 15:24 LICENSE -rw-r--r--. 1 root root 315 Aug 1 15:24 README.md drwxr-xr-x. 3 root root 4096 Aug 1 15:24 versions When comparing these values with the ones under /usr/libexec/openshift/cartridges/zend/metadata/manifest.yml the values fail. /usr/libexec/openshift/cartridges/zend/metadata/manifest.yml Name: zend Cartridge-Version: 0.0.3 Version-Release number of selected component (if applicable): rhc-node-1.12.7-1.el6oso.x86_64 openshift-origin-node-util-1.12.6-1.el6oso.noarch How reproducible: This is currently happening on several of our nodes in production and int environments. Steps to Reproduce: 1. Run oo-accept-node on a host with a missing /var/lib/openshift/.cartridge_repository/redhat-zend/0.0.3/metadata directory 2. Observe the failure. 3. Actual results: This is currently failing for oo-accept-node. Expected results: This should pass. Additional info: Spoke with pmorie and we believe there is corruption happening to the cartridge repos. We are unsure how or why this is occurring but we need to fix this ASAP.
Could mcollectived been restarted while starting?
Not sure how this happened, but a simple stop / start of mcollective and it fixed this issue.
platform-trace.log shows copy operations failing: August 09 16:37:10 INFO oo_spawn buffer(10/) /bin/cp: cannot create directory `/var/lib/openshift/.cartridge_repository/redhat-python/0.0.3/versions/2.7/template/libs': No such file or directory /bin/cp: preserving times for `/var/lib/openshift/.cartridge_repository/redhat-python/0.0.3/versions/2.7/template': No such file or directory /bin/cp: preserving times for `/var/lib/openshift/.cartridge_repository/redhat-python/0.0.3/versions/2.7': No such file or directory /bin/cp: preserving times for `/var/lib/openshift/.cartridge_repository/redhat-python/0.0.3/versions': No such file or directory August 07 14:19:23 INFO oo_spawn buffer(10/) /bin/cp: cannot create directory `/var/lib/openshift/.cartridge_repository/redhat-phpmyadmin/0.0.3/env': File exists August 07 14:19:23 INFO oo_spawn buffer(10/) /bin/cp: cannot create regular file `/var/lib/openshift/.cartridge_repository/redhat-phpmyadmin/0.0.3/versions/shared/conf.d/php.conf': File exists August 07 14:19:23 INFO oo_spawn buffer(10/) /bin/cp: cannot create regular file `/var/lib/openshift/.cartridge_repository/redhat-phpmyadmin/0.0.3/versions/shared/conf.d/openshift.conf.erb': File exists August 07 14:19:23 INFO oo_spawn buffer(10/) /bin/cp: cannot create regular file `/var/lib/openshift/.cartridge_repository/redhat-phpmyadmin/0.0.3/versions/shared/conf.d/phpMyAdmin.conf': File exists
Cartridge Repository became corrupted. Added lock to openshift mcollective agent to prevent two agents from rebuilding cartridge repository at the same time.
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/55c3e065a6a3a8aeea57068f68cc6d579a419a33 Bug 995599 - Add lock when building cartridge repository * When building cartridge repository protect from an additional openshift mcollective being started
Not sure if the following method is correct or not. 1. watch the process with grep key word mco # watch --interval 0.2 "ps -ef |grep mco|grep -v grep |grep -v update_yaml|grep -v ruby193" 2. do a parallel cartridge install # cd /usr/libexec/openshift/cartridges/ # for i in `ls`; do oo-admin-cartridge -a install -s ./$i/ --mco & done 3. chekc if there are more than one mco process generated. There still a lot of cartridge install processes exist at the same time. root 21733 21376 2 05:41 pts/1 00:00:00 ruby /usr/sbin/mco rpc -q openshift cartridge_repository action=install path=/usr/libexec/openshift/cartridges/jbossew root 21734 21365 2 05:41 pts/1 00:00:00 ruby /usr/sbin/mco rpc -q openshift cartridge_repository action=install path=/usr/libexec/openshift/cartridges/jbossas root 21737 21374 2 05:41 pts/1 00:00:00 ruby /usr/sbin/mco rpc -q openshift cartridge_repository action=install path=/usr/libexec/openshift/cartridges/10gen-m root 21738 21373 2 05:41 pts/1 00:00:00 ruby /usr/sbin/mco rpc -q openshift cartridge_repository action=install path=/usr/libexec/openshift/cartridges/diy root 21746 21386 2 05:41 pts/1 00:00:00 ruby /usr/sbin/mco rpc -q openshift cartridge_repository action=install path=/usr/libexec/openshift/cartridges/mock
The method used is not verifying the issue. The issues is multiple mcollectived running multiple openshift agents. Not multiple mco clients. The only method I've seen to reproduce the problem and it's not reliable is to restart mcollectived repeatedly. /jwh
Checked again on devenv_3678, after about 100 times repeatedly restart the mcollective service, there is only one mcollectived running. Move bug to verified.