Created attachment 1084935 [details] Summary of affected INT apps Description of problem: In the INT environment, I'm seeing these accept-node errors on all std nodes: OpenShift::MissingElementError error reading /var/lib/openshift/562226120e78864f6700019b/php/metadata/manifest.yml: Version is a required element The error appears to be occurring with failed app-destroys. [root ~]# grep 562226120e78864f6700019b /var/log/openshift/node/platform.log |grep delet October 20 17:21:48 INFO [request_id=1b4417fdcb3fa725035d2c10d3eeef38,app_uuid=562226120e78864f6700019b] Failure while deleting gear 562226120e78864f6700019b: '' is not a legal cartridge identifier There are many of these errors in the platform logs, as it tries to delete the app throughout the day. The issue can be fixed by running 'oo-admin-gear destroygear -c $UUID' to remove the gear. Then the app destroy finishes automatically. Version-Release number of selected component (if applicable): openshift-origin-node-util-1.38.4-1.el6oso.noarch How reproducible: It appears on the most heavily-used nodes in INT. use-std-node1,2,3. It is not yet present in STG. Steps to Reproduce: 1. Create apps in INT as part of regular QE testing. 2. Ops can run 'oo-accept-node -v' to look for "Version is a required element" 3. Actual results: App is never deleted. Mcollective tries forever to delete the apps. Expected results: Apps should finish deleting, even if '' is an illegal cartridge identifier. Additional info:
PR: https://github.com/openshift/origin-server/pull/6285
Commit pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/650f8368aeeabc0e7a9757b3bdcf993962838efc FrontendHttpServer: Recover from missing manifest OpenShift::Runtime::FrontendHttpServer#initialize: Rescue any exception from initializing the cartridge model, and set @standalone_web_proxy to false in that contingency. Before this commit, failure to initialize the cartridge model would cause a failure to initialize the frontend http server, which would cause a failure to initialize the container plugin, which would prevent the container plugin's destroy method from finishing. Consequently, it was impossible to delete a gear with a bad manifest.yml file. This commit fixes bug 1273658.
verify step: 1. create an app 2. delete below in manifest.yaml Version: ‘5.4‘ 3. rhc delete app PHP 4. app is delete successfully 5. check the log grep 56318c088636d89b2a000051 /var/log/openshift/node/platform.log October 28 23:01:41 INFO Shell command 'quota -p --always-resolve -w 56318c088636d89b2a000051' ran. rc=0 out=Disk quotas for user 56318c088636d89b2a000051 (uid 1000): October 28 23:12:36 WARN Failure while deleting gear 56318c088636d89b2a000051: Version is a required element October 28 23:12:36 INFO Failure while deleting gear 56318c088636d89b2a000051: Version is a required element October 28 23:12:36 INFO Shell command 'rm /var/lib/openshift/.last_access/56318c088636d89b2a000051' ran. rc=1 out= 1802249 600 56318c088636d89b2a000051 56318c088636d89b2a000051 56318c088636d89b2a000051 56318c088636d89b2a000051 October 28 23:12:38 INFO Shell command 'userdel --remove -f "56318c088636d89b2a000051"' ran. rc=0 out= version: devenv-stage-1188 could you help to confirm the step is OK?
To test, I truncated manifest.yml so that it was empty, which was the initial cause of the problem reported (at least it was for some gears we looked at). However, simply deleting the "Version:" field should trigger the same error. You should still see the error message in the logs. However, the node runtime should continue after it encounters the error and ultimately delete the gear, so the gear should be gone (and /var/lib/openshift/56318c088636d89b2a000051 should have been removed, along with frontend configuration etc.) after the rhc command finishes. Other than that, your verification procedure looks good.
thanks for your info