Description of problem: We still see some pending_ops that are corrupted in online deployment. e.g. output of clear-pending-ops says : ERROR in cleaning up application's op because the type is nil. App uuid - 5217b47b4382ec8f25000223. Op - #<PendingAppOpGroup _id: 52362df6e0b8cd02df000001, _type: nil, created_at: nil, .. <truncated> Version-Release number of selected component (if applicable): How reproducible: Seems to happen time and again with some applications that see lot of REST API volume. Steps to Reproduce: 1. unknown 2. 3. Actual results: pending_op becomes corrupted possibly because of mongoid overwriting at the wrong index. Expected results: No pending_op should ever be corrupted. Additional info: Last time around the debugging revealed that this happens when 'lock' sanity is violated. Mcoll can sometimes return back after a long time (~15 minutes). If this time happens to be >30 minutes, the lock gets expired exposing the broker thread to write to mongo without a lock.
Fixed with rev#ec4f4447e1b65e276f10c2695b9ba799835b7b5b in origin-server. Atomic updates are now made to embedded documents. This bug will be difficult to reproduce. We will have to wait for feedback from Online production if this happens again.
This bug is not reproduced from QE side in recent days, I'm moving it to verified.