Red Hat Bugzilla – Bug 1008283
Corrupted pending_op in pending_op_groups queue
Last modified: 2015-05-14 20:20:49 EDT
Description of problem:
We still see some pending_ops that are corrupted in online deployment. e.g. output of clear-pending-ops says :
ERROR in cleaning up application's op because the type is nil. App uuid - 5217b47b4382ec8f25000223. Op - #<PendingAppOpGroup _id: 52362df6e0b8cd02df000001, _type: nil, created_at: nil, .. <truncated>
Version-Release number of selected component (if applicable):
Seems to happen time and again with some applications that see lot of REST API volume.
Steps to Reproduce:
pending_op becomes corrupted possibly because of mongoid overwriting at the wrong index.
No pending_op should ever be corrupted.
Last time around the debugging revealed that this happens when 'lock' sanity is violated. Mcoll can sometimes return back after a long time (~15 minutes). If this time happens to be >30 minutes, the lock gets expired exposing the broker thread to write to mongo without a lock.
Fixed with rev#ec4f4447e1b65e276f10c2695b9ba799835b7b5b in origin-server.
Atomic updates are now made to embedded documents.
This bug will be difficult to reproduce. We will have to wait for feedback from Online production if this happens again.
This bug is not reproduced from QE side in recent days, I'm moving it to verified.