Bug 1008283

Summary: Corrupted pending_op in pending_op_groups queue
Product: OpenShift Online Reporter: Rajat Chopra <rchopra>
Component: PodAssignee: Rajat Chopra <rchopra>
Status: CLOSED CURRENTRELEASE QA Contact: libra bugs <libra-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 2.xCC: jhou, twiest
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-10-17 13:28:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Rajat Chopra 2013-09-16 05:15:04 UTC
Description of problem:

We still see some pending_ops that are corrupted in online deployment. e.g. output of clear-pending-ops says :

ERROR in cleaning up application's op because the type is nil. App uuid - 5217b47b4382ec8f25000223. Op - #<PendingAppOpGroup _id: 52362df6e0b8cd02df000001, _type: nil, created_at: nil, .. <truncated>

Version-Release number of selected component (if applicable):


How reproducible:
Seems to happen time and again with some applications that see lot of REST API volume.

Steps to Reproduce:
1. unknown
2.
3.

Actual results:
pending_op becomes corrupted possibly because of mongoid overwriting at the wrong index.

Expected results:
No pending_op should ever be corrupted.

Additional info:
Last time around the debugging revealed that this happens when 'lock' sanity is violated. Mcoll can sometimes return back after a long time (~15 minutes). If this time happens to be >30 minutes, the lock gets expired exposing the broker thread to write to mongo without a lock.

Comment 1 Rajat Chopra 2013-09-18 04:58:59 UTC
Fixed with rev#ec4f4447e1b65e276f10c2695b9ba799835b7b5b in origin-server.
Atomic updates are now made to embedded documents.

This bug will be difficult to reproduce. We will have to wait for feedback from Online production if this happens again.

Comment 2 Jianwei Hou 2013-09-30 01:44:24 UTC
This bug is not reproduced from QE side in recent days, I'm moving it to verified.