Bug 1008283 - Corrupted pending_op in pending_op_groups queue
Summary: Corrupted pending_op in pending_op_groups queue
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Online
Classification: Red Hat
Component: Pod
Version: 2.x
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Rajat Chopra
QA Contact: libra bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-09-16 05:15 UTC by Rajat Chopra
Modified: 2015-05-15 00:20 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-10-17 13:28:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Rajat Chopra 2013-09-16 05:15:04 UTC
Description of problem:

We still see some pending_ops that are corrupted in online deployment. e.g. output of clear-pending-ops says :

ERROR in cleaning up application's op because the type is nil. App uuid - 5217b47b4382ec8f25000223. Op - #<PendingAppOpGroup _id: 52362df6e0b8cd02df000001, _type: nil, created_at: nil, .. <truncated>

Version-Release number of selected component (if applicable):


How reproducible:
Seems to happen time and again with some applications that see lot of REST API volume.

Steps to Reproduce:
1. unknown
2.
3.

Actual results:
pending_op becomes corrupted possibly because of mongoid overwriting at the wrong index.

Expected results:
No pending_op should ever be corrupted.

Additional info:
Last time around the debugging revealed that this happens when 'lock' sanity is violated. Mcoll can sometimes return back after a long time (~15 minutes). If this time happens to be >30 minutes, the lock gets expired exposing the broker thread to write to mongo without a lock.

Comment 1 Rajat Chopra 2013-09-18 04:58:59 UTC
Fixed with rev#ec4f4447e1b65e276f10c2695b9ba799835b7b5b in origin-server.
Atomic updates are now made to embedded documents.

This bug will be difficult to reproduce. We will have to wait for feedback from Online production if this happens again.

Comment 2 Jianwei Hou 2013-09-30 01:44:24 UTC
This bug is not reproduced from QE side in recent days, I'm moving it to verified.


Note You need to log in before you can comment on or make changes to this bug.