Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1610394

Summary: RFE: add dark data detection
Product: Red Hat OpenStack Reporter: Pete Zaitcev <zaitcev>
Component: openstack-swiftAssignee: Pete Zaitcev <zaitcev>
Status: CLOSED DEFERRED QA Contact:
Severity: medium Docs Contact: Kim Nylander <knylande>
Priority: medium    
Version: 15.0 (Stein)CC: cschwede, derekh, gcharot, gfidente, njohnston, pgrist, scohen, spower, srevivo, zaitcev
Target Milestone: z2Keywords: FutureFeature, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-01-19 10:51:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pete Zaitcev 2018-07-31 14:20:06 UTC
A large user hit a situation where XFS panics and freezes volumes until
a person notices the problem and remounts them. This happens when
volumes overflow. Because of overflow condition, the users will try
to delete objects in Swift. The proxy places tombstones, and in a week
the replicator deletes objects. If the frozen volume is re-mounted
later, it releases all the stuck objects inside as a dark data.

We need some sort of consistency checking in order to pick up pieces.

An alternative would be to alert operators to overflow condition
better, and most specifically to stuck volumes condition. But there's
only so much we can do if they just don't care to maintain their
cluster until a disaster strikes.

Comment 2 Pete Zaitcev 2018-07-31 14:32:16 UTC
Just to make it easier to find, here's what XFS does:

[27997104.894094] XFS (sde1): Internal error xfs_trans_cancel at line 1007 of file fs/xfs/xfs_trans.c. Caller xfs_create+0x40e/0x710 [xfs]
[27997104.894143] CPU: 0 PID: 2513 Comm: swift-object-se Not tainted 3.10.0-327.13.1.el7.x86_64 #1 
[27997104.894145] Hardware name: Cisco Systems Inc UCSC-C240-M4L/UCSC-C240-M4L, BIOS C240M4.2.0.9a.0.120120151913 12/01/2015
[27997104.894146] ffff88013e6854d8 00000000e98a2272 ffff880847fa3b48 ffffffff816356f4
[27997104.894151] ffff880847fa3b60 ffffffffa02d3e6b ffffffffa02e37fe ffff880847fa3b88
[27997104.894153] ffffffffa02ee919 ffff88025b961dc0 ffff880a45a85000 0000000000000000
[27997104.894156] Call Trace:
[27997104.894165] [<ffffffff816356f4>] dump_stack+0x19/0x1b
[27997104.894180] [<ffffffffa02d3e6b>] xfs_error_report+0x3b/0x40 [xfs]
[27997104.894191] [<ffffffffa02e37fe>] ? xfs_create+0x40e/0x710 [xfs]
[27997104.894203] [<ffffffffa02ee919>] xfs_trans_cancel+0xd9/0x100 [xfs]
[27997104.894213] [<ffffffffa02e37fe>] xfs_create+0x40e/0x710 [xfs]
[27997104.894224] [<ffffffffa02dfd9b>] xfs_vn_mknod+0xbb/0x250 [xfs]
[27997104.894233] [<ffffffffa02dff63>] xfs_vn_create+0x13/0x20 [xfs]
[27997104.894237] [<ffffffff811eac9d>] vfs_create+0xcd/0x130
[27997104.894239] [<ffffffff811ec32f>] do_last+0xbef/0x1270
[27997104.894243] [<ffffffff811c11ee>] ? kmem_cache_alloc_trace+0x1ce/0x1f0
[27997104.894245] [<ffffffff811ee692>] path_openat+0xc2/0x490
[27997104.894248] [<ffffffff811efe5b>] do_filp_open+0x4b/0xb0
[27997104.894252] [<ffffffff811fc9f7>] ? __alloc_fd+0xa7/0x130
[27997104.894256] [<ffffffff811dd803>] do_sys_open+0xf3/0x1f0
[27997104.894259] [<ffffffff811dd91e>] SyS_open+0x1e/0x20
[27997104.894264] [<ffffffff81645e89>] system_call_fastpath+0x16/0x1b
[27997104.894266] XFS (sde1): xfs_do_force_shutdown(0x8) called from line 1008 of file fs/xfs/xfs_trans.c. Return address = 0xffffffffa02ee932
[27997104.894272] XFS (sde1): Corruption of in-memory data detected. Shutting down filesystem
[27997104.894298] XFS (sde1): Please umount the filesystem and rectify the problem(s)
[27997106.503918] XFS (sde1): xfs_log_force: error -5 returned.
[27997136.605229] XFS (sde1): xfs_log_force: error -5 returned.

Comment 3 Pete Zaitcev 2018-07-31 14:35:57 UTC
There was an old time upstream review by Sam Merritt that added watchers
to the auditor, 212824. It might come useful for this - possibly for us
to package a solution.

Comment 15 Gregory Charot 2020-03-12 12:05:08 UTC
Development is not complete upstream moving to 17

This RFE is mostly targeted for support in situation where the cluster is in a bad shape. In the mean time we have other ways to fix such deployments.

Comment 20 spower 2022-05-11 10:30:10 UTC
This RFE is not marked as an MVP for 17.0, so it is being moved for consideration to OSP 17.1. As stated in the OSP Program Call, QE and Docs only have the capacity to verify and document MVP features for OSP 17.0.