1223624 – lock file management is weak and can block

Bug 1223624 - lock file management is weak and can block

Summary: lock file management is weak and can block

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	crudini
Sub Component:
Version:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	ga
Target Release:	7.0 (Kilo)
Assignee:	Pádraig Brady
QA Contact:	Itzik Brown
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1185030 1251948 1261487
TreeView+	depends on / blocked

Reported:	2015-05-21 04:22 UTC by Fabio Massimo Di Nitto
Modified:	2023-02-22 23:02 UTC (History)
CC List:	10 users (show)
Fixed In Version:	crudini-0.7-1.el7ost
Doc Type:	Bug Fix
Doc Text:	Prior to this update, separate lock files were used while updating config files. In addition, directory entries were not correctly synchronized during an update. As a result, a crash during this process could cause deadlock issues on subsequent config update attempts, or very occasionally result in corrupted (empty) config files. This update adds more robust locking and synchronization within the 'crudini' utility. The result is that config file updates are now more robust during system crash events.
Clone Of:
Environment:
Last Closed:	2015-08-05 13:24:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2015:1548	0	normal	SHIPPED_LIVE	Red Hat Enterprise Linux OpenStack Platform Enhancement Advisory	2015-08-05 17:07:06 UTC

Description Fabio Massimo Di Nitto 2015-05-21 04:22:48 UTC

Description of problem:

When crudini attempts to read/write values to/from a config file, it creates a lock file.

The lock file management is weak and can lead to blocking situation.

Version-Release number of selected component (if applicable):

crudini-0.5-1.el7ost.noarch

How reproducible:

hard to say. I did hit this bug because the node was rebooted while some operation from the installer was in progress and there was a stale lock file.

Steps to Reproduce:
1. touch /etc/nova/.nova.conf.crudini.lck
2. crudini --get /etc/nova/nova.conf neutron admin_auth_url

Actual results:

strace shows cruding trying over and over to get the lock file and will loop forever.

Expected results:

The lock file should include a pid information and the lock code should check if that pid is still running and/or generated by another crudini.

lack of a matching pid file should be considered as stalled and lock file removed.

if the PID is running then report an error instead of looping indefinetely.

Additional info:

Comment 4 Pádraig Brady 2015-05-21 10:27:55 UTC

This is tricky. I wrote these notes about crudini locking:

# Note we can't combine these methods to provide separated locks
# which are immune to stale file deadlock, as once the separated
# file is unlinked or renamed, you introduce a race with 3 or more
# users if there is an associated fcntl lock.

# Also a pid based method doesn't work with distributed file systems.

Note you can call crudini with the --inplace option to avoid this issue.
Though there are then these caveats:

# Caveats in --inplace mode:
#  - File must be writeable
#  - File should be generally non readable to avoid read lock DoS.

#  - Not Atomic as readers may see incomplete data for a while.
#  - Not Consistent as multiple (non crudini) writers may overlap.
#  - Less Durable as existing data truncated before I/O completes.
#  - Requires write access to file rather than write access to dir.

Isn't UNIX great. I'd love to come up with a scheme to avoid these issues.
For reference the latest crudini is at:
https://github.com/pixelb/crudini/blob/master/crudini

Comment 5 Fabio Massimo Di Nitto 2015-05-21 10:54:22 UTC

(In reply to Pádraig Brady from comment #4)
> This is tricky.

Yeps I know :)

> I wrote these notes about crudini locking:
> 
> # Note we can't combine these methods to provide separated locks
> # which are immune to stale file deadlock, as once the separated
> # file is unlinked or renamed, you introduce a race with 3 or more
> # users if there is an associated fcntl lock.
> 
> # Also a pid based method doesn't work with distributed file systems.

agreed.

> 
> Note you can call crudini with the --inplace option to avoid this issue.
> Though there are then these caveats:
> 
> # Caveats in --inplace mode:
> #  - File must be writeable

this isn't a problem

> #  - File should be generally non readable to avoid read lock DoS.
> 
> #  - Not Atomic as readers may see incomplete data for a while.

In theory this isn't a problem once the deployment is completed, but we can't assume anything.

> #  - Not Consistent as multiple (non crudini) writers may overlap.
> #  - Less Durable as existing data truncated before I/O completes.
> #  - Requires write access to file rather than write access to dir.
> 
> Isn't UNIX great. I'd love to come up with a scheme to avoid these issues.
> For reference the latest crudini is at:
> https://github.com/pixelb/crudini/blob/master/crudini

I am using the one available in OSP7. We need some kind of fix/agreement for product. I don't have bw to test upstream pieces of code sorry.

The issue here boils down to a combination of scripts that need to access config files to get/set some values as services are migrating between cluster nodes.

If any of those scripts fail due to "any random reason" the node can potentially be rebooted hard (poweroff/poweron).

On reboot, if a lock file is stale, the subsequent operations will continue to fail and we will enter in an infinite loop of lock -> reboot -> lock ....

We can adjust the scripts to deal with an error from crudini and avoid the loop (potentially reporting the error up to the user) easily, but a locking infinite loop will trigger a script execution timeout and we can't distinguish it from a non-recoverable failure.

If fixing the locking is "impossible", a "simple" try 10 times and then give up, exit 1 would also be acceptable (even tho it will require some extra code changes in different packages.

Comment 6 Fabio Massimo Di Nitto 2015-06-12 04:44:03 UTC

Raising severity. This is blocking Instance HA deployments and it's causing controller and compute nodes to fail badly when stray lock files are left around.

Comment 7 Pádraig Brady 2015-06-12 16:03:58 UTC

https://github.com/pixelb/crudini/commit/10875e6a2bb

Comment 9 Fabio Massimo Di Nitto 2015-06-16 08:52:40 UTC

I have been using 0.7.1-1 for almost 3 days now, with dozens and dozens of failovers and never hit this locking issue.

The fix looks good from engineering testing.

thanks
Fabio

Comment 12 Toni Freger 2015-06-28 07:39:22 UTC

Can you please suggest a way to verify this bug is solved?

Comment 13 Itzik Brown 2015-06-29 14:36:53 UTC

Checked on RHEL 7.1 , crudini 0.7

Comment 16 errata-xmlrpc 2015-08-05 13:24:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2015:1548

Note You need to log in before you can comment on or make changes to this bug.