Bug 1223624

Summary:	lock file management is weak and can block
Product:	Red Hat OpenStack	Reporter:	Fabio Massimo Di Nitto <fdinitto>
Component:	crudini	Assignee:	Pádraig Brady <pbrady>
Status:	CLOSED ERRATA	QA Contact:	Itzik Brown <itbrown>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	7.0 (Kilo)	CC:	abeekhof, apevec, jschluet, lhh, mlopes, ohochman, pbrady, sclewis, tfreger, yeylon
Target Milestone:	ga
Target Release:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	crudini-0.7-1.el7ost	Doc Type:	Bug Fix
Doc Text:	Prior to this update, separate lock files were used while updating config files. In addition, directory entries were not correctly synchronized during an update. As a result, a crash during this process could cause deadlock issues on subsequent config update attempts, or very occasionally result in corrupted (empty) config files. This update adds more robust locking and synchronization within the 'crudini' utility. The result is that config file updates are now more robust during system crash events.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-08-05 13:24:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1185030, 1251948, 1261487

Description Fabio Massimo Di Nitto 2015-05-21 04:22:48 UTC

Description of problem:

When crudini attempts to read/write values to/from a config file, it creates a lock file.

The lock file management is weak and can lead to blocking situation.

Version-Release number of selected component (if applicable):

crudini-0.5-1.el7ost.noarch

How reproducible:

hard to say. I did hit this bug because the node was rebooted while some operation from the installer was in progress and there was a stale lock file.

Steps to Reproduce:
1. touch /etc/nova/.nova.conf.crudini.lck
2. crudini --get /etc/nova/nova.conf neutron admin_auth_url

Actual results:

strace shows cruding trying over and over to get the lock file and will loop forever.

Expected results:

The lock file should include a pid information and the lock code should check if that pid is still running and/or generated by another crudini.

lack of a matching pid file should be considered as stalled and lock file removed.

if the PID is running then report an error instead of looping indefinetely.

Additional info:

Comment 4 Pádraig Brady 2015-05-21 10:27:55 UTC

This is tricky. I wrote these notes about crudini locking:

# Note we can't combine these methods to provide separated locks
# which are immune to stale file deadlock, as once the separated
# file is unlinked or renamed, you introduce a race with 3 or more
# users if there is an associated fcntl lock.

# Also a pid based method doesn't work with distributed file systems.

Note you can call crudini with the --inplace option to avoid this issue.
Though there are then these caveats:

# Caveats in --inplace mode:
#  - File must be writeable
#  - File should be generally non readable to avoid read lock DoS.

#  - Not Atomic as readers may see incomplete data for a while.
#  - Not Consistent as multiple (non crudini) writers may overlap.
#  - Less Durable as existing data truncated before I/O completes.
#  - Requires write access to file rather than write access to dir.

Isn't UNIX great. I'd love to come up with a scheme to avoid these issues.
For reference the latest crudini is at:
https://github.com/pixelb/crudini/blob/master/crudini

Comment 5 Fabio Massimo Di Nitto 2015-05-21 10:54:22 UTC

(In reply to Pádraig Brady from comment #4)
> This is tricky.

Yeps I know :)

> I wrote these notes about crudini locking:
> 
> # Note we can't combine these methods to provide separated locks
> # which are immune to stale file deadlock, as once the separated
> # file is unlinked or renamed, you introduce a race with 3 or more
> # users if there is an associated fcntl lock.
> 
> # Also a pid based method doesn't work with distributed file systems.

agreed.

> 
> Note you can call crudini with the --inplace option to avoid this issue.
> Though there are then these caveats:
> 
> # Caveats in --inplace mode:
> #  - File must be writeable

this isn't a problem

> #  - File should be generally non readable to avoid read lock DoS.
> 
> #  - Not Atomic as readers may see incomplete data for a while.

In theory this isn't a problem once the deployment is completed, but we can't assume anything.

> #  - Not Consistent as multiple (non crudini) writers may overlap.
> #  - Less Durable as existing data truncated before I/O completes.
> #  - Requires write access to file rather than write access to dir.
> 
> Isn't UNIX great. I'd love to come up with a scheme to avoid these issues.
> For reference the latest crudini is at:
> https://github.com/pixelb/crudini/blob/master/crudini

I am using the one available in OSP7. We need some kind of fix/agreement for product. I don't have bw to test upstream pieces of code sorry.

The issue here boils down to a combination of scripts that need to access config files to get/set some values as services are migrating between cluster nodes.

If any of those scripts fail due to "any random reason" the node can potentially be rebooted hard (poweroff/poweron).

On reboot, if a lock file is stale, the subsequent operations will continue to fail and we will enter in an infinite loop of lock -> reboot -> lock ....

We can adjust the scripts to deal with an error from crudini and avoid the loop (potentially reporting the error up to the user) easily, but a locking infinite loop will trigger a script execution timeout and we can't distinguish it from a non-recoverable failure.

If fixing the locking is "impossible", a "simple" try 10 times and then give up, exit 1 would also be acceptable (even tho it will require some extra code changes in different packages.

Comment 6 Fabio Massimo Di Nitto 2015-06-12 04:44:03 UTC

Raising severity. This is blocking Instance HA deployments and it's causing controller and compute nodes to fail badly when stray lock files are left around.

Comment 7 Pádraig Brady 2015-06-12 16:03:58 UTC

https://github.com/pixelb/crudini/commit/10875e6a2bb

Comment 9 Fabio Massimo Di Nitto 2015-06-16 08:52:40 UTC

I have been using 0.7.1-1 for almost 3 days now, with dozens and dozens of failovers and never hit this locking issue.

The fix looks good from engineering testing.

thanks
Fabio

Comment 12 Toni Freger 2015-06-28 07:39:22 UTC

Can you please suggest a way to verify this bug is solved?

Comment 13 Itzik Brown 2015-06-29 14:36:53 UTC

Checked on RHEL 7.1 , crudini 0.7

Comment 16 errata-xmlrpc 2015-08-05 13:24:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2015:1548