Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1665216

Summary: Databases crashes on Gluster 5 with the option performance.write-behind enabled
Product: [Community] GlusterFS Reporter: gabisoft
Component: write-behindAssignee: Raghavendra G <rgowdapp>
Status: CLOSED WORKSFORME QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 5CC: bugs, bugzilla.redhat.com, gabisoft, rgowdapp
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-26 14:23:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
asb-etcd-1-smjcf.log
none
asb-etcd-3-dsfxf.log
none
dump-fuse, gzipped
none
strace of initdb (which crashed) none

Description gabisoft 2019-01-10 17:13:52 UTC
Created attachment 1519880 [details]
asb-etcd-1-smjcf.log

Description of problem:
Running Etcd, Cassandra and PostgreSQL show a stacktrace after starting with DB files on Gluster 5.2 volumes, if the volume has enabled the volume option performance.write-behind. Using the Gluster volumes to serve normal files does not enforce the issue.

Version-Release number of selected component (if applicable):
5.2


How reproducible:


Steps to Reproduce:
1. Start Etcd with DB files on a gluster volume option performance.write-behind is on
2. Etcd does start and crashes after listening to clients (unexpected fault address 0x7fca0c001040)
3. Disable performance.write-behind on the gluster volume
4. Restart Etcd
5. Etcd does start normally

Actual results:
Output of a Etcd crashing (asb-etcd-1-smjcf.log)

Expected results:
Output of a Etcd running with performance.write-behind off (asb-etcd-3-dsfxf.log)


Additional info:
The content or size of the Etcd DB doesn't matter. It is also reproducible if the DB is created from scratch.

Comment 1 gabisoft 2019-01-10 17:14:53 UTC
Created attachment 1519881 [details]
asb-etcd-3-dsfxf.log

Comment 2 Raghavendra G 2019-01-11 03:47:45 UTC
Can you paste the backtrace here? If possible can you attach the core?

Comment 3 Raghavendra G 2019-01-11 03:51:36 UTC
Sorry I interpreted the bug as glusterfs crashing. I see that etcd is having problems coming up. Can you get the following information (I don't need core of glusterfs, as there is none):

* strace of etcd (strace -ff -v ...), to find out what syscalls it did.
* dump of traffic between fuse kernel module and glusterfs (see --dump-fuse option of glusterfs)

Comment 4 Raghavendra G 2019-01-11 14:05:38 UTC
Also detailed steps for reproducer (even better a script or capture of the cmds you executed) would greatly speed up the debugging.

Comment 5 mhutter 2019-02-01 10:23:14 UTC
Reproduction case: Exactly as described in the original Ticket.


# Prepare gluster volume
gluster volume set gluster-pv18 performance.write-behind off

# mount the volume
mount -t glusterfs <gluster-server>:/gluster-pv18 /mnt/gluster-pv18

# start Postgres
docker run --name psql-test --rm -v /mnt/gluster-pv18:/var/lib/postgresql/data docker.io/postgres:9.5
# this should work as expected

# clean up
docker stop psql-test
rm -rf /mnt/gluster-pv18/*
umount /mnt/gluster-pv18

# enable write-behind
gluster volume set gluster-pv18 performance.write-behind on

# mount the volume
mount -t glusterfs <gluster-server>:/gluster-pv18 /mnt/gluster-pv18

# start Postgres
docker run --name psql-test --rm -v /mnt/gluster-pv18:/var/lib/postgresql/data docker.io/postgres:9.5
# !!! this will now fail:

# creating template1 database in /var/lib/postgresql/data/base/1 ... ok
# initializing pg_authid ... LOG:  invalid primary checkpoint record
# LOG:  invalid secondary checkpoint record
# PANIC:  could not locate a valid checkpoint record
# Aborted (core dumped)
# child process exited with exit code 134
# initdb: removing contents of data directory "/var/lib/postgresql/data"

Comment 6 mhutter 2019-02-01 10:39:26 UTC
Created attachment 1525793 [details]
dump-fuse, gzipped

Comment 7 mhutter 2019-02-01 10:41:10 UTC
Created attachment 1525794 [details]
strace of initdb (which crashed)

Also interesting: while creating the TGZ archive (not on the gluster volume) of all strace files (which were on the gluster volume), a lot of messages like this appeared:

tar: strace/initdb.42: file changed as we read it

Comment 8 mhutter 2019-03-18 06:37:56 UTC
Hi, were you able to reproduce the issue?

Comment 9 gabisoft 2019-03-26 13:01:28 UTC
Could not reproduce this issue anymore with Gluster 5.5 and Etcd, Cassandra and PostgreSQL.

Comment 10 Raghavendra G 2019-03-26 14:17:28 UTC
(In reply to gabisoft from comment #9)
> Could not reproduce this issue anymore with Gluster 5.5 and Etcd, Cassandra
> and PostgreSQL.

Its likely that fixes to bz 1512691 have helped. Can you please close the bug?