Bug 1665216 - Databases crashes on Gluster 5 with the option performance.write-behind enabled
Summary: Databases crashes on Gluster 5 with the option performance.write-behind enabled
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: GlusterFS
Classification: Community
Component: write-behind
Version: 5
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Raghavendra G
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-10 17:13 UTC by gabisoft
Modified: 2019-03-26 14:23 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-26 14:23:09 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
asb-etcd-1-smjcf.log (9.68 KB, application/octet-stream)
2019-01-10 17:13 UTC, gabisoft
no flags Details
asb-etcd-3-dsfxf.log (2.92 KB, application/octet-stream)
2019-01-10 17:14 UTC, gabisoft
no flags Details
dump-fuse, gzipped (8.22 MB, application/gzip)
2019-02-01 10:39 UTC, mhutter
no flags Details
strace of initdb (which crashed) (508.51 KB, application/gzip)
2019-02-01 10:41 UTC, mhutter
no flags Details

Description gabisoft 2019-01-10 17:13:52 UTC
Created attachment 1519880 [details]
asb-etcd-1-smjcf.log

Description of problem:
Running Etcd, Cassandra and PostgreSQL show a stacktrace after starting with DB files on Gluster 5.2 volumes, if the volume has enabled the volume option performance.write-behind. Using the Gluster volumes to serve normal files does not enforce the issue.

Version-Release number of selected component (if applicable):
5.2


How reproducible:


Steps to Reproduce:
1. Start Etcd with DB files on a gluster volume option performance.write-behind is on
2. Etcd does start and crashes after listening to clients (unexpected fault address 0x7fca0c001040)
3. Disable performance.write-behind on the gluster volume
4. Restart Etcd
5. Etcd does start normally

Actual results:
Output of a Etcd crashing (asb-etcd-1-smjcf.log)

Expected results:
Output of a Etcd running with performance.write-behind off (asb-etcd-3-dsfxf.log)


Additional info:
The content or size of the Etcd DB doesn't matter. It is also reproducible if the DB is created from scratch.

Comment 1 gabisoft 2019-01-10 17:14:53 UTC
Created attachment 1519881 [details]
asb-etcd-3-dsfxf.log

Comment 2 Raghavendra G 2019-01-11 03:47:45 UTC
Can you paste the backtrace here? If possible can you attach the core?

Comment 3 Raghavendra G 2019-01-11 03:51:36 UTC
Sorry I interpreted the bug as glusterfs crashing. I see that etcd is having problems coming up. Can you get the following information (I don't need core of glusterfs, as there is none):

* strace of etcd (strace -ff -v ...), to find out what syscalls it did.
* dump of traffic between fuse kernel module and glusterfs (see --dump-fuse option of glusterfs)

Comment 4 Raghavendra G 2019-01-11 14:05:38 UTC
Also detailed steps for reproducer (even better a script or capture of the cmds you executed) would greatly speed up the debugging.

Comment 5 mhutter 2019-02-01 10:23:14 UTC
Reproduction case: Exactly as described in the original Ticket.


# Prepare gluster volume
gluster volume set gluster-pv18 performance.write-behind off

# mount the volume
mount -t glusterfs <gluster-server>:/gluster-pv18 /mnt/gluster-pv18

# start Postgres
docker run --name psql-test --rm -v /mnt/gluster-pv18:/var/lib/postgresql/data docker.io/postgres:9.5
# this should work as expected

# clean up
docker stop psql-test
rm -rf /mnt/gluster-pv18/*
umount /mnt/gluster-pv18

# enable write-behind
gluster volume set gluster-pv18 performance.write-behind on

# mount the volume
mount -t glusterfs <gluster-server>:/gluster-pv18 /mnt/gluster-pv18

# start Postgres
docker run --name psql-test --rm -v /mnt/gluster-pv18:/var/lib/postgresql/data docker.io/postgres:9.5
# !!! this will now fail:

# creating template1 database in /var/lib/postgresql/data/base/1 ... ok
# initializing pg_authid ... LOG:  invalid primary checkpoint record
# LOG:  invalid secondary checkpoint record
# PANIC:  could not locate a valid checkpoint record
# Aborted (core dumped)
# child process exited with exit code 134
# initdb: removing contents of data directory "/var/lib/postgresql/data"

Comment 6 mhutter 2019-02-01 10:39:26 UTC
Created attachment 1525793 [details]
dump-fuse, gzipped

Comment 7 mhutter 2019-02-01 10:41:10 UTC
Created attachment 1525794 [details]
strace of initdb (which crashed)

Also interesting: while creating the TGZ archive (not on the gluster volume) of all strace files (which were on the gluster volume), a lot of messages like this appeared:

tar: strace/initdb.42: file changed as we read it

Comment 8 mhutter 2019-03-18 06:37:56 UTC
Hi, were you able to reproduce the issue?

Comment 9 gabisoft 2019-03-26 13:01:28 UTC
Could not reproduce this issue anymore with Gluster 5.5 and Etcd, Cassandra and PostgreSQL.

Comment 10 Raghavendra G 2019-03-26 14:17:28 UTC
(In reply to gabisoft from comment #9)
> Could not reproduce this issue anymore with Gluster 5.5 and Etcd, Cassandra
> and PostgreSQL.

Its likely that fixes to bz 1512691 have helped. Can you please close the bug?


Note You need to log in before you can comment on or make changes to this bug.