Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1665216

Summary:

Databases crashes on Gluster 5 with the option performance.write-behind enabled

Product:

[Community] GlusterFS

Reporter:

gabisoft

Component:

write-behind

Assignee:

Raghavendra G <rgowdapp>

Status:

CLOSED WORKSFORME

QA Contact:

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

CC:

bugs, bugzilla.redhat.com, gabisoft, rgowdapp

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-03-26 14:23:09 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
asb-etcd-1-smjcf.log	none
asb-etcd-3-dsfxf.log	none
dump-fuse, gzipped	none
strace of initdb (which crashed)	none

Description gabisoft 2019-01-10 17:13:52 UTC

Created attachment 1519880 [details]
asb-etcd-1-smjcf.log

Description of problem:
Running Etcd, Cassandra and PostgreSQL show a stacktrace after starting with DB files on Gluster 5.2 volumes, if the volume has enabled the volume option performance.write-behind. Using the Gluster volumes to serve normal files does not enforce the issue.

Version-Release number of selected component (if applicable):
5.2


How reproducible:


Steps to Reproduce:
1. Start Etcd with DB files on a gluster volume option performance.write-behind is on
2. Etcd does start and crashes after listening to clients (unexpected fault address 0x7fca0c001040)
3. Disable performance.write-behind on the gluster volume
4. Restart Etcd
5. Etcd does start normally

Actual results:
Output of a Etcd crashing (asb-etcd-1-smjcf.log)

Expected results:
Output of a Etcd running with performance.write-behind off (asb-etcd-3-dsfxf.log)


Additional info:
The content or size of the Etcd DB doesn't matter. It is also reproducible if the DB is created from scratch.

Comment 1 gabisoft 2019-01-10 17:14:53 UTC

Created attachment 1519881 [details]
asb-etcd-3-dsfxf.log

Comment 2 Raghavendra G 2019-01-11 03:47:45 UTC

Can you paste the backtrace here? If possible can you attach the core?

Comment 3 Raghavendra G 2019-01-11 03:51:36 UTC

Sorry I interpreted the bug as glusterfs crashing. I see that etcd is having problems coming up. Can you get the following information (I don't need core of glusterfs, as there is none):

* strace of etcd (strace -ff -v ...), to find out what syscalls it did.
* dump of traffic between fuse kernel module and glusterfs (see --dump-fuse option of glusterfs)

Comment 4 Raghavendra G 2019-01-11 14:05:38 UTC

Also detailed steps for reproducer (even better a script or capture of the cmds you executed) would greatly speed up the debugging.

Comment 5 mhutter 2019-02-01 10:23:14 UTC

Reproduction case: Exactly as described in the original Ticket.


# Prepare gluster volume
gluster volume set gluster-pv18 performance.write-behind off

# mount the volume
mount -t glusterfs <gluster-server>:/gluster-pv18 /mnt/gluster-pv18

# start Postgres
docker run --name psql-test --rm -v /mnt/gluster-pv18:/var/lib/postgresql/data docker.io/postgres:9.5
# this should work as expected

# clean up
docker stop psql-test
rm -rf /mnt/gluster-pv18/*
umount /mnt/gluster-pv18

# enable write-behind
gluster volume set gluster-pv18 performance.write-behind on

# mount the volume
mount -t glusterfs <gluster-server>:/gluster-pv18 /mnt/gluster-pv18

# start Postgres
docker run --name psql-test --rm -v /mnt/gluster-pv18:/var/lib/postgresql/data docker.io/postgres:9.5
# !!! this will now fail:

# creating template1 database in /var/lib/postgresql/data/base/1 ... ok
# initializing pg_authid ... LOG:  invalid primary checkpoint record
# LOG:  invalid secondary checkpoint record
# PANIC:  could not locate a valid checkpoint record
# Aborted (core dumped)
# child process exited with exit code 134
# initdb: removing contents of data directory "/var/lib/postgresql/data"

Comment 6 mhutter 2019-02-01 10:39:26 UTC

Created attachment 1525793 [details]
dump-fuse, gzipped

Comment 7 mhutter 2019-02-01 10:41:10 UTC

Created attachment 1525794 [details]
strace of initdb (which crashed)

Also interesting: while creating the TGZ archive (not on the gluster volume) of all strace files (which were on the gluster volume), a lot of messages like this appeared:

tar: strace/initdb.42: file changed as we read it

Comment 8 mhutter 2019-03-18 06:37:56 UTC

Hi, were you able to reproduce the issue?

Comment 9 gabisoft 2019-03-26 13:01:28 UTC

Could not reproduce this issue anymore with Gluster 5.5 and Etcd, Cassandra and PostgreSQL.

Comment 10 Raghavendra G 2019-03-26 14:17:28 UTC

(In reply to gabisoft from comment #9)
> Could not reproduce this issue anymore with Gluster 5.5 and Etcd, Cassandra
> and PostgreSQL.

Its likely that fixes to bz 1512691 have helped. Can you please close the bug?