Bug 1285594

Summary:

[abrt] gvfs: allocator_memalign(): gvfsd-metadata killed by SIGSEGV

Product:

[Fedora] Fedora

Reporter:

todd_lewis

Component:

gvfs

Assignee:

Ondrej Holy <oholy>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

CC:

alexl, beland, bnocera, oholy, todd_lewis

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Unspecified

URL:

https://retrace.fedoraproject.org/faf/reports/bthash/81232989976e8d20ad22f1e6a5bd2e359ef7e6ee

Whiteboard:

abrt_hash:befa3e52dabddcafd0a56e13683313e26ba692b0;

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-04-25 12:32:25 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
File: backtrace	none
File: cgroup	none
File: core_backtrace	none
File: dso_list	none
File: environ	none
File: exploitable	none
File: limits	none
File: maps	none
File: mountinfo	none
File: namespaces	none
File: open_fds	none
File: proc_pid_status	none
File: var_log_messages	none
requested journalctl.log	none
requested gvfs-metadata.zip	none
Output from valgrind --track-origins=yes --log-file=/tmp/valgrind.log /usr/libexec/gvfsd-metadata --replace	none
tarball of directory structure	none

Description todd_lewis 2015-11-26 01:01:57 UTC

Description of problem:
This happens any time an file's icon or folder is renamed, moved, created, etc. in the GUI, though not in the home directory, only in other filesystems, whether local or remote nfs mounted. Setting selinux to permissive has no effect.

Version-Release number of selected component:
gvfs-1.26.1.1-2.fc23

Additional info:
reporter:       libreport-2.6.3
backtrace_rating: 4
cmdline:        /usr/libexec/gvfsd-metadata
crash_function: allocator_memalign
executable:     /usr/libexec/gvfsd-metadata
global_pid:     13572
kernel:         4.2.6-300.fc23.x86_64
runlevel:       N 5
type:           CCpp
uid:            1001

Truncated backtrace:
Thread no. 1 (10 frames)
 #5 allocator_memalign at gslice.c:1378
 #6 allocator_add_slab at gslice.c:1252
 #7 slab_allocator_alloc_chunk at gslice.c:1297
 #8 magazine_cache_pop_magazine at gslice.c:731
 #9 thread_memory_magazine1_reload at gslice.c:801
 #10 g_slice_alloc at gslice.c:996
 #11 g_slice_alloc0 at gslice.c:1032
 #12 g_list_insert_sorted_real at glist.c:1038
 #13 g_list_insert_sorted at glist.c:1086
 #14 metadata_new at metabuilder.c:127

Potential duplicate: bug 1146848

Comment 1 todd_lewis 2015-11-26 01:02:21 UTC

Created attachment 1099042 [details]
File: backtrace

Comment 2 todd_lewis 2015-11-26 01:02:22 UTC

Created attachment 1099043 [details]
File: cgroup

Comment 3 todd_lewis 2015-11-26 01:02:25 UTC

Created attachment 1099044 [details]
File: core_backtrace

Comment 4 todd_lewis 2015-11-26 01:02:26 UTC

Created attachment 1099045 [details]
File: dso_list

Comment 5 todd_lewis 2015-11-26 01:02:27 UTC

Created attachment 1099046 [details]
File: environ

Comment 6 todd_lewis 2015-11-26 01:02:28 UTC

Created attachment 1099047 [details]
File: exploitable

Comment 7 todd_lewis 2015-11-26 01:02:29 UTC

Created attachment 1099048 [details]
File: limits

Comment 8 todd_lewis 2015-11-26 01:02:31 UTC

Created attachment 1099049 [details]
File: maps

Comment 9 todd_lewis 2015-11-26 01:02:32 UTC

Created attachment 1099050 [details]
File: mountinfo

Comment 10 todd_lewis 2015-11-26 01:02:33 UTC

Created attachment 1099051 [details]
File: namespaces

Comment 11 todd_lewis 2015-11-26 01:02:34 UTC

Created attachment 1099052 [details]
File: open_fds

Comment 12 todd_lewis 2015-11-26 01:02:35 UTC

Created attachment 1099053 [details]
File: proc_pid_status

Comment 13 todd_lewis 2015-11-26 01:02:36 UTC

Created attachment 1099054 [details]
File: var_log_messages

Comment 14 Ondrej Holy 2015-11-26 14:27:26 UTC

Thanks for your bug report.

There is several useful warning prints in metadata code. Don't you see any? Please get a log in following way. I hope the log will contain some relevant warnings to determine what is exactly wrong.
1/ run "journalctl -f | tee ~/journal.log"
2/ reproduce the bug
3/ attach journal.log from your home directory

It looks like some metadata database file is corrupted. Could you make an archive of the following folder and attach here for further analysis please? You can do it following way:
$ zip gvfs-metadata.zip -r ~/.local/share/gvfs-metadata

You told that files in home folder are not affected. What paths are affected? You can remove (rename/backup) corrupted database file to avoid the crashes as a workaround, but which database file is corrupted depends on what paths are affected. Or you can remove (rename/backup) whole folder .local/share/gvfs-metadata, but you will lost consequently all the metadata...

Comment 15 todd_lewis 2015-11-29 01:27:38 UTC

Created attachment 1100057 [details]
requested journalctl.log

journalctl -f output from two generated segv's as requested.

Comment 16 todd_lewis 2015-11-29 01:28:59 UTC

Created attachment 1100058 [details]
requested gvfs-metadata.zip

gvfs-metadata.zip as requested

Comment 17 todd_lewis 2015-11-29 01:44:43 UTC

(In reply to Ondrej Holy from comment #14)
I have added the two files you requested as attachments.

Informed by your comments, I was able to systematically test, and I can confirm that the presence of one particular pair of files in the .local/share/gvfs-metadata directory results in the segfaults, and removing that pair reliably makes the problem go away. That pair matches the pattern "uuid-ac16*", and it is included in the attached gvfs-metadata.zip. I'm hopeful that with this culprit file in hand you'll be able to make gvfsd-metadata more robust.

In the mean time, my wife's Fedora experience will be greatly improved. Thanks for the help.

You also asked what paths were affected. We have a shared work space mounted at /work, and in all my testing tonight, everything that was affected was under /work; whether through a symlink to ../../../work/blah1/blah2 or directly as /work/blah1/blah2 didn't seem to make any difference.

Rather than removing all the files from .local/share/gvfs-metadata, I've chosen to remove the minimal amount possible that still affords an acceptable resolution. If I run into additional relevant info before this ticket is closed I'll add it. But for the moment, things seem to be working well.

Comment 18 Ondrej Holy 2015-12-01 11:28:40 UTC

Thanks for the feedback! I will try to investigate what is wrong using the corrupted database file. Don't you remember what might cause the database file corruption? Didn't run the computer out of the disk space? Or what were you doing when you see the bug first time?

Unfortunately the journalctl log doesn't contain what I expected, couldn't you provide whole log from the boot (from the day you saw the bug) please?

You can list the boots using:
journalctl --list-boots -e

and then export concrete one by (where ID is negative number (e.g. -1) you got from previous command):
journalctl --boot=<ID> > journalctl.log

Comment 19 Ondrej Holy 2015-12-01 13:29:39 UTC

Is it really uuid-ac16*? It seems it isn't corrupted and it doesn't crash in my test environment. I use Nautilus and I don't test with nfs, though it seems you are using Caja and the affected files are on nfs, however both should be irrelevant. Is ~/.local/share/gvfs-metadata on local drive, or nfs?

Comment 20 todd_lewis 2015-12-01 18:27:12 UTC

It really is uuid-ac16*, the presence of which causes the segfault. I wouldn't expect it to cause a problem on other systems, as that file name matches the uuid of the filesystem on which the error exhibited. Behold:

> # blkid | grep ac16
> /dev/sda7: LABEL="work2" UUID="ac162a19-f3fa-48c1-b1fa-2fcf20c1cb59" TYPE="ext4" PARTUUID="0004fb74-07"

It appears only file operations within my "work2" partition (mounted at /work) would trigger the failure. I initially indicated the problem exhibited on automounted nfs partitions imported from another host, but I was not able to replicate that when working on comment #17.

I hesitate to attach the journalctl.log file as it's rather large and contains no gvfsd-metadata messages different from those already posted. For the record: --boot=-26 was my last Fedora 22 boot; --boot=-25 was my first Fedora 23 boot, but no GUI was run then; --boot=-24 contains the first occurrence of the gvfsd-metadata segfault. So the seminal event appears to be the upgrade from fc22 to fc23.

If you really really really want the journalctl.log as per comment #18, I can attach it, but (1) it's huge and (2) it contains nothing relevant that hasn't already been posted.

Strictly speaking, I don't claim the data base file is corrupted, only that gvfsd-metadata crashes in allocator_memalign() if it's present. I don't know what combination of corner cases on my filesystem and that file conspire to trigger the fault.

Comment 21 Ondrej Holy 2015-12-02 08:28:17 UTC

Thanks for the reply.

Just a presence of this file clearly doesn't cause problems on other systems. I renamed the file to match existing filesystem, made some operations and let metadata daemon update the file... however updated database file is successfully written.
 
Ok, forget on journalctl.log, please just confirm that output from the following is empty:
journalctl -e -a SYSLOG_IDENTIFIER=org.gtk.vfs.Metadata

I supposed that the file is corrupted, but it seems to be valid. Could you please try to get log from valgrind? It might be really helpful...

1/ Be sure metadata daemon is not running (to avoid overwriting):
pkill gvfsd-metadata
2/ Restore the buggy database files, ideally using cmd tools (i.e. cp, mv).
3/ Run metadata daemon under valgrind:
valgrind --track-origins=yes --log-file=valgrind.log  /usr/libexec/gvfsd-metadata --replace
4/ Reproduce the crash
5/ Provide valgrind.log

Comment 22 todd_lewis 2015-12-02 14:55:04 UTC

Created attachment 1101518 [details]
Output from valgrind --track-origins=yes --log-file=/tmp/valgrind.log /usr/libexec/gvfsd-metadata --replace

valgrind output indicates a blown stack. Thanks for the concise instructions for producing valgrind.log. Could it be sensitive to the directory hierarchy on that filesystem? That is, if you're trying to reproduce the fault, might you need a matching directory tree? (Grasping at straws here.)

I can confirm that
journalctl -e -a SYSLOG_IDENTIFIER=org.gtk.vfs.Metadata
is empty.

Comment 23 todd_lewis 2015-12-02 15:13:16 UTC

Created attachment 1101522 [details]
tarball of directory structure

The attached tarball contains the directory structure of the problematic filesystem, produced by
# find . -type d -print0 | sudo xargs -0 tar cvf /tmp/work-dirs.tar --no-recursion
at the root of the /work partition.

Comment 24 Alexander Larsson 2015-12-09 15:28:04 UTC

Looking at the backtrace the issue is clearly that meta_file_copy_into() recurses forever, until we're out of stack space.

Take the last part of the meta_builder_copy recursion from the backtrace:

#16 meta_file_copy_into (src=src@entry=0x55c89c16db20, dest=0x55c89c16f610, mtime=mtime@entry=1425164163) at metabuilder.c:285
        src_child = <optimized out>
        dest_child = <optimized out>
        l = 0x55c89c16dc80
#17 0x000055c88218da5d in meta_file_copy_into (src=src@entry=0x55c89c16dad0, dest=0x55c89c16f5c0, mtime=mtime@entry=1425164163) at metabuilder.c:291
        src_child = 0x55c89c16db20
        dest_child = <optimized out>
        l = 0x55c89c16dc00
#18 0x000055c88218da5d in meta_file_copy_into (src=src@entry=0x55c89c16c830, dest=0x55c89c16db20, mtime=mtime@entry=1425164163) at metabuilder.c:291
        src_child = 0x55c89c16dad0
        dest_child = <optimized out>
        l = 0x55c89c16c5a0

In particular, notice how in frame #18 when we're copying a child of $src, into $dest. It creates a new dest child and passes it to the copy in frame #17. This copy in #17 then finds a child of its $src, and calls copy_into with it as src. However, at the src is now the dest_child from frame #18 (same pointer value) !?

How can this happen? The source tree we're copying from should never contain the new children we've only just allocated.

Comment 25 Alexander Larsson 2015-12-09 15:45:17 UTC

So, my guess is that there is a COPY_PATH operation in the journal file for that tree, and when we replay the journal we hit  meta_builder_copy() with some arguments that causes it to recurse forever in meta_file_copy_into().

I can't immediately see whey this happens, but maybe its a copy into itself in some way.

Comment 26 Alexander Larsson 2015-12-09 15:48:06 UTC

Its probably something like a copy operation from:
 /foo/bar/*
into
 /foo/

and the recursion hits bar, copies to foo, and then continues on the other children of foo, now seeing the new child.

Comment 27 Alexander Larsson 2015-12-09 16:29:04 UTC

Actually, its probably the other way, copy /a_dir into /b_dir, and then recurse into next child, which is b_dir.

Comment 28 Alexander Larsson 2015-12-09 17:32:43 UTC

Eh, copies /a_dir into /a_dir/b_dir i mean.

Comment 29 Alexander Larsson 2015-12-10 08:13:08 UTC

Anyway, looking at meta_builder_copy + meta_file_copy_into there are two obvious problems:

1) Its traversing the tree while at the same time possibly writing to it

2) When copying a file it always creates a new MetaFile and inserts it into the tree, which could cause duplicate files if one already existed with that name. 

meta_builder_copy() is used to implement a local move (rename(foo,bar) == copy(foo,bar) + remove (foo))), and as such it should overwrite any target destingation.

So, i think the correct fix here is to do the entire copy into a MetaFile that is not inserted into the source tree at all, and then when that copy is done, insert it at the target position, freeing whatever MetaFile tree was there before.

Comment 30 Ondrej Holy 2015-12-10 16:31:35 UTC

It makes sense to me and the solution is obvious, so I will prepare fix for it, but I wonder how this can happen. This is called when you move, rename, or trash file, however all those operations should fail if you try them with such parameters...

Comment 31 Alexander Larsson 2015-12-11 07:58:08 UTC

Yeah, with a 1-1 mapping of the filesystem and the metadata tree the operation leading to this copy should fail. However, maybe some combination of symlinks and divergence between the on-disk structure and the metatree hierarchy allowed it to succeed. Might be posible to figure out by analyzing the db + journal.

Comment 32 Ondrej Holy 2015-12-11 10:23:54 UTC

I have proposed upstream patch, can you look at please?

Comment 33 todd_lewis 2015-12-14 16:02:52 UTC

Original poster here. Just another data point: It started happening again yesterday while (I'm pretty sure) using the "rename" function of gthumb. After deleting 5Gb backtrace from /var/spool/abrt and deleting ~/.local/share/gvfs-metadata/uuid-ac16* (same file as before) the system behavior was back to normal.

Using gthumb's rename function again later again threw the system into the same state. Same fix worked around the problem. I'm hopeful your patch solves this issue. Thanks for continuing to dig into this.

Comment 34 Ondrej Holy 2015-12-15 09:15:40 UTC

If you are able to reproduce the bug e.g. using gthumb after you removed the metadata files, please write exact steps to reproduce including the details about file paths (otherwise I plan to analyze the attached files once I will have some time). Be sure that metadata daemon is not running, when you removing the database files: 
1) pkill gvfsd-metadata
2) rm ~/.local/share/gvfs-metadata/uuid-ac16*
3) do something with gthumb (metadata daemon should be autospawned)?

I made test build with the mentioned patch if you want to try it:
http://koji.fedoraproject.org/koji/taskinfo?taskID=12195299

Comment 35 todd_lewis 2016-04-25 11:49:05 UTC

(In reply to Ondrej Holy from comment #34)
> If you are able to reproduce the bug e.g. using gthumb after you removed the
> metadata files [...]

I am no longer able to reproduce this bug.

Comment 36 Ondrej Holy 2016-04-25 12:32:25 UTC

Ok, let's close it, the patch is part of gvfs-1.26.3 anyway.

Comment 37 Christopher Beland 2016-04-29 16:10:01 UTC

Similar problem has been detected:

Trying to access data on Google Nexus phone

reporter:       libreport-2.6.4
backtrace_rating: 4
cmdline:        /usr/libexec/gvfsd-mtp --spawner :1.9 /org/gtk/gvfs/exec_spaw/28
crash_function: allocator_memalign
executable:     /usr/libexec/gvfsd-mtp
global_pid:     15837
kernel:         4.4.7-300.fc23.x86_64
package:        gvfs-mtp-1.26.3-1.fc23
reason:         gvfsd-mtp killed by SIGSEGV
runlevel:       N 5
type:           CCpp
uid:            1000