Bug 1285594
|
Description
todd_lewis
2015-11-26 01:01:57 UTC
Created attachment 1099042 [details]
File: backtrace
Created attachment 1099043 [details]
File: cgroup
Created attachment 1099044 [details]
File: core_backtrace
Created attachment 1099045 [details]
File: dso_list
Created attachment 1099046 [details]
File: environ
Created attachment 1099047 [details]
File: exploitable
Created attachment 1099048 [details]
File: limits
Created attachment 1099049 [details]
File: maps
Created attachment 1099050 [details]
File: mountinfo
Created attachment 1099051 [details]
File: namespaces
Created attachment 1099052 [details]
File: open_fds
Created attachment 1099053 [details]
File: proc_pid_status
Created attachment 1099054 [details]
File: var_log_messages
Thanks for your bug report. There is several useful warning prints in metadata code. Don't you see any? Please get a log in following way. I hope the log will contain some relevant warnings to determine what is exactly wrong. 1/ run "journalctl -f | tee ~/journal.log" 2/ reproduce the bug 3/ attach journal.log from your home directory It looks like some metadata database file is corrupted. Could you make an archive of the following folder and attach here for further analysis please? You can do it following way: $ zip gvfs-metadata.zip -r ~/.local/share/gvfs-metadata You told that files in home folder are not affected. What paths are affected? You can remove (rename/backup) corrupted database file to avoid the crashes as a workaround, but which database file is corrupted depends on what paths are affected. Or you can remove (rename/backup) whole folder .local/share/gvfs-metadata, but you will lost consequently all the metadata... Created attachment 1100057 [details]
requested journalctl.log
journalctl -f output from two generated segv's as requested.
Created attachment 1100058 [details]
requested gvfs-metadata.zip
gvfs-metadata.zip as requested
(In reply to Ondrej Holy from comment #14) I have added the two files you requested as attachments. Informed by your comments, I was able to systematically test, and I can confirm that the presence of one particular pair of files in the .local/share/gvfs-metadata directory results in the segfaults, and removing that pair reliably makes the problem go away. That pair matches the pattern "uuid-ac16*", and it is included in the attached gvfs-metadata.zip. I'm hopeful that with this culprit file in hand you'll be able to make gvfsd-metadata more robust. In the mean time, my wife's Fedora experience will be greatly improved. Thanks for the help. You also asked what paths were affected. We have a shared work space mounted at /work, and in all my testing tonight, everything that was affected was under /work; whether through a symlink to ../../../work/blah1/blah2 or directly as /work/blah1/blah2 didn't seem to make any difference. Rather than removing all the files from .local/share/gvfs-metadata, I've chosen to remove the minimal amount possible that still affords an acceptable resolution. If I run into additional relevant info before this ticket is closed I'll add it. But for the moment, things seem to be working well. Thanks for the feedback! I will try to investigate what is wrong using the corrupted database file. Don't you remember what might cause the database file corruption? Didn't run the computer out of the disk space? Or what were you doing when you see the bug first time? Unfortunately the journalctl log doesn't contain what I expected, couldn't you provide whole log from the boot (from the day you saw the bug) please? You can list the boots using: journalctl --list-boots -e and then export concrete one by (where ID is negative number (e.g. -1) you got from previous command): journalctl --boot=<ID> > journalctl.log Is it really uuid-ac16*? It seems it isn't corrupted and it doesn't crash in my test environment. I use Nautilus and I don't test with nfs, though it seems you are using Caja and the affected files are on nfs, however both should be irrelevant. Is ~/.local/share/gvfs-metadata on local drive, or nfs? It really is uuid-ac16*, the presence of which causes the segfault. I wouldn't expect it to cause a problem on other systems, as that file name matches the uuid of the filesystem on which the error exhibited. Behold: > # blkid | grep ac16 > /dev/sda7: LABEL="work2" UUID="ac162a19-f3fa-48c1-b1fa-2fcf20c1cb59" TYPE="ext4" PARTUUID="0004fb74-07" It appears only file operations within my "work2" partition (mounted at /work) would trigger the failure. I initially indicated the problem exhibited on automounted nfs partitions imported from another host, but I was not able to replicate that when working on comment #17. I hesitate to attach the journalctl.log file as it's rather large and contains no gvfsd-metadata messages different from those already posted. For the record: --boot=-26 was my last Fedora 22 boot; --boot=-25 was my first Fedora 23 boot, but no GUI was run then; --boot=-24 contains the first occurrence of the gvfsd-metadata segfault. So the seminal event appears to be the upgrade from fc22 to fc23. If you really really really want the journalctl.log as per comment #18, I can attach it, but (1) it's huge and (2) it contains nothing relevant that hasn't already been posted. Strictly speaking, I don't claim the data base file is corrupted, only that gvfsd-metadata crashes in allocator_memalign() if it's present. I don't know what combination of corner cases on my filesystem and that file conspire to trigger the fault. Thanks for the reply. Just a presence of this file clearly doesn't cause problems on other systems. I renamed the file to match existing filesystem, made some operations and let metadata daemon update the file... however updated database file is successfully written. Ok, forget on journalctl.log, please just confirm that output from the following is empty: journalctl -e -a SYSLOG_IDENTIFIER=org.gtk.vfs.Metadata I supposed that the file is corrupted, but it seems to be valid. Could you please try to get log from valgrind? It might be really helpful... 1/ Be sure metadata daemon is not running (to avoid overwriting): pkill gvfsd-metadata 2/ Restore the buggy database files, ideally using cmd tools (i.e. cp, mv). 3/ Run metadata daemon under valgrind: valgrind --track-origins=yes --log-file=valgrind.log /usr/libexec/gvfsd-metadata --replace 4/ Reproduce the crash 5/ Provide valgrind.log Created attachment 1101518 [details]
Output from valgrind --track-origins=yes --log-file=/tmp/valgrind.log /usr/libexec/gvfsd-metadata --replace
valgrind output indicates a blown stack. Thanks for the concise instructions for producing valgrind.log. Could it be sensitive to the directory hierarchy on that filesystem? That is, if you're trying to reproduce the fault, might you need a matching directory tree? (Grasping at straws here.)
I can confirm that
journalctl -e -a SYSLOG_IDENTIFIER=org.gtk.vfs.Metadata
is empty.
Created attachment 1101522 [details]
tarball of directory structure
The attached tarball contains the directory structure of the problematic filesystem, produced by
# find . -type d -print0 | sudo xargs -0 tar cvf /tmp/work-dirs.tar --no-recursion
at the root of the /work partition.
Looking at the backtrace the issue is clearly that meta_file_copy_into() recurses forever, until we're out of stack space.
Take the last part of the meta_builder_copy recursion from the backtrace:
#16 meta_file_copy_into (src=src@entry=0x55c89c16db20, dest=0x55c89c16f610, mtime=mtime@entry=1425164163) at metabuilder.c:285
src_child = <optimized out>
dest_child = <optimized out>
l = 0x55c89c16dc80
#17 0x000055c88218da5d in meta_file_copy_into (src=src@entry=0x55c89c16dad0, dest=0x55c89c16f5c0, mtime=mtime@entry=1425164163) at metabuilder.c:291
src_child = 0x55c89c16db20
dest_child = <optimized out>
l = 0x55c89c16dc00
#18 0x000055c88218da5d in meta_file_copy_into (src=src@entry=0x55c89c16c830, dest=0x55c89c16db20, mtime=mtime@entry=1425164163) at metabuilder.c:291
src_child = 0x55c89c16dad0
dest_child = <optimized out>
l = 0x55c89c16c5a0
In particular, notice how in frame #18 when we're copying a child of $src, into $dest. It creates a new dest child and passes it to the copy in frame #17. This copy in #17 then finds a child of its $src, and calls copy_into with it as src. However, at the src is now the dest_child from frame #18 (same pointer value) !?
How can this happen? The source tree we're copying from should never contain the new children we've only just allocated.
So, my guess is that there is a COPY_PATH operation in the journal file for that tree, and when we replay the journal we hit meta_builder_copy() with some arguments that causes it to recurse forever in meta_file_copy_into(). I can't immediately see whey this happens, but maybe its a copy into itself in some way. Its probably something like a copy operation from: /foo/bar/* into /foo/ and the recursion hits bar, copies to foo, and then continues on the other children of foo, now seeing the new child. Actually, its probably the other way, copy /a_dir into /b_dir, and then recurse into next child, which is b_dir. Eh, copies /a_dir into /a_dir/b_dir i mean. Anyway, looking at meta_builder_copy + meta_file_copy_into there are two obvious problems: 1) Its traversing the tree while at the same time possibly writing to it 2) When copying a file it always creates a new MetaFile and inserts it into the tree, which could cause duplicate files if one already existed with that name. meta_builder_copy() is used to implement a local move (rename(foo,bar) == copy(foo,bar) + remove (foo))), and as such it should overwrite any target destingation. So, i think the correct fix here is to do the entire copy into a MetaFile that is not inserted into the source tree at all, and then when that copy is done, insert it at the target position, freeing whatever MetaFile tree was there before. It makes sense to me and the solution is obvious, so I will prepare fix for it, but I wonder how this can happen. This is called when you move, rename, or trash file, however all those operations should fail if you try them with such parameters... Yeah, with a 1-1 mapping of the filesystem and the metadata tree the operation leading to this copy should fail. However, maybe some combination of symlinks and divergence between the on-disk structure and the metatree hierarchy allowed it to succeed. Might be posible to figure out by analyzing the db + journal. I have proposed upstream patch, can you look at please? Original poster here. Just another data point: It started happening again yesterday while (I'm pretty sure) using the "rename" function of gthumb. After deleting 5Gb backtrace from /var/spool/abrt and deleting ~/.local/share/gvfs-metadata/uuid-ac16* (same file as before) the system behavior was back to normal. Using gthumb's rename function again later again threw the system into the same state. Same fix worked around the problem. I'm hopeful your patch solves this issue. Thanks for continuing to dig into this. If you are able to reproduce the bug e.g. using gthumb after you removed the metadata files, please write exact steps to reproduce including the details about file paths (otherwise I plan to analyze the attached files once I will have some time). Be sure that metadata daemon is not running, when you removing the database files: 1) pkill gvfsd-metadata 2) rm ~/.local/share/gvfs-metadata/uuid-ac16* 3) do something with gthumb (metadata daemon should be autospawned)? I made test build with the mentioned patch if you want to try it: http://koji.fedoraproject.org/koji/taskinfo?taskID=12195299 (In reply to Ondrej Holy from comment #34) > If you are able to reproduce the bug e.g. using gthumb after you removed the > metadata files [...] I am no longer able to reproduce this bug. Ok, let's close it, the patch is part of gvfs-1.26.3 anyway. Similar problem has been detected: Trying to access data on Google Nexus phone reporter: libreport-2.6.4 backtrace_rating: 4 cmdline: /usr/libexec/gvfsd-mtp --spawner :1.9 /org/gtk/gvfs/exec_spaw/28 crash_function: allocator_memalign executable: /usr/libexec/gvfsd-mtp global_pid: 15837 kernel: 4.4.7-300.fc23.x86_64 package: gvfs-mtp-1.26.3-1.fc23 reason: gvfsd-mtp killed by SIGSEGV runlevel: N 5 type: CCpp uid: 1000 |