Description of problem: Utilising Dbase database files through a program causes a file error and crashes the accessing program. How reproducible: Occurs on startup of PLC driver consistently. Steps to Reproduce: We are running a SCADA package (Macroview) with Dbase (dbf) files used to store the equipment communication parameters. We have been running the PLC driver separately in order to diagnose the issues we're encountering. Removing the glusterfs mount and running on files mounted on the root drive allows the PLC driver to start, so the dbf file is not corrupted. The file can be opened with a text editor, and it's contents viewed, however accessing it through the PLC driver causes the error shown below. Actual results: Program exits with result the following: >Severe Error Number -950 >Unexpected Result >filelength()Aborted (core dumped) Expected results: Program should be able to access the dbf file, read the relevant configuration out of it and start communications with the equipment. Are there any restrictions on some file access methods using a fuse mount point? Additional info: Relevant log section of mount point (set to DEBUG): [2018-04-13 07:47:53.086444] D [MSGID: 0] [fd-lk.c:455:fd_lk_insert_and_merge] 0-fd-lk: new lock request: owner = 310ac35d69deefe4, fl_type = F_WRLCK, fs_start = 1999999999, fs_end = 1999999999, user_flock: l_type = F_WRLCK, l_start = 1999999999, l_len = 1 [2018-04-13 07:47:53.086475] D [MSGID: 0] [fd-lk.c:421:print_lock_list] 0-fd-lk: lock list: [2018-04-13 07:47:53.086959] D [logging.c:1979:_gf_msg_internal] 0-logging-infra: Buffer overflow of a buffer whose size limit is 5. About to flush least recently used log message to disk [2018-04-13 07:47:53.086492] D [MSGID: 0] [fd-lk.c:432:print_lock_list] 0-fd-lk: owner = 310ac35d69deefe4, cmd = F_SETLK fl_type = F_WRLCK, fs_start = 1999999999, fs_end = 1999999999, user_flock: l_type = F_WRLCK, l_start = 1999999999, l_len = 1, [2018-04-13 07:47:53.086959] D [MSGID: 0] [dht-common.c:2694:dht_lookup] 0-gv0-dht: calling revalidate lookup for /cg-rvwd/config/plcstn.dbf at gv0-replicate-0 [2018-04-13 07:47:53.089387] D [MSGID: 0] [dht-common.c:1056:dht_revalidate_cbk] 0-gv0-dht: revalidate lookup of /cg-rvwd/config/plcstn.dbf returned with op_ret 0 [2018-04-13 07:47:53.090877] D [MSGID: 0] [fd-lk.c:455:fd_lk_insert_and_merge] 0-fd-lk: new lock request: owner = 310ac35d69deefe4, fl_type = F_UNLCK, fs_start = 1999999999, fs_end = 1999999999, user_flock: l_type = F_UNLCK, l_start = 1999999999, l_len = 1 [2018-04-13 07:47:53.097758] D [logging.c:1979:_gf_msg_internal] 0-logging-infra: Buffer overflow of a buffer whose size limit is 5. About to flush least recently used log message to disk
Would it make a difference if you turn off all performance xlators? Following are the relevant cli options: * performance.stat-prefetch * performance.write-behind * performance.read-ahead * performance.readdir-ahead * performance.parallel-readdir * performance.io-cache * performance.open-behind * performance.quick-read
I've disabled the requested performance xlators using "sudo gluster volume gv0 performance.stat-prefetch off" and similar. I'm still receiving the same error while starting the driver. Do you need me to try anything else or get you more information out of the system?
(In reply to joshua.collins from comment #2) > I've disabled the requested performance xlators using "sudo gluster volume > gv0 performance.stat-prefetch off" and similar. I'm still receiving the same > error while starting the driver. Thanks for doing that. > Do you need me to try anything else or get > you more information out of the system? Yes. We need strace and fuse dump to figure out I/O pattern of application. * strace of application # strace -ff -tt -p <pid> -o <file-to-which-strace-logs-are-to-be-stored> * Dump of traffic between kernel and glusterfs # mount -t glusterfs <volfile-server>:/<volname> -o dump-fuse=<file-to-which-fuse-dump-to-be-stored> Please start collecting this information before you start tests.
Created attachment 1422337 [details] Fuse Dump Log File Result of running fuse-dump on glusterfs brick during driver test
Created attachment 1422338 [details] Strace Log File Result of running strace on driver program start up
Created attachment 1422339 [details] Strace Log File (Non gluster mount) strace log of program with configuration files not mounted through glusterfs
(In reply to joshua.collins from comment #5) > Created attachment 1422338 [details] > Strace Log File > > Result of running strace on driver program start up Sorry. I missed one argument to stat i.e., -v. Without this, I get truncated stat output, like: 08:21:34.310477 fstat64(3, {st_mode=S_IFREG|0664, st_size=1649, ...}) = 0 Since this is the last syscall done on the file, I would like to compare whether there was any difference in stat got from gluster mount and from non-gluster mount. Is it possible to collect a new strace? # strace -ff -tt -v -p <pid> -o <file-to-which-strace-logs-are-to-be-stored>
Created attachment 1422362 [details] Strace Log File Strace log file with verbose option
Created attachment 1422363 [details] Strace Log File (Non gluster mount) Strace log file with verbose option on non gluster mount
I am posting the sections where application errored out on Glusterfs (I assume it was accessing file "plcstn.dbf". Non Glusterfs mount: 3:20:38.308077 stat64("plcstn.dbf", {st_dev=makedev(0, 27), st_ino=350496, st_mode=S_IFREG|0664, st_nlink=1, st_uid=123, st_gid=129, st_blksize=4096, st_blocks=8, st_size=1649, st_atime=2018/04/13-15:49:59.008574835, st_mtime=2017/11/20-18:28:38, st_ctime=2018/04/13-15:50:28.156916510}) = 0 13:20:38.308156 open("plcstn.dbf", O_RDONLY) = 3 13:20:38.308197 fcntl64(3, F_GETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1, l_len=1881195675648, l_pid=18446744073709035320}) = 0 13:20:38.308240 lseek(3, 0, SEEK_SET) = 0 13:20:38.308276 read(3, "\3u\v\24\37\0\0\0\201\0001\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32) = 32 13:20:38.308315 fstat64(3, {st_dev=makedev(0, 27), st_ino=350496, st_mode=S_IFREG|0664, st_nlink=1, st_uid=123, st_gid=129, st_blksize=4096, st_blocks=8, st_size=1649, st_atime=2018/04/13-15:49:59.008574835, st_mtime=2017/11/20-18:28:38, st_ctime=2018/04/13-15:50:28.156916510}) = 0 Glusterfs mount: 13:23:53.206624 stat64("plcstn.dbf", {st_dev=makedev(0, 56), st_ino=11372107093322068921, st_mode=S_IFREG|0664, st_nlink=1, st_uid=123, st_gid=129, st_blksize=131072, st_blocks=4, st_size=1649, st_atime=2018/04/12-10:31:02.213560658, st_mtime=2018/04/12-10:31:02.225560935, st_ctime=2018/04/12-11:11:00.280660783}) = 0 13:23:53.213693 open("plcstn.dbf", O_RDONLY) = 3 13:23:53.220256 fcntl64(3, F_GETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=1, l_len=1881195675648, l_pid=18446744073707472760}) = 0 13:23:53.227938 lseek(3, 0, SEEK_SET) = 0 13:23:53.231695 read(3, "\3u\v\24\37\0\0\0\201\0001\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32) = 32 13:23:53.236607 fstat64(3, {st_dev=makedev(0, 56), st_ino=11372107093322068921, st_mode=S_IFREG|0664, st_nlink=1, st_uid=123, st_gid=129, st_blksize=131072, st_blocks=4, st_size=1649, st_atime=2018/04/12-10:31:02.213560658, st_mtime=2018/04/12-10:31:02.225560935, st_ctime=2018/04/12-11:11:00.280660783}) = 0 I see the only differences between both are: * (a)(m)(c)time * inode number * blocks and blcksize * st_dev There is nothing wrong in these value differing for files across different filesystems. Rest of the values like st_size/uid/gid/link/mode etc match. I am at loss to explain why application errored. @Joshua, Any clue on what might've caused application to misbehave? regards, Raghavendra
One difference I can see is in the relation of atime/mtime/ctime. On Non-gluster mount, ctime > atime > mtime On Gluster mount, ctime > mtime > atime there is a difference in relation between atime and mtime. Is there anything in application access pattern to indicate this is wrong?
(In reply to Raghavendra G from comment #11) > One difference I can see is in the relation of atime/mtime/ctime. > > On Non-gluster mount, ctime > atime > mtime > On Gluster mount, ctime > mtime > atime > > there is a difference in relation between atime and mtime. Is there anything > in application access pattern to indicate this is wrong? Technically speaking read of 32 bytes should've changed atime. But I guess filesystems are mounted with "noatime" option. So, I don't see change in atime after read of 32bytes on both Gluster and non-gluster mounts.
The only other information I have that I can see affecting the driver would be that it's running on btrfs in a raid1 configuration. The non gluster file is also running on btrfs so I don't believe it would be the cause. Happy to be wrong though. The btrfs filesystem is mounted with noatime and nodiratime based on a recommendation. Would turning these back on be likely to have an affect?
(In reply to joshua.collins from comment #13) > The only other information I have that I can see affecting the driver would > be that it's running on btrfs in a raid1 configuration. The non gluster file > is also running on btrfs so I don't believe it would be the cause. Happy to > be wrong though. > > The btrfs filesystem is mounted with noatime and nodiratime based on a > recommendation. Would turning these back on be likely to have an affect? I am not sure. But, lets try mounting without noatime and nodiratime. As usual, please capture strace :)
(In reply to Raghavendra G from comment #14) > (In reply to joshua.collins from comment #13) > > The only other information I have that I can see affecting the driver would > > be that it's running on btrfs in a raid1 configuration. The non gluster file > > is also running on btrfs so I don't believe it would be the cause. Happy to > > be wrong though. > > > > The btrfs filesystem is mounted with noatime and nodiratime based on a > > recommendation. Would turning these back on be likely to have an affect? > > I am not sure. But, lets try mounting without noatime and nodiratime. As > usual, please capture strace :) Please make sure, glusterfs too is mounted without these options.
Created attachment 1422391 [details] Strace Log File (Non gluster mount) Strace with noatime, nodiratime disabled on all mount points
I ran up some new gluster systems with a base file system with ext4 as the underlying file system, and still receive the same error message. Attaching the strace now
Created attachment 1422420 [details] EXT4 Gluster Log
(In reply to joshua.collins from comment #16) > Created attachment 1422391 [details] > Strace Log File (Non gluster mount) > > Strace with noatime, nodiratime disabled on all mount points Looks like we got error on non-gluster mount too? 19:00:40.472518 fstat64(3, {st_dev=makedev(0, 54), st_ino=11372107093322068921, st_mode=S_IFREG|0664, st_nlink=1, st_uid=123, st_gid=129, st_blksize=131072, st_blocks=4, st_size=1649, st_atime=2018/04/16-18:59:14.019651366, st_mtime=2018/04/12-10:31:02.225560935, st_ctime=2018/04/12-11:11:00.280660783}) = 0 19:00:40.477823 fstat64(1, {st_dev=makedev(0, 22), st_ino=11, st_mode=S_IFCHR|0620, st_nlink=1, st_uid=1000, st_gid=5, st_blksize=1024, st_blocks=0, st_rdev=makedev(136, 8), st_atime=2018/04/16-19:00:40.087651367, st_mtime=2018/04/16-19:00:40.087651367, st_ctime=2018/04/16-18:59:14.087651367}) = 0 19:00:40.481536 write(1, "\r\n", 2) = 2 19:00:40.485414 write(1, "\r\n", 2) = 2 19:00:40.489142 write(1, "Severe Error Number -950\r\n", 27) = 27 19:00:40.492821 write(1, "Unexpected Result\r\n", 19) = 19 19:00:40.496921 rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0 19:00:40.500826 write(1, "filelength()", 12) = 12 19:00:40.507019 gettid() = 3665 19:00:40.511008 tgkill(3665, 3665, SIGABRT) = 0 19:00:40.514679 --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=3665, si_uid=1000} --- 19:00:40.630002 +++ killed by SIGABRT (core dumped) +++
Sorry I think auto fill in my browser has changed the title field on that one. That one was the gluster mount with noatime and nodiratime on it from the description.
Release 3.12 has been EOLd and this bug was still found to be in the NEW state, hence moving the version to mainline, to triage the same and take appropriate actions.
We had other reports saying with glusterfs-6.2 and higher versions of fedora, this usecase works properly. Please upgrade and let us know. (ref: bz1672076)