Bug 1647229

Summary:	Errors w/ directory entries when running yarn
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	John Strunk <jstrunk>
Component:	fuse	Assignee:	Csaba Henk <csaba>
Status:	CLOSED WONTFIX	QA Contact:	Rahul Hinduja <rhinduja>
Severity:	high	Docs Contact:
Priority:	high
Version:	rhgs-3.4	CC:	amukherj, jahernan, jstrunk, rhopp, rhs-bugs, sabose, storage-qa-internal, sutan
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-01-17 09:33:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description John Strunk 2018-11-06 22:20:35 UTC

Description of problem:
When running `yarn`, it fails with the following error (or similar):
error An unexpected error occurred: "ENOENT: no such file or directory, lstat '/home/theia/node_modules/color'".
info If you think this is a bug, please open a bug report with the information provided in "/home/theia/yarn-error.log".
info Visit https://yarnpkg.com/en/docs/cli/install for documentation about this command.

looking at the directory entries shows problems:
$ ls -al /home/theia/node_modules/ | grep color
ls: cannot access /home/theia/node_modules/color: No such file or directory
ls: cannot access /home/theia/node_modules/cross-spawn: No such file or directory
d??????????   ? ?       ?       ?            ? color
drwxr-xr-x.   3 default root 4096 Nov  6 20:50 color-convert
drwxr-xr-x.   2 default root 4096 Nov  6 20:50 color-name
drwxr-xr-x.   2 default root 4096 Nov  6 20:50 color-string
drwxr-xr-x.   2 default root 4096 Nov  6 20:50 color-support
drwxr-xr-x.   3 default root 4096 Nov  6 20:50 colormin
drwxr-xr-x.   2 default root 4096 Nov  6 20:50 colornames
drwxr-xr-x.   2 default root 4096 Nov  6 20:50 colors
drwxr-xr-x.   2 default root 4096 Nov  6 20:50 colorspace
drwxr-xr-x.   2 default root 4096 Nov  6 20:50 css-color-names


Version-Release number of selected component (if applicable):
$ rpm -qa | grep gluster
glusterfs-cli-3.12.2-25.el7rhgs.x86_64
glusterfs-api-3.12.2-25.el7rhgs.x86_64
glusterfs-rdma-3.12.2-25.el7rhgs.x86_64
glusterfs-libs-3.12.2-25.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-25.el7rhgs.x86_64
glusterfs-fuse-3.12.2-25.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-geo-replication-3.12.2-25.el7rhgs.x86_64
glusterfs-server-3.12.2-25.el7rhgs.x86_64
glusterfs-3.12.2-25.el7rhgs.x86_64
python2-gluster-3.12.2-25.el7rhgs.x86_64
gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
pcp-pmda-gluster-4.1.0-5.el7_6.x86_64
libvirt-daemon-driver-storage-gluster-4.5.0-10.el7_6.2.x86_64
vdsm-gluster-4.19.43-2.3.el7rhgs.noarch


How reproducible:
always

Steps to Reproduce:
1. Create a 1x3 volume accd to vol info below
On a client machine:
2. sudo mount -tglusterfs node3:/supervol01 /mnt/
3. sudo chmod 777 /mnt/
4. sudo docker run -it --rm -v /mnt:/home docker.io/sunix/docker-centos-git-yarn bash
5. cd /home
6. git clone https://github.com/theia-ide/theia.git
7. cd theia
8. yarn
... error ...

Actual results:
$ yarn
yarn install v1.12.1
[1/5] Validating package.json...
[2/5] Resolving packages...
[3/5] Fetching packages...
warning monaco-languageclient.0: The engine "vscode" appears to be invalid.
warning vscode-base-languageclient.0: The engine "vscode" appears to be invalid.
info fsevents.4: The platform "linux" is incompatible with this module.
info "fsevents.4" is an optional dependency and failed compatibility check. Excluding it from installation.
[4/5] Linking dependencies...
warning " > istanbul-instrumenter-loader.1" has unmet peer dependency "webpack@^2.0.0 || ^3.0.0 || ^4.0.0".
warning " > tslint-language-service.9" has incorrect peer dependency "typescript@>= 2.3.1 < 3".
warning "workspace-aggregator-86df8aee-f807-4a0d-aa8e-75bffa368fee > @theia/application-manager > font-awesome-webpack.5-beta.2" has unmet peer dependency "font-awesome@>=4.3.0".
error An unexpected error occurred: "ENOENT: no such file or directory, lstat '/home/theia/node_modules/color'".
info If you think this is a bug, please open a bug report with the information provided in "/home/theia/yarn-error.log".
info Visit https://yarnpkg.com/en/docs/cli/install for documentation about this command.


Expected results:
File and directory creation during yarn's run should succeed.

Additional info:

$ sudo gluster vol info supervol01
 
Volume Name: supervol01
Type: Replicate
Volume ID: 558944e4-6880-4189-bfaa-dab2206b90bf
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: node3:/bricks/supervol01/brick
Brick2: node4:/bricks/supervol01/brick
Brick3: node5:/bricks/supervol01/brick
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
client.ssl: on
server.ssl: on
performance.cache-refresh-timeout: 60
performance.cache-size: 134217728
performance.nl-cache: on
performance.md-cache-timeout: 300
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
cluster.enable-shared-storage: enable
auto-delete: enable

# The 'color' directory is present on all bricks
[vagrant@node5 ~]$ sudo getfattr -d -m . -e hex  /bricks/supervol01/brick/theia/node_modules/color
getfattr: Removing leading '/' from absolute path names
# file: bricks/supervol01/brick/theia/node_modules/color
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.gfid=0x935efacfacf1453c8f14d8b8463c1a83
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.dht.mds=0x00000000

[vagrant@node4 ~]$ sudo getfattr -d -m . -e hex  /bricks/supervol01/brick/theia/node_modules/color
getfattr: Removing leading '/' from absolute path names
# file: bricks/supervol01/brick/theia/node_modules/color
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.gfid=0x935efacfacf1453c8f14d8b8463c1a83
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.dht.mds=0x00000000

[vagrant@node3 ~]$ sudo getfattr -d -m . -e hex  /bricks/supervol01/brick/theia/node_modules/color
getfattr: Removing leading '/' from absolute path names
# file: bricks/supervol01/brick/theia/node_modules/color
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.gfid=0x935efacfacf1453c8f14d8b8463c1a83
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.dht.mds=0x00000000


# The gfid symlinks also appear to be there
[vagrant@node3 ~]$ sudo ls -al /bricks/supervol01/brick/.glusterfs/93/5e
total 4
drwx------.   2 root root   50 Nov  6 15:50 .
drwx------. 154 root root 4096 Nov  6 15:50 ..
lrwxrwxrwx.   1 root root   54 Nov  6 15:50 935efacf-acf1-453c-8f14-d8b8463c1a83 -> ../../27/c9/27c9b8eb-f2b1-409a-942f-f52f1f3ebdf2/color


# Mount logs from client show
[root@jumper glusterfs]# tail -f mnt.log 
[2018-11-06 20:15:52.072215] I [MSGID: 108002] [afr-common.c:5164:afr_notify] 0-supervol01-replicate-0: Client-quorum is met
[2018-11-06 20:15:52.072548] I [MSGID: 114035] [client-handshake.c:121:client_set_lk_version_cbk] 0-supervol01-client-1: Server lk version = 1
[2018-11-06 20:15:52.072681] I [MSGID: 114046] [client-handshake.c:1150:client_setvolume_cbk] 0-supervol01-client-0: Connected to supervol01-client-0, attached to remote volume '/bricks/supervol01/brick'.
[2018-11-06 20:15:52.072711] I [MSGID: 114047] [client-handshake.c:1161:client_setvolume_cbk] 0-supervol01-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2018-11-06 20:15:52.073960] I [MSGID: 114035] [client-handshake.c:121:client_set_lk_version_cbk] 0-supervol01-client-0: Server lk version = 1
[2018-11-06 20:15:52.074218] I [fuse-bridge.c:4216:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.22
[2018-11-06 20:15:52.074231] I [fuse-bridge.c:4846:fuse_graph_sync] 0-fuse: switched to graph 0
[2018-11-06 20:15:52.077333] I [MSGID: 109063] [dht-layout.c:713:dht_layout_normalize] 0-supervol01-dht: Found anomalies in / (gfid = 00000000-0000-0000-0000-000000000001). Holes=1 overlaps=0
[2018-11-06 20:42:04.068672] W [fuse-bridge.c:540:fuse_entry_cbk] 0-glusterfs-fuse: 1357579: MKDIR() /theia/node_modules/babel-generator => -1 (File exists)
[2018-11-06 20:42:04.089269] W [fuse-bridge.c:540:fuse_entry_cbk] 0-glusterfs-fuse: 1357587: MKDIR() /theia/node_modules/babel-generator => -1 (File exists)
[2018-11-06 20:50:45.797291] W [fuse-bridge.c:540:fuse_entry_cbk] 0-glusterfs-fuse: 1383150: MKDIR() /theia/node_modules/color => -1 (File exists)
[2018-11-06 20:50:45.869101] W [fuse-bridge.c:540:fuse_entry_cbk] 0-glusterfs-fuse: 1383175: MKDIR() /theia/node_modules/cross-spawn => -1 (File exists)

# Other info
- I didn't notice anything of interest in brick or self-heal logs.
- Exiting the container, the same anomaly is visible under /mnt/theia/node_modules/

Re-mounting makes the problem go away:
[vagrant@jumper ~]$ ls -al /mnt/theia/node_modules/ | grep color
ls: cannot access /mnt/theia/node_modules/color: No such file or directory
ls: cannot access /mnt/theia/node_modules/cross-spawn: No such file or directory
d??????????   ? ?       ?       ?            ? color
drwxr-xr-x.   3 vagrant root 4096 Nov  6 15:50 color-convert
drwxr-xr-x.   3 vagrant root 4096 Nov  6 15:50 colormin
drwxr-xr-x.   2 vagrant root 4096 Nov  6 15:50 color-name
drwxr-xr-x.   2 vagrant root 4096 Nov  6 15:50 colornames
drwxr-xr-x.   2 vagrant root 4096 Nov  6 15:50 colors
drwxr-xr-x.   2 vagrant root 4096 Nov  6 15:50 colorspace
drwxr-xr-x.   2 vagrant root 4096 Nov  6 15:50 color-string
drwxr-xr-x.   2 vagrant root 4096 Nov  6 15:50 color-support
drwxr-xr-x.   2 vagrant root 4096 Nov  6 15:50 css-color-names
[vagrant@jumper ~]$ sudo umount /mnt/
[vagrant@jumper ~]$ sudo mount -tglusterfs node3:/supervol01 /mnt/
[vagrant@jumper ~]$ ls -al /mnt/theia/node_modules/ | grep color
drwxr-xr-x.   2 vagrant root 4096 Nov  6 15:50 color
drwxr-xr-x.   3 vagrant root 4096 Nov  6 15:50 color-convert
drwxr-xr-x.   3 vagrant root 4096 Nov  6 15:50 colormin
drwxr-xr-x.   2 vagrant root 4096 Nov  6 15:50 color-name
drwxr-xr-x.   2 vagrant root 4096 Nov  6 15:50 colornames
drwxr-xr-x.   2 vagrant root 4096 Nov  6 15:50 colors
drwxr-xr-x.   2 vagrant root 4096 Nov  6 15:50 colorspace
drwxr-xr-x.   2 vagrant root 4096 Nov  6 15:50 color-string
drwxr-xr-x.   2 vagrant root 4096 Nov  6 15:50 color-support
drwxr-xr-x.   2 vagrant root 4096 Nov  6 15:50 css-color-names

Comment 6 Amar Tumballi 2018-11-07 02:46:11 UTC

Looks like the fix is https://review.gluster.org/#/c/glusterfs/+/21147/ & https://review.gluster.org/#/c/glusterfs/+/21146/

This was fixed by Raghavendra while working on transactional work load support on glusterfs. Not sure if we could get it backported.

Comment 7 Raghavendra G 2018-11-07 06:06:17 UTC

(In reply to Amar Tumballi from comment #6)
> Looks like the fix is https://review.gluster.org/#/c/glusterfs/+/21147/ &
> https://review.gluster.org/#/c/glusterfs/+/21146/
> 
> This was fixed by Raghavendra while working on transactional work load
> support on glusterfs. Not sure if we could get it backported.

Those two patches were related to rename atomicity. However, with just the logs attached its difficult to figure whether its a rename atomicity issue. Also, note that rename atomicity errors are transient and will go away if the file is accessed again. However, in this case, even after application failed, ls complained about ENOENT.

I would require following data:
* strace of application that failed (I think its yarn here).
* fusedump (mount glusterfs with option --dump-fuse=<path to dump>, note that the mount script doesn't recognise this option, so please copy the glusterfs mount's cmdline from ps and add --dump-fuse option to that).
* Is the error persistent or does it go away after sometime?

Comment 9 John Strunk 2018-11-07 15:47:56 UTC

The errors seem to disappear after an unpredictable amount of time (can be a minute or so up to many minutes).

Requested logs made available via private link.

A note about correlating the logs...
The fuse dump was taken on the host, so the volume is mounted on /mnt. The strace was run w/in the container, so the gluster volume is on /data. (/mnt/foo == /data/foo)

Comment 10 Raghavendra G 2018-11-08 04:35:49 UTC

Based on the strace from comment #8:

[rgowdapp@rgowdapp 1647229]$ grep ENOENT strace.out | grep -wvE "stat|mkdir|open|lstat|connect|openat|access|statfs|epoll_ctl" -c
0
[rgowdapp@rgowdapp 1647229]$ grep "rename(" -w strace.out -c
0

* There are no renames. Hence this is not rename atomicity issue and hence patches in comment #6 are unlikely to fix the issue.
* Also, ENOENT is not on an fd based operation. Barring openat, all are path based syscalls. Need to eliminate whether openat failed because fd didn't provide access after unlink semantics.

Comment 11 Atin Mukherjee 2018-11-16 01:58:40 UTC

Has any progress being made on this bug since the last update?

Comment 12 Raghavendra G 2018-11-16 02:20:45 UTC

(In reply to Atin Mukherjee from comment #11)
> Has any progress being made on this bug since the last update?

Not much, except for comment #10, which says this is not a rename atomicity issue.

Comment 13 Sun Tan 2018-11-19 08:58:04 UTC

Hello,
Do you have any plan to solve this issue ? what is the status ? is this issue wip or plan for the next sprint ?
Thanks,

Comment 14 Sun Tan 2018-11-29 14:32:48 UTC

Hello,
Sorry to ask again,
Could you let us know if there are any plans to investigate this issue ?
It is fine if you tell us that this is it not critical or cannot be fixed in a short term. Though, knowing if this is going to be fix in 1 week or 1 month or never would help us to move forward and find other solutions. This issue is quite critical for https://che.openshift.io which promotes Eclipse Che, Openshift but also Gluster as it is being used in the background.
Thanks,

Comment 15 Raghavendra G 2018-11-30 02:35:36 UTC

(In reply to Sun Tan from comment #14)
> Hello,
> Sorry to ask again,
> Could you let us know if there are any plans to investigate this issue ?
> It is fine if you tell us that this is it not critical or cannot be fixed in
> a short term. Though, knowing if this is going to be fix in 1 week or 1
> month or never would help us to move forward and find other solutions. This
> issue is quite critical for https://che.openshift.io which promotes Eclipse
> Che, Openshift but also Gluster as it is being used in the background.
> Thanks,

Sorry about the inaction on my part. I've not analyzed the data uploaded in comment #8. I'll do so today and then comment on the timelines.

Comment 16 Sun Tan 2018-12-06 14:13:31 UTC

np, thank you,
Let us know if you have any update.

Comment 17 Raghavendra G 2018-12-12 08:27:42 UTC

Some updates:

* I couldn't find any syscalls on the problematic directory from strace output attached:

[root@rgowdapp 1647229]# grep -c "/theia/node_modules/color" strace.out 
0

* My attempts at recreating the issue failed.
[root@rgowdapp docker]# ln -s /usr/libexec/docker/docker-runc-current /usr/bin/docker-runc
[root@rgowdapp docker]# docker run -it --rm -v /mnt:/home docker.io/sunix/docker-centos-git-yarn bash
bash-4.2$ cd /home/
bash-4.2$ git clone https://github.com/theia-ide/theia.git
Cloning into 'theia'...
remote: Enumerating objects: 23, done.
remote: Counting objects: 100% (23/23), done.
remote: Compressing objects: 100% (21/21), done.
remote: Total 41666 (delta 5), reused 9 (delta 2), pack-reused 41643
Receiving objects: 100% (41666/41666), 111.26 MiB | 2.26 MiB/s, done.
Resolving deltas: 100% (27729/27729), done.
Checking out files: 100% (1495/1495), done.
bash-4.2$ cd theia/
bash-4.2$ yarn
yarn install v1.12.3
[1/5] Validating package.json...
[2/5] Resolving packages...
[3/5] Fetching packages...
warning monaco-languageclient.0: The engine "vscode" appears to be invalid.
warning vscode-base-languageclient.0: The engine "vscode" appears to be invalid.
info fsevents.4: The platform "linux" is incompatible with this module.
info "fsevents.4" is an optional dependency and failed compatibility check. Excluding it from installation.
[4/5] Linking dependencies...
warning " > istanbul-instrumenter-loader.1" has unmet peer dependency "webpack@^2.0.0 || ^3.0.0 || ^4.0.0".
warning " > tslint-language-service.9" has incorrect peer dependency "typescript@>= 2.3.1 < 3".
warning "workspace-aggregator-38ceceb9-bedc-4134-855a-7afa827982aa > @theia/application-manager > font-awesome-webpack.5-beta.2" has unmet peer dependency "font-awesome@>=4.3.0".
[5/5] Building fresh packages...
[-/13] ⠂ waiting...
[-/13] ⠂ waiting...
[3/13] ⠄ electron
[-/13] ⠄ waiting...
[-/13] ⠄ waiting...
error /home/theia/node_modules/electron: Command failed.
Exit code: 1
Command: node install.js
Arguments: 
Directory: /home/theia/node_modules/electron
Output:
Downloading electron-v2.0.14-linux-x64.zip
Error: read ECONNRESET
/home/theia/node_modules/electron/install.js:54
  throw err
  ^

bash-4.2$ ls -l /home/theia/node_modules/electron
total 372
-rw-r--r--. 1 default root   1060 Dec 12 07:39 LICENSE
-rw-r--r--. 1 default root   4563 Dec 12 07:39 README.md
-rwxr-xr-x. 1 default root    479 Dec 12 07:39 cli.js
-rw-r--r--. 1 default root 366142 Dec 12 07:39 electron.d.ts
-rw-r--r--. 1 default root    338 Dec 12 07:39 index.js
-rw-r--r--. 1 default root   1887 Dec 12 07:39 install.js
drwxr-xr-x. 4 default root   4096 Dec 12 07:45 node_modules
-rw-r--r--. 1 default root    814 Dec 12 07:39 package.json
bash-4.2$ ls -l /home/theia/node_modules/electron -l
total 372
-rw-r--r--. 1 default root   1060 Dec 12 07:39 LICENSE
-rw-r--r--. 1 default root   4563 Dec 12 07:39 README.md
-rwxr-xr-x. 1 default root    479 Dec 12 07:39 cli.js
-rw-r--r--. 1 default root 366142 Dec 12 07:39 electron.d.ts
-rw-r--r--. 1 default root    338 Dec 12 07:39 index.js
-rw-r--r--. 1 default root   1887 Dec 12 07:39 install.js
drwxr-xr-x. 4 default root   4096 Dec 12 07:45 node_modules
-rw-r--r--. 1 default root    814 Dec 12 07:39 package.json

Note that I ran into an error. But the error was not exactly the same reported in this bug. As can be seen the directory is accessible and the error code is different.

Later I changed the loglevel of glusterfs and running yarn was successful (I think there might be a race).

* Trying to recreated I removed everything from the mount point, killed docker and repeated the test. This time I got EPERM errors
bash-4.2$ ls -l
total 68
-rw-r--r--.   1 root root 12030 Oct  6 19:15 anaconda-post.log
lrwxrwxrwx.   1 root root     7 Oct  6 19:14 bin -> usr/bin
drwxr-xr-x.   5 root root   360 Dec 12 08:22 dev
drwxr-xr-x.   1 root root  4096 Dec 12 08:22 etc
drwxr-xr-x.   3 root root  4096 Dec 12 08:22 home
lrwxrwxrwx.   1 root root     7 Oct  6 19:14 lib -> usr/lib
lrwxrwxrwx.   1 root root     9 Oct  6 19:14 lib64 -> usr/lib64
drwxr-xr-x.   2 root root  4096 Apr 11  2018 media
drwxr-xr-x.   2 root root  4096 Apr 11  2018 mnt
drwxr-xr-x.   2 root root  4096 Apr 11  2018 opt
dr-xr-xr-x. 344 root root     0 Dec 12 08:22 proc
dr-xr-x---.   1 root root  4096 Nov 22 14:01 root
drwxr-xr-x.   1 root root  4096 Dec 12 08:22 run
lrwxrwxrwx.   1 root root     8 Oct  6 19:14 sbin -> usr/sbin
drwxr-xr-x.   2 root root  4096 Apr 11  2018 srv
dr-xr-xr-x.  13 root root     0 Nov 24 14:06 sys
drwxrwxrwt.   1 root root  4096 Nov 22 14:03 tmp
-rwxr-xr-x.   1 root root   195 Nov 22 14:00 uid_entrypoint
drwxr-xr-x.   1 root root  4096 Oct  6 19:14 usr
drwxr-xr-x.   1 root root  4096 Oct  6 19:14 var
bash-4.2$ touch usr/file
touch: cannot touch 'usr/file': Permission denied

bash-4.2$ yarn
yarn install v1.12.3
warning Skipping preferred cache folder "/home/user/.cache/yarn" because it is not writable.
warning Selected the next writable cache folder in the list, will be "/tmp/.yarn-cache-1001".
info No lockfile found.
[1/4] Resolving packages...
[2/4] Fetching packages...
[3/4] Linking dependencies...
[4/4] Building fresh packages...
error Could not write file "/yarn-error.log": "EACCES: permission denied, open '/yarn-error.log'"
error An unexpected error occurred: "EACCES: permission denied, mkdir '/node_modules'".
info Visit https://yarnpkg.com/en/docs/cli/install for documentation about this command.
Error: ENOENT: no such file or directory, open '/home/user/.yarnrc'

Note that the directory is owned by root, but the user seems to be 'default'
bash-4.2$ whoami
default


This doesn't really look like glusterfs issue.

If you've a live setup, I can take a look at what went wrong. Also, see whether you can stat the gfid handle of problematic directory (symbolic links) on each brick. From the bug description it looks like gfid-handle couldn't be resolved.