763999 – (GLUSTER-2267) apt-get fails to work inside a Proxmox container: Value too large for defined data type

Bug 763999 (GLUSTER-2267) - apt-get fails to work inside a Proxmox container: Value too large for defined data type

Summary: apt-get fails to work inside a Proxmox container: Value too large for defined...

Keywords:
Status:	CLOSED EOL
Alias:	GLUSTER-2267
Product:	GlusterFS
Classification:	Community
Component:	fuse
Sub Component:
Version:	3.4.0-alpha
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-01-03 19:11 UTC by alessandro.iurlano
Modified:	2016-01-18 13:16 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-10-07 14:05:11 UTC
Regression:	---
Mount Type:	fuse
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:
Flags:	alessandro.iurlano: needinfo-

Attachments	(Terms of Use)
Logfile (60.01 KB, text/plain) 2011-01-14 15:05 UTC, alessandro.iurlano	no flags	Details
Logfile of running LANG=C dpkg-buildpackage -rfakeroot (19.65 KB, application/x-bzip) 2011-01-20 05:08 UTC, Johannes Martin	no flags	Details
strace for apt-get update, client log, server logs (303.31 KB, application/x-bzip) 2011-01-20 17:36 UTC, Johannes Martin	no flags	Details
View All

Description alessandro.iurlano 2011-01-03 19:11:42 UTC

The setup is two machines with ProxMox Virtual Environment (which is Debian GNU/Linux) with OpenVZ installed.
On these two machines I also installed gluster. I created a volume with replica=2 on both these two machines and mounted via fuse (-t glusterfs) on one of the two machines.
I then created an OpenVZ container from the standard debian template from ProxMox.
I had to use latest git of glusterfs (as of 20 December 2010) to install the container on the gluster mounted filesystem due to a mknod bug.
The container can be started and entered. You can issue some commands but after a minute it stops working. All I got is this error: Transport endpoint is not connected
At this point I cannot access the mounted filesystem neither from inside the container nor from the host itself. The only solution is to reboot the machine.

At first I had a mixed environment, one machine was 64bit and the other 32bit. But after a while I tried with two 64bit machine and the problem is still there.
There is a thread on the mailing list regarding this issue:
http://www.mail-archive.com/gluster-users@gluster.org/msg04639.html

Here is an extract from the logs:
http://glusterfs.pastebin.com/7WpU8hG5

Comment 1 shishir gowda 2011-01-13 09:34:28 UTC

This seems to a duplicate of bug 763943.

Please try to reproduce the issue with the latest git or qa release 3.1.2qa4.

*** This bug has been marked as a duplicate of bug 2211 ***

Comment 2 alessandro.iurlano 2011-01-14 15:05:46 UTC

Created attachment 416

Comment 3 alessandro.iurlano 2011-01-14 15:12:50 UTC

I just made a test with the latest git (as of 01-14-2011) and the crash happened again.
This time seemed to last longer. I was able to do an apt-get update and apt-get upgrade of about five minutes before I got the endpoint not connected error.

Step to reproduce:
create a volume with two machines and replica = 2
mount the volume with -t glusterfs on a ProxMox machine (that is also one of the glusterfs nodes)
create a container with root filesystem on the just mounted gluster filesystem
start the container
enter the container and issue commands

I have attached a logfile of this test session.

Thanks

Comment 4 Johannes Martin 2011-01-19 06:14:01 UTC

I can confirm that this problem occurs with 3.1.2.

My setup is pretty much the same:
two proxmox servers, /var/lib/vz (which hosts the root of the virtual machines) on a replicated gluster volume.

The gluster volume gets stuck occasionally (sometimes during VM boot, sometimes when doing an apt-get).

umount <mountpoint>; mount <mountpoint> makes it accessible again. If all processes accessing the share have been terminated.

umount -l helps in cases where processes are still using the volume.

For my virtual machines it means that I have to reboot them all before the work again correctly (since their root is lost).

I'll try to get some clean debug logs later.

Comment 5 shishir gowda 2011-01-20 03:50:33 UTC

Can you also please upload the log file from the bricks? And if possible the stack trace of the crash.

Do you see any formatting related warnings during build time? If so, could you please also provide them.

In our build environments, the fixes seem fine.

Comment 6 Johannes Martin 2011-01-20 05:08:13 UTC

Created attachment 427

Comment 7 Johannes Martin 2011-01-20 06:51:10 UTC

The problem occurred to me a couple times this morning, before I had strace installed in the virtual machine.

Now that I tried with strace, the error has disappeared. I could do
apt-get update
apt-get upgrade (which had quite a lot to do)
Previously, just apt-get update made the filesystem disappear.

However, now I see a debug entry in the logs that I don't quite understand:
---
[2011-01-20 10:44:35.632509] D [afr-common.c:651:afr_lookup_done] vz-replicate-0: Only 1 child up - do not attempt to detect self heal
---

The message occurs a couple times per second. 

Would this mean that one of the servers is down? As far as I can tell, they are both running.

There are a couple of log entries for vz-client-0. vz-client-1 only prints messages occasionally, and they look like this:
---
[2011-01-20 10:12:01.383611] D [client3_1-fops.c:4308:client3_1_lk] vz-client-1: (4228438): failed to get fd ctx. EBADFD
---

Looking farther back in the log, that message occured for vz-client-0 earlier when apt-get caused the filesystem to stop working:
---
[2011-01-20 08:17:46.123722] W [fuse-bridge.c:184:fuse_entry_cbk] glusterfs-fuse: 431358: LOOKUP() /var-lib-vz/private/6006/usr/share/locale => -1 (Transport endpoint is not connected)
[2011-01-20 08:17:46.123751] D [afr-lk-common.c:410:transaction_lk_op] vz-replicate-0: lk op is for a transaction
[2011-01-20 08:17:46.123823] D [client3_1-fops.c:4470:client3_1_finodelk] vz-client-0: (4228278): failed to get fd ctx. EBADFD
[2011-01-20 08:17:46.123848] D [afr-common.c:651:afr_lookup_done] vz-replicate-0: Only 1 child up - do not attempt to detect self heal
[2011-01-20 08:17:46.123901] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x77) [0x7f9887998d17] (-->/usr/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x7f98879984ae] (-->/usr/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f988799840e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(FXATTROP(34)) called at 2011-01-20 08:17:45.967147
[2011-01-20 08:17:46.123907] D [afr-common.c:651:afr_lookup_done] vz-replicate-0: Only 1 child up - do not attempt to detect self heal
[2011-01-20 08:17:46.123956] D [afr-lk-common.c:410:transaction_lk_op] vz-replicate-0: lk op is for a transaction
[2011-01-20 08:17:46.123999] W [fuse-bridge.c:184:fuse_entry_cbk] glusterfs-fuse: 431359: LOOKUP() /var-lib-vz/private/6006/opt/openoffice-server/jodconverter-tomcat-2.2.2/lib => -1 (Transport endpoint is not connected)
[2011-01-20 08:17:46.124031] D [client3_1-fops.c:4470:client3_1_finodelk] vz-client-0: (4228278): failed to get fd ctx. EBADFD
---

So maybe this time I'm lucky in that client-1 died instead of client-0?

Comment 8 shishir gowda 2011-01-20 08:40:39 UTC

The format related issue does not seem to be culprit here as we suspected.
Could you provide a stack trace of the dump along with log files.

Comment 9 Johannes Martin 2011-01-20 09:08:32 UTC

I just noticed that I have mounted the ext4 partitions without extended attribute support (which seems to be the default for ext4). Could this be a cause of the problem?

By stack trace do you mean the strace of apt-get?

Comment 10 shishir gowda 2011-01-20 09:12:38 UTC

The extended attribute support shouldnt matter as gluster does not use user-extended attributes.

Please provide the logs from the client, and from the servers (atleast from the one which crashed). These logs might allow us to triage the issue at hand.

Comment 11 Johannes Martin 2011-01-20 17:36:52 UTC

Created attachment 428


I managed to reproduce the problem:
- shutdown glusterd on server 1
- killall glusterfsd on server 1
- start glusterd on server 1

- shutdown OpenVZ virtual machine on server 2
- umount all glusterfs volumes on server 2
- shutdown glusterd on server 2
- killall glusterfsd on server 2
- start glusterd on server 2
- mount all glusterfs volumes on server 2
- start OpenVZ virtual machine
- ssh into virtual machine
- strace apt-get update as root

Comment 12 Johannes Martin 2011-08-22 10:13:31 UTC

Nothing has happened regarding this problem within the last 6+ months. Is there any chance this bug will be fixed?

Comment 13 Amar Tumballi 2011-08-22 13:23:19 UTC

(In reply to comment #12)
> Nothing has happened regarding this problem within the last 6+ months. Is there
> any chance this bug will be fixed?

Hi Martin,

Sorry about the delay. We surely want all the 'bugs' to be fixed in GlusterFS (if it is a enhancement, we do go by the roadmap, otherwise we are committed to fix bugs).

We are in a process of fixing lot of issues with replicate self-heal (most of the fixes are with design level and hence are quite involved). These changes will be available to you in only 3.3.0 release branches.

This week we will be releasing 3.3.0beta, If you are still having a openVZ setup, please try this version and see if the issues are fixed for you.

Regards,

Comment 14 Amar Tumballi 2011-09-30 05:50:04 UTC

the effort is on to fix this bug, but removing it from the list of 'blocker' bugs for 3.3.0 release.

Comment 15 Johannes Martin 2011-12-30 11:46:21 UTC

Just tested this with the latest stable version
  flusterfs 3.2.5 buld on Dec  9 2011 19:12:06
(deb-package downloaded from gluster.org)

Proxmox VE 1.9, Debian 5.0 template installed in OpenVZ VM

If the OpenVZ virtual machine is stored on a glusterfs volume, I can no longer apt-get update. I get an error message saying "E: Unable to determine a suitable packaging system type". This seems to indicate that apt-get can't read /etc/apt/sources.list (which is present and readable by cat).

The same works fine if I run the VM from the partition that hosts the glusterfs brick.

Also, when running the VM from the glusterfs share, strace does not work:
strace /bin/ls
strace: /bin/ls: command not found

So, it looks like things have degraded rather than become better.

Comment 16 Johannes Martin 2011-12-30 11:54:50 UTC

Retried with 
  glusterfs 3.3beta2 build on Aug 23 2011 19:00:54

Same problems, same error messages as above:
- strace /bin/ls does not work
- apt-get update does not work

Comment 17 Johannes Martin 2011-12-30 12:21:12 UTC

Retried with
  glusterfs 3git built on Dec 30 2011 13:15:40 (CET)

Same problems and error messages as before :(

Comment 18 shishir gowda 2012-07-11 04:14:39 UTC

Can we try to have a setup which can help us to look into this, with the latest release.

Comment 19 Amar Tumballi 2012-11-27 10:52:18 UTC

on top of latest 3.4.0qa2 releases. please re-open if seen again.

Comment 20 Johannes Martin 2013-02-05 07:16:28 UTC

Is there anywhere I can download a tar ball for the 3.4.0qa2 release? I can't find any on the download server.

I tried downloading the qa8 tarball from the git repository, but I'm not familiar enought with autconf etc. to produce a working configure and Makefile.in.

Comment 21 Amar Tumballi 2013-02-05 07:45:26 UTC

http://bits.gluster.org/pub/gluster/glusterfs/src/glusterfs-3.4.0qa8.tar.gz

Comment 22 Johannes Martin 2013-02-05 10:40:14 UTC

Tried the following:
- Host System Ubuntu 12.04.1
- Within a KVM virtual machine:
  - proxmox1: Proxmox VE 2.2 with all updates installed
  - proxmox2: Proxmox VE 2.2 with all updates installed
  - setup the promox servers as cluster (pvecm ...)
- Within each of the proxmox VMs
  - compiled and installed glusterfs 3.4.0qa8
  - umounted /var/lib/vz
  - created replicated glusterfs share var-lib-vz
  - mounted that share on /var/lib/vz
- Using proxmox web interface on one of the nodes
  - downloaded Debian 6.0 template
  - created container from that template
  - started container
- In that container:
  - open a console:
    - message on login: fstat: Value too large for defined data type
    - ssh/scp anywhere: PRNG is not seeded
    - apt-get update: 
W: unable to read /etc/apt/apt.conf.d/ - DirectoryExists (75: Value too large for defined data type)
E: Unable to determine a suitable packaging system type

Comment 23 Johannes Martin 2013-02-05 10:47:40 UTC

Tried the following:
- unmounted /var/lib/vz
- linked directory that lies under the gluster share to /var/lib/vz (i.e. openvz will now access the files directly rather than through the gluster share)
- start container, and open a console
  - no error message on login
  - ssh/scp works
  - apt-get update works

I can't figure out how to reopen the bug, would somebody please do this for me?

Comment 24 Amar Tumballi 2013-02-05 10:52:14 UTC

reopened as per comment #23

Comment 26 Niels de Vos 2014-11-09 10:49:59 UTC

Summary/current status:

- Tested with 3.4.0qa8
- Running a proxmox container stored on a glusterfs mount-point
- When updating, apt-get throws the following error:
      W: unable to read /etc/apt/apt.conf.d/ - 
                 DirectoryExists (75: Value too large for defined data type)

This suggests that READDIR returned a structure that could not be used. These structures contain 64-bit values (also when using 32-bit userspace). It it expected to work on all recent versions of Linux, but some legacy operating systems or applications may not accept things like 64-bit inodes (or > 4GB files). If 64-bit inodes would be the issue, mounting the Gluster volume with "-o enable-ino32" should work around this.

Investigation on what is happening, should be done in the Proxmox container. A strace of the "apt-get" process that fails to read the content of the directory would be a good start.

Comment 27 Niels de Vos 2015-05-17 21:58:26 UTC

GlusterFS 3.7.0 has been released (http://www.gluster.org/pipermail/gluster-users/2015-May/021901.html), and the Gluster project maintains N-2 supported releases. The last two releases before 3.7 are still maintained, at the moment these are 3.6 and 3.5.

This bug has been filed against the 3,4 release, and will not get fixed in a 3.4 version any more. Please verify if newer versions are affected with the reported problem. If that is the case, update the bug with a note, and update the version if you can. In case updating the version is not possible, leave a comment in this bug report with the version you tested, and set the "Need additional information the selected bugs from" below the comment box to "bugs".

If there is no response by the end of the month, this bug will get automatically closed.

Comment 28 Kaleb KEITHLEY 2015-10-07 14:05:11 UTC

GlusterFS 3.4.x has reached end-of-life.

If this bug still exists in a later release please reopen this and change the version or open a new bug.

Note You need to log in before you can comment on or make changes to this bug.