Bug 1672480 - Bugs Test Module tests failing on s390x [NEEDINFO]
Summary: Bugs Test Module tests failing on s390x
Keywords:
Status: NEW
Alias: None
Product: GlusterFS
Classification: Community
Component: tests
Version: 4.1
Hardware: s390x
OS: Linux
medium
urgent
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-02-05 05:17 UTC by abhays
Modified: 2019-10-01 06:01 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
abhaysingh1722: needinfo? (atumball)


Attachments (Terms of Use)
bug-1161311.log (1.86 KB, text/plain)
2019-02-13 12:08 UTC, abhays
no flags Details
bug-1161311_glusterd.log (1.54 MB, text/plain)
2019-02-13 12:13 UTC, abhays
no flags Details
bug-1193636.log (1.51 KB, text/plain)
2019-02-13 12:14 UTC, abhays
no flags Details
bug-1193636_glusterd.log (136.99 KB, text/plain)
2019-02-13 12:15 UTC, abhays
no flags Details
bug-1619720.log (1.67 KB, text/plain)
2019-02-13 12:16 UTC, abhays
no flags Details
bug-1619720_glusterd.log (135.22 KB, text/plain)
2019-02-13 12:17 UTC, abhays
no flags Details
bugs/distribute/bug-1161311.t (4.47 KB, application/x-shellscript)
2019-02-13 15:11 UTC, Nithya Balachandran
no flags Details
bugs/distribute/bug-1193636.t (1.89 KB, application/x-shellscript)
2019-02-13 15:12 UTC, Nithya Balachandran
no flags Details
bug-847622_brick0.log (66.18 KB, text/plain)
2019-02-14 08:54 UTC, abhays
no flags Details
bug-847622_nfs.log (35.38 KB, text/plain)
2019-02-14 08:55 UTC, abhays
no flags Details
bug-847622_subtest_failure.log (546 bytes, text/plain)
2019-02-14 08:56 UTC, abhays
no flags Details
bug-1619720_mnt_glusterfs-0.log (80.32 KB, text/plain)
2019-02-14 08:57 UTC, abhays
no flags Details
bug-1619720-patchy0.log (112.02 KB, text/plain)
2019-02-14 08:58 UTC, abhays
no flags Details
bug-1619720-patchy1.log (65.00 KB, text/plain)
2019-02-14 09:00 UTC, abhays
no flags Details
bug-902610_diff.log (922 bytes, text/plain)
2019-02-14 09:00 UTC, abhays
no flags Details
bug-902610_mnt-glusterfs-0.log (91.29 KB, text/plain)
2019-02-14 09:04 UTC, abhays
no flags Details
bug-902610_patchy0.log (42.28 KB, text/plain)
2019-02-14 09:05 UTC, abhays
no flags Details
Zip Folder for all the logs (137.32 KB, application/zip)
2019-02-14 09:22 UTC, abhays
no flags Details
Bitrot.log (10.34 KB, application/zip)
2019-02-22 04:12 UTC, abhays
no flags Details


Links
System ID Priority Status Summary Last Updated
Gluster.org Gerrit 22217 None Open tests/dht: Remove hardcoded brick paths 2019-02-18 04:45:02 UTC

Description abhays 2019-02-05 05:17:23 UTC
Description of problem:
Observing test failures for the following test cases:-
./tests/bugs/glusterfs/bug-902610.t
./tests/bugs/posix/bug-1619720.t
./tests/bitrot/bug-1207627-bitrot-scrub-status.t

After analyzing the above test failures, we have observed that the hash values for the bricks and files are getting differently calculated on s390x systems as compared to those on x86.

As per the documentation given at https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Features/dht/ , To place a file in a directory, a hash is calculated for the file using both the (containing) directory's unique GFID and the file's name.This hash is then matched to one of the layout assignments, to yield the hashed location.
However, on s390x, certain files have hash values that are beyond the hash range of the available bricks. Therefore, these files don't get located in their respective hashed locations.

This has been observed in other test cases too. 
For example, ./tests/bugs/distribute/bug-1161311.t, ./tests/bugs/distribute/bug-1193636.t, ./tests/basic/namespace.t.
 

Is there any workaround to get the correct hashed locations for the files?

Version-Release number of selected component (if applicable):
v4.1.5


How reproducible:
Build Glusterfs v4.1.5 and run the test case with ./run-tests.sh prove -vf <testcase_name>

Steps to Reproduce:
1.
2.
3.

Actual results:
Tests FAIL

Expected results:
Tests should PASS


Additional info:

Comment 1 abhays 2019-02-06 09:25:13 UTC
Please Update on this, Required Urgently.

Comment 2 Nithya Balachandran 2019-02-06 10:31:38 UTC
Do you have a mixed setup where some clients are little-endian and others big-endian? If not this should not be a problem.

Comment 3 Nithya Balachandran 2019-02-06 10:32:25 UTC
(In reply to Nithya Balachandran from comment #2)
> Do you have a mixed setup where some clients are little-endian and others
> big-endian? If not this should not be a problem.

As long as all your servers and clients have the same "endianness", gluster should work fine.

Comment 4 abhays 2019-02-06 10:52:48 UTC
Thanks for the reply.

Agreed @Nithya.
The setup used by the clients are having same "endianness"(Big Endian).
However, certain test cases fail on our Big Endian Systems which is our major concern.

Comment 5 Nithya Balachandran 2019-02-06 10:59:37 UTC
(In reply to abhays from comment #4)
> Thanks for the reply.
> 
> Agreed @Nithya.
> The setup used by the clients are having same "endianness"(Big Endian).
> However, certain test cases fail on our Big Endian Systems which is our
> major concern.

I think that is because the tests in  question are assuming that the files will exist on a certain brick based on the results we got while running them on out little-endian systems. As the hash values are different on big-endian systems, thos assumptions no longer hold.

Do you have a list of all the tests that are failing and where they fail? I can check to see if that is the case.

Comment 6 abhays 2019-02-06 11:21:34 UTC
Yes @Nithya, Below are the test cases that fail, their cause of failure and possible workaround on Big-Endian:-

Following are the test cases which pass after changing the bricks in the test case:-
./tests/bugs/distribute/bug-1161311.t --------passes after changing brick3 to brick1 in subtests 31 and 41.
./tests/bugs/distribute/bug-1193636.t --------passes after changing brick3 to brick1 in subtest 10.
./tests/bugs/nfs/bug-847622.t ----------------passes after giving absolute path of testfile in subtest 9.

Following are the test cases that are still failing even after changing the bricks, however if little-endian hash values are hard-coded on big-endian in the file ./xlators/cluster/dht/src/dht-layout.c, then these test cases pass on Big-Endian:-
./tests/bugs/glusterfs/bug-902610.t-------------subtest 7 fails
./tests/bugs/posix/bug-1619720.t----------------subtests 13 and 14 fail

Following test case is failing because of "Cannot allocate memory" issue:-
./tests/bitrot/bug-1207627-bitrot-scrub-status.t----------subtest 20 fails with the below error:-
[client-rpc-fops_v2.c:961:client4_0_fgetxattr_cbk] 0-patchy-client-0: remote operation failed [Cannot allocate memory]

Following test case is failing on which issue has already been raised:-
./tests/features/trash.t-------------- https://bugzilla.redhat.com/show_bug.cgi?id=1627060

So, please look into this and let us know if any workaround can be provided to make the above tests pass on Big-Endian.

Comment 7 Nithya Balachandran 2019-02-07 06:53:41 UTC
I will take a look and get back next week.

Comment 8 abhays 2019-02-11 09:36:55 UTC
(In reply to Nithya Balachandran from comment #5)
> (In reply to abhays from comment #4)
> > Thanks for the reply.
> > 
> > Agreed @Nithya.
> > The setup used by the clients are having same "endianness"(Big Endian).
> > However, certain test cases fail on our Big Endian Systems which is our
> > major concern.
> 
> I think that is because the tests in  question are assuming that the files
> will exist on a certain brick based on the results we got while running them
> on out little-endian systems. As the hash values are different on big-endian
> systems, thos assumptions no longer hold.
> 
> Do you have a list of all the tests that are failing and where they fail? I
> can check to see if that is the case.

@Nithya,The above test cases(except ./tests/bitrot/bug-1207627-bitrot-scrub-status.t) pass on big endian systems with the following change in "libglusterfs/src/hashfn.c":-
diff --git a/libglusterfs/src/hashfn.c b/libglusterfs/src/hashfn.c
index 62f7ab878..4e18144b8 100644
--- a/libglusterfs/src/hashfn.c
+++ b/libglusterfs/src/hashfn.c
@@ -10,7 +10,7 @@

#include <stdint.h>
#include <stdlib.h>
-
+#include <endian.h>
#include "hashfn.h"

#define get16bits(d) (*((const uint16_t *) (d)))
@@ -45,7 +45,6 @@ uint32_t SuperFastHash (const char * data, int32_t len) {

         rem = len & 3;
         len >>= 2;
-
         /* Main loop */
         for (;len > 0; len--) {
                 hash  += get16bits (data);
@@ -151,8 +150,9 @@ gf_dm_hashfn (const char *msg, int len)

         for (i = 0; i < full_quads; i++) {
                 for (j = 0; j < 4; j++) {
-                        word     = *intmsg;
-                        array[j] = word;
+                        //word     = *intmsg;
+                        word     = htole32(*intmsg);
+                       array[j] = word;
                         intmsg++;
                         full_words--;
                         full_bytes -= 4;
@@ -162,8 +162,9 @@ gf_dm_hashfn (const char *msg, int len)

         for (j = 0; j < 4; j++) {
                 if (full_words) {
-                        word     = *intmsg;
-                        array[j] = word;
+                        //word     = *intmsg;
+                        word     = htole32(*intmsg);
+                       array[j] = word;
                         intmsg++;
                         full_words--;
                         full_bytes -= 4;

This confirms that test cases were failing due to different hash value calculations on both the systems(little and big endian). Please, let us know when you have looked into the failures and got a fix.

Comment 9 Nithya Balachandran 2019-02-13 07:26:28 UTC
The diffs for possible fixes for bug-1161311.t and bug-1193636.t are as follows. Please make the changes on your setup and let me know if it works.


[root@rhgs313-6 tests]# git diff bugs/distribute/bug-1161311.t
diff --git a/tests/bugs/distribute/bug-1161311.t b/tests/bugs/distribute/bug-1161311.t
index c52c69b..3dc45a4 100755
--- a/tests/bugs/distribute/bug-1161311.t
+++ b/tests/bugs/distribute/bug-1161311.t
@@ -82,6 +82,8 @@ for i in {1..10}; do
   cat /tmp/FILE2 >> $M0/dir1/FILE2
 done
 
+brick_loc=$(get_backend_paths $M0/dir1/FILE2)
+
 #dd if=/dev/urandom of=$M0/dir1/FILE2 bs=64k count=10240
 
 # Rename the file to create a linkto, for rebalance to
@@ -99,7 +101,7 @@ TEST $CLI volume rebalance $V0 start force
 
 # Wait for FILE to get the sticky bit on, so that file is under
 # active rebalance, before creating the links
-TEST checksticky $B0/${V0}3/dir1/FILE1
+TEST checksticky $brick_loc
 
 # Create the links
 ## FILE3 FILE5 FILE7 have hashes, c8c91469 566d26ce 22ce7eba
@@ -120,7 +122,7 @@ cd /
 
 # Ideally for this test to have done its job, the file should still be
 # under migration, so check the sticky bit again
-TEST checksticky $B0/${V0}3/dir1/FILE1
+TEST checksticky $brick_loc
 
 # Wait for rebalance to complete
 EXPECT_WITHIN $REBALANCE_TIMEOUT "completed" rebalance_status_field $V0



[root@rhgs313-6 tests]# git diff ./bugs/distribute/bug-1193636.t
diff --git a/tests/bugs/distribute/bug-1193636.t b/tests/bugs/distribute/bug-1193636.t
index ccde02e..6ffa2d9 100644
--- a/tests/bugs/distribute/bug-1193636.t
+++ b/tests/bugs/distribute/bug-1193636.t
@@ -37,6 +37,8 @@ TEST mkdir $M0/dir1
 # Create a large file (1GB), so that rebalance takes time
 dd if=/dev/zero of=$M0/dir1/FILE2 bs=64k count=10240
 
+brick_loc=$(get_backend_paths $M0/dir1/FILE2)
+
 # Rename the file to create a linkto, for rebalance to
 # act on the file
 TEST mv $M0/dir1/FILE2 $M0/dir1/FILE1
@@ -45,7 +47,7 @@ build_tester $(dirname $0)/bug-1193636.c
 
 TEST $CLI volume rebalance $V0 start force
 
-TEST checksticky $B0/${V0}3/dir1/FILE1
+TEST checksticky $brick_loc
 
 TEST setfattr -n "user.test1" -v "test1" $M0/dir1/FILE1
 TEST setfattr -n "user.test2" -v "test1" $M0/dir1/FILE1

Comment 10 Nithya Balachandran 2019-02-13 07:59:42 UTC
Possible fix for ./tests/bugs/posix/bug-1619720.t:


[root@rhgs313-6 tests]# git diff  bugs/posix/bug-1619720.t
diff --git a/tests/bugs/posix/bug-1619720.t b/tests/bugs/posix/bug-1619720.t
index bfd304d..8584476 100755
--- a/tests/bugs/posix/bug-1619720.t
+++ b/tests/bugs/posix/bug-1619720.t
@@ -1,6 +1,7 @@
 #!/bin/bash
 
 . $(dirname $0)/../../include.rc
+. $(dirname $0)/../../volume.rc
 . $(dirname $0)/../../dht.rc
 
 cleanup;
@@ -35,7 +36,8 @@ TEST mkdir $M0/tmp
 # file-2 will hash to the other subvol
 
 TEST touch $M0/tmp/file-2
-pgfid_xattr_name=$(getfattr -m "trusted.pgfid.*" $B0/${V0}1/tmp/file-2 | grep "trusted.pgfid")
+loc_2=$(get_backend_paths $M0/tmp/file-2)
+pgfid_xattr_name=$(getfattr -m "trusted.pgfid.*" $loc_2 | grep "trusted.pgfid")
 echo $pgfid_xattr_name

Comment 11 Nithya Balachandran 2019-02-13 08:13:51 UTC
(In reply to abhays from comment #6)
> Yes @Nithya, Below are the test cases that fail, their cause of failure and
> possible workaround on Big-Endian:-
> 
> Following are the test cases which pass after changing the bricks in the
> test case:-
> ./tests/bugs/distribute/bug-1161311.t --------passes after changing brick3
> to brick1 in subtests 31 and 41.
> ./tests/bugs/distribute/bug-1193636.t --------passes after changing brick3
> to brick1 in subtest 10.


Provided diff of fix for these 2 tests in comment#9. Please try it out and let me know if it works.



> ./tests/bugs/nfs/bug-847622.t ----------------passes after giving absolute
> path of testfile in subtest 9.


I don't see why this should be dependent on the hashing. Please provide output of the test and the gluster logs with debug enabled when this fails.



> 
> Following are the test cases that are still failing even after changing the
> bricks, however if little-endian hash values are hard-coded on big-endian in
> the file ./xlators/cluster/dht/src/dht-layout.c, then these test cases pass
> on Big-Endian:-
> ./tests/bugs/glusterfs/bug-902610.t-------------subtest 7 fails

I don't see why this should be dependent on the hashing. Please provide output of the test and the gluster logs with debug enabled when this fails.


> ./tests/bugs/posix/bug-1619720.t----------------subtests 13 and 14 fail

Diff of fix provided in comment#10.



> 
> Following test case is failing because of "Cannot allocate memory" issue:-
> ./tests/bitrot/bug-1207627-bitrot-scrub-status.t----------subtest 20 fails
> with the below error:-
> [client-rpc-fops_v2.c:961:client4_0_fgetxattr_cbk] 0-patchy-client-0: remote
> operation failed [Cannot allocate memory]
> 

This does not seem related to the hashing algorithm. Please check the brick log to see if there are any errors.



> Following test case is failing on which issue has already been raised:-
> ./tests/features/trash.t--------------
> https://bugzilla.redhat.com/show_bug.cgi?id=1627060
> 

I'll take a look at this and see what can be done.


> So, please look into this and let us know if any workaround can be provided
> to make the above tests pass on Big-Endian.

Comment 12 abhays 2019-02-13 12:06:29 UTC
(In reply to Nithya Balachandran from comment #9)
> The diffs for possible fixes for bug-1161311.t and bug-1193636.t are as
> follows. Please make the changes on your setup and let me know if it works.
> 
> 
> [root@rhgs313-6 tests]# git diff bugs/distribute/bug-1161311.t
> diff --git a/tests/bugs/distribute/bug-1161311.t
> b/tests/bugs/distribute/bug-1161311.t
> index c52c69b..3dc45a4 100755
> --- a/tests/bugs/distribute/bug-1161311.t
> +++ b/tests/bugs/distribute/bug-1161311.t
> @@ -82,6 +82,8 @@ for i in {1..10}; do
>    cat /tmp/FILE2 >> $M0/dir1/FILE2
>  done
>  
> +brick_loc=$(get_backend_paths $M0/dir1/FILE2)
> +
>  #dd if=/dev/urandom of=$M0/dir1/FILE2 bs=64k count=10240
>  
>  # Rename the file to create a linkto, for rebalance to
> @@ -99,7 +101,7 @@ TEST $CLI volume rebalance $V0 start force
>  
>  # Wait for FILE to get the sticky bit on, so that file is under
>  # active rebalance, before creating the links
> -TEST checksticky $B0/${V0}3/dir1/FILE1
> +TEST checksticky $brick_loc
>  
>  # Create the links
>  ## FILE3 FILE5 FILE7 have hashes, c8c91469 566d26ce 22ce7eba
> @@ -120,7 +122,7 @@ cd /
>  
>  # Ideally for this test to have done its job, the file should still be
>  # under migration, so check the sticky bit again
> -TEST checksticky $B0/${V0}3/dir1/FILE1
> +TEST checksticky $brick_loc
>  
>  # Wait for rebalance to complete
>  EXPECT_WITHIN $REBALANCE_TIMEOUT "completed" rebalance_status_field $V0
> 
> 
> 
> [root@rhgs313-6 tests]# git diff ./bugs/distribute/bug-1193636.t
> diff --git a/tests/bugs/distribute/bug-1193636.t
> b/tests/bugs/distribute/bug-1193636.t
> index ccde02e..6ffa2d9 100644
> --- a/tests/bugs/distribute/bug-1193636.t
> +++ b/tests/bugs/distribute/bug-1193636.t
> @@ -37,6 +37,8 @@ TEST mkdir $M0/dir1
>  # Create a large file (1GB), so that rebalance takes time
>  dd if=/dev/zero of=$M0/dir1/FILE2 bs=64k count=10240
>  
> +brick_loc=$(get_backend_paths $M0/dir1/FILE2)
> +
>  # Rename the file to create a linkto, for rebalance to
>  # act on the file
>  TEST mv $M0/dir1/FILE2 $M0/dir1/FILE1
> @@ -45,7 +47,7 @@ build_tester $(dirname $0)/bug-1193636.c
>  
>  TEST $CLI volume rebalance $V0 start force
>  
> -TEST checksticky $B0/${V0}3/dir1/FILE1
> +TEST checksticky $brick_loc
>  
>  TEST setfattr -n "user.test1" -v "test1" $M0/dir1/FILE1
>  TEST setfattr -n "user.test2" -v "test1" $M0/dir1/FILE1



(In reply to Nithya Balachandran from comment #10)
> Possible fix for ./tests/bugs/posix/bug-1619720.t:
> 
> 
> [root@rhgs313-6 tests]# git diff  bugs/posix/bug-1619720.t
> diff --git a/tests/bugs/posix/bug-1619720.t b/tests/bugs/posix/bug-1619720.t
> index bfd304d..8584476 100755
> --- a/tests/bugs/posix/bug-1619720.t
> +++ b/tests/bugs/posix/bug-1619720.t
> @@ -1,6 +1,7 @@
>  #!/bin/bash
>  
>  . $(dirname $0)/../../include.rc
> +. $(dirname $0)/../../volume.rc
>  . $(dirname $0)/../../dht.rc
>  
>  cleanup;
> @@ -35,7 +36,8 @@ TEST mkdir $M0/tmp
>  # file-2 will hash to the other subvol
>  
>  TEST touch $M0/tmp/file-2
> -pgfid_xattr_name=$(getfattr -m "trusted.pgfid.*" $B0/${V0}1/tmp/file-2 |
> grep "trusted.pgfid")
> +loc_2=$(get_backend_paths $M0/tmp/file-2)
> +pgfid_xattr_name=$(getfattr -m "trusted.pgfid.*" $loc_2 | grep
> "trusted.pgfid")
>  echo $pgfid_xattr_name

Thanks for the reply @Nithya. 
Unfortunately the above changes for bugs/distribute/bug-1161311,bugs/distribute/bug-1193636.t and bugs/posix/bug-1619720.t do not work.
PFA the logs for the same.

Comment 13 abhays 2019-02-13 12:08:36 UTC
Created attachment 1534374 [details]
bug-1161311.log

Comment 14 Nithya Balachandran 2019-02-13 12:12:47 UTC
Please provide the complete test output and the gluster log files.

Comment 15 abhays 2019-02-13 12:13:52 UTC
Created attachment 1534375 [details]
bug-1161311_glusterd.log

Comment 16 abhays 2019-02-13 12:14:56 UTC
Created attachment 1534376 [details]
bug-1193636.log

Comment 17 abhays 2019-02-13 12:15:44 UTC
Created attachment 1534377 [details]
bug-1193636_glusterd.log

Comment 18 abhays 2019-02-13 12:16:20 UTC
Created attachment 1534378 [details]
bug-1619720.log

Comment 19 abhays 2019-02-13 12:17:19 UTC
Created attachment 1534379 [details]
bug-1619720_glusterd.log

Comment 20 Nithya Balachandran 2019-02-13 12:30:53 UTC
Hi,

I need the client and brick logs (not glusterd which is the management daemon). Please do the following:

Add the following lines to the test after the volume is started:
TEST $CLI volume set $V0 client-log-level DEBUG
TEST $CLI volume set $V0 brick-log-level DEBUG


Run the test and send the client and brick logs.


It might be that the hashe values on your system mean that files are on the same hashed subvol.

Please send :
the hashes of the file names in the tests
the trusted.glusterfs.dht xattr values for the parent directories of these files on each brick

Comment 21 Nithya Balachandran 2019-02-13 15:11:29 UTC
Created attachment 1534434 [details]
bugs/distribute/bug-1161311.t

Comment 22 Nithya Balachandran 2019-02-13 15:12:10 UTC
Created attachment 1534436 [details]
bugs/distribute/bug-1193636.t

Comment 23 Nithya Balachandran 2019-02-14 05:52:01 UTC
There was a mistake in the diffs sent earlier. Please try the modified .t files that I have attached and let me know if they work for you.

Comment 24 abhays 2019-02-14 08:50:48 UTC
(In reply to Nithya Balachandran from comment #23)
> There was a mistake in the diffs sent earlier. Please try the modified .t
> files that I have attached and let me know if they work for you.

Thanks for letting me know about this.
The test cases bugs/distribute/bug-1193636.t and bugs/distribute/bug-1161311.t are passing successfully on big-endian. 


(In reply to Nithya Balachandran from comment #11)
> (In reply to abhays from comment #6)
> > Yes @Nithya, Below are the test cases that fail, their cause of failure and
> > possible workaround on Big-Endian:-
> > 
> > Following are the test cases which pass after changing the bricks in the
> > test case:-
> > ./tests/bugs/distribute/bug-1161311.t --------passes after changing brick3
> > to brick1 in subtests 31 and 41.
> > ./tests/bugs/distribute/bug-1193636.t --------passes after changing brick3
> > to brick1 in subtest 10.
> 
> 
> Provided diff of fix for these 2 tests in comment#9. Please try it out and
> let me know if it works.
> 

Working fine.


> 
> 
> > ./tests/bugs/nfs/bug-847622.t ----------------passes after giving absolute
> > path of testfile in subtest 9.
> 
> 
> I don't see why this should be dependent on the hashing. Please provide
> output of the test and the gluster logs with debug enabled when this fails.
> 
> 

I agree @Nithya. This test might not be related to hashing. But please look into the attached logs for the same.


> 
> > 
> > Following are the test cases that are still failing even after changing the
> > bricks, however if little-endian hash values are hard-coded on big-endian in
> > the file ./xlators/cluster/dht/src/dht-layout.c, then these test cases pass
> > on Big-Endian:-
> > ./tests/bugs/glusterfs/bug-902610.t-------------subtest 7 fails
> 
> I don't see why this should be dependent on the hashing. Please provide
> output of the test and the gluster logs with debug enabled when this fails.
> 

@Nithya, I am quite certain this test fails due to differing hash values. Refer to comment #8 for the same.Providing the logs for the same.

> 
> > ./tests/bugs/posix/bug-1619720.t----------------subtests 13 and 14 fail
> 
> Diff of fix provided in comment#10.
> 
> 

This test is failing with the changes shared. PFA the logs for the same.

> 
> > 
> > Following test case is failing because of "Cannot allocate memory" issue:-
> > ./tests/bitrot/bug-1207627-bitrot-scrub-status.t----------subtest 20 fails
> > with the below error:-
> > [client-rpc-fops_v2.c:961:client4_0_fgetxattr_cbk] 0-patchy-client-0: remote
> > operation failed [Cannot allocate memory]
> > 
> 
> This does not seem related to the hashing algorithm. Please check the brick
> log to see if there are any errors.
> 
> 

Providing logs for the same.

> 
> > Following test case is failing on which issue has already been raised:-
> > ./tests/features/trash.t--------------
> > https://bugzilla.redhat.com/show_bug.cgi?id=1627060
> > 
> 
> I'll take a look at this and see what can be done.
> 
> 
> > So, please look into this and let us know if any workaround can be provided
> > to make the above tests pass on Big-Endian.

Comment 25 abhays 2019-02-14 08:54:20 UTC
Created attachment 1534710 [details]
bug-847622_brick0.log

Comment 26 abhays 2019-02-14 08:55:12 UTC
Created attachment 1534711 [details]
bug-847622_nfs.log

Comment 27 abhays 2019-02-14 08:56:23 UTC
Created attachment 1534712 [details]
bug-847622_subtest_failure.log

Comment 28 abhays 2019-02-14 08:57:42 UTC
Created attachment 1534713 [details]
bug-1619720_mnt_glusterfs-0.log

Comment 29 abhays 2019-02-14 08:58:53 UTC
Created attachment 1534714 [details]
bug-1619720-patchy0.log

Comment 30 abhays 2019-02-14 09:00:00 UTC
Created attachment 1534715 [details]
bug-1619720-patchy1.log

Comment 31 abhays 2019-02-14 09:00:56 UTC
Created attachment 1534716 [details]
bug-902610_diff.log

Comment 32 abhays 2019-02-14 09:04:34 UTC
Created attachment 1534720 [details]
bug-902610_mnt-glusterfs-0.log

Comment 33 abhays 2019-02-14 09:05:21 UTC
Created attachment 1534721 [details]
bug-902610_patchy0.log

Comment 34 abhays 2019-02-14 09:22:07 UTC
Created attachment 1534723 [details]
Zip Folder for all the logs

Comment 35 abhays 2019-02-14 09:23:49 UTC
I have attached all the logs as per requested with the hash values too(in the Hash_Values.log file).
Let me know if you need anything else.

Comment 36 Nithya Balachandran 2019-02-14 10:45:13 UTC
Does bugs/glusterfs/bug-902610.t pass if you replace

kill_brick $V0 $H0 $B0/${V0}2 
with
kill_brick $V0 $H0 $B0/${V0}1

?

Comment 37 Nithya Balachandran 2019-02-14 10:47:12 UTC
I have attached an updated trash.t to BZ#1627060. Please let me know if that works for you.

Comment 38 abhays 2019-02-14 11:52:49 UTC
(In reply to Nithya Balachandran from comment #36)
> Does bugs/glusterfs/bug-902610.t pass if you replace
> 
> kill_brick $V0 $H0 $B0/${V0}2 
> with
> kill_brick $V0 $H0 $B0/${V0}1
> 
> ?

No, It doesn't.
=========================
TEST 10 (line 67): 0 echo 1
not ok 10 Got "1" instead of "0", LINENUM:67
RESULT 10: 1
Failed 1/10 subtests

Test Summary Report
-------------------
./tests/bugs/glusterfs/bug-902610.t (Wstat: 0 Tests: 10 Failed: 1)
  Failed test:  10
Files=1, Tests=10, 16 wallclock secs ( 0.04 usr  0.01 sys +  2.32 cusr  0.50 csys =  2.87 CPU)
Result: FAIL
End of test ./tests/bugs/glusterfs/bug-902610.t
================================================================================

Comment 39 Worker Ant 2019-02-14 12:22:19 UTC
REVIEW: https://review.gluster.org/22217 (tests/dht:   Remove hardcoded brick paths) posted (#1) for review on master by N Balachandran

Comment 40 Nithya Balachandran 2019-02-15 02:56:03 UTC
> > ./tests/bugs/posix/bug-1619720.t----------------subtests 13 and 14 fail
> 
> Diff of fix provided in comment#10.
> 
> 

This test is failing with the changes shared. PFA the logs for the same.

As it is difficult for me to figure out what is happening without a Big Endian system, I would encourage you to understand what is expected and try to make the changes yourself. We will be happy to take your patches if they work for us as well.

Comment 41 abhays 2019-02-15 04:23:31 UTC
(In reply to Nithya Balachandran from comment #40)
> > > ./tests/bugs/posix/bug-1619720.t----------------subtests 13 and 14 fail
> > 
> > Diff of fix provided in comment#10.
> > 
> > 
> 
> This test is failing with the changes shared. PFA the logs for the same.
> 
> As it is difficult for me to figure out what is happening without a Big
> Endian system, I would encourage you to understand what is expected and try
> to make the changes yourself. We will be happy to take your patches if they
> work for us as well.

Yes sure. We are trying to debug further. Additionally, we'll need your timely help in resolving these failures.

Comment 42 abhays 2019-02-15 05:55:49 UTC
@Nithya, Do ya'll have a Jenkins CI Infrastructure where continuous builds are executed for glusterfs. We have come across the below links regarding the same:-
https://ci.centos.org/label/gluster/
https://build.gluster.org/


Can you please confirm about these?

Comment 43 Worker Ant 2019-02-18 04:45:03 UTC
REVIEW: https://review.gluster.org/22217 (tests/dht:   Remove hardcoded brick paths) merged (#2) on master by Amar Tumballi

Comment 44 Nithya Balachandran 2019-02-18 08:27:13 UTC
(In reply to abhays from comment #42)
> @Nithya, Do ya'll have a Jenkins CI Infrastructure where continuous builds
> are executed for glusterfs. We have come across the below links regarding
> the same:-
> https://ci.centos.org/label/gluster/
> https://build.gluster.org/
> 
> 
> Can you please confirm about these?

https://build.gluster.org/ is the gluster project's CI. All patches that are posted on review.gluster.org  will run the regression suite on this.

Comment 45 abhays 2019-02-19 06:43:03 UTC
(In reply to Nithya Balachandran from comment #44)
> (In reply to abhays from comment #42)
> > @Nithya, Do ya'll have a Jenkins CI Infrastructure where continuous builds
> > are executed for glusterfs. We have come across the below links regarding
> > the same:-
> > https://ci.centos.org/label/gluster/
> > https://build.gluster.org/
> > 
> > 
> > Can you please confirm about these?
> 
> https://build.gluster.org/ is the gluster project's CI. All patches that are
> posted on review.gluster.org  will run the regression suite on this.

Thanks for the information.


(In reply to Nithya Balachandran from comment #40)
> > > ./tests/bugs/posix/bug-1619720.t----------------subtests 13 and 14 fail
> > 
> > Diff of fix provided in comment#10.
> > 
> > 
> 
> This test is failing with the changes shared. PFA the logs for the same.
> 
> As it is difficult for me to figure out what is happening without a Big
> Endian system, I would encourage you to understand what is expected and try
> to make the changes yourself. We will be happy to take your patches if they
> work for us as well.

Is it be possible for us to add our big endian(s390x) systems on gluster project's CI so that it's easier for you to debug the test failures on big endian platforms?

Comment 46 Nithya Balachandran 2019-02-19 06:46:05 UTC
(In reply to Nithya Balachandran from comment #44)
> (In reply to abhays from comment #42)
> > @Nithya, Do ya'll have a Jenkins CI Infrastructure where continuous builds
> > are executed for glusterfs. We have come across the below links regarding
> > the same:-
> > https://ci.centos.org/label/gluster/
> > https://build.gluster.org/
> > 
> > 
> > Can you please confirm about these?
> 
> https://build.gluster.org/ is the gluster project's CI. All patches that are
> posted on review.gluster.org  will run the regression suite on this.


(In reply to abhays from comment #45)
> (In reply to Nithya Balachandran from comment #44)
> > (In reply to abhays from comment #42)
> > > @Nithya, Do ya'll have a Jenkins CI Infrastructure where continuous builds
> > > are executed for glusterfs. We have come across the below links regarding
> > > the same:-
> > > https://ci.centos.org/label/gluster/
> > > https://build.gluster.org/
> > > 
> > > 
> > > Can you please confirm about these?
> > 
> > https://build.gluster.org/ is the gluster project's CI. All patches that are
> > posted on review.gluster.org  will run the regression suite on this.
> 
> Thanks for the information.
> 
> 
> (In reply to Nithya Balachandran from comment #40)
> > > > ./tests/bugs/posix/bug-1619720.t----------------subtests 13 and 14 fail
> > > 
> > > Diff of fix provided in comment#10.
> > > 
> > > 
> > 
> > This test is failing with the changes shared. PFA the logs for the same.
> > 
> > As it is difficult for me to figure out what is happening without a Big
> > Endian system, I would encourage you to understand what is expected and try
> > to make the changes yourself. We will be happy to take your patches if they
> > work for us as well.
> 
> Is it be possible for us to add our big endian(s390x) systems on gluster
> project's CI so that it's easier for you to debug the test failures on big
> endian platforms?


I don't think so. I would recommend that you debug the tests on your systems and post patches which will work on both.

Comment 47 Raghavendra Bhat 2019-02-21 21:36:04 UTC

I am looking at the bitrot error. The error is while doing a getxattr on the xattr "trusted.glusterfs.get-signature". But one of the attached files (bug-1207627-bitrot-scrub-status.log.txt) shows
the following extended attributes. Unfortunately "trusted.bit-rot.signature" is not seen. 

TEST 22 (line 55): trusted.bit-rot.bad-file check_for_xattr trusted.bit-rot.bad-file //d/backends/patchy1/FILE1
not ok 22 Got "" instead of "trusted.bit-rot.bad-file", LINENUM:55
RESULT 22: 1
getfattr: Removing leading '/' from absolute path names
# file: d/backends/patchy1
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
trusted.glusterfs.volume-id=0xdc6b47ebd73f46798f5a86b42678fb44

Would it be possible for you to upload the bitd.log from /var/log/glusterfs directory after running the bitrot test? 

Because, it is bit-rot daemon (whose log file is bitd.log that I asked) which sets sends the setxattr requests to the brick to set the extended attributes needed for bit-rot detection.
If setting that attribute failed, then bit-rot daemon's log file would have got an error.


Regards,
Raghavendra

Comment 48 abhays 2019-02-22 04:12:00 UTC
Created attachment 1537322 [details]
Bitrot.log

Contains bitd.log and scrub.log for the test case bug-1207627-bitrot-scrub-status.t

Comment 49 abhays 2019-02-22 04:14:50 UTC
Comment on attachment 1537322 [details]
Bitrot.log

Contains bitd.log and scrub.log for the test case bug-1207627-bitrot-scrub-status.t

We have observed error on the line below in the scrub.log:-
[2019-02-14 04:16:31.164722] E [MSGID: 118008] [bit-rot.c:479:br_log_object] 0-patchy-bit-rot-0: fgetxattr() failed on object b179d471-b5a0-49c6-a22b-08fbf599e734 [Cannot allocate memory]

Comment 50 Raghavendra Bhat 2019-02-25 17:08:47 UTC
Hi,

Thanks for the logs. From the logs saw that the following things are happening.

1) The scrubbing is started

2) Scrubber always decides whether a file is corrupted or not by comparing the stored on-disk signature (gets by getxattr) with its own calculated signature of the file.

3) Here, while getting the on-disk signature, getxattr is failing with ENOMEM (i.e. Cannot allocate memory) because of the endianness.

4) Further testcases in the test fail because, they expect the bad-file extended attribute to be present which scrubber could not set because of the above error (i.e. had it been able to successfully get the signature of the file via getxattr, it would have been able to compare the signature with its own calculated signature and set the bad-file extended attribute to indicate the file is corrupted).


Looking at the code to come up with a fix to address this.

Comment 51 Nithya Balachandran 2019-02-28 09:19:24 UTC
> 
> 
> I don't think so. I would recommend that you debug the tests on your systems
> and post patches which will work on both.

Please note what I am referring to is for you to look at the .t files and modify file names or remove hardcoding as required.

Comment 52 abhays 2019-03-04 04:57:01 UTC
(In reply to Raghavendra Bhat from comment #50)
> Hi,
> 
> Thanks for the logs. From the logs saw that the following things are
> happening.
> 
> 1) The scrubbing is started
> 
> 2) Scrubber always decides whether a file is corrupted or not by comparing
> the stored on-disk signature (gets by getxattr) with its own calculated
> signature of the file.
> 
> 3) Here, while getting the on-disk signature, getxattr is failing with
> ENOMEM (i.e. Cannot allocate memory) because of the endianness.
> 
> 4) Further testcases in the test fail because, they expect the bad-file
> extended attribute to be present which scrubber could not set because of the
> above error (i.e. had it been able to successfully get the signature of the
> file via getxattr, it would have been able to compare the signature with its
> own calculated signature and set the bad-file extended attribute to indicate
> the file is corrupted).
> 
> 
> Looking at the code to come up with a fix to address this.

Thanks for the reply @Raghavendra. We are also looking into the same.

Comment 53 abhays 2019-03-04 05:23:28 UTC
(In reply to Nithya Balachandran from comment #51)
> > 
> > 
> > I don't think so. I would recommend that you debug the tests on your systems
> > and post patches which will work on both.
> 
> Please note what I am referring to is for you to look at the .t files and
> modify file names or remove hardcoding as required.

Yes @Nithya, We understood that you want us to continue debugging the tests and provide patches if fix is found.
While doing the same, we were able to fix the ./tests/bugs/nfs/bug-847622.t with the following patch:-

diff --git a/tests/bugs/nfs/bug-847622.t b/tests/bugs/nfs/bug-847622.t
index 3b836745a..f21884972 100755
--- a/tests/bugs/nfs/bug-847622.t
+++ b/tests/bugs/nfs/bug-847622.t
@@ -28,7 +32,7 @@ cd $N0

 # simple getfacl setfacl commands
 TEST touch testfile
-TEST setfacl -m u:14:r testfile
+TEST setfacl -m u:14:r $B0/brick0/testfile
 TEST getfacl testfile

Please check, if the above patch can be merged.


However, the test cases are still failing and only pass if x86 hash values are provided(Refer to comment#8):-
./tests/bugs/glusterfs/bug-902610.t
./tests/bugs/posix/bug-1619720.t

We have tried modifying filenames in these test cases, but nothing worked.
We think that the test cases might pass if the behavior of the source code for hash value calculation is changed to support s390x architecture as well.
Could you please look into the same?

Comment 54 Nithya Balachandran 2019-03-04 05:35:44 UTC
> 
> However, the test cases are still failing and only pass if x86 hash values
> are provided(Refer to comment#8):-
> ./tests/bugs/glusterfs/bug-902610.t
> ./tests/bugs/posix/bug-1619720.t

Please provide more information on what changes you tried.


> 
> We have tried modifying filenames in these test cases, but nothing worked.
> We think that the test cases might pass if the behavior of the source code
> for hash value calculation is changed to support s390x architecture as well.
> Could you please look into the same?


This seem extremely unlikely at this point. Like I said, the code will work fine as long as the setup is not mixed-endian. As the test cases are the only things that fail and that is because they use hard coded values, such a huge change in the source code is not the first step.

Comment 55 abhays 2019-03-04 07:19:31 UTC
(In reply to Nithya Balachandran from comment #54)
> > 
> > However, the test cases are still failing and only pass if x86 hash values
> > are provided(Refer to comment#8):-
> > ./tests/bugs/glusterfs/bug-902610.t
> > ./tests/bugs/posix/bug-1619720.t
> 
> Please provide more information on what changes you tried.

For tests/bugs/glusterfs/bug-902610.t:-
In the test case, after the kill_brick function is run, the mkdir $M0/dir1 doesn't work and hence the get_layout function test fails. So,as a workaround we tried not killing the brick and then checked the functionality of the test case, after which the dir1 did get created in all the 4 bricks, however, the test failed with the following output:-
=========================
TEST 9 (line 59): ls -l /mnt/glusterfs/0
ok 9, LINENUM:59
RESULT 9: 0
getfattr: Removing leading '/' from absolute path names
/d/backends/patchy3/dir1 /d/backends/patchy0/dir1 /d/backends/patchy2/dir1 /d/backends/patchy1/dir1
layout1 from 00000000 to 00000000
layout2 from 00000000 to 55555554
target for layout2 = 55555555
=========================
TEST 10 (line 72): 0 echo 1
not ok 10 Got "1" instead of "0", LINENUM:72
RESULT 10: 1
Failed 1/10 subtests
=========================

But, the below patch works for the test case(only on Big Endian):-

diff --git a/tests/bugs/glusterfs/bug-902610.t b/tests/bugs/glusterfs/bug-902610.t
index b45e92b8a..8a8eaf7a3 100755
--- a/tests/bugs/glusterfs/bug-902610.t
+++ b/tests/bugs/glusterfs/bug-902610.t
@@ -2,6 +2,7 @@

 . $(dirname $0)/../../include.rc
 . $(dirname $0)/../../volume.rc
+. $(dirname $0)/../../dht.rc

 cleanup;

@@ -11,11 +12,11 @@ function get_layout()
         layout1=`getfattr -n trusted.glusterfs.dht -e hex $1 2>&1|grep dht |cut -d = -f2`
        layout1_s=$(echo $layout1 | cut -c 19-26)
        layout1_e=$(echo $layout1 | cut -c 27-34)
-       #echo "layout1 from $layout1_s to $layout1_e" > /dev/tty
+       echo "layout1 from $layout1_s to $layout1_e" > /dev/tty
         layout2=`getfattr -n trusted.glusterfs.dht -e hex $2 2>&1|grep dht |cut -d = -f2`
        layout2_s=$(echo $layout2 | cut -c 19-26)
        layout2_e=$(echo $layout2 | cut -c 27-34)
-       #echo "layout2 from $layout2_s to $layout2_e" > /dev/tty
+       echo "layout2 from $layout2_s to $layout2_e" > /dev/tty

        if [ x"$layout2_s" = x"00000000" ]; then
                # Reverse so we only have the real logic in one place.
@@ -29,7 +30,7 @@ function get_layout()

        # Figure out where the join point is.
        target=$( $PYTHON -c "print '%08x' % (0x$layout1_e + 1)")
-       #echo "target for layout2 = $target" > /dev/tty
+       echo "target for layout2 = $target" > /dev/tty

        # The second layout should cover everything that the first doesn't.
        if [ x"$layout2_s" = x"$target" -a x"$layout2_e" = x"ffffffff" ]; then
@@ -41,26 +42,30 @@ function get_layout()

 BRICK_COUNT=4

-TEST glusterd
+TEST glusterd --log-level DEBUG
 TEST pidof glusterd

 TEST $CLI volume create $V0 $H0:$B0/${V0}0 $H0:$B0/${V0}1 $H0:$B0/${V0}2 $H0:$B0/${V0}3
 ## set subvols-per-dir option
 TEST $CLI volume set $V0 subvols-per-directory 3
 TEST $CLI volume start $V0
+TEST $CLI volume set $V0 client-log-level DEBUG
+TEST $CLI volume set $V0 brick-log-level DEBUG
+

 ## Mount FUSE
 TEST glusterfs -s $H0 --volfile-id $V0 $M0 --entry-timeout=0 --attribute-timeout=0;

 TEST ls -l $M0
+#brick_loc=$(get_backend_paths $M0)

 ## kill 2 bricks to bring down available subvol < spread count
-kill_brick $V0 $H0 $B0/${V0}2
-kill_brick $V0 $H0 $B0/${V0}3
+kill_brick $V0 $H0 $B0/${V0}0
+kill_brick $V0 $H0 $B0/${V0}1

 mkdir $M0/dir1 2>/dev/null

-get_layout $B0/${V0}0/dir1 $B0/${V0}1/dir1
+get_layout $B0/${V0}2/dir1 $B0/${V0}3/dir1
 EXPECT "0" echo $?

 cleanup;


From above patch, the below output is seen:-
=========================
TEST 9 (line 59): ls -l /mnt/glusterfs/0
ok 9, LINENUM:59
RESULT 9: 0
Socket=/var/run/gluster/e90af2b6fbd74dbe.socket
Brick=/d/backends/patchy0
connected
disconnected
OK
Socket=/var/run/gluster/d7212ecddcb22a08.socket
Brick=/d/backends/patchy1
connected
disconnected
OK
layout1 from 00000000 to 7ffffffe
layout2 from 7fffffff to ffffffff
target for layout2 = 7fffffff
=========================
TEST 10 (line 72): 0 echo 0
ok 10, LINENUM:72
RESULT 10: 0
ok
All tests successful.
Files=1, Tests=10, 13 wallclock secs ( 0.03 usr  0.01 sys +  2.05 cusr  0.34 csys =  2.43 CPU)
Result: PASS
=========================


Therefore, can these changes be added in the test case with a condition for s390x separately?

Also, We have a few queries on the tests behaviour.
When a directory or a file gets created, according to me, it should be placed in the brick depending on its hash range and value of the file/directory.
However, in the above test, as you can see, if we don't kill the bricks{2,3}, the directory gets created in all the bricks{0,1,2,3}.So, does it not consider hash values and range at this point or is it something to do with mounting FUSE?

Comment 56 Nithya Balachandran 2019-03-05 14:52:57 UTC
(In reply to abhays from comment #53)
> (In reply to Nithya Balachandran from comment #51)
> > > 
> > > 
> > > I don't think so. I would recommend that you debug the tests on your systems
> > > and post patches which will work on both.
> > 
> > Please note what I am referring to is for you to look at the .t files and
> > modify file names or remove hardcoding as required.
> 
> Yes @Nithya, We understood that you want us to continue debugging the tests
> and provide patches if fix is found.
> While doing the same, we were able to fix the ./tests/bugs/nfs/bug-847622.t
> with the following patch:-
> 
> diff --git a/tests/bugs/nfs/bug-847622.t b/tests/bugs/nfs/bug-847622.t
> index 3b836745a..f21884972 100755
> --- a/tests/bugs/nfs/bug-847622.t
> +++ b/tests/bugs/nfs/bug-847622.t
> @@ -28,7 +32,7 @@ cd $N0
> 
>  # simple getfacl setfacl commands
>  TEST touch testfile
> -TEST setfacl -m u:14:r testfile
> +TEST setfacl -m u:14:r $B0/brick0/testfile
>  TEST getfacl testfile
> 
> Please check, if the above patch can be merged.
> 
> 

This fix is incorrect. The patch changes the test to modify the brick directly while the test is to check that these operations succeed on the mount. You need to see why it fails and then we can figure out the fix.

Comment 57 Nithya Balachandran 2019-03-05 15:05:32 UTC
(In reply to abhays from comment #55)
> (In reply to Nithya Balachandran from comment #54)
> > > 
> > > However, the test cases are still failing and only pass if x86 hash values
> > > are provided(Refer to comment#8):-
> > > ./tests/bugs/glusterfs/bug-902610.t
> > > ./tests/bugs/posix/bug-1619720.t
> > 
> > Please provide more information on what changes you tried.
> 
> For tests/bugs/glusterfs/bug-902610.t:-
> In the test case, after the kill_brick function is run, the mkdir $M0/dir1
> doesn't work and hence the get_layout function test fails. So,as a
> workaround we tried not killing the brick and then checked the functionality
> of the test case, after which the dir1 did get created in all the 4 bricks,
> however, the test failed with the following output:-

The mkdir function will fail if the hashed brick of the directory being created is down. In your case, the change in hashed values means the brick that was killed is the hashed subvol for the directory. Killing a different brick  should cause it to succeed.

In any case this is not a feature that we support anymore so I can just remove the test case.

> Therefore, can these changes be added in the test case with a condition for
> s390x separately?

I do not think we should separate it out like this. The better way would be to just find 2 bricks that work for both big and little endian.
I will try out your changes on a big endian system and see if this combination will work there as well.

> 
> Also, We have a few queries on the tests behaviour.
> When a directory or a file gets created, according to me, it should be
> placed in the brick depending on its hash range and value of the
> file/directory.
> However, in the above test, as you can see, if we don't kill the
> bricks{2,3}, the directory gets created in all the bricks{0,1,2,3}.So, does
> it not consider hash values and range at this point or is it something to do
> with mounting FUSE?

The way dht creates files and directories is slightly different.

For files, it calculates the hash and creates it in the subvolume in whose directory layout range it falls.
For directories, it first tries to create it on the hashed subvol. If for some reason that fails, it will not be created on the other bricks. In this test, for s390x, one of the bricks killed was the hashed subvol so mkdir fails. 
The solution here is to make sure the bricks being killed are not the hashed subvol in either big or little endian systems.

Comment 58 abhays 2019-03-07 09:43:11 UTC
(In reply to Nithya Balachandran from comment #56)
> (In reply to abhays from comment #53)
> > (In reply to Nithya Balachandran from comment #51)
> > > > 
> > > > 
> > > > I don't think so. I would recommend that you debug the tests on your systems
> > > > and post patches which will work on both.
> > > 
> > > Please note what I am referring to is for you to look at the .t files and
> > > modify file names or remove hardcoding as required.
> > 
> > Yes @Nithya, We understood that you want us to continue debugging the tests
> > and provide patches if fix is found.
> > While doing the same, we were able to fix the ./tests/bugs/nfs/bug-847622.t
> > with the following patch:-
> > 
> > diff --git a/tests/bugs/nfs/bug-847622.t b/tests/bugs/nfs/bug-847622.t
> > index 3b836745a..f21884972 100755
> > --- a/tests/bugs/nfs/bug-847622.t
> > +++ b/tests/bugs/nfs/bug-847622.t
> > @@ -28,7 +32,7 @@ cd $N0
> > 
> >  # simple getfacl setfacl commands
> >  TEST touch testfile
> > -TEST setfacl -m u:14:r testfile
> > +TEST setfacl -m u:14:r $B0/brick0/testfile
> >  TEST getfacl testfile
> > 
> > Please check, if the above patch can be merged.
> > 
> > 
> 
> This fix is incorrect. The patch changes the test to modify the brick
> directly while the test is to check that these operations succeed on the
> mount. You need to see why it fails and then we can figure out the fix.

Okay, thanks for the clarification. Below are some of the observations I made for this test case:-
When brick is not changed and kept the way it is in the test case, then the below happens on s390x:
getfacl /d/backends/brick0/testfile
getfacl: Removing leading '/' from absolute path names
# file: d/backends/brick0/testfile
# owner: root
# group: root
user::rw-
group::r--
other::r--

Whereas, on x86,
getfacl /d/backends/brick0/testfile
getfacl: Removing leading '/' from absolute path names
# file: d/backends/brick0/testfile
# owner: root
# group: root
user::rw-
user:14:r--
group::r--
mask::r--
other::r--

Since the setfacl command fails,the above behavior is seen. When I checked the logs,
On s390x, this is shown:-
D [MSGID: 0] [client-rpc-fops_v2.c:887:client4_0_getxattr_cbk] 0-patchy-client-0: remote operation failed: No data available. Path: /testfile (fa921dc9-41a3-4fad-9fab-2c0933e54e38). Key: system.posix_acl_access
On x86, this is shown:-
D [MSGID: 0] [nfs3-helpers.c:1660:nfs3_log_fh_entry_call] 0-nfs-nfsv3: XID: a2d2141c, LOOKUP: args: FH: exportid d7f43849-b25a-49d2-8084-aefb8d7797f2, gfid 00000000-0000-0000-0000-000000000001, mountid 8d32c8d1-0000-0000-0000-000000000000, name: libacl.so.1

Therefore, I tried remounting acl in the test case and even tried adding acl in /etc/fstab in the following ways:-
In the test case-------> mount -o remount,acl /
In /etc/fstab----------> /dev     /boot/zipl  ext2  defaults,acl        0  2

However, the test case still fails. So, can you please provide us with some details as to what happens when the commands;
EXPECT_WITHIN $NFS_EXPORT_TIMEOUT "1" is_nfs_export_available;
TEST mount_nfs $H0:/$V0 $N0 nolock
are run in the test case.

Comment 59 abhays 2019-03-07 09:44:07 UTC
(In reply to Nithya Balachandran from comment #57)
> (In reply to abhays from comment #55)
> > (In reply to Nithya Balachandran from comment #54)
> > > > 
> > > > However, the test cases are still failing and only pass if x86 hash values
> > > > are provided(Refer to comment#8):-
> > > > ./tests/bugs/glusterfs/bug-902610.t
> > > > ./tests/bugs/posix/bug-1619720.t
> > > 
> > > Please provide more information on what changes you tried.
> > 
> > For tests/bugs/glusterfs/bug-902610.t:-
> > In the test case, after the kill_brick function is run, the mkdir $M0/dir1
> > doesn't work and hence the get_layout function test fails. So,as a
> > workaround we tried not killing the brick and then checked the functionality
> > of the test case, after which the dir1 did get created in all the 4 bricks,
> > however, the test failed with the following output:-
> 
> The mkdir function will fail if the hashed brick of the directory being
> created is down. In your case, the change in hashed values means the brick
> that was killed is the hashed subvol for the directory. Killing a different
> brick  should cause it to succeed.
> 
> In any case this is not a feature that we support anymore so I can just
> remove the test case.
> 
> > Therefore, can these changes be added in the test case with a condition for
> > s390x separately?
> 
> I do not think we should separate it out like this. The better way would be
> to just find 2 bricks that work for both big and little endian.
> I will try out your changes on a big endian system and see if this
> combination will work there as well.
> 
> > 
> > Also, We have a few queries on the tests behaviour.
> > When a directory or a file gets created, according to me, it should be
> > placed in the brick depending on its hash range and value of the
> > file/directory.
> > However, in the above test, as you can see, if we don't kill the
> > bricks{2,3}, the directory gets created in all the bricks{0,1,2,3}.So, does
> > it not consider hash values and range at this point or is it something to do
> > with mounting FUSE?
> 
> The way dht creates files and directories is slightly different.
> 
> For files, it calculates the hash and creates it in the subvolume in whose
> directory layout range it falls.
> For directories, it first tries to create it on the hashed subvol. If for
> some reason that fails, it will not be created on the other bricks. In this
> test, for s390x, one of the bricks killed was the hashed subvol so mkdir
> fails. 
> The solution here is to make sure the bricks being killed are not the hashed
> subvol in either big or little endian systems.

Thanks for this explanation.

Comment 60 abhays 2019-04-30 12:21:57 UTC
Hi @Nithya,

Any updates on this issue?
Seems that the same test cases are failing in the Glusterfs v6.1 with additional ones:-
./tests/bugs/replicate/bug-1655854-support-dist-to-rep3-arb-conversion.t
./tests/features/fuse-lru-limit.t

And one query we have with respect to these failures whether they affect the main functionality of Glusterfs or they can be ignored for now?
Please let us know.


Also, s390x systems have been added on the gluster-ci. Any updates regards to that?

Comment 61 abhays 2019-04-30 12:22:48 UTC
(In reply to abhays from comment #52)
> (In reply to Raghavendra Bhat from comment #50)
> > Hi,
> > 
> > Thanks for the logs. From the logs saw that the following things are
> > happening.
> > 
> > 1) The scrubbing is started
> > 
> > 2) Scrubber always decides whether a file is corrupted or not by comparing
> > the stored on-disk signature (gets by getxattr) with its own calculated
> > signature of the file.
> > 
> > 3) Here, while getting the on-disk signature, getxattr is failing with
> > ENOMEM (i.e. Cannot allocate memory) because of the endianness.
> > 
> > 4) Further testcases in the test fail because, they expect the bad-file
> > extended attribute to be present which scrubber could not set because of the
> > above error (i.e. had it been able to successfully get the signature of the
> > file via getxattr, it would have been able to compare the signature with its
> > own calculated signature and set the bad-file extended attribute to indicate
> > the file is corrupted).
> > 
> > 
> > Looking at the code to come up with a fix to address this.
> 
> Thanks for the reply @Raghavendra. We are also looking into the same.

Any Updates on this @Raghavendra?

Comment 62 Raghavendra Bhat 2019-04-30 20:05:01 UTC
(In reply to abhays from comment #61)
> (In reply to abhays from comment #52)
> > (In reply to Raghavendra Bhat from comment #50)
> > > Hi,
> > > 
> > > Thanks for the logs. From the logs saw that the following things are
> > > happening.
> > > 
> > > 1) The scrubbing is started
> > > 
> > > 2) Scrubber always decides whether a file is corrupted or not by comparing
> > > the stored on-disk signature (gets by getxattr) with its own calculated
> > > signature of the file.
> > > 
> > > 3) Here, while getting the on-disk signature, getxattr is failing with
> > > ENOMEM (i.e. Cannot allocate memory) because of the endianness.
> > > 
> > > 4) Further testcases in the test fail because, they expect the bad-file
> > > extended attribute to be present which scrubber could not set because of the
> > > above error (i.e. had it been able to successfully get the signature of the
> > > file via getxattr, it would have been able to compare the signature with its
> > > own calculated signature and set the bad-file extended attribute to indicate
> > > the file is corrupted).
> > > 
> > > 
> > > Looking at the code to come up with a fix to address this.
> > 
> > Thanks for the reply @Raghavendra. We are also looking into the same.
> 
> Any Updates on this @Raghavendra?

I am still working on a fix for this.

Comment 63 Nithya Balachandran 2019-05-02 04:01:29 UTC
(In reply to abhays from comment #60)
> Hi @Nithya,
> 
> Any updates on this issue?
> Seems that the same test cases are failing in the Glusterfs v6.1 with
> additional ones:-
> ./tests/bugs/replicate/bug-1655854-support-dist-to-rep3-arb-conversion.t
> ./tests/features/fuse-lru-limit.t
> 
> And one query we have with respect to these failures whether they affect the
> main functionality of Glusterfs or they can be ignored for now?
> Please let us know.
> 
> 
> Also, s390x systems have been added on the gluster-ci. Any updates regards
> to that?

I am no longer working on this. @Amar, please assign this to the appropriate person.

Comment 64 Amar Tumballi 2019-05-17 10:38:26 UTC
Will keep it in my name as I am yet to setup a team on this (and ARM).

Comment 65 abhays 2019-05-17 11:01:32 UTC
(In reply to Amar Tumballi from comment #64)
> Will keep it in my name as I am yet to setup a team on this (and ARM).

Thanks for the reply @Amar.


> And one query we have with respect to these failures whether they affect the
> main functionality of Glusterfs or they can be ignored for now?
> Please let us know.
> 
> 
> Also, s390x systems have been added on the gluster-ci. Any updates regards
> to that?

@Amar,Could you please comment on this also?

Comment 66 abhays 2019-07-24 05:38:46 UTC
@Amar, Any updates on this bug?

Comment 67 Amar Tumballi 2019-07-30 04:59:35 UTC
Noticed that most of the failures are because below errors:


[2019-06-24 03:33:51.569436] I [MSGID: 100030] [glusterfsd.c:2867:main] 0-/usr/local/sbin/glusterfsd: Started running /usr/local/sbin/glusterfsd version 7dev (args: /usr/local/sbin/glusterfsd -s gluster-rh7-1-1.novalocal --volfile-id patchy.gluster-rh7-1-1.novalocal.d-backends-patchy0 -p /var/run/gluster/vols/patchy/gluster-rh7-1-1.novalocal-d-backends-patchy0.pid -S /var/run/gluster/eb0f5797853f4e92.socket --brick-name /d/backends/patchy0 -l /var/log/glusterfs/bricks/d-backends-patchy0.log --xlator-option *-posix.glusterd-uuid=44c0f533-4993-4372-a5b7-1dcd0eeff367 --process-name brick --brick-port 49152 --xlator-option patchy-server.listen-port=49152) 
[2019-06-24 03:33:51.569991] I [glusterfsd.c:2594:daemonize] 0-glusterfs: Pid of current running process is 16385
[2019-06-24 03:33:51.574799] I [socket.c:955:__socket_server_bind] 0-socket.glusterfsd: closing (AF_UNIX) reuse check socket 9
[2019-06-24 03:33:51.601265] I [MSGID: 101190] [event-epoll.c:674:event_dispatch_epoll_worker] 0-epoll: Started thread with index 0 
[2019-06-24 03:33:51.601304] I [MSGID: 101190] [event-epoll.c:674:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 
[2019-06-24 03:33:52.599189] I [glusterfsd-mgmt.c:2671:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: gluster-rh7-1-1.novalocal
[2019-06-24 03:33:52.599256] I [glusterfsd-mgmt.c:2691:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers


because of which, volume start fails, and hence, most of the tests are failing.

Comment 68 abhays 2019-08-27 12:11:44 UTC
@Amar Could you please share the details of the environment and the log file, you encountered this error on?

Comment 70 abhays 2019-10-01 06:01:07 UTC
(In reply to abhays from comment #68)
> @Amar Could you please share the details of the environment and the log
> file, you encountered this error on?

Any Updates on this @Amar?


Note You need to log in before you can comment on or make changes to this bug.