Created attachment 1247149 [details] oc logs po/hawkular-cassandra-1-0l1c5 Description of problem: Hawkular metrics fails to deploy # oc get po NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-0l1c5 0/1 CrashLoopBackOff 11 37m hawkular-metrics-kj7gz 0/1 CrashLoopBackOff 5 37m heapster-7rnsp 0/1 Running 4 37m metrics-deployer-9b07a 1/1 Running 0 37m ruby-hello-world-2-ctwsj 1/1 Running 0 37m Version-Release number of selected component (if applicable): OSE 3.4 How reproducible: Every time Steps to Reproduce: 1. wget https://raw.githubusercontent.com/openshift/openshift-ansible/master/roles/openshift_hosted_templates/files/v1.4/enterprise/metrics-deployer.yaml 2. Create PV oc create -f - << EOF apiVersion: v1 kind: PersistentVolume metadata: name: metrics-volume spec: capacity: storage: 100Gi accessModes: - ReadWriteOnce persistentVolumeReclaimPolicy: Retain hostPath: path: /mnt EOF 3. Create serviceaccounts oc create -f - << EOF apiVersion: v1 kind: ServiceAccount metadata: name: metrics-deployer secrets: - name: metrics-deployer EOF 4. Create policy stuff oadm policy add-role-to-user \ edit system:serviceaccount:openshift-infra:metrics-deployer oadm policy add-cluster-role-to-user \ cluster-reader system:serviceaccount:openshift-infra:heapster 5. Create secrets oc secrets new metrics-deployer nothing=/dev/null 6. Initiate deployment oc new-app \ --as=system:serviceaccount:openshift-infra:metrics-deployer \ -f metrics-deployer.yaml \ -p CASSANDRA_PV_SIZE=100Gi \ -p \ -p \ -p HAWKULAR_METRICS_HOSTNAME=metrics.default.abc1.feedhenry.net Actual results: Expected results: hawkular and heapster pods should be running correctly... # oc get po NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-0l1c5 1/1 Running 0 37m hawkular-metrics-kj7gz 1/1 Running 0 37m heapster-7rnsp 1/1 Running 0 37m metrics-deployer-9b07a 1/1 Stopped 0 37m ruby-hello-world-2-ctwsj 1/1 Running 0 37m Additional info:
Following exactly your steps, this cannot be reproduced on my system. From the log it appears that this is failing with: Exception (java.lang.IllegalArgumentException) encountered during startup: Out of range: -2199023255552 This is caused by https://github.com/apache/cassandra/blob/cassandra-3.0.9/src/java/org/apache/cassandra/config/DatabaseDescriptor.java#L526 which would indicate its something strange going on when its trying to determine the size of the commit log directory (which should be /cassandra_data/commitlog) Is there anything special about this directory or the filesystem this is on?
So since this is being loaded in AWS, the disksize for the directory is huge (exabytes) which is causing issues with the code. It looks like we are running into this java bug https://bugs.openjdk.java.net/browse/JDK-8162520 And possible with Cassandra not considering this large of disk space either (https://github.com/apache/cassandra/blob/cassandra-3.0.9/src/java/org/apache/cassandra/config/DatabaseDescriptor.java#L526). This check can be overwritten by setting a value in cassandra.yaml, but we currently don't expose this option. We may have to update our Cassandra start script to take into consideration this options and allow setting it via an envar or property.
Yes. AWS/EFS is a exabyte size nfs target. There really is no ceiling. Therefore it is not a good data point to base cassandra sizing. Red Hat Mobile / SaaS has a deadline of June 1, 2017 to deploy an OpenShift based eval environment for customers. We have plenty of time to get this right. Review what needs to be reviewed. Prioritize the priorities. Thanks, /Chris Callegari
The best you are going to have for this for the time being is the work around we are working on (https://github.com/openshift/origin-metrics/pull/292). This will require manual intervention. To properly get this working upstream is most likely going to take much longer.
Longer than June 1??
Running this with an exabyte sized filesystem is not supported by the underlying software (eg the JDK). If you want this to work properly and automatically, you will need to change how you are running this.
That comment alienates customers wanting to deploy OpenShift to AWS and use AWS/EFS as a storage target for Hawkular
Same behavior with NFS based persistent volume # oc edit pv/metrics-volume # Please edit the object below. Lines beginning with a '#' will be ignored, # and an empty file will abort the edit. If an error occurs while saving this file will be # reopened with the relevant failures. # apiVersion: v1 kind: PersistentVolume metadata: annotations: pv.kubernetes.io/bound-by-controller: "yes" creationTimestamp: 2017-02-09T18:29:59Z name: metrics-volume resourceVersion: "3647" selfLink: /api/v1/persistentvolumes/metrics-volume uid: c351e1dd-eef5-11e6-8511-06eb61e6059c spec: accessModes: - ReadWriteOnce capacity: storage: 1000Gi claimRef: apiVersion: v1 kind: PersistentVolumeClaim name: metrics-cassandra-1 namespace: openshift-infra resourceVersion: "3644" uid: 302a0cdb-eef6-11e6-8f3b-0a16016f2cb4 nfs: path: //metrics server: fs-c35efe8a.efs.us-east-1.amazonaws.com persistentVolumeReclaimPolicy: Retain status: phase: Bound # oc logs -f po/hawkular-cassandra-1-mfcql The MAX_HEAP_SIZE envar is not set. Basing the MAX_HEAP_SIZE on the available memory limit for the pod (7933222912). The memory limit is between 4 and 32GB. Using 1/4 of the available memory for the max_heap_size. The MAX_HEAP_SIZE has been set to 1891M THE HEAP_NEWSIZE envar is not set. Setting to 200M based on the CPU_LIMIT of 2000. [100M per CPU core] About to generate seeds Trying to access the Seed list [try #1] Trying to access the Seed list [try #2] Trying to access the Seed list [try #3] Setting seeds to be hawkular-cassandra-1-mfcql The previous version of Cassandra was 3.0.9.redhat-1. The current version is 3.0.9.redhat-1 cat: /etc/ld.so.conf.d/*.conf: No such file or directory Picked up JAVA_TOOL_OPTIONS: -Duser.home=/home/jboss -Duser.name=jboss OpenJDK 64-Bit Server VM warning: Cannot open file /opt/apache-cassandra/logs/gc.log due to No such file or directory CompilerOracle: dontinline org/apache/cassandra/db/Columns$Serializer.deserializeLargeSubset (Lorg/apache/cassandra/io/util/DataInputPlus;Lorg/apache/cassandra/db/Columns;I)Lorg/apache/cassandra/db/Columns; CompilerOracle: dontinline org/apache/cassandra/db/Columns$Serializer.serializeLargeSubset (Ljava/util/Collection;ILorg/apache/cassandra/db/Columns;ILorg/apache/cassandra/io/util/DataOutputPlus;)V CompilerOracle: dontinline org/apache/cassandra/db/Columns$Serializer.serializeLargeSubsetSize (Ljava/util/Collection;ILorg/apache/cassandra/db/Columns;I)I CompilerOracle: dontinline org/apache/cassandra/db/transform/BaseIterator.tryGetMoreContents ()Z CompilerOracle: dontinline org/apache/cassandra/db/transform/StoppingTransformation.stop ()V CompilerOracle: dontinline org/apache/cassandra/db/transform/StoppingTransformation.stopInPartition ()V CompilerOracle: dontinline org/apache/cassandra/io/util/BufferedDataOutputStreamPlus.doFlush (I)V CompilerOracle: dontinline org/apache/cassandra/io/util/BufferedDataOutputStreamPlus.writeExcessSlow ()V CompilerOracle: dontinline org/apache/cassandra/io/util/BufferedDataOutputStreamPlus.writeSlow (JI)V CompilerOracle: dontinline org/apache/cassandra/io/util/RebufferingInputStream.readPrimitiveSlowly (I)J CompilerOracle: inline org/apache/cassandra/io/util/Memory.checkBounds (JJ)V CompilerOracle: inline org/apache/cassandra/io/util/SafeMemory.checkBounds (JJ)V CompilerOracle: inline org/apache/cassandra/utils/AsymmetricOrdering.selectBoundary (Lorg/apache/cassandra/utils/AsymmetricOrdering/Op;II)I CompilerOracle: inline org/apache/cassandra/utils/AsymmetricOrdering.strictnessOfLessThan (Lorg/apache/cassandra/utils/AsymmetricOrdering/Op;)I CompilerOracle: inline org/apache/cassandra/utils/ByteBufferUtil.compare (Ljava/nio/ByteBuffer;[B)I CompilerOracle: inline org/apache/cassandra/utils/ByteBufferUtil.compare ([BLjava/nio/ByteBuffer;)I CompilerOracle: inline org/apache/cassandra/utils/ByteBufferUtil.compareUnsigned (Ljava/nio/ByteBuffer;Ljava/nio/ByteBuffer;)I CompilerOracle: inline org/apache/cassandra/utils/FastByteOperations$UnsafeOperations.compareTo (Ljava/lang/Object;JILjava/lang/Object;JI)I CompilerOracle: inline org/apache/cassandra/utils/FastByteOperations$UnsafeOperations.compareTo (Ljava/lang/Object;JILjava/nio/ByteBuffer;)I CompilerOracle: inline org/apache/cassandra/utils/FastByteOperations$UnsafeOperations.compareTo (Ljava/nio/ByteBuffer;Ljava/nio/ByteBuffer;)I CompilerOracle: inline org/apache/cassandra/utils/vint/VIntCoding.encodeVInt (JI)[B INFO 18:40:09 Configuration location: file:/opt/apache-cassandra-3.0.9.redhat-1/conf/cassandra.yaml INFO 18:40:09 Node configuration:[allocate_tokens_for_keyspace=null; authenticator=AllowAllAuthenticator; authorizer=AllowAllAuthorizer; auto_bootstrap=true; auto_snapshot=true; batch_size_fail_threshold_in_kb=50; batch_size_warn_threshold_in_kb=5; batchlog_replay_throttle_in_kb=1024; broadcast_address=null; broadcast_rpc_address=null; buffer_pool_use_heap_if_exhausted=true; cas_contention_timeout_in_ms=1000; client_encryption_options=<REDACTED>; cluster_name=hawkular-metrics; column_index_size_in_kb=64; commit_failure_policy=stop; commitlog_compression=LZ4Compressor; commitlog_directory=/cassandra_data/commitlog; commitlog_max_compression_buffers_in_pool=3; commitlog_periodic_queue_size=-1; commitlog_segment_size_in_mb=32; commitlog_sync=periodic; commitlog_sync_batch_window_in_ms=null; commitlog_sync_period_in_ms=10000; commitlog_total_space_in_mb=null; compaction_large_partition_warning_threshold_mb=100; compaction_throughput_mb_per_sec=16; concurrent_compactors=null; concurrent_counter_writes=32; concurrent_materialized_view_writes=32; concurrent_reads=32; concurrent_replicates=null; concurrent_writes=32; counter_cache_keys_to_save=2147483647; counter_cache_save_period=7200; counter_cache_size_in_mb=null; counter_write_request_timeout_in_ms=5000; cross_node_timeout=false; data_file_directories=[Ljava.lang.String;@105fece7; disk_access_mode=auto; disk_failure_policy=stop; disk_optimization_estimate_percentile=0.95; disk_optimization_page_cross_chance=0.1; disk_optimization_strategy=ssd; dynamic_snitch=true; dynamic_snitch_badness_threshold=0.1; dynamic_snitch_reset_interval_in_ms=600000; dynamic_snitch_update_interval_in_ms=100; enable_scripted_user_defined_functions=false; enable_user_defined_functions=false; enable_user_defined_functions_threads=true; encryption_options=null; endpoint_snitch=SimpleSnitch; file_cache_size_in_mb=512; gc_log_threshold_in_ms=200; gc_warn_threshold_in_ms=1000; hinted_handoff_disabled_datacenters=[]; hinted_handoff_enabled=true; hinted_handoff_throttle_in_kb=1024; hints_compression=null; hints_directory=null; hints_flush_period_in_ms=10000; incremental_backups=false; index_interval=null; index_summary_capacity_in_mb=null; index_summary_resize_interval_in_minutes=60; initial_token=null; inter_dc_stream_throughput_outbound_megabits_per_sec=200; inter_dc_tcp_nodelay=false; internode_authenticator=null; internode_compression=all; internode_recv_buff_size_in_bytes=null; internode_send_buff_size_in_bytes=null; key_cache_keys_to_save=2147483647; key_cache_save_period=14400; key_cache_size_in_mb=null; listen_address=hawkular-cassandra-1-mfcql; listen_interface=null; listen_interface_prefer_ipv6=false; listen_on_broadcast_address=false; max_hint_window_in_ms=10800000; max_hints_delivery_threads=2; max_hints_file_size_in_mb=128; max_mutation_size_in_kb=null; max_streaming_retries=3; max_value_size_in_mb=256; memtable_allocation_type=heap_buffers; memtable_cleanup_threshold=null; memtable_flush_writers=null; memtable_heap_space_in_mb=null; memtable_offheap_space_in_mb=null; min_free_space_per_drive_in_mb=50; native_transport_max_concurrent_connections=-1; native_transport_max_concurrent_connections_per_ip=-1; native_transport_max_frame_size_in_mb=256; native_transport_max_threads=128; native_transport_port=9042; native_transport_port_ssl=null; num_tokens=256; otc_coalescing_strategy=TIMEHORIZON; otc_coalescing_window_us=200; partitioner=org.apache.cassandra.dht.Murmur3Partitioner; permissions_cache_max_entries=1000; permissions_update_interval_in_ms=-1; permissions_validity_in_ms=2000; phi_convict_threshold=8.0; range_request_timeout_in_ms=10000; read_request_timeout_in_ms=5000; request_scheduler=org.apache.cassandra.scheduler.NoScheduler; request_scheduler_id=null; request_scheduler_options=null; request_timeout_in_ms=10000; role_manager=CassandraRoleManager; roles_cache_max_entries=1000; roles_update_interval_in_ms=-1; roles_validity_in_ms=2000; row_cache_class_name=org.apache.cassandra.cache.OHCProvider; row_cache_keys_to_save=2147483647; row_cache_save_period=0; row_cache_size_in_mb=0; rpc_address=hawkular-cassandra-1-mfcql; rpc_interface=null; rpc_interface_prefer_ipv6=false; rpc_keepalive=true; rpc_listen_backlog=50; rpc_max_threads=2147483647; rpc_min_threads=16; rpc_port=9160; rpc_recv_buff_size_in_bytes=null; rpc_send_buff_size_in_bytes=null; rpc_server_type=sync; saved_caches_directory=null; seed_provider=org.apache.cassandra.locator.SimpleSeedProvider{seeds=hawkular-cassandra-1-mfcql}; server_encryption_options=<REDACTED>; snapshot_before_compaction=false; ssl_storage_port=7001; sstable_preemptive_open_interval_in_mb=50; start_native_transport=true; start_rpc=false; storage_port=7000; stream_throughput_outbound_megabits_per_sec=200; streaming_socket_timeout_in_ms=86400000; thrift_framed_transport_size_in_mb=15; thrift_max_message_length_in_mb=16; tombstone_failure_threshold=100000; tombstone_warn_threshold=1000; tracetype_query_ttl=86400; tracetype_repair_ttl=604800; trickle_fsync=false; trickle_fsync_interval_in_kb=10240; truncate_request_timeout_in_ms=60000; unlogged_batch_across_partitions_warn_threshold=10; user_defined_function_fail_timeout=1500; user_defined_function_warn_timeout=500; user_function_timeout_policy=die; windows_timer_interval=1; write_request_timeout_in_ms=2000] INFO 18:40:09 DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap INFO 18:40:09 Global memtable on-heap threshold is enabled at 468MB INFO 18:40:09 Global memtable off-heap threshold is enabled at 468MB Exception (java.lang.IllegalArgumentException) encountered during startup: Out of range: -2199023255552 ERROR 18:40:09 Exception encountered during startup java.lang.IllegalArgumentException: Out of range: -2199023255552 at com.google.common.primitives.Ints.checkedCast(Ints.java:91) ~[guava-18.0.jar:na] at org.apache.cassandra.config.DatabaseDescriptor.applyConfig(DatabaseDescriptor.java:526) ~[apache-cassandra-3.0.9.redhat-1.jar:3.0.9.redhat-1] at org.apache.cassandra.config.DatabaseDescriptor.<clinit>(DatabaseDescriptor.java:119) ~[apache-cassandra-3.0.9.redhat-1.jar:3.0.9.redhat-1] at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:543) [apache-cassandra-3.0.9.redhat-1.jar:3.0.9.redhat-1] at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:696) [apache-cassandra-3.0.9.redhat-1.jar:3.0.9.redhat-1] java.lang.IllegalArgumentException: Out of range: -2199023255552 at com.google.common.primitives.Ints.checkedCast(Ints.java:91) at org.apache.cassandra.config.DatabaseDescriptor.applyConfig(DatabaseDescriptor.java:526) at org.apache.cassandra.config.DatabaseDescriptor.<clinit>(DatabaseDescriptor.java:119) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:543) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:696)
Just an FYI, NFS is not recommended to be used with metrics as it tends to have poor performance when dealing with even modest sized cluster sizes. We are looking into this issue, but it looks like the work around may not work. The upstream issue is https://issues.apache.org/jira/browse/CASSANDRA-13067
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0884