Thursday, May 3, 2018

Best Practices for Running Cassandra on AWS



This post will explain how we can run Cassandra on AWS effectively, or in other words best approaches to running Cassandra on AWS. As you know AWS is one of the largest cloud environment available in the world.  I've done presentation recently about above topic and I would like to outline that presentation here for your reference.


What is Cassandra?
Apache Cassandra is a massively scalable open source NoSQL database
It delivers,
  • Continues availability,
  • Linear scalability,
  • Operational simplicity across many commodity servers with no single point of failure,
  • A masterless peer-to-peer distributed system where data is distributed among all nodes in the cluster


What is Cassandra Node?

  • A physical server, EC2 instance
  • Each machine has one installation of Cassandra.
  • A node in a cluster is just a fully functional machine that is connected to other nodes in the cluster through the high internal network.
  • All nodes work together to make sure that even if one of them failed due to an unexpected error, they as a whole cluster can provide service.
  • All nodes in a Cassandra cluster are same.
  • AWS provides an expert level of the platform for the Cassandra cluster.

How Cassandra supports High Availability?
  • Cassandra is designed to be fault-tolerant and highly available during multiple node failures.
  • Amazon Regions and availability zones can be used for deployment 
  • Resiliency is ensured through infrastructure automation.
  • Quick replacement of failing nodes
  • In case of regionwide failure, if we deploy with the multi_region option, traffic can be directed to the other active Region

Deploying Cassandra on AWS
  • Cassandra on Amazon EC2  can be automated.
  • Amazon CloudFormation which allows you to describe and provision all your infrastructure resources in AWS. No additional charge and you pay only for the AWS resources 
  • Cassandra common design patterns on AWS
    • Single AWS Region, 3 Availability Zones
    • Active-Active, Multi-Region
    • Active-Standby, Multi-Region
Single Region, 3 Availability Zones
  • Deploy the Cassandra cluster in one AWS Region and three Availability Zones.
  • There is only one ring in the cluster. 
  • By using EC2 instances in three zones, you ensure that the replicas are distributed uniformly in all zones.
Single Region, 3 Availability Zones

  • Deploy the Cassandra cluster in one AWS Region and three Availability Zones.
  • There is only one ring in the cluster. 
  • By using EC2 instances in three zones, you ensure that the replicas are distributed uniformly in all zones.
Active-Active, Multi-Region

  • Two rings in two Regions
  • The VPCs in the two Regions have peered 
  • The two Regions be identical in nature, having the same number of nodes, instance types, and storage configuration.
  • This pattern is most suitable when the applications using the Cassandra cluster are deployed in more than one Region.
  • Read/write traffic can be localized to the closest Region for the user for lower latency and higher performance.
Active-Standby, Multi-region
  • Two rings in two Regions
  • The VPCs in the two Regions have peered 
  • The two Regions be identical in nature, having the same number of nodes, instance types, and storage configuration.
  • the second Region does not receive traffic from the applications. 
  • It only functions as a secondary location for disaster recovery reasons. If the primary Region is not available, the second Region receives traffic.

Planning High-Performance Storage Options

  • Cassandra is sequential for a write-heavy workload.  But read-heavy workloads require random access.
  • If your working set (data + index) does not fit into memory, Then you need to have more I/O requests on the disk
  • Very important to select the correct storage option
  • We are not recommended to use magnetic volume types(HDD)  due to low-performance reasons
  • AWS provides two main options,
  • Amazon EC2 Instance stores 
  • Amazon EBS
Amazon EC2 Instance Store

  • Disk storage located on disks that are physically attached to the host computer – Called it as “Instance store”
  • If you are using  more than a single volume, we can stripe the instance store volumes ( RAID 0)
  • Enhanced I/O throughput
  • But If the instance is stopped, fails or is terminated,
  • You will lose all your data
  • Therefore, we need to replicate data across the multiple nodes across the Availability Zones or can go across the region level based on the requirements
Amazon EBS - Amazon Elastic Block Store

  • It provides persistent block storage 
  • Each Amazon volume id automatically replicated within its Availability Zone to protect from component failure ( High Availability, Durability)
  • By using Amazon CloudWatch with AWS Lambda you can automate volume changes 
  • General Purpose SSD  (gp2)
    • gp2 is designed to offer single-digit millisecond latencies
    • Deliver a consistent baseline performance of 3 IOPS/GB (minimum 100 IOPS) to a maximum of 10,000 IOPS
    • provide up to 160 MB/s of throughput per volume. 
    • Can reach up to 3000 IOPS if the volume is less than 1 TB
  • Provisioned IOPS SSD (io1)
    • The highest performance EBS storage option designed for critical, I/O intensive database.
    • 50 IOPS/GB to maximum 32000 IOPS
    • 500 MB/s of throughput per volume.
    • single-digit millisecond latencies and it designed to deliver the provisioned performance 99.9% of the time
Why EBS Optimized Instance?

  • Usually, the network traffic and EBS traffic is shared in Amazon EC2 instance
  • Meaning, Consistent EBS performance depends on the amount of non-EBS related network.  So we can't guarantee network traffic between Instance and EBS volume
  • The solution, EBS-Optimized instances
  • It has an additional and dedicated capacity between EC2 and EBS I/O
  • This optimization minimizes the contention between EBS I/O and other traffic from EC2
  • It has dedicated bandwidth to Amazon EBS depending on the instance type
  • Minimum  425 Mbps and 14,000 Mbps
Instance Types that support EBS Optimization
  • The Current generation instance types are EBS-optimized by default
  •  C5, C4, M5, M4, P3, P2, G3, and D2 instances,
  • No need to enable EBS optimization and no effect if you disable EBS optimization
  • We can enable EBS optimization if the instance that is not EBS-optimized by default
  • Can enable when launching the instance, or enable while the instance is running
  • We need to pay additionally if the instance doesn’t come with EBS-optimized
Available Instance Types for Cassandra

  • Computer Optimized  - C4, C5
    • High Level of Network performance
    • Default EBS-optimized for increased storage performance at no additional cost
  • Storage Optimized – I3
    • SSD-backed instance storage optimized for low latency
    • Very high random I/O performance,
  • High sequential and read throughput provide high IOPS at the low cost
    • Memory Optimized – x1e, R4
    • Optimized for memory – intensive applications 
    • SSD storage and EBS-optimized by default and at no additional cost
    • The lowest price of RAM
Planning Instance Types Based on Storage Needs

  • Let's assume we need to have a 600TB and 10% for overhead for disk formatting 
  • 95% writes and 5% reads ( Write heavy )
  • The most common instance type for Cassandra on AWS is “i3”. Why?
    • Designed for I/O intensive workloads
    • Having with SSD storage ( Instance store)
    • Instances are available in On-Demand, Reserved, and Spot from in 15 regions
  • Let’s pick i3.2xlarge instance
    • (600 x 1024)/(1900 x 0.9)  = 360 Instances 
    • 360 x $0.624 x 720 = $161,740.8 per month
      • Not including data transferring charges
      • Assumed commit log also stored on the same drive
  • This value is so costly 
    • Can use more local storage but it might not help with the cost
Decoupling with EBS ( Instance and storage separately)

  • Cassandra recommends separate drives for Data and commit log(better performance) 
  •  We will be allocated in each node, 
    • 500GB for commit log
    • 4 TB for data
  • Computer Optimized instance type is more popular for running Cassandra on EBS 
    • C4 Instance type does not have  any local storage
    • C4.4xlarge Good fit with for production workload 
    • 30 GB memory, 16 vCPU, 2,000 Mbps  Dedicated EBS Throughput
  • Will see the cost estimation based on the facts that we discussed 

Number of instances = 600TB/ 4TB => 150 Instances

 Storage Requirements =Data Storage + Commit Log storage 
600 TB   +   75 TB
    675TB

Calculating EC2  cost  per month -  US East(N. Virginia)    
      150 c4.4xlarge =  150 x $0.796 x720 = $85,968

Calculating EBS volume costs per month
  675 TB EBS GP2 = 675 x1024x $0.1 = $69,120

Total cost = $ 85,968 + $ 69,120
$155,088

We can consider Reserved Instance for further optimizations
  • Let’s say that we are happy with C4.4xlarge plus EBS configuration with cluster performance over couple of months
  • We can make reservations and optimize cost
  •  For 3 years plan with partial upfront plan
    • Per month Instances cost = 150 x $0.33x720 => $35,640
    • Per month EBS cost = 675 x1024x $0.1 => $69,120
    • Total cost per month = $35,640 + 69,120  => $104,760
  • For i3.2xlarge instance, 3 years plan with partial upfront plan
    • Per month cost =  360 x $0.28 x 720 => $72,576
Planning Elastic Network Interfaces (ENI)


  • The virtual Network interface that you can be attached /detached to/from an instance in VPC in single Availability Zone.
  • Can be attached to one instance, detached it from that instance, and attached it to another instance in the same availability zone
  • When you move ENI from one instance to another, network traffic is redirected to the new instance
  • Really help with seeds node configuration( Hard code config)
  • Failure of a seed node, you can automate in such a way that the new seed node takes over the ENI IP address programmatically

Monitoring by using Amazon CloudWatch

  • Amazon CloudWatch can be used as a resource monitoring service 
  • It collects and tracks metrics,
    • Collect and monitor log files
    • Set alarms
  • We can write a custom metric and submit it to Amazon CloudWatch
  • We can configure alarms to notify you when the metrics exceed certain defined thresholds
Maintenance
  • In terms of Cassandra cluster health,
    • Scaling 
      • Cassandra is horizontally scaled by adding more instances to the ring.
    • Upgrades
      • Rolling upgrade pattern will be used for Cassandra, Operating System patching and instance type)
    • Backup & Restore 
      • Cassandra supports snapshots and incremental backup
      • Can be used instance store, the file-based backup tool works best
      • These backup files are copied to new instances to restore
      • We recommend using S3 to durably store backup files for long-term storage.
Security

  • Ensure that the data is encrypted at rest and in transit. 
  • The second step is to restrict access to unauthorized users.
  • Encryption at rest
    • Encryption at rest can be achieved by using EBS volumes with encryption enabled.
  • Encryption in transit
    • Cassandra uses Transport Layer Security (TLS) for client and internode communications.
References 


https://d1.awsstatic.com/whitepapers/Cassandra_on_AWS.pdf
https://aws.amazon.com/ec2/instance-types/
https://aws.amazon.com/ebs/details/
https://aws.amazon.com/ec2/pricing/reserved-instances/pricing/
https://aws.amazon.com/ec2/pricing/on-demand/
https://aws.amazon.com/ec2/instance-types/x1e/
https://aws.amazon.com/blogs/aws/now-available-new-c4-instances/
https://aws.amazon.com/blogs/aws/now-available-i3-instances-for-demanding-io-intensive-applications/



No comments:

Post a Comment