Friday, August 31, 2018

Cassandra data retrieving issue - Due to huge tombstones

Recently, one of the development team has complained about their Cassandra cluster isn't returning the data from one of the tables and soon after that issue faced cluster went down.

This incident happened over the weekend and on-call Cassandra DBA was informed and asked to check the cluster. Initially, Dev team has complained about the performance issue with Cassandra cluster, in other words, Cassandra cluster couldn't handle the current load( Later they have identified it was a sudden huge data load came directly from one of the subsystems)
As a first step, DBA decided to start the cluster by starting Cassandra service but he observed the Cassandra log saying "unable to handle the load" exception ( This not the actual error)


As usual heap size wasn't enough for the existing servers due to that new nodes have been added with 12 GB of heap size with total 16 GB of physical RAM.

Even after adding new nodes, DBA has executed repair command and the problem got solved temporarily. 

Late part of the day, Again Dev team has reported their issue still exist and ask DBA support. This time we have observed my WARN message for tombstoned and it was a huge error list. <-- This error was happening from the beginning but it didn't notice because the team wanted to start the cluster running state somehow.

With the high number of tombstones, we suspected that their might have some unusual behavior in application level and we checked with Dev team regarding application involvement with showed table.

Finally, the dev team has confirmed that they are using this table as a temporary location for particular data and delete the data in every 15 mins via corn job.

Since default grace period of the tombstone is nearly 10 days, data is not being deleted from the Cassandra cluster completely until grace period is over.  Due recursive call of the delete operation, these tombstones were slowly growing up and no repair jobs were running to resolve the data scatter.

With all of this,  select query on this table was unable to receive the data as it has to read the huge list of tombstones value to find a live or actual data and heap memory wasn't enough to load all tombstones values. 

As a solution, we have changed the "gc_grace_seconds" to the 1200( 20 mins) and upgrade the sstable and then compact the table manually in each node

connect to the instance and change the gc_grace_seconds value in gradebeforeitem  table
use xxxxxx_prd;
alter table gradebeforeitem with gc_grace_seconds = 1200;


Each command has executed in each node individually ( One after another) to reduce the impact.
nohup nodetool upgradesstables -a xxxxxx_prd   gradebeforeitem &
nohup nodetool compact xxxxxx_prd gradebeforeitem &

Once we received the confirmation and success story from dev team we reverted back to the " gc_grace_seconds " value to the default (864000 ) and set up the repair job in each node to run weekly. We scheduled the job keeping 24 hours difference between jobs to avoid the clashes between the repair jobs