Wednesday, March 18, 2015

Cassandra Storage Sizing

Column Overhead and Sizing
Every column in a ColumnFamily requires overhead.  And since every row can contain different column names as well as a different number of columns, the column meta-data must be stored with every row.
For every column, the following must be saved:
name : 2 bytes (len as short int) + byte[]
flags : 1 byte
if counter column : 8 bytes (timestamp of last delete)
if expiring column : 4 bytes (TTL) + 4 bytes (local deletion time)
timestamp : 8 bytes (long)
value : 4 bytes (len as int) + byte[]
For the majority of us (the ones not using TTL or Counters) the formula would be:
Ø  Column size = 15 + size of name + size of value
This tells us that if you have columns with small values, the column metadata can account for a non-trivial part of your storage requirements.  For example let's say you are using a timestamp (long) for column name and another long for the column value.  Using the formula:
column size = 15 + 8 + 8 = 31 bytes.  23 (bytes of overhead to store 8 bytes of data!)
See org.apache.cassandra.db.ColumnSerializer for more details.
Row Overhead and Sizing
Just like columns, every row incurs some overhead as well when stored on disk in an SSTABLE.  As you can see below, I don't have values for the bloom filter nor index overhead.  Calculating them is proportional to the size of the row and number of columns.  I'll give a rough estimate for both of them, but I don't completely understand bloom filters, so forgive me.  For every row, the following is stored:
key : 2 bytes (len as short) + byte[]
flag : 1 byte (1 or 0)
ColumnFamily ID : 4 bytes (int)
local deletion time : 4 bytes (int)
marked for delete time : 8 bytes (long)
column count : 4 bytes (int)
data : all columns (overhead + data)
row bloom filter : ???
row index : ???
As you can see there is a lot going on when storing a row of columns!  Minimally there are 26 bytes of overhead (including minimum size for bloom filter.)  It is basically impossible to predict the actual row size if your key size varies or your column data size varies.  So best just to pick some averages for key size, number of columns, etc and get an estimate:
Ø  row overhead = 23 + avg key size + bloom filter size + index size
Bloom Filter Size
The bloom filter is complicated and the following formula is only an estimate, but should capture the majority of use cases … if I understand things correctly:
Ø  bloom filter size = 2 bytes (len as short) + (8 + ceiling((number of columns * 4 + 20) / 8))
Example:  if you have 40 columns, then the bloom filter will require ~33 bytes.
For a row with only a few columns and not much data the size of the bloom filter seems more significant than a row with many columns.
See org.apache.cassandra.utils.BloomFilter and org.apache.cassandra.utils.BloomFilterSerializer for more details.
Index Size
The row index helps optimize locating a specific column within a row.  The size of the index will normally be zero unless you have rows with a lot of columns and/or data.  For the index size to grow larger than zero, the size of the row (overhead + column data) must exceed column_index_size_in_kb, defined in your YAML file (default = 64.)  If the size is exceeded, one index entry is created for each column_index_size_in_kb number of bytes in the row.  So if your row is 80kb, there will be two index entries.
An index entry consists of:
first col name : 2 bytes + byte[]
last col name : 2 bytes + byte[]
offset into the row data : 8 bytes (long)
width of data : 8 bytes (long)
The formulat for calculating number of index entries:
Ø  number of entries = ceiling(row data size / column_index_size_in_kb)
The formula for calculating size of index an index entry:
Ø  entry size = 20 + first size + last size
Each entry will be a different size if your column names differ in size, so you must calculate an average column name size for sizing the row like this:
Ø  row index size = 4 bytes [len as int] + if index size > 1 => (20 + 2 * avg name size) * numEntries
So if you have 80kb worth of data in a row and the avg column name size is 10, then the index will require the following space:
number of entries = ceiling(80kb / 64kb) = 2
row index size = 4 + (20 + 2 * 10) * 2
row index size = 84 bytes
And if you have 23kb worth of data in a row and the avg column name size is 10, then the index will require the following space (notice that the length of the index is always saved which requires 4 bytes):
number of entries = ceiling(23kb / 64kb) = 1
row index size = 4 + 0
row index size = 4 bytes
As you can see the size of the index isn't that large compared to the size of the row for default settings.  Since it is only created if the size of the row exceeds column_index_size_in_kb, most of us do not incur the overhead of the index.  And you probably should not care about it unless you have a lot of rows with a large number of columns, and/or have reduced the column_index_size_in_kb setting.
See org.apache.cassandra.io.sstable.IndexHelper for more details. 
SSTABLE Overhead and Sizing
Every SSTABLE has a Data file that contains the actual column data.  Also there is an Index, Bloom Filter, and Statistics data associated with it - each stored in its on file separate from the row data.  The size of each of these is determined by the number of rows and the key size.
Index Overhead
The index is for quickly finding where in an SSTABLE the key and its data are located.  For an index there is one entry per row key:
Row key
2 bytes (length) + byte[]
File offset into Data file
8 bytes (long)
So the index size is dependent on the number of rows and size of your keys.  For sizing purposes, just pick an average key size:
Ø  index size = rows * (10 + avg key size)
There is also a summary index and segment boundary markers, but they are not worth calculating - and I didn't want to figure it out ;)  What we can surmise from this is that the index overhead is not tiny but will only be significant if you have very narrow and skinny rows.
Bloom Filter Overhead
Bloom filters are used to quickly determine if an SSTABLE contains a key.  The bloom filter for SSTABLES is calculated similar to rows, but has a different target.  Otherwise the same code applies.  As I mentioned earlier, I am not that familiar with the bloom filter algorithm and this is just an approximation for sizing:
Ø  bloom filter size = (numRows * 15 + 20) / 8
The same can be surmised for bloom filters as the indices, if you have lots of narrow and skinny rows then the overhead for the bloom filter will be more significant.
Statistics Overhead
Very very minimal and not worth calculating.
See org.apache.cassandra.io.sstable.SSTableWriter for more details.
Replication Overhead
Replication will obviously play a role in how much storage is used.  If you set ReplicationFactor (RF) = 1 then there isn't any overhead for replicas (because you don’t have any.)  However if RF > 1 then your total data storage requirement will include replication overhead:
Ø  replication overhead = (base storage) * (RF-1)
                                          
"base storage" is based on all the calculations we've made in the previous sections – see the summary table below.
Snapshots
On linux, snapshots are made by creating hard links (man ln) to all of the data files.  By definition a hard link cannot be made across logical volumes, so the snapshots will exist on the same volume as your data files. This is very fast and does not count against storage usage until SSTABLES are removed leaving only the snapshot reference to the files.  So you can either do nothing (consuming disk space), or move them off to another volume.
If you choose to keep them on the same volume, then make sure you take into account the size of a snapshot, which is the same size as your total storage requirement.
More Overhead!
So it seems all I can talk about is overhead overhead overhead.  Well there is (at least) one more place Cassandra uses storage, during compaction.  When Cassandra performs a compaction (major or minor type) it combines a subset of existing SSTABLES into one new SSTABLE.  During the process Cassandra will discard tombstoned data and merge rows.  How much extra space is required?  Up to the total required storage size we’ve calculated thus far.  That means that you need an _additional_ “total storage” size.  If there are tombstones that can be cleared, then the extra storage will be less depending on how often you delete data.  Once the process is completed the original SSTABLES are marked as compacted and will be removed during the next GC or Cassandra restart.
Ø  Extra space for compaction = an additional "total storage” size
Example:  Suppose compaction is combining 4 SSTABLES 1gb each.  If there are no tombstones, the process will require 4gb of extra space.
Storage Sizing Summary
The following table clearly defines and summarizes the formulas from previous sections:
Number of Rows
NR
Estimated or Target number of rows in your ColumnFamily
Number of Columns Per Row
NC
Estimated/Average number of columns per row
Total Number of Columns
TNC
Number of columns in ColumnFamily
Column name size
CNS
Average size of name of column
Column value size
CVS
Average size of data stored in a column value
Row key size
RKS
Average size of keys
ReplicationFactor
RF
Number of Nodes
NN
Number of nodes in cluster
Column data
NR * NC * CVS
Column overhead
NR * NC * (15 + CNS)
Column overhead (counter)
NR * NC * (23 + CNS)
Column overhead (w/TTL)
NR * NC * (19 + CNS)
Row header overhead
NR * (23 + RKS)
Row bloom filter
NR * (2 bytes (len as short) + (8 + ceiling((NC * 4 + 20) / 8)))
Row index
4 bytes [len as int] + if index size > 1 => (20 + 2 * avg name size) * numEntries
Number of entries in index is always written.  See above for more details
SSTABLE index
NR * (10 + RKS)
SSTABLE bloom filter
(NR * 15 + 20)/8
Base Storage
BS
Sum of everything above
Additional Replicas
BS * (RF-1)
Compaction Overhead
BS * RF
Total Storage
BS * RF * 2
Snapshots
Total Storage
For every snapshot, “Total Storage” space is required.  Remember to move them off the data volume and you will recover the space
Every column requires overhead that cannot be overlooked
For sizing purposes just assume your column names are 10 bytes each, so the total overhead per column is 15 + 10 = 25 bytes.   That's just the overhead, don't forget about the actual size of the values.  It's the ratio of the data size to the overhead that surprises people.  If you have a lot of columns with small data sizes you will very likely have more overhead than actual data.
Compaction overhead requires enough space for an additional copy of the data
This figure assumes no deletes.  This is a _LARGE_ requirement which is being optimized in a future Cassandra release.
Use Case where overhead far exceeds data size
Let's say we want to save the the number of times per day that a URL on our website has been accessed.  And we’ll assume that over time, 1 million URLs will be accessed on average about 10 different days per URL.
NR
1,000,000
Estimated or Target number of rows in your ColumnFamily
NC
10
Average over all the rows
TNC
10,000,000
Total number of columns
CNS
8
All names are timestamps (long)
CVS
4
All values are counts (integer)
RKS
50
Average URL length
RF
3
Replication Factor
Number of Nodes
9
We will have the following space requirement:
Column data
1,000,000 * 10 * 4
40mb
Column overhead
1,000,000 * 10 * (15+8)
230mb
Row header overhead
1,000,000 * (23 + 50 + 0)
73mb
Row bloom filter
1,000,000 * (2+8 + ceiling((10 * 4 + 20) / 8))
18mb
Row index
RI = 1,000,000 * 4
4mb
We assume no indices because it will require 2,428 days to exceed the default column_index_size_in_kb limit.
SSTABLE index
1,000,000 * (10 + 50)
60mb
SSTABLE bloom filter
(1,000,000 * 15 + 20)/8
1.9mb
Base Storage
BS
426,875,003 = ~427mb
Additional Replicas
427mb * (3-1)
853,750,005 = ~854mb
Compaction Overhead
427mb * 3
1,280,625,008 = ~1,281mb
Total Storage
427mb * 3 * 2
2,561,250,015 = ~2,561mb
                                                                                                                   
Actual column value data = 40mb (9.4%)
Column overhead = 230mb (53.9%)
Row overhead = 95mb (22.2%)
SSTABLE overhead = 61.9mb (14.5%)
Summary
So you can see that in the example Use Case the overhead is over 9 times the actual data size!  This doesn’t mean the overhead is bad or should be reduced.  In this case the column names do more than simply designate a value, they actually store the timestamp (date) when the URL was accessed.  This exercise is merely to determine your required storage footprint and allow you to plan for the future.
Enjoy

No comments:

Post a Comment