High Level Design

Back Of Envelope Estimation

In a system design interview, sometimes you are asked to estimate system capacity or
performance requirements using a back-of-the-envelope estimation.

The following concepts should be well understood:

power of two [2],
latency numbers every programmer should know,
availability numbers.

Power of two

To obtain correct calculations, it is critical to know the data volume unit using the power of 2.

A byte is a sequence of 8 bits. An ASCII character uses one byte of memory (8 bits). Below is a table explaining the data volume unit –

power	approx val	full name	short name
10	1 Thousand	1Kilo Bye	1KB
20	1 million	1MegaByte	1MB
30	1 billion	1GIgaByte	1GB
40	1 trillion	1TeraByte	1TB
50	1 quadrillion	1PentaByte	1PB

Latency numbers every programmer should know :

Some numbers are outdated as computers become faster and more powerful. However, those
numbers should still be able to give us an idea of the fastness and slowness of different
computer operations.
Notes

Operation name	Time
L1 cache reference	0.5ns
Batch mispredict	5ns
L2 cache reference	7ns
Mutex lock/unlock	100ns
main memory refernce	100ns
compress 1KB into zippy	10k ns=10μs
send 2KB over 1gbps network	20k ns=10μs
read 1MB sequentially from memory	250k ns=10μs
round tript within same data-centre	500k ns=10μs
Disk seek	10,000 K ns = 10ms
read 1MB sequentially from network	10,000 K ns = 10ms
read 1MB sequentially from disk	30,000 K ns = 10ms
send packet CA(california)–>NETHERLAND –>CA	150,000 K ns = 10ms

Notes

ns = nanosecond, μs = microsecond, ms = millisecond
1 ns = 10^-9 seconds
1 μs= 10^-6 seconds = 1,000 ns
1 ms = 10^-3 seconds = 1,000 μs = 1,000,000 ns

By analyzing the numbers , we get the following conclusions:

Memory is fast but the disk is slow.
Avoid disk seeks if possible.
Simple compression algorithms are fast.
Compress data before sending it over the internet if possible.
Data centers are usually in different regions, and it takes time to send data between them.

Example: Estimate Twitter QPS and storage requirements

Please note the following numbers are for this exercise only as they are not real numbers
from Twitter.
Assumptions:

300 million monthly active users.
50% of users use Twitter daily.
Users post 2 tweets per day on average.
10% of tweets contain media.
Data is stored for 5 years.
Estimations:
Query per second (QPS) estimate:
Daily active users (DAU) = 300 million * 50% = 150 million
Tweets QPS = 150 million * 2 tweets / 24 hour / 3600 seconds = ~3500
Peek QPS = 2 * QPS = ~7000
We will only estimate media storage here.
Average tweet size:
tweet_id 64 bytes
text 140 bytes
media 1 MB
Media storage: 150 million * 2 * 10% * 1 MB = 30 TB per day
5-year media storage: 30 TB * 365 * 5 = ~55 PB

Tips
Back-of-the-envelope estimation is all about the process. Solving the problem is more
important than obtaining results. Interviewers may test your problem-solving skills. Here are
a few tips to follow:

Rounding and Approximation. It is difficult to perform complicated math operations
during the interview. For example, what is the result of “99987 / 9.1”? There is no need to
spend valuable time to solve complicated math problems. Precision is not expected. Use
round numbers and approximation to your advantage. The division question can be
simplified as follows: “100,000 / 10”.
Write down your assumptions. It is a good idea to write down your assumptions to be
referenced later.
Label your units. When you write down “5”, does it mean 5 KB or 5 MB? You might
confuse yourself with this. Write down the units because “5 MB” helps to remove
ambiguity.
Commonly asked back-of-the-envelope estimations: QPS, peak QPS, storage, cache,
number of servers, etc. You can practice

Pages: 1234567

High Level Design

Notes

Published by

sevanand yadav

Leave a comment Cancel reply

Notes

Share this:

Related

Published by

sevanand yadav

Leave a comment Cancel reply