Back Of Envelope Estimation
In a system design interview, sometimes you are asked to estimate system capacity or
performance requirements using a back-of-the-envelope estimation.
The following concepts should be well understood:
- power of two [2],
- latency numbers every programmer should know,
- availability numbers.
Power of two
To obtain correct calculations, it is critical to know the data volume unit using the power of 2.
A byte is a sequence of 8 bits. An ASCII character uses one byte of memory (8 bits). Below is a table explaining the data volume unit –
| power | approx val | full name | short name |
|---|---|---|---|
| 10 | 1 Thousand | 1Kilo Bye | 1KB |
| 20 | 1 million | 1MegaByte | 1MB |
| 30 | 1 billion | 1GIgaByte | 1GB |
| 40 | 1 trillion | 1TeraByte | 1TB |
| 50 | 1 quadrillion | 1PentaByte | 1PB |
Latency numbers every programmer should know :
Some numbers are outdated as computers become faster and more powerful. However, those
numbers should still be able to give us an idea of the fastness and slowness of different
computer operations.
Notes
| Operation name | Time |
|---|---|
| L1 cache reference | 0.5ns |
| Batch mispredict | 5ns |
| L2 cache reference | 7ns |
| Mutex lock/unlock | 100ns |
| main memory refernce | 100ns |
| compress 1KB into zippy | 10k ns=10μs |
| send 2KB over 1gbps network | 20k ns=10μs |
| read 1MB sequentially from memory | 250k ns=10μs |
| round tript within same data-centre | 500k ns=10μs |
| Disk seek | 10,000 K ns = 10ms |
| read 1MB sequentially from network | 10,000 K ns = 10ms |
| read 1MB sequentially from disk | 30,000 K ns = 10ms |
| send packet CA(california)–>NETHERLAND –>CA | 150,000 K ns = 10ms |
Notes
ns = nanosecond, μs = microsecond, ms = millisecond
1 ns = 10^-9 seconds
1 μs= 10^-6 seconds = 1,000 ns
1 ms = 10^-3 seconds = 1,000 μs = 1,000,000 ns
By analyzing the numbers , we get the following conclusions:
- Memory is fast but the disk is slow.
- Avoid disk seeks if possible.
- Simple compression algorithms are fast.
- Compress data before sending it over the internet if possible.
- Data centers are usually in different regions, and it takes time to send data between them.
Example: Estimate Twitter QPS and storage requirements
Please note the following numbers are for this exercise only as they are not real numbers
from Twitter.
Assumptions:
- 300 million monthly active users.
- 50% of users use Twitter daily.
- Users post 2 tweets per day on average.
- 10% of tweets contain media.
- Data is stored for 5 years.
Estimations:
Query per second (QPS) estimate: - Daily active users (DAU) = 300 million * 50% = 150 million
- Tweets QPS = 150 million * 2 tweets / 24 hour / 3600 seconds = ~3500
- Peek QPS = 2 * QPS = ~7000
We will only estimate media storage here. - Average tweet size:
- tweet_id 64 bytes
- text 140 bytes
- media 1 MB
- Media storage: 150 million * 2 * 10% * 1 MB = 30 TB per day
- 5-year media storage: 30 TB * 365 * 5 = ~55 PB
Tips
Back-of-the-envelope estimation is all about the process. Solving the problem is more
important than obtaining results. Interviewers may test your problem-solving skills. Here are
a few tips to follow:
- Rounding and Approximation. It is difficult to perform complicated math operations
during the interview. For example, what is the result of “99987 / 9.1”? There is no need to
spend valuable time to solve complicated math problems. Precision is not expected. Use
round numbers and approximation to your advantage. The division question can be
simplified as follows: “100,000 / 10”. - Write down your assumptions. It is a good idea to write down your assumptions to be
referenced later. - Label your units. When you write down “5”, does it mean 5 KB or 5 MB? You might
confuse yourself with this. Write down the units because “5 MB” helps to remove
ambiguity. - Commonly asked back-of-the-envelope estimations: QPS, peak QPS, storage, cache,
number of servers, etc. You can practice