Unit-1 : Reliable , Scalable and Maintainable [ R.S.M. ] code
Index
- Objective
- Reliability
- Definition
- Problems and solutions
- Scalability
- Maintainability
Objective: Terminologies used throughout the book & in essence describe RSM how to use and upcoming chapters describes how to implement them[RSM]
Statement: Many Applications are DATA-intensive as opposed to COMPUTE-Intensive.
Explanation : Raw CPU power is rarely a limiting factor for these applications— bigger problems are usually the amount of data, the complexity of data, and the speed at which it is changing.
Reliability
The system should continue to work correctly even in the face of adversity.
For software, typical expectations from reliability include:
- The application performs the function that the user expected.
- It can tolerate the user making mistakes or using the software in unexpected ways.
- Its performance is good enough for the required use case, under the expected load and data volume.
- Security perspective: The system prevents any unauthorized access and abuse.
Reliability Scope : Fault vs Failure
The things that can go wrong are called faults, and systems that anticipate faults and
can cope with them are called fault-tolerant or resilient.
The former term is slightly misleading: it suggests that we could make a system tolerant of every possible kind of fault, which in reality is not feasible. If the entire planet Earth (and all servers on it) were swallowed by a black hole, tolerance of that fault would require web hosting in space— good luck getting that budget item approved.
So it only makes sense to talk about tolerating certain types of faults.
Fault vs failure
Note that a fault is not the same as a failure.
A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user.
It is impossible to reduce the probability of a fault to zero; therefore it is usually best to design fault-tolerance mechanisms that prevent faults from causing failures. In this book we cover several techniques for building reliable systems from unreliable parts.
Counterintuitively, in such fault-tolerant systems, it can make sense to increase the
rate of faults by triggering them deliberately—
The Netflix Chaos Monkey is an example of this approach.
Although we generally prefer tolerating faults over preventing faults, there are cases
where prevention is better than cure (e.g., because no cure exists). This is the case
with security matters,
for example: if an attacker has compromised a system and gained access to sensitive data, that event cannot be undone.
However, this book mostly deals with the kinds of faults that can be cured, as described in the following sections.
Hardware Faults
hardware faults quickly come to mind. Hard disks crash, RAM becomes faulty, the power grid has a blackout, someone
unplugs the wrong network cable. Anyone who has worked with large data-centers
can tell you that these things happen all the time when you have a lot of machines
Hardware faults Solution:
Our first response is usually to add redundancy to the individual hardware components in order to reduce the failure rate of the system.
This approach cannot completely prevent hardware problems from causing failures, but it is well understood and can often keep a machine running uninterrupted for years.
In some cloud platforms such as Amazon Web Services (AWS) it is fairly common for virtual machine instances to become unavailable without warning , as the platforms are designed to prioritize flexibility and elasticity over single-machine reliability.
there is a move toward systems that can tolerate the loss of entire machines, by
using software fault-tolerance techniques in preference or in addition to hardware redundancy.
Such systems also have operational advantages: a single-server system requires planned downtime if you need to reboot the machine (to apply operating
system security patches, for example), whereas a system that can tolerate machine
failure can be patched one node at a time, without downtime of the entire system (a
rolling upgrade;
Software Errors
We usually think of hardware faults as being random and independent from each
other: one machine’s disk failing does not imply that another machine’s disk is going
to fail.
Another class of fault is a systematic error within the system . Such faults are
harder to anticipate, and because they are correlated across nodes, they tend to cause many more system failures than uncorrelated hardware faults .
Examples include:
• A software bug that causes every instance of an application server to crash when
given a particular bad input. For example, consider the leap second on June 30,
2012, that caused many applications to hang simultaneously due to a bug in the
Linux kernel.
Solution to software Errors:
Lots of small things can help: carefully thinking about assumptions and interactions in the system; thorough testing; process isolation; allowing processes to crash and restart; measuring, monitoring, and analyzing system behavior in production
Human Errors
One study of large internet services found that CONFIGURATION ERRORS BY OPERATOR were the leading cause of outages, whereas hardware
faults (servers or network) played a role in only 10–25% of outages
Solution for Human Errors:
- Design systems in a way that minimizes opportunities for error. For example,well-designed abstractions, APIs, and admin interfaces make it easy to do “the right thing” and discourage “the wrong thing.”
- Decouple the places where people make the most mistakes from the places where they can cause failures.
- In particular, provide fully featured non-production sandbox environments where people can explore and experiment safely, using real data, without affecting real users
How Important Is Reliability?
Reliability is not just for nuclear power stations and air traffic control software—
Bugs in business Applications cause lost productivity (and legal risks if figures are reported incorrectly), and outages of eCommerce sites can have huge costs in terms of lost revenue and damage to reputation.
Scalability
Scalability is the term we use to describe a system’s ability to cope with increased LOAD.
Describing Load:
Load can be described with a few numbers which we call load parameters.
The best choice of parameters depends on the architecture of your system:
- It may be requests per second to a web server, the ratio of reads to writes in a database,
- The number of simultaneously active users in a chat room, the hit rate on a cache, or something else.
Example of twitter best describes the scenarios of load.
Describing Performance
You can look at it in two ways:
- When you increase a load parameter and keep the system resources (CPU, memory, network bandwidth, etc.) unchanged, how is the performance of your system affected?
- When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?
- Both questions require performance numbers, so let’s look briefly at describing the performance of a system.
Latency VS Response time:
Latency and response time are often used anonymously, but they
are not the same.
The response time is what the client sees: besides the actual time to process the request (the service time), it includes network delays and queuing delays.
Latency is the duration that a request is waiting to be handled—during which it is latent, awaiting service.
Parameter for Measuring Response time :
Even if you only make the same request over and over again, you’ll get a slightly different response time on every try.
In practice, in a system handling a variety of requests, the response time can vary a lot. We therefore need to think of response time not as a single number, but as a DISTRIBUTION OF VALUES that you can measure.
It’s common to see the average response time of a service reported (arithmetic mean: given n values, add up all the values, and divide by n.)
However, the mean is not a very good metric if you want to know your “typical”
response time, because it doesn’t tell you how many users actually experienced
that delay.
Usually it is better to use percentiles. If you take your list of response times and sort it from fastest to slowest, then the median is the halfway point:
Example :
if median response time is 200 ms, that means half your requests return in less than
200 ms, and half your requests take longer than that.
The median is also known as the 50th percentile, and sometimes abbreviated as p50. Note that the median refers to a single request; if the user makes several requests (over the course of a session, or because several resources are included in a single page), the probability that at least one of them is slower than the median is much greater than 50%.
Approaches for Coping with Load:
Having the parameters for describing load and metrics for measuring performance,
Now we can discuss how do we maintain good performance even when our load parameters increase by some amount?:
People often talk of a dichotomy between scaling up (vertical scaling, moving to a
more powerful machine) and scaling out (horizontal scaling, distributing the load
across multiple smaller machines).
A system that can run on a single machine is often simpler, but high-end machines can become very expensive, so very intensive workloads often can’t avoid scaling out.
Some systems are elastic, meaning that they can automatically add computing resources when they detect a load increase, whereas other systems are scaled manually (a human analyzes the capacity and decides to add more machines to the system). An elastic system can be useful if load is highly unpredictable, but manually scaled systems are simpler and may have fewer operational surprises.
Question for discussion – in last meeting head suggested to increase the number of Instances to 12, can it be automated and ,add elastic. (for any operational surprises need reviewed periodically)
Maintainability
Over time, many different people will work on the system , and they should all be able to work on it productively.
we can and should design software in such a way that it will hopefully minimize
pain during maintenance, and thus avoid creating legacy software ourselves.
To this end, we will pay particular attention to three design principles for software
systems: Operability, Simplicity, Evolve-ability
- Operability: Making Life Easy for Operations
- Operations teams are vital to keeping a software system running smoothly. A good operations team typically is responsible for the following, and more:
- Monitoring the health of the system and quickly restoring service if it goes into a bad state.
- Tracking down the cause of problems, such as system failures or degraded performance
- Operations teams are vital to keeping a software system running smoothly. A good operations team typically is responsible for the following, and more:
Good operability means making routine tasks easy, allowing the operations team to
focus their efforts on high-value activities. Data systems can do various things to
make routine tasks easy, including:
- Providing visibility into the runtime behavior and internals of the system, with good monitoring
- Providing good documentation and an easy-to-understand operational model (“If I do X, Y will happen”)
- Simplicity: Managing Complexity as projects get larger, they often become very complex and difficult to understand.further increasing the cost of maintenance.
There are various possible symptoms of complexity:
- explosion of the state space,
- tight coupling of modules,
- tangled dependencies,
- inconsistent naming and terminology,
- hacks aimed at solving performance problems, special-casing to work around issues elsewhere,
Solution:
One of the best tools we have for removing accidental complexity is abstraction. A
good abstraction can hide a great deal of implementation detail behind a clean,
simple-to-understand facade.
Throughout this book, we will keep our eyes open for good abstractions that allow us
to extract parts of a large system into well-defined, reusable components.
- Evolvability : Making Change Easy
The ease with which you can modify a data system, and adapt it to changing requirements, is closely linked to its simplicity and its abstractions: simple and easy-to understand systems are usually easier to modify than complex ones.
Unit 2: Data Models and Query Languages
Comparison of data-models(?) and Query languages(DB)
RDBMS Vs Document model:
It’s not possible to say in general which data model leads to simpler application code;
it depends on the kinds of relationships that exist between data items.
Shortcomings of Document Model:
- Documents Db are schema-less , no rule that guarantees(on write but there is rule on read)
Advantages of Document Model:
- Document databases are sometimes called schema-less, but that’s misleading, as the code that reads the data usually assumes some kind of structure—i.e., there is an implicit schema, but it is not enforced by the database.
- But Document model lacks joins(many-many) it leaves upto application developer to filter data.
TO-DO- fill below difference
Shortcomings of Relational Model:
1.
Advantages of Relational Model:
1.Relational Model made easier to add new features to application (add reasoning)
Schema-on-read vs Schema-on-write
Examples:
- Resume best represented in Document model(one-many relation or hierarchical relation)
- Analytics best represented in document model as we need to represent particular event occurred at what all time.
Query Languages:
Declarative Query Language vs imperative Query Language
Unit 3: Storage and Retrieval
Brief : Internals of storage engine,how data is layed on disk .Choosing the right storage engine.
A simple db [operation] example:
write operation: append to eof
read operation : sequentially traverse through record and get the record
Why index needed in db?
To avoid sequential search for a record in db (convert O(n) to O(1))
Why DBs gives us option to implement the index instead of auto-provided?
Trade-off between write and read i.e any index adds to write overhead whereas well chosen index speeds-up the retrieval of data.
Terms: number of disc seek(increases access time)
Types of indexes:[TODO -Complete the types ,defination and benifits of one over another]
Hash index
[TODO- Complete the following statement that makes sense]
PERFORMANCE PENALTY on SYSTEM/READ because of Increased system reads.
Chapter 4 : Encoding and Evolution
Overview : Compare various format of data encoding(serialization).
Choice of data encoding:
json/xml: If data to be exchanged btw organization i.e difficulty of organization agreeing on somethings outweighs anything else.
binary encoding: faster processing.(BSON, bjon, WBXML) but for some reason XML/JSON widely used, Binary encoding didn’t got that popularity.(reason the space that binary saevs if approx 20bytes each data ubit so rpes=fering this small space to loss human readability)
Third party encodes has larger space gains:
- Thrift(Google): Binary Protocol, Compact protocol(Better saving of space i.e flexible data size).
- Protocol Buffer(Facebook)
Term:
Schema Evolution
Forward and backward comparability meaning:
schema:(thrift)
struct Person
1: required
2: optional
3: optional
}
Forward compatibility:
Old code can read records that were written by new code e.g.
You can add new fields to the schema, provided that you give each field a new tag
number. If old code (which doesn’t know about the new tag numbers you added)
tries to read data written by new code, including a new field with a tag number it
doesn’t recognize, it can simply ignore that field. The datatype annotation allows the
parser to determine how many bytes it needs to skip
Backward compatibility:
Modes of Data-flow:
Points: 1.requirement of compatibility with each mode of data-flow
Via Database
Via Service calls
Via asynchronous Message passing
Data flow Through Databases:
Forward and backward compatibility impact
Data-flow Through Services: REST and RPC
REST VS RPC
RPC: Specific to java (need to convert data type from application in different language).
REST: good for debugging as request can be made from browser and command line no need to setup software installation and over that supported by all major programming languages.
and there is a vast ecosystem of tools available (servers, caches, load balancers, proxies, firewalls, monitoring, debugging tools, testing tools, etc.).
The main focus of RPC frameworks is on requests between services owned by the same organization, typically within the same data-center.
Micro-service architecture and SOA(Service Oriented Architecture) are one and same thing, SOA is rebranded to Micro-service architecture.
Message Passing Data flow: