Techglot

System-Design Data-Intensive Application

Aims to bring out the optimal use of available technologies and tools for specific application in data-driven application .

Part-1 : Foundation of Data-Systems
1. Unit-1 : Reliable , Scalable and Maintainable [ R.S.M. ] code
2. Unit-2 : Data Models and Query Languages
3. Unit-3 : Storage and Retrieval
4. Unit-4 : Encoding and Evolution
Part-2 : Distributed Data
1. Unit-5 : Replication
2. Unit-6 : Partitioning
3. Unit-7 : Transactions
4. Unit-8 : The trouble with distributed system
5. Unit-9 : Consistency and Consensus
Part-3 : Derived Data
1. Batch Processing
2. Stream Processing
3. The future of Data Systems

Pages: 1234

Subject	Core Concepts to Apply	Project/Application Use Case (E-commerce & Fintech)
Mathematical Foundations for ML (AIMLZC416)	Vector algebra, Gradient descent, & PCA.	ML System Optimization: Using PCA for dimensionality reduction to optimize the performance of the ClickHouse-powered analytics dashboard you built at xyz.
Introduction to Statistical Methods (AIMLZC418)	Bayes Theorem, Hypothesis Testing, & Time series analysis.	Inventory Forecasting: Applying ARIMA/SARIMA time series models to your SKU-level inventory API at abc to predict stock-outs with higher statistical significance.
Machine Learning (AIMLZG565)	Supervised/Unsupervised Learning & Ensemble methods	Intelligent Fraud Detection: Utilizing Random Forest or SVMs to enhance the real-time reconciliation system designed for upi payments at xyz.
Deep Neural Networks (AIMLZG511)	Backpropagation, CNNs, & Transformers.	B2B Communication Intelligence: Using Transformer-based NLP models to extract intent and automate tracking from the Order Notes API you designed.

Comprehensive AI ML

Phase 1

Mathematical Foundations for Machine Learning (AIMLZC416)
- Focuses on linear algebra (vectors, matrices, eigenvalues), multivariate calculus (Jacobian, Hessian), and optimization techniques like Gradient Descent and PCA.
Introduction to Statistical Methods (AIMLZC418)
- Covers probability concepts, Bayes Theorem, hypothesis testing, and time series analysis (ARIMA, SARIMA).
Machine Learning (AIMLZG565)
- An introduction to various kinds of learning (supervised, unsupervised), model selection, Bayesian learning, and non-linear models like Decision Trees and Support Vector Machines (SVM).
Deep Neural Networks (AIMLZG511)
- Covers the approximation properties of neural networks, backpropagation, CNNs, RNNs, Transformers, and applications in time series modeling.

Phase 2

Phase 3

Phase 4 – Hands on

MLA-C01

INDEX

Theory Topics:

Services
Algorithms

Sample Exams :

Exam1
Exam2

Theory Topics:

Service and use case

SNO	Name	category	Usage
1	Amazon Kendra	Text and document	handles document search queries
	Amazon Personalize	Product recommendation	generate real-time recommendations based on item interaction data, also this can be integrated to lambda and though lambda others can call e.g lex for personalised chat .
3	SageMaker Model Monitor	Model monitoring – Accuracy	Monitoring deployed models – continuously tracks the accuracy of deployed models, and can trigger retraining workflows when model drift or decreased accuracy is detected. Note – 1.Monitoring for issues like data drift, helping maintain model performance over time is its feature. 2.Not for evaluation of training performance. 3.SageMaker Model Monitor uses ground truth ingestion to merge the actual outcomes (ground truth) with the model’s predictions to evaluate the model’s performance in production.
4	SageMaker Neo	Model Optimisation	for model optimization
5	SageMaker Feature Store		managing features
6	SageMaker Autopilot		automates the entire ML workflow from data preprocessing to model training and tuning
7	SageMaker Experiments		tracks each experiment. It Captures and organizes metadata for trials to ensure reproducibility. Features : – SageMaker Experiments offers both an API and a graphical interface in SageMaker Studio, which allows users to visualize and compare key performance metrics, such as accuracy and loss, across different trials to identify the best-performing model. – In SageMaker Experiments, trial components are used to represent different stages or steps of a workflow within a trial, such as data preprocessing, model training, and evaluation. This allows the ML engineer to track each step separately and compare results across trials.
8	SageMaker Pipelines		orchestrates complex workflows, ensuring automation and reproducibility.
9	AWS Glue		-ETL processes ,not real-time stream processing. -AWS Glue FindMatches to automatically detect and group duplicate records in the dataset AWS Glue jobs cannot be directly integrated as processing steps in SageMaker Pipelines because Glue is an independent service, not a SageMaker-native processing task. AWS Glue crawlers to infer the schemas and available columns. – AWS Glue crawlers can analyze the .csv files in Amazon S3 and automatically infer the schema and structure of the data. This creates a table definition in the AWS Glue Data Catalog, enabling the data to be organized and understood. AWS Glue DataBrew for data cleaning and feature engineering.- AWS Glue DataBrew is a visual data preparation tool that allows the ML engineer to clean and preprocess the .csv data. Tasks such as filling in missing fields, transforming formats, and normalizing data can be performed easily with Glue DataBrew.
10	The Analyze Lending Workflow API		Specialised for mortgage documents
11	Analyze Expenses API		for financial data – focuses on invoices and receipts etc., not legal contracts
12	The Analyze ID API		For analysing specific to identity documents, not legal contracts or others.
13	Analyze Document API		extracts key-value pairs and tables from document and for enhanced feature – custom queries can be used to categorize documents based on specific criteria.
14	Amazon Macie		PII – automated data classification and sensitive data discovery jobs that can be scheduled to run regularly.
15	Amazon GuardDuty	Security threats
16	AWS CloudTrail		logs access and actions performed on the model OR any other service
17	AWS KMS		used to encrypt both the training data stored in Amazon S3 and the model artifacts, ensuring compliance with healthcare data privacy regulations. i.e training data and deployed models
18	SageMaker Ground Truth	Labelling	Features : – Active learning in SageMaker Ground Truth uses a combination of ML models to automatically label simpler cases, while more complex cases are sent to human workers, helping reduce costs. – Private Workforce – refers to an internal team used for labeling but does not involve automation.
19	Kinesis Data Streams	Collect and stream Real Time Data	For collecting and streaming data from IoT devices.Real time data and can be forwarded to lambda for analysis
	Kinesis Data Analytics		allows for the immediate processing of high-throughput streams with SQL or Apache Flink applications
20	AWS Shield Standard
21	AWS Shield Advanced		Shield Advanced offers enhanced protection, detailed attack diagnostics, and cost protection against scaling charges due to DDoS attacks.
22	Amazon Transcribe	Audio — > text	analyse audio recordings from phone calls. Features : Batch transcription is useful for large volumes, Custom vocabulary improves transcription accuracy for industry-specific terms/jargon. Auto punctuation ensures transcripts are readable. speaker identification distinguishes between speakers.
23	SageMaker Clarify	Model Bias	Features – – Post-training bias detection helps analyze biases that may be present in the model predictions after training. – SageMaker Clarify offers pre-training bias detection, allowing for the identification of biases in input data before model training. – SageMaker Clarify uses SHAP (Shapley Additive Explanations) values, a game theory-based method, to calculate how individual features impact a model’s prediction, providing both global and local explanations. This is crucial for explainability and compliance. Note- LIME is another method for explainability, but SageMaker Clarify specifically uses SHAP values.
24	QuickSight		QuickSight is used for data visualization
25	SageMaker Debugger		Provides built-in rules to detect disappearing gradients and overfitting, enabling real-time monitoring of the training process.
26	SageMaker RL estimator	Model Training	The SageMaker RL estimator helps you easily train models in local mode, allowing for quick iteration during development.
27	Data wrangler		Features: – One-hot encoding missing values : for converting categorical data to numerical values, not for handling missing values. – Scaling the missing values : Scaling changes the range of numerical data but doesn’t address missing values. – Imputing missing values using pandas or PySpark- Custom transformations in SageMaker Data Wrangler allow using libraries like pandas or PySpark to impute missing values based on the mean, median, or more complex methods. – Random Undersampling -andom undersampling is one of the techniques used in SageMaker Data Wrangler to handle class imbalance by reducing the number of samples in the majority class. – Random oversampling – for class impabalce – add duplicates https://aws.amazon.com/blogs/machine-learning/balance-your-data-for-machine-learning-with-amazon-sagemaker-data-wrangler/ t does not perform large-scale ETL operations or automate schema inference from on-premises databases. SageMaker Data Wrangler’s corrupt image transform is specifically designed to simulate real-world image imperfections, such as noise, blurriness, or resolution changes, during the preprocessing stage. SageMaker Data Wrangler’s outlier detection -hile the outlier detection transform in SageMaker Data Wrangler is useful for identifying and removing anomalous data points in numerical datasets, it is not designed for handling image quality variations.
28	Amazon Bedrock	chatbot	features – Fine-tuning models with custom data :Fine-tuning allows the company to customize the foundation model with their own data, improving its relevance and accuracy for specific domain-related queries. RAG for enhanced knowledge base integration : RAG helps by integrating a knowledge base into the model, allowing for more accurate and context-aware responses based on private data. Data encryption in transit and at rest-Amazon Bedrock ensures that data, including model prompts and responses, are encrypted both in transit and at rest, ensuring secure handling of sensitive information.
29	Amazon SageMaker Studio Notebooks		Features – SageMaker Studio Notebooks provide persistent storage, allowing users to manage multiple notebooks, store datasets, and access them later. This enables better management of machine learning projects over time. – Traditional NB doesn’t offer storage
30	Amazon Rekognition	Image	Features : – Unsafe content detection, label detection, and face comparison
31	SageMaker Feature Store		Features: Offline Store– The offline store is for batch processing and storage, not for tracking the evolution of features. Feature Versioning- Feature versioning allows tracking changes to features over time, ensuring that models can be reproduced accurately by maintaining a history of the features. Feature Scaling- Feature scaling adjusts the range of values in features Lineage Tracking– Lineage tracking provides transparency into feature creation
32	Amazon Comprehend	Text analysis	Features : Amazon Comprehend for general feedback analysis . Amazon Comprehend Medical to process and extract relevant insights from medical feedback.
33	SageMaker Pipelines,		The relationships between steps in SageMaker Pipelines are defined using a directed acyclic graph (DAG). This structure outlines the dependencies and sequence of each step in the pipeline. SageMaker Pipelines callback steps are specifically designed to integrate external processes into the SageMaker pipeline workflow. By using a callback step, the SageMaker pipeline waits until the AWS Glue jobs complete.
34	Amazon A2I -Amazon Augmented AI		Amazon A2I can be directly integrated with Amazon Textract to route low-confidence predictions to human reviewers, simplifying the review process
35	SageMaker batch transform job		A SageMaker batch transform job is used to run inference on large datasets with trained models, without deploying a persistent endpoint. Additional : Flow : s3event->cloudwatchevent/eventBridge–> SM batchTRanformjob–>SM pipeline (automated data prep , learning etc.)
36	Amazon Athena		used for querying data, not triggering inference jobs.
37	Amazon Q Business		Features: – Integration with Jira through Amazon Q Business Plugins – RAG – for accuracy – Natural Language Understanding (NLU) – Related to understanding text. -Security : 1. enable data encryption for sensitive information. 2.Access control – Integrate Amazon Q Business with AWS Identity and Access Management (IAM) Identity Center to manage user permissions
38	Amazon Fraud Detector		(Deprecated 7NOv2025) Features – has built-in scalability .
39	Amazon SageMaker Lineage Tracking	Artifact	is used to track the lineage of artifacts (e.g., datasets, models, and experiments) within an ML workflow. It provides visibility into the relationships between components, such as which dataset and training job produced a specific model version
40	AWS Lake Formation		specifically designed for aggregating and managing large datasets from various sources, including Amazon S3, databases, and other on-premises or cloud-based sources. upports connecting to on-premises PostgreSQL databases and Amazon S3, making it the best choice for aggregating transaction logs, customer profiles, and database tables. Additionally, the centralized data lake can be used for further analysis and ML training.
41	SageMaker Model Registry		Key Benefits of SageMaker Model Registry collections : Non-disruptive reorganization using collections. Better model management and discoverability at scale.
42	SageMaker ML Lineage Tracking		Automatically tracks lineage information, including input datasets, model artifacts, and inference endpoints, ensuring compliance and auditability.
43
44

Algorithms and usage – https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html

Supervised learning – train on input/output

Linerar regression
- The Factorization Machines algorithm is a general-purpose supervised learning algorithm that you can use for both classification and regression tasks. It is an extension of a linear model that is designed to capture interactions between features within high dimensional sparse datasets economically.
Decision tree
Neural network

UNsupervised learning

Clustering
Association rule learning
Dimensionality reduction
Probability density

sno	Algorithm	category	Usage
1	SageMaker’s image classification algorithm	Image classification, managed	image classification
2	Random Cut Forest algorithm	Model/ALgo/unsupervised	is used for anomaly detection. It does not require labeled data for training; instead, it analyzes the structure of data to identify points that deviate significantly from normal patterns.
3	K-Nearest Neighbors	classification/regression algorithm
4	Binary classification model	classification	FEatures: Recall– (proportion) Recall (sensitivity) measures the proportion of actual positive cases (fraudulent transactions) that were correctly identified by the model. Precision (how many)-Precision focuses on how many of the predicted positive cases were correct. Accuracy: Accuracy doesn’t distinguish between false positives and false negatives. F1 Score: F1 score balances precision and recall but does not specifically address recall alone.
5	Linear regression	a supervised learning model.	A type of supervised learning model used for predicting continuous numerical values, such as house prices. It identifies the relationship between independent features (e.g., rooms, size, location) and the dependent variable (price).
6	K-means clustering	unsupervised learning algorithm	used for clustering,
7	Principal Component Analysis (PCA)		PCA is a dimensionality reduction technique used to reduce feature complexity.
8	factorization machines algorithm	recommendation systems	Factorization machines are ideal for recommendation systems because they model interactions between features (users, movies, ratings) and handle sparse datasets effectively, which is common in such systems where only a few interactions are known.
9	Object detection		Object detection is used to identify and localize multiple objects in an image,
10	Semantic segmentation		Semantic segmentation assigns a label to each pixel in the image
11	Image classification	supervised learning	Image classification is the most suitable algorithm when the goal is to assign a single label (e.g. benign or malignant) to an entire image based on learned features.
12	K-nearest neighbors (KNN)	supervised learning algorithm	KNN is also supervised, relying on labeled data.
13	XGBOOST	MODEL – TREE BASED	a tree-based model with early stopping and hyperparameter tuning, balancing accuracy with reduced training time and computational cost –XGBoost is known for its ability to deliver high performance with relatively efficient training times, especially with techniques like early stopping and hyperparameter tuning. This approach balances the need for accuracy with reduced computational cost and training time, making it an ideal choice for this scenario.
14	AdaBoost		AdaBoost works by focusing on correcting the errors of weak classifiers, assigning more weight to misclassified instances in each iteration. However, it may struggle with noisy data and extreme class imbalance, as it can overemphasize hard-to-classify instances.
15	Gradient Boosting		Gradient Boosting to sequentially train weak learners, using the gradient of the loss function to improve performance on the minority class
16	Amazon SageMaker Linear Learner	supervised learning	ideal tasks like binary classification (e.g., churn prediction). It is specifically designed to handle class imbalance by adjusting class weights. Linear Learner ensures that the minority class (churned customers) is adequately represented during training. It minimizes operational effort as Linear Learner is straightforward to use, optimized for AWS, and requires less hyperparameter tuning compared to other complex algorithms.
17	Amazon SageMaker Neural Topic Model (NTM) –	unsupervised	used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution
18	BlazingText		is effective for text classification and word embedding

Exam1

SageMaker Linear Learner algorithm – straightforward, easy-to-deploy solution without managing custom infrastructure or creating their own containers
SageMaker Pipelines – pipeline feature that can handle concurrent workflows efficiently
- It scales to handle tens of thousands of concurrent ML workflows in production.
For system like fraud detection solution using SageMaker Jumpstart. HOw They to fine-tune the pre-trained model for their specific dataset
- training it on their custom dataset.
- Use SageMaker automatic hyperparameter tuning to optimize model performance for their data.
- Not suitable – prompt Eng – this is applicable for text based interaction
SageMaker’s built-in image classification algorithm – Algorithm for managed image classification
(Q-23)Model’s performance Analysis -SageMaker provides built-in evaluation tools and allows users to plot training metrics using the SageMaker SDK to assess the model’s performance before deployment. This helps users visualize learning progress and adjust as needed.
(Q26)What is the primary consideration when defining the state in a reinforcement learning environment
- In reinforcement learning, the state represents the current situation or configuration of the environment relevant to the agent’s decision-making. It doesn’t require the entire history of actions, but should provide enough information for the agent to act effectively.
(Q26) FPr creating a reinforcement learning model in SageMaker to optimize trading strategies for a financial application. What are the key steps you should take to develop and train this model?
- Use the SageMaker RL estimator and train the model using local mode for faster prototyping.
- Defining the environment (state, actions, and reward function) is a critical step in any reinforcement learning problem.
Q54 – SageMaker Pipeline parameters– Parameters in SageMaker Pipelines can have default values, but they can be overridden with specific values during pipeline execution (NOT MANDATORY TO BE PROVIDED)
Q58 – TOKEN BASED PRICING MODEL (E.G backrock ) is differnt from EC2 pricing model
- Spot instance pricing EC2 pscific not in BedRock
- Reserved instance pricing – is for long-term, consistent usage but is not relevant in the context of serverless Bedrock, which uses token-based pricing.
- On-demand pricing – pay as you go
- Provisioned throughput mode – More suitable for large, steady workloads requiring consistent performance, not variable traffic.

Test 1 [1/3]

q03 – FEATURE Store steps
- Set up a feature group to organize and store features.
- Load the feature data into the store.
- Prepare training dataset by accessing the feature data from the store
Q-4 Ensemble modelling – Use of multiple learning models and learning algorithm to gain more accurate result than individual learning algorithm.
- Boosting – sequential learning – one model result is passed to next model – where each model tries to correct the previews model incorrect result.
- Bagging – it usually involves training multiple instances of the same model type (e.g., decision trees in random forests) rather than combining different types of models.
- Stacking(Blending) – using different models result to a meta model.
Q11- Sagemaker endpoints
- Amazon SageMaker Serverless Inference – minimizes costs during low-traffic periods while managing large infrequent spikes of requests efficiently
- Amazon SageMaker Asynchronous Inference – is ideal for handling large and long-running inference requests that do not require an immediate response.
- Amazon SageMaker Real-time Inference -real-time inference is ideal for inference workloads where you have real-time, interactive, low latency requirements.
- Batch transform– o get predictions for an entire dataset, you can use Batch transform with Amazon SageMaker.
- \
https://docs.aws.amazon.com/sagemaker/latest/dg/model-access-training-data.html
CDD Conditional Demographic Disparity evaluates the disparity in positive prediction rates across demographic groups, conditioned on a specific feature like income, to detect bias that may not be apparent when only considering overall outcomes
hyperparameter tuning.When you’re training machine learning models, each dataset and model needs a different set of hyperparameters, which are a kind of variable. The only way to determine these is through multiple experiments, where you pick a set of hyperparameters and run them through your model. This is called hyperparameter tuning. In essence, you’re training your model sequentially with different sets of hyperparameters. This process can be manual, or you can pick one of several automated hyperparameter tuning methods.
- Bayesian Optimization is a technique based on Bayes’ theorem, which describes the probability of an event occurring related to current knowledge. When this is applied to hyperparameter optimization, the algorithm builds a probabilistic model from a set of hyperparameters that optimizes a specific metric. It uses regression analysis to iteratively choose the best set of hyperparameters.
- Random Search selects groups of hyperparameters randomly on each iteration. It works well when a relatively small number of the hyperparameters primarily determine the model outcome.
- Bayesian Optimization is more efficient than Random Search for hyperparameter tuning, especially when dealing with complex models and large hyperparameter spaces. It learns from previous trials to predict the best set of hyperparameters, thus focusing the search more effectively. Narrowing the range of critical hyperparameters can further improve the chances of finding the optimal values, leading to better model convergence and performance.
- https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html
Amazon Athena is a serverless query service that allows you to run SQL queries directly on data stored in Amazon S3. Using the CTAS (CREATE TABLE AS SELECT) query:
- You can filter the data efficiently based on the observation date column since the data is partitioned.
- Athena can write the filtered results back to S3 in optimized formats like Parquet or ORC, which significantly improves query performance and reduces costs.
- Athena requires no infrastructure management, making it the most efficient and low-operational solution for querying S3-based CSV data.
An AUC close to 1.0 indicates that the model has excellent discriminatory power, effectively distinguishing between defaulters and non-defaulters
- Area Under the (Receiver Operating Characteristic) Curve (AUC) represents an industry-standard accuracy metric for binary classification models. AUC measures the ability of the model to predict a higher score for positive examples as compared to negative examples. Because it is independent of the score cut-off, you can get a sense of the prediction accuracy of your model from the AUC metric without picking a threshold.
- The AUC metric returns a decimal value from 0 to 1. AUC values near 1 indicate an ML model that is highly accurate. Values near 0.5 indicate an ML model that is no better than guessing at random. Values near 0 are unusual to see, and typically indicate a problem with the data. Essentially, an AUC near 0 says that the ML model has learned the correct patterns, but is using them to make predictions that are flipped from reality (‘0’s are predicted as ‘1’s and vice versa). The ROC curve is the plot of the true positive rate (TPR) against the false positive rate (FPR) at each threshold
- https://aws.amazon.com/blogs/machine-learning/is-your-model-good-a-deep-dive-into-amazon-sagemaker-canvas-advanced-metrics/
https://aws.amazon.com/ec2/pricing/
exploratory data analysis (EDA) is the most appropriate first step. EDA allows you to understand the data distribution, identify and address missing values, and assess the extent of the class imbalance. This process helps determine whether the available data is sufficient to build a reliable model and what preprocessing steps might be necessary.
- https://aws.amazon.com/blogs/machine-learning/exploratory-data-analysis-feature-engineering-and-operationalizing-your-data-flow-into-your-ml-pipeline-with-amazon-sagemaker-data-wrangler/
Shapely values are a local interpretability method that explains individual predictions by assigning each feature a contribution score based on its marginal effect on the prediction. This method is useful for understanding the impact of each feature on a specific instance’s prediction.
- Partial Dependence Plots (PDP), on the other hand, provide a global view of the model’s behavior by illustrating how the predicted outcome changes as a single feature is varied across its range, holding all other features constant. PDPs help understand the overall relationship between a feature and the model output across the entire dataset.
- Thus, Shapley values are suited for explaining individual decisions, while PDP is used to understand broader trends in model behavior.
- https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-model-explainability.html
SageMaker’s Script Mode – BYOC –
- Script mode enables you to write custom training and inference code while still utilizing common ML framework containers maintained by AWS.
- SageMaker supports most of the popular ML frameworks through pre-built containers, and has taken the extra step to optimize them to work especially well on AWS compute and network infrastructure in order to achieve near-linear scaling efficiency. These pre-built containers also provide some additional Python packages, such as Pandas and NumPy, so you can write your own code for training an algorithm. These frameworks also allow you to install any Python package hosted on PyPi by including a requirements.txt file with your training code or to include your own code directories.
- This is the correct approach for using the BYOC strategy with SageMaker. You build a Docker container that includes the required TensorFlow version and custom dependencies, then push the image to Amazon ECR. SageMaker can reference this image to create training jobs and deploy endpoints. By using Script Mode, you can execute your custom training script within the container, ensuring compatibility with your specific environment.
https://aws.amazon.com/blogs/aws/managed-spot-training-save-up-to-90-on-your-amazon-sagemaker-training-jobs/
https://aws.amazon.com/what-is/overfitting/
aws Personalise vs bedrock
- bedrock can connect to 3rd party foundational model – eg claude but personalise cant
Metrics used to measure machine learning model performance:X2
- https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-metrics-validation.html
- q61
Bedrock vs jumpstart
- Amazon Bedrock provides access to foundation models from third-party providers, allowing for easy deployment and integration into applications. However, Bedrock does not support fine-tuning the base model within its interface. You need to create your own private copy of the base Foundation Model and then fine-tune this copy with your custom dataset.
aUTO SCALNG ENDPOIT –
- https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling-prerequisites.html

Exam 2/3

Compare the training and validation loss curves over time; if the validation loss is much higher than the training loss, the model is likely overfitting
- if loss is more in traning then its underfitting
- https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html
Using SageMaker’s linear learner algorithm is an effective approach for creating a simple and interpretable baseline.
- The linear learner is quick to train and provides a clear point of comparison for more complex models.
in the context of machine learning
- training set- The training set is used to train the model,
- validation set – optional – the validation set is used for tuning hyperparameters and selecting the best model during the training process
- test set – Test set is used to determine how well the model generalizes
split your dataset effectively into training, validation, and test sets.
- Split the dataset into three sets: a training set for model learning, a validation set for hyperparameter tuning, and a test set for final performance evaluation on unseen data
- Use a stratified split to ensure that the training, validation, and test sets each contain a representative distribution of fraudulent and non-fraudulent transactions
  1. Given the imbalanced nature of the dataset, a stratified split is essential to ensure that each set (training, validation, and test) contains a similar distribution of the classes (fraudulent and non-fraudulent transactions). This prevents the model from learning misleading patterns and ensures that performance metrics are reliable.
“`Amazon EMR cluster architecture consists of the following nodes:
- Primary Node – Manages the cluster and must always be reliable. An On-Demand instance ensures continuous operation.
  - Core Nodes – Perform data processing and store intermediate results in HDFS. These nodes must also be fault-tolerant, requiring On-Demand instances.
  - Task Nodes – Execute additional processing without storing data, making Spot instances a cost-effective choice since their loss does not affect data durability.
    1. This configuration combines fault tolerance for critical components (primary and core nodes) with cost savings for processing (task nodes).
Ground Truth labels are useful for model quality monitoring and not for data quality monitoring.
SageMaker Canvas
- Apache Parquet is specifically designed for large-scale data processing, making it ideal for machine learning workflows with complex datasets. Its columnar format reduces the amount of data that needs to be processed by enabling efficient selection of relevant columns, thereby minimizing data processing time. Parquet also supports compression and parallel processing, which further enhances its performance when handling large datasets like transactional records. This makes it the most suitable file format for importing data into SageMaker Canvas efficient
AWS Cost Explorer allows you to analyze your past AWS spending, identify cost trends, and forecast future costs based on historical data. This tool is valuable for budgeting and financial planning, helping you make informed decisions about resource allocation and cost management.
- AWS Trusted Advisor provides actionable recommendations to optimize your AWS environment, including cost-saving suggestions. It identifies opportunities such as underutilized resources, idle instances, and reserved instance purchasing options that can significantly reduce your AWS costs.
AWS Budgets to set custom budgets for cost and usage to govern costs across your organization and receive alerts when costs exceed your defined thresholds.
Early stopping is a proven method to prevent overfitting by halting training when the model’s performance on the validation set stops improving. Incorporating new data incrementally through transfer learning helps to mitigate catastrophic forgetting by allowing the model to learn new information while retaining its prior knowledge.
- Regularization helps prevent linear models from overfitting training data examples by penalizing extreme weight values. L1 regularization reduces the number of features used in the model by pushing the weight of features that would otherwise have very small weights to zero. L1 regularization produces sparse models and reduces the amount of noise in the model. L2 regularization results in smaller overall weight values, which stabilizes the weights when there is high correlation between the features.
- L1 regularization is beneficial for feature selection, which can improve model generalization and prevent both overfitting and underfitting.
Hyperarameter
- Hyperband is a multi-fidelity based tuning strategy that dynamically reallocates resources. Hyperband uses both intermediate and final results of training jobs to re-allocate epochs to well-utilized hyperparameter configurations and automatically stops those that underperform. It also seamlessly scales to using many parallel training jobs. These features can significantly speed up hyperparameter tuning over random search and Bayesian optimization strategies.

OCI Multicloud Architect Professional

Objective :

Analyze the key drivers and benefits of adopting a multicloud architecture.
- Cost benefit of leveraging best one.
- High availability due to multiple Clouds platform
Design secure and efficient multicloud solutions by using OCI tools and services.
Configure advanced networking options, including VPNs, FastConnect, and interconnects, to optimize connectivity between cloud platforms.
Evaluate multicloud use cases to identify the most effective strategies for workload distribution and performance optimization.
Implement redundancy, high availability, and DR solutions to ensure system reliability across multiple clouds.
Apply best practices for cost management, regulatory compliance, and data security in a multicloud environment.

Index

OCI IAM
OCI network
multi Cloud network Connectivity
Oracle Azure Interconnect

Areas covered in certification

IAM – 10%
Core OCI serivces
Connection and configuring DB @ Multiclould

Hint to skim – go through test and demo

Gist : The below segments/Notes gives idea on :

What are options availble to make multi-clould conenction
What are relative competent in other cloud provider keeping OCI as central piece for connectivity
More tilted towards the networking concepts and configuration required to setup connection.

Pages: 123456789

OCI

Certification

OCI generative AI
Multiclould

AWS Certified Solutions Architect Associate

Exam1
Exam2
Exam3
Exam4
Exam5
Quick Notes
- Developer Associate Revision
- Services and use case
- Common Exam scenarios :
- White papers
- Summary from exams, topic-wise

Domain (TRY TO MAP QS to this domain)

Design Secure Architectures
- Design Secure Applications and Architectures
Design Resilient Architectures
Design High-Performing Architectures
Design Cost-Optimized Architectures

Pages: 1234

Framework	Cold Start Time (avg)
Spring Boot REST API	3–6 seconds
Spring Cloud Function	1–2 seconds (less with SnapStart)
Micronaut	< 1 second
Quarkus + GraalVM	~100–200ms

Aspect	Recommendation
Framework	Use Spring Boot (bare minimum) or Micronaut/Quarkus
Avoid	Heavy Spring Cloud modules in Lambda
Use	Provisioned Concurrency or SnapStart
Better Fit	Use Spring Cloud in ECS, EC2, or EKS-based services

We had structured profile data (name, phone, address), so used Amazon RDS (MySQL). Our Spring Cloud Function-based Lambda connects to RDS using JDBC and fetches user data using a simple SELECT query. We expose this via API Gateway as a REST endpoint. This setup leverages schema constraints and relational features of RDS while keeping cold starts optimized via provisioned concurrency and SnapStart (if needed).

// FunctionConfig.java
@Configuration
public class FunctionConfig {

    private final JdbcTemplate jdbcTemplate;

    public FunctionConfig(DataSource dataSource) {
        this.jdbcTemplate = new JdbcTemplate(dataSource);
    }

    @Bean
    public Function<Map<String, String>, Map<String, Object>> getProfile() {
        return input -> {
            String userId = input.get("userId");

            try {
                return jdbcTemplate.queryForMap(
                        "SELECT name, phone, address FROM user_profile WHERE id = ?",
                        userId
                );
            } catch (EmptyResultDataAccessException e) {
                return Map.of("error", "User not found");
            }
        };
    }
}

//Handler
public class LambdaHandler extends SpringBootRequestHandler<Map<String, String>, Map<String, Object>> {}

Amazon S3

What is S3 Lifecycle Management, and how did you use it?
- Lifecycle rules automate transitions and deletions:
  1. Moved objects to Standard-IA after 30 days, Glacier after 90 days, and deleted after 365 days.
  2. Useful for logs, reports, and backup data where older versions are rarely accessed.
  3. Managed versioned data with separate rules for current and non-current versions.
How did you ensure security for data stored in S3?
1. Enabled encryption at rest using SSE-S3 or SSE-KMS for sensitive data.
2. Applied bucket policies and IAM roles with least privilege.
3. Blocked public access at the account and bucket level.
4. Used VPC endpoints for private access without exposing S3 to the internet.
5. Enabled object-level logging and AWS Config rules for audit trails.

Amazon DynamoDB

How is DynamoDB different from Amazon RDS?

Feature	DynamoDB	RDS (e.g., PostgreSQL)
Type	NoSQL	Relational SQL
Schema	Schema-less	Schema-based
Scaling	Auto (serverless/provisioned)	Vertical (read replicas needed)
Querying	Key-based + Indexes	Rich SQL support
Joins	❌ Not supported	✅ Fully supported
Use cases	Fast reads, IoT, session data	Reporting, analytics, joins

Does DynamoDB have indexes? If yes, what types and how are they used?
- Yes, it supports:
  1. Primary Index: Partition Key (and optional Sort Key)
  2. Global Secondary Index (GSI): Alternate PK/SK, used for flexible queries
  3. Local Secondary Index (LSI): Same PK, different Sort Key
    Indexes are critical for enabling non-primary key access patterns.
What is partition key ?
What is TTL in dynamo db?
- TTL automatically deletes expired items using a timestamp field (epoch seconds).
  1. We used it for sessions, temporary tokens, and cache entries.
  2. TTL field was named expiryTime and enabled at table level.
  3. Combined with DynamoDB Streams to track deletion events.

Amazon RDS

Does Amazon RDS support indexing?
- Yes. RDS supports all traditional index types:
  1. Primary Key
  2. Secondary Indexes
  3. Composite Indexes
  4. Full-text and functional indexes depending on the engine (e.g., PostgreSQL, MySQL).
    We used them to optimize search, filters, and JOIN-heavy queries.

CI/CD on AWS

What CI/CD tools did you use with AWS?

Answer:

Used CodePipeline + CodeBuild for AWS-native CI/CD.
Integrated with GitHub Actions and Jenkins to build Docker images.
Deployed to ECS Fargate, Lambda, or EKS with blue/green or canary deployments.
Stored artifacts in S3 and container images in ECR.

Backtracking

Backtracking is a general algorithm for finding all (or some) solutions to some computational problems (notably Constraint satisfaction problems or CSPs), which incrementally builds candidates to the solution and abandons a candidate (“backtracks”) as soon as it determines that the candidate cannot lead to a valid solution

AI LLM

Development

Framework
- Lang-chain
Terms
- LLM – General knowlege
- RAG -Retrieval Augmented Generation i.e RAG+LLM – Specific information trained

Stream event on job portal – BUIDING LLM Stack for production – By AthinaAI eng by Pathway framework

Storage

VectorDB

Deployment