Portfolio Artur Skrzeta

My Comprehension of AWS Services

Intro

In order to learn clouds, I choose AWS Web Services as the example to explore. The best thing about cloud solutions is that companies can simply rent a server instead of building, maintaining and paying for its own infrastructure.

Navigate to AWS aspects:

Cloud Computing
Server Connection
Infrastructure as a Code
AWS storage
Domain Name System
Serverless services
Storage classes

Navigate to services:

CDK
KMS
IAM
VPC
EC2
RDS
S3
DynamoDB
ECS
SQS
SNS
Lambda
API Gateway
CloudFront
Storage Gateway
Redshift
Glue
Athena
Elastic MapReduce
Data Lake
Kinesis Firehose
Cloud Watch
Route 53
ElastiCache
AWS Transfer Family

Features

App includes following features:

Demo

Cloud Computing:

Common opinion on cloud computing is that it concerns System Administrators or DevOps Engineers because of solutions it provides on servers provsion, networking etc. In fact, It also concerns Software Developers, Security Engineers, Program Managers and so on as there are many services related to computing, storage, database, analytics, encryption, deployment and more.
Cloud computing is basically a model in which computing resources are available as a service.
Important characteristics of cloud computing:
- On-demand and self-serviced: we can launch it at any time with no manual intervention. What that means is we can provison resources whenever needed without requireing any human's action from provider's side.
- Elasticity: we can scale it up or down at any time due to our needs. This property is strictly related to scalability that can be horizontal (adding or removing servers in a cluster) or vertical (adding or removing resources in existing server).
- Measured: we pay as much as much resources we use.
- High-Availability: making backups and accommodating failure of a single component. In case of a server's failure there is another instance to back it up.
There are 3 types of cloud computing:
- Software as a Service (Saas) - a ready-to-use application typically accessible via a browser. Provider takes responsibility for upgrading, security, error handling etc.
- Platform as a Service (PaaS) - deploying Python application on a server along with all the dependencies.
- Infrastructure as a Service (Iaas) - gives the highest level of flexibility and management control over entire IT infrastrstructure without having to physically maintain it. It provides access to networking features, computers (virtual or on dedicated hardware), and data storage space.

source: wp-includes.com
Cloud Architecture:
1. Clouds is actually a real Data Center with X servers already set up.
2. Data Center provides Virtualization layer (in case of AWS it's mainly XEN).
3. On top of virtualization layer there are multiple Virtual servers.
Data centers are organized into availability zones that are separated by geograpic region. They play a role of backups in case of one of the Data Centers failure.
Each AWS region contains at least 2 availability zones.

Server Connection:

For server connection we need to get:
- SSH Client for Linux Server or
- RDP Client for Windows Server
which allows us to get connected with the server.
To get SSH client for windows on local machine I use MobaXterm.
MAC and Linux has its own terminal with already built-in SHH client.
There is also a way of connecting to server from a browser using Broweser Based SSH Connection. Then you don't have to have any SSG client. It can be done from AWS console directly.
There is also a need for establishing the Key Based Authentication.
- it replaces the password based authentication as using password for authentication is less secure,
- in key based authentication there are two special keys: public key and private key,
- when public key is stored in a server then only corresponding private key can authenticate successfully,
- we can simply create a key pair for exampple in EC2 in the AWS console. We can choose its format as pem (when using OpenSSH) or ppk (when using PuTTY). Once created, key gets downloaded to local machine as the private key.

Data Warehouse:

Using in Business Intelligence in which we transofrm raw data into useful business insights. It goes with following steps:
1. Gathering data from different sources: ERP, CRM, OS, flat files.
2. Extracting data from all sources, transforming and data cleaning and loading it into a data warehouse.
3. Using data in data warehouse for the business analysis.

source:radikal-labs.com

Data ralational database vs data warehouse:

Relational Database	Data Warehouse
Contains the up-to-date data.	Contains the historical data.
Useful in running the business.	Useful in analyzing the business.
Read and write operations.	Mostly read opeartions.
Accessing limited number of records.	Accessing even milion of rows if needed.
Usually one source that serves an application.	Typically a collection of many data sources.

Infrastructure as a Code (IaaC):

There are two ways of building the infrastructure: manually or with script's automation.
Automation with IaaC helps to autmate infrastrucutre building on every stage of app's service:
1. As every time a new app's service comes up, its infrastructure needs to be first built up in the development environment.
2. Then app's service comes to the staging area for testing where the same infrastructure needs to be built once again.
3. Moving to the production environment where all the infrastructure has to be replicated.
It's getting even more helpful when deploying multiple app's services.

source: udemy

One IaaC template can be reused for building infrastructure for different stages of different app's services during deployment.
Tools for IaaC: Terraform or AWS CloudFormation.
AWS CloudFormation:
- provides template where you describe your desired resources and their dependencies so that you can launch and configure them together as a stack.
- workflow:
source: AWS

Cloud Development Kit (CDK):

We can utilize a programming lanugae like Python or JS in order to create, configure and deploy AWS resources.
Enusres any other IDE functionalities like autocomplete, compile-time warnings, control flow statements, obejct oriented programming.
CDK code compiles to cloud formation or terraform output which can be deployed to AWS.
This solution is better than Cloud Formation which requires YAML or JSON what makes hard sharing creations across different projects in a scalable way.

AWS Storage types:

Block Storage	Object Storage
Data (files) is split into smaller chunks of a fixed size (blocks).	Each store fiels is an object.
Each block has its own address.	Each object has unique identifier.
No metadata about blocks.	Metadata with contextual information about single object.
Supports read/write operations.	Data is mostly read (rather than written to).
Easy data modification accessing specific block.	Modifying a file means uploading a new revision.
Accessing blocks on server with underlying file system protocol (NFS, CIFS, ext3/ext4).	Accessing objects relies on HTTP protocol.

Domain Name System:

Translates Domain Name to corresponding server's IP address:
www.example.com -> 1.2.3.4
Workflow:
1. User enters www.example.com in a browser.
2. IPS DNS Resolver looks up corresponding IP address and returns 1.2.3.4.
3. Browser takes server IP address and makes HTTP request to AWS server.
4. AWS EC2 server accepts request.

source: AWS

Serverless Services:

In fact, it doesn't mean that there is no server being present. There are server to host your application, however is completely managed by the provider. You only care about the app's code.
Very popular way of having serverless service is a PaaS model where only need is to upload the the application. PaaS provider takes care of the rest which is setting up the capacity, launching servers in high-availability and auto-scaling mode, installing technology-specifc packages and dependencies, security, patching, monitoring.
There are AWS Services that can be used without any server instantiation. There is no need for any capacity planning or how much resources we might need. We are being charged only for the computing time we consume. Example serverless services:
- AWS Lmabda for computing,
- AWS S3 for the data storage,
- DynamoDB for the Database,
- SQS, SNS for an app integration.

Storage classes:

Depending on Storage Class, the availablity, durability and perfomrance, thus pricing will differ:
- Standard S3: for general purpose, has higher availablility and pricing much higher than for infrequent access.
- Standard S3 with Infrequent Access (Standard IA): when we don't care about high availablity then we can go with that opition with lower pricing.
- Reduced Redundancy Storage (RRS): lower durability and lower availablity so we could keep only non-critical, reproducible data.
- Glacier: meant for archiving and storing long-term backups. It has a very high durability however low availability - it takes even a few hours to get data restored.
- Glacier Deep Archive: lowest-costs possible storage class that AWS offers. Supports long-term retention for data that may be accessed once or twiece a year. It has very lowe availablity - data can be restored within 12 hours.
- Intelligent Tiering: it detects seldom used data and moves it to most cost-effective tier like Standard IA. So we end up with frequent access tier and infrequent access tier that differ with pricing. This type is preferable when we store long-lived data where access patterns are unknown or unpredictable - we cannot assess which part of data will bea accessed frequently and which not.
- One Zone-IA: while Standard S3 or Standard IA stores data in min. 3 availability zones, S3 On Zone_IA stores data in single availablity zone which reduces overall costs. It's a good solution for a secondary backup copies of on-premises data or for the data that can be easily recreated. Only risk is the data will be lost in case of availablity zone destruction.
We can choose storage class while uploading object to S3:

Key Management Service (KMS):

KMS is used for storing and managing encrypting kyes on AWS data.
We apply encryption on sensitive data preventing unauthorized users form accessing. Even hackers can have find it hard to decrypt the data even if he hacked your db server.
Need to be applied due to company policy or even because of external regulations like GDPR that enforces following personal data security.
Here is the highlevel encryption flow:
confidential data >> encryption algorithm + encryption key >>
encrypted data on AWS storage
Encryption can be executed in two ways:
- Client side encryption - app on EC2 maintains key and ecrypts the data sending encrypted data into AWS S3.
- Server side encryption - sending confidential data with HTTPS (security of data in-transit) to AWS S3 and the data gets encrypted from there (security data at-rest).
Envelope encryption:
- encryption key for encrypting confidential data is called a Data Key which undergoes the following process:
Data Key >> Encryption algorithm + Customer Master Key >>
Encrypted Data Key
- Customer Master Key can be either AWS managed or Customer managed.

AWS Identity and Access Management (IAM):

Managing/comtrolling access and user roles to AWS services and resources (AWS entity like S3 bucket or other object).
IAM as a feature is free of charge. You are only charged for use of AWS services by your users.
Things we can do with IAM:
- creating users assigning them individual security credentials and providing access to AWS services and resources,
- managing user roles and permissions to control what operation can be perfomred by the individaul or what AWS resources the individual is allowed to access,
IAM most important elements:
- users - individuals with logins,
- groups - collection of users with the common theme - one permisions for entire group,
- policy document giving away acess or restricting access,
- roles - collection of policies assigned as well but they are interchangeable - sharing, limiting etc.
Each role can have one or more policies assigned:
User specific AWS Key and Secret Access Key will be created as soon as we create an user in the IAM:

- we can download them with csv file.

Virtual Private Cloud (VPC):

It is a private sub-section of AWS which you are in control of in terms of who has access to what AWS resources.
More technically, AWS lets us provision a logically isolated cloud's section in which we can define a custom virtual network configurations like IP addresses, subnets, route tables and network gateways.
When creating AWS account, there is a default VPC being created for a user. So everybody has their own VPC.
VPC architecture and components:
source: Linux Academy
- Internet Gateway (IGW) allows communication between your VPC and the internet..
- A route table lists predefined routes to the default subnets.
- A Network Access control List has pre-defined rules for access.
- VPC is partitioned into subnets to provision AWS resources in (e.g. EC2 instances).
We can set a specific VPC for an EC2 instance:

EC2:

It stands for Elastic Computing Cloud.
Actually, it's a name for a server that we can launch in AWS.
Elastic means that we can resize the server's capacity at any time.
AWS ensures high-availablity so when one EC2 server goes downm the hosted application can be still served on another EC2 server.
Launching server is as easy as hitting a button and going through a configuration:
When launching a new instance of EC2 on AWS we need to configure following things:
- region,
- server OS which is Amazon Machine Image (AMI),
- CPU and memory size of EC2 instance,
- num of instances,
- storage capacity,
- authentication key,
- security (firewall).
Once instance created, we can connect server with SSH:
- connecting to EC2 server:
ssh -i ec2-key.pem ec2-user@{Public IP}
- getting admin rights:
sudo su -
- intalling some stuff:
yum -y install nginx
yum -y install mysql
- cd to location:
cd /usr/share/nginx/html/
- modifying a file:
echo "Hello World" > index.html
- running service:
service nginx start
OS for EC2 instance is the Amazon Machine Image (AMI). We can run multiple instances from a single AMI.
There are also a persistent block storage volumes for AWS EC2 instances. We call it Elastic Block Store (EBS):
- It's available under Root device property of the EC2 instance:

- Persistent means that the data will remain even when we stop the EC2 instance.
- The volumes are replicated, backed up and connected to EC2 instances with the network:

- We can still ustilize the instance's store which gives fast performance, however the data will be lost if you EC2 instance stops or terminates or the underlying hosting disk fails. The recompensation might be the fact it's quite cost-effecitve, however, we need to make sure to back up the data in S3 for example.
- What kind of storage we want to use, we need to specify on AMI configuration step:
EC2 is equipped with Elastic Load Balancer (ELB) so that traffic can be distributed across multiple EC2 instances.
EC2 has Auto-Scaling built in which automatically adds or removes EC2 instances according to conditions we define e.g.:
1. Dynamic Scaling:
    - if average CPU utilization > 60 % then add two more instances,
    - if average CPU utilization < 30 % then remove two instances.
2. Scheduled Scaling:
   - servers are scaled based on a specific schedule.
3. Predictive Scaling:
    - based on machine learning algorithms to automatically adjusting servers capacity.

AWS Relational Database Service (RDS):

AWS RDS supports various database engines like MySQL, PostgreSQL, Microsoft SQL Server, Oracle that can be hosted on EC2.
AWS offers also the noSQL databse in DynamoDB that stores key-value pairs.
Like in other services, AWS provides:
- database provisioning via GUI,
- security,
- patching,
- backup,
- high-availablity.
We are able to pick up the engine while creating database
Connectivity:
- we can deploy RDS db into a specific VPC (further we would need to deploy lambda function into as the same VPC as it was for RDS db),
- or we can set it up to Data API which allows interacting with db using a http enpoint by lambda for instance.
Databse will be created with a bunch of technical details like enpoint and port which can be checked in Connectivity and security tab.
Amazon Aurora is a compromise between performance of traditional enterprise databases and simplicity and cost-effectiveness of open-source databases. Once creating we don't have to specify the storage as it grows along with the size of it.
Query editor:
- in order to query a database we need to create a connection when entering the query editor:

Amazon Simple Storage Service (Amazon S3):

S3 is the durable storage system and is based on object storage.
In S3 we have buckets that are like folders where we can store multiple objects (files). Bucket names are unique across enitre AWS namespace. Buckets can have subfolders.
Can be used for storing simple websites at the lower costs. With that solution there is no need for instantiating EC2 server.
When you upload some files to the cloud storage it backs file up automatically.
S3 has a lifecycle of a files which means file can either be moved to a cheaper storage or archived or even deleted when it's older than x days.
We can still configure S3 to define some replicatioon rules to copy into different bucket when file is uploaded. We can also set some events which can be triggered once a file uploaded.
Each object in the s3 bucket has its own object URL assigned. Everyone can access any object as long as someone has its url and we select 'Everyone' in Access control list of Permisson's tab.
When there are a big security requirements, each S3 bucket needs to have individual KMS encryption key (one to one relationship). At best, the key alias should reference S3 bucket name - key ids are not self-explainatory.

DynamoDB

noSQL database.
DynamoDB Stream:
- It's a feature that launches events when record modifications occur on a dynamodb table.
- We distinguish 3 types of events on a table: insert, update, remove.
- Events can carry the content of the rows being modified so we can have a look at before change and after chagne.
- Events are in the same order as in which the modifications take place,
- We can be detecting changes in a dynamoDB table using a lambda function - every time event occurs, lambda gets invoked.
- Lambda's arguments are the contents of the change that has occured.
- No performance impact on source table.

Elastic Container Service (ECS)

Deploying docker containers and making sure that containers are isloated from one another.
Allow launching, seting-up and monitoring docker containers on ECS cluster.
Serverless (with fargate) or managed (with EC2) options.
Auto-scaling of number of containers ensured based on traffic, memory or cpu utilization.
Either for ad-hoc jobs or full scale services.
Cost effective as we can host multiple different containers on a single computing resource.
With docker we only need one operating system wich is as opposed to virtual machines.
ECS elements and workflow:
1. Running docker file and uploading an image to amazon Elastic Container Repository (ECR) - like S3 for docker images.
2. Defining a task in ECS - task is an abstracion on the top of a container that tells ECS how we want to spin up docker containers. Task can contain more than one container.
3. A cluster - resoucrce farm (ec2 instances). We take task and run it on ecs cluster.
4. We can put the sevice on a ECS cluster - it allows to specify a min number of tasks and therefore container running on the cluster at point of time.
5. Load balancer.

Simple Queue Service (SQS)

Distributed message queuing service.
Supports standard queue (ordering not preserved) or FIFO (First In Frist Out).
Integration:
Client > SQS > Lambda function
- we can set lambda function as a consumer of queue's messages.
SQS holds messages until someones come along (lambda) and read the message off the queue. Once done processing it - lambda deletes the message.

Lmabda code handling the message:

exports.handler = async (event) => {
  for ( const { messageId, body } of event.Records ) {
    console.log('SQS message %s, %j', messageId, body);
  };
  return 'Successfully processed ${event.Records.length} messages.';
}

We need to enable the SQS trigger in lambd function in order to integrate both :

Simple Notification Service (SNS)

Relation one publisher (publishing messages to the topics) to many consumers of the messages in a specific topic.
We can set up different kind of consumers: email, HTTP endpoint in nodJS or Python Flask app that is listening on a specific port, SQS etc.
While SQS has pulling mechanism (pulling messages from the queue), SNS has pushing mechanism (pushing messages to the subscribers).
Two main elements: Topics and Subscriptions.
Purpose: App to Person or App to App.
App to App model:
1. External customer service publishes a message (details about order for instance).
2. Serverless lambda function taking data, applies business logic optionally and pushes further to the database.
3. or SQS for receiving SNS messages that can be consumed at a later time (no need for immediate data processing).
It's necessary to have SNS in the middle in any of model as we don't want external customer service to know about each of consumer. Not having SNS in the middle causes also perfomance and scaling (adding more consumers) problems.
When setting a topic up we need to decide who is going to be able to publish messages to this topic and who can subscribe the topic:

- when we select everyone as publishers, then anyone who has ARN of a topic can publish to that one.

AWS Lambda:

Fully managed compute service that runs our code when event appears (for instance uploading objects to S3 can trigger Lambda function) or on the time base.
AWS Lambda provides:
- servers,
- capacity,
- deployment,
- scaling,
- high-availability,
- os updates,
- security.
What we provide:
- code,
- money - as much as we use it.
All functions in lambda are stateless. To keep data there is a need for an integration with S3 od DynamoDB.
When we upload the code we receive so-called Amazon Resource Name (ARN):
- ARN is an unique identifier for a particular lambda application,
- using ARN we have a mechanism to invoke lambda function,
- behind the invocation there is the load balancer that manages compute resources (EC2),
- so when invocation comes in lambda, the load balancer deploys the code into one or more EC2 instances,
- multiple EC2 available when concurrent invocaions appear.
Available integrations:
- lambda function behind API Gateway to creatre REST APIs,
- hooking up S3 to lambda funciton in data processing - when new file inserted/updated/deleted, the lambda gets triggered to respond to that change,
- SQS with lambda for message buffering and processing,
- SNS with lambda for mesagge processing,
- Step functions with lambda for workflow orchestration,
- Snowflake or DynamoDB with lamda for a change detecting in a database's table.
There are many AWS-related events that can trigger a lambda function:

- trigger makes the lambda code being executed,
- the event is the your code itself, the lambda code will get a copy of the event and usually we want to inspect that copy of the event data and perform an action based on it.
We can specify Runtime while configuration:

We can integrate lambda with S3 bucket getting a file from it:

import boto3
import csv

key ='sub_folder_name/file_name.csv'
bucket = 's3_bucket_name'

def lambda_handler(event,  context):
    s3_resource = boto3.resource('s3')
    s3_object = s3_resource.Object(bucket, key)

    data = s3_object.get()['Body'].read().decode('utf-8').splitlines()

    lines = list(csv.reader(data))

    print(len(lines))
    print(lines[0])

    # for line in list(lines):
    #     print(line)

lambda is automatically passed a context JSON object that's essentially metadata about your lambda function if you need to access it's attributes within your code:
lambda_handler(event, context)
- event - action info about trigger that caused invocation ot the lambda instance,
- context - provides info about information about the invocation, function itself and execution environment.
Function logs are stored in the AWS cloud watch.
Handler name = lamdbda_function from lambda.function.py + lambda_handler from def lambda_hanlder()
Getting connected with RDS database:
```
import boto3

rds = boto3.client('rds-data')

db_name = 'db_name'
db_cluster_arn = 'db_arn'
db_credentials_secret_arn = '...'

def lambda_handler(event, context):
  resp = execute_sql('SELECT * FROM db_name.tbl_employees')
  return resp['records']

def execute_sql(sql_string):
  resp = rds_client.execute_statement(
    secretArn = db_credentials_secret_arn,
    database = db_name,
    resourceArn = db_cluster_arn,
    sql=sql_string
  )
  return resp
```
- boto3 is the library for interacting with AWS services endpoints.
- We can set timeout in the configuration that terminates the runtime for a function.
- We need to have a new role that have following permissions:
+ AmazonRDSDataFullAccess which hanldes alos permissons for secret keys,
+ AWSLambdaBasicExecutionRole.
If lambda has created its own role we need to replace it with the newly-customized one.
- We receive back db result in JSON format.

Integratin Lmabda with AWS Athena:
- we need following persmissions for a role assigned to a lambda function:
1. athena:StartQueryExecution,
2. athena:GetQueryExecution,
3. athena:GetQueryResults,
4. glue:GetTable.
- processflow:
Lambda function > query > Athena > output > S3 bucket
- example code:

import boto3
import json
import time

def lambda_handler(event, context):
  client = boto3.client('athena')

  query = client.start_query_execution(
    QueryString = 'SELECT * FROM aws_athen_example_table;',
    QueryExecutionContext = {
      'Database': 'db_name'
    },
    ResultConfiguration = {
      'OutputLocation': 's3://bucket_name/'
    }
  )

  queryId = query['QueryExecutionId']
  time.sleep(10)

  results = client.get_query_results(QueryExecutionId = queryId)
  for row in results['ResultSet']['Rows']:
    print(row)

API Gateway:

AWS service that allows us to build HTTP or REST APIs.
API is an accessible by client logic that dictates how (methods), where (endpoints) and to what (resources) client's app can get access.
API has so-called endpoints under which a specific resource appears. For each endpoint we can assign a method like GET, POST, DELETE and so on.
We can connect API Gateway endpoints to Lambda functions.
When configuring we need to provide:
- integration: lmabda function,
- routes: methods (GET, POST), resource path (/getResource) and integration target that will handle the request (lambda function, database,...),
- stage name that API will be deployed to,
- select auto-deploy every time there is a change to HTTP API.
When created, AWS gives us invoking URL that is going to invoke the lambda function - this is a kind of endpoint that can be changed interacting with Route 53.
Using invoking URL with a declared resource path with a browser will send GET method lounching lambda and getting lambda's result.
When POST method with values in the request's body sent to the resource path - the values are being passed into a lambda function in the event parameter. We can be extracting values and some additional info like following:
- event['rawPath'] - to get resource path,
- event['queryStringParameter']['param_name'] - to get value assigned to param_name within an url.
- decoded = json.loads(event['body'])
name = decoded['name']
we need to parse JSON format in order to get body's values.

AWS CloudFront:

Service for Content Delivery Network (CDN) which acts like a proxy that receives requests and forward those requests to the backend systems.
CDN caches website or application files, HTML, CSS, JS, images or videos at data centers around the world. Even when backend server goes down, CDN is able to serve the content of a static web-site back to the end user..
When setting up the CloudFront service we define amount of data centers - the edge locations. The more edge locations, the higher perfomrance and the lower latency of getting server's content as reposnse on user's request.
The edge locations allow users to download the app content much faster from the nearest edge location rather than when request would need to go all the way to the origin server.
User's request may need to go to the origin server in case when the content at the closest edge location is not present at the moment.
Content is being cached at the edge location for a specific period of time - Time To Live (TTL).

AWS Storage Gateway:

A service that lets the on-premise application to access and use the cloud storage.