My Comprehension of AWS Services
Intro
In order to learn clouds, I choose AWS Web Services as the example to explore. The best thing about cloud solutions is that companies can simply rent a server instead of building, maintaining and paying for its own infrastructure.
Navigate to AWS aspects:
- Cloud Computing
- Server Connection
- Infrastructure as a Code
- AWS storage
- Domain Name System
- Serverless services
- Storage classes
Navigate to services:
Features
App includes following features:
Demo
Cloud Computing:
- Common opinion on cloud computing is that it concerns System Administrators or DevOps Engineers because of solutions it provides on servers provsion, networking etc. In fact, It also concerns Software Developers, Security Engineers, Program Managers and so on as there are many services related to computing, storage, database, analytics, encryption, deployment and more.
- Cloud computing is basically a model in which computing resources are available as a service.
- Important characteristics of cloud computing:
- On-demand and self-serviced: we can launch it at any time with no manual intervention. What that means is we can provison resources whenever needed without requireing any human's action from provider's side.
- Elasticity: we can scale it up or down at any time due to our needs. This property is strictly related to scalability that can be horizontal (adding or removing servers in a cluster) or vertical (adding or removing resources in existing server).
- Measured: we pay as much as much resources we use.
- High-Availability: making backups and accommodating failure of a single component. In case of a server's failure there is another instance to back it up. - There are 3 types of cloud computing:
- Software as a Service (Saas) - a ready-to-use application typically accessible via a browser. Provider takes responsibility for upgrading, security, error handling etc.
- Platform as a Service (PaaS) - deploying Python application on a server along with all the dependencies.
- Infrastructure as a Service (Iaas) - gives the highest level of flexibility and management control over entire IT infrastrstructure without having to physically maintain it. It provides access to networking features, computers (virtual or on dedicated hardware), and data storage space.
source: wp-includes.com - Cloud Architecture:
1. Clouds is actually a real Data Center with X servers already set up.
2. Data Center provides Virtualization layer (in case of AWS it's mainly XEN).
3. On top of virtualization layer there are multiple Virtual servers.
- Data centers are organized into availability zones that are separated by geograpic region. They play a role of backups in case of one of the Data Centers failure.
- Each AWS region contains at least 2 availability zones.
Server Connection:
- For server connection we need to get:
- SSH Client for Linux Server or
- RDP Client for Windows Server
which allows us to get connected with the server. - To get SSH client for windows on local machine I use MobaXterm.
- MAC and Linux has its own terminal with already built-in SHH client.
- There is also a way of connecting to server from a browser using Broweser Based SSH Connection. Then you don't have to have any SSG client. It can be done from AWS console directly.
- There is also a need for establishing the Key Based Authentication.
- it replaces the password based authentication as using password for authentication is less secure,
- in key based authentication there are two special keys: public key and private key,
- when public key is stored in a server then only corresponding private key can authenticate successfully,
- we can simply create a key pair for exampple in EC2 in the AWS console. We can choose its format as pem (when using OpenSSH) or ppk (when using PuTTY). Once created, key gets downloaded to local machine as the private key.
Data Warehouse:
- Using in Business Intelligence in which we transofrm raw data into useful business insights. It goes with following steps:
1. Gathering data from different sources: ERP, CRM, OS, flat files.
2. Extracting data from all sources, transforming and data cleaning and loading it into a data warehouse.
3. Using data in data warehouse for the business analysis. - Data ralational database vs data warehouse:
Relational Database Data Warehouse Contains the up-to-date data. Contains the historical data. Useful in running the business. Useful in analyzing the business. Read and write operations. Mostly read opeartions. Accessing limited number of records. Accessing even milion of rows if needed. Usually one source that serves an application. Typically a collection of many data sources.
source:radikal-labs.com
Infrastructure as a Code (IaaC):
- There are two ways of building the infrastructure: manually or with script's automation.
- Automation with IaaC helps to autmate infrastrucutre building on every stage of app's service:
1. As every time a new app's service comes up, its infrastructure needs to be first built up in the development environment.
2. Then app's service comes to the staging area for testing where the same infrastructure needs to be built once again.
3. Moving to the production environment where all the infrastructure has to be replicated. - It's getting even more helpful when deploying multiple app's services.
- One IaaC template can be reused for building infrastructure for different stages of different app's services during deployment.
- Tools for IaaC: Terraform or AWS CloudFormation.
- AWS CloudFormation:
- provides template where you describe your desired resources and their dependencies so that you can launch and configure them together as a stack.
- workflow:
source: AWS
source: udemy
Cloud Development Kit (CDK):
- We can utilize a programming lanugae like Python or JS in order to create, configure and deploy AWS resources.
- Enusres any other IDE functionalities like autocomplete, compile-time warnings, control flow statements, obejct oriented programming.
- CDK code compiles to cloud formation or terraform output which can be deployed to AWS.
- This solution is better than Cloud Formation which requires YAML or JSON what makes hard sharing creations across different projects in a scalable way.
AWS Storage types:
Block Storage | Object Storage |
---|---|
Data (files) is split into smaller chunks of a fixed size (blocks). | Each store fiels is an object. |
Each block has its own address. | Each object has unique identifier. |
No metadata about blocks. | Metadata with contextual information about single object. |
Supports read/write operations. | Data is mostly read (rather than written to). |
Easy data modification accessing specific block. | Modifying a file means uploading a new revision. |
Accessing blocks on server with underlying file system protocol (NFS, CIFS, ext3/ext4). | Accessing objects relies on HTTP protocol. |
Domain Name System:
- Translates Domain Name to corresponding server's IP address:
www.example.com
->1.2.3.4
- Workflow:
1. User enterswww.example.com
in a browser.
2. IPS DNS Resolver looks up corresponding IP address and returns1.2.3.4
.
3. Browser takes server IP address and makes HTTP request to AWS server.
4. AWS EC2 server accepts request.
source: AWS
Serverless Services:
- In fact, it doesn't mean that there is no server being present. There are server to host your application, however is completely managed by the provider. You only care about the app's code.
- Very popular way of having serverless service is a PaaS model where only need is to upload the the application. PaaS provider takes care of the rest which is setting up the capacity, launching servers in high-availability and auto-scaling mode, installing technology-specifc packages and dependencies, security, patching, monitoring.
- There are AWS Services that can be used without any server instantiation.
There is no need for any capacity planning or how much resources we might need.
We are being charged only for the computing time we consume.
Example serverless services:
- AWS Lmabda for computing,
- AWS S3 for the data storage,
- DynamoDB for the Database,
- SQS, SNS for an app integration.
Storage classes:
- Depending on Storage Class, the availablity, durability and perfomrance, thus pricing will differ:
- Standard S3: for general purpose, has higher availablility and pricing much higher than for infrequent access.
- Standard S3 with Infrequent Access (Standard IA): when we don't care about high availablity then we can go with that opition with lower pricing.
- Reduced Redundancy Storage (RRS): lower durability and lower availablity so we could keep only non-critical, reproducible data.
- Glacier: meant for archiving and storing long-term backups. It has a very high durability however low availability - it takes even a few hours to get data restored.
- Glacier Deep Archive: lowest-costs possible storage class that AWS offers. Supports long-term retention for data that may be accessed once or twiece a year. It has very lowe availablity - data can be restored within 12 hours.
- Intelligent Tiering: it detects seldom used data and moves it to most cost-effective tier like Standard IA. So we end up with frequent access tier and infrequent access tier that differ with pricing. This type is preferable when we store long-lived data where access patterns are unknown or unpredictable - we cannot assess which part of data will bea accessed frequently and which not.
- One Zone-IA: while Standard S3 or Standard IA stores data in min. 3 availability zones, S3 On Zone_IA stores data in single availablity zone which reduces overall costs. It's a good solution for a secondary backup copies of on-premises data or for the data that can be easily recreated. Only risk is the data will be lost in case of availablity zone destruction. - We can choose storage class while uploading object to S3:
Key Management Service (KMS):
- KMS is used for storing and managing encrypting kyes on AWS data.
- We apply encryption on sensitive data preventing unauthorized users form accessing. Even hackers can have find it hard to decrypt the data even if he hacked your db server.
- Need to be applied due to company policy or even because of external regulations like GDPR that enforces following personal data security.
- Here is the highlevel encryption flow:
confidential data
>>encryption algorithm
+encryption key
>>
encrypted data on AWS storage
- Encryption can be executed in two ways:
- Client side encryption - app on EC2 maintains key and ecrypts the data sending encrypted data into AWS S3.
- Server side encryption - sending confidential data with HTTPS (security of data in-transit) to AWS S3 and the data gets encrypted from there (security data at-rest). - Envelope encryption:
- encryption key for encrypting confidential data is called a Data Key which undergoes the following process:
Data Key
>>Encryption algorithm
+Customer Master Key
>>
Encrypted Data Key
- Customer Master Key can be either AWS managed or Customer managed.
AWS Identity and Access Management (IAM):
- Managing/comtrolling access and user roles to AWS services and resources (AWS entity like S3 bucket or other object).
- IAM as a feature is free of charge. You are only charged for use of AWS services by your users.
- Things we can do with IAM:
- creating users assigning them individual security credentials and providing access to AWS services and resources,
- managing user roles and permissions to control what operation can be perfomred by the individaul or what AWS resources the individual is allowed to access,
- IAM most important elements:
- users - individuals with logins,
- groups - collection of users with the common theme - one permisions for entire group,
- policy document giving away acess or restricting access,
- roles - collection of policies assigned as well but they are interchangeable - sharing, limiting etc.
- Each role can have one or more policies assigned:
- User specific AWS Key and Secret Access Key will be created as soon as we create an user in the IAM:
- we can download them with csv file.
Virtual Private Cloud (VPC):
- It is a private sub-section of AWS which you are in control of in terms of who has access to what AWS resources.
- More technically, AWS lets us provision a logically isolated cloud's section in which we can define a custom virtual network configurations like IP addresses, subnets, route tables and network gateways.
- When creating AWS account, there is a default VPC being created for a user. So everybody has their own VPC.
- VPC architecture and components:
source: Linux Academy
- Internet Gateway (IGW) allows communication between your VPC and the internet..
- A route table lists predefined routes to the default subnets.
- A Network Access control List has pre-defined rules for access.
- VPC is partitioned into subnets to provision AWS resources in (e.g. EC2 instances). - We can set a specific VPC for an EC2 instance:
EC2:
- It stands for Elastic Computing Cloud.
- Actually, it's a name for a server that we can launch in AWS.
- Elastic means that we can resize the server's capacity at any time.
- AWS ensures high-availablity so when one EC2 server goes downm the hosted application can be still served on another EC2 server.
- Launching server is as easy as hitting a button and going through a configuration:
- When launching a new instance of EC2 on AWS we need to configure following things:
- region,
- server OS which is Amazon Machine Image (AMI),
- CPU and memory size of EC2 instance,
- num of instances,
- storage capacity,
- authentication key,
- security (firewall). - Once instance created, we can connect server with SSH:
- connecting to EC2 server:
ssh -i ec2-key.pem ec2-user@{Public IP}
- getting admin rights:
sudo su -
- intalling some stuff:
yum -y install nginx
yum -y install mysql
- cd to location:
cd /usr/share/nginx/html/
- modifying a file:
echo "Hello World" > index.html
- running service:
service nginx start
- OS for EC2 instance is the Amazon Machine Image (AMI). We can run multiple instances from a single AMI.
- There are also a persistent block storage volumes for AWS EC2 instances. We call it Elastic Block Store (EBS):
- It's available under Root device property of the EC2 instance:
- Persistent means that the data will remain even when we stop the EC2 instance.
- The volumes are replicated, backed up and connected to EC2 instances with the network:
- We can still ustilize the instance's store which gives fast performance, however the data will be lost if you EC2 instance stops or terminates or the underlying hosting disk fails. The recompensation might be the fact it's quite cost-effecitve, however, we need to make sure to back up the data in S3 for example.
- What kind of storage we want to use, we need to specify on AMI configuration step:
- EC2 is equipped with Elastic Load Balancer (ELB) so that traffic can be distributed across multiple EC2 instances.
- EC2 has Auto-Scaling built in which automatically adds or removes EC2 instances according to conditions we define e.g.:
1. Dynamic Scaling:
    - if average CPU utilization > 60 % then add two more instances,
    - if average CPU utilization < 30 % then remove two instances.
2. Scheduled Scaling:
   - servers are scaled based on a specific schedule.
3. Predictive Scaling:
    - based on machine learning algorithms to automatically adjusting servers capacity.
AWS Relational Database Service (RDS):
- AWS RDS supports various database engines like MySQL, PostgreSQL, Microsoft SQL Server, Oracle that can be hosted on EC2.
- AWS offers also the noSQL databse in DynamoDB that stores key-value pairs.
- Like in other services, AWS provides:
- database provisioning via GUI,
- security,
- patching,
- backup,
- high-availablity. - We are able to pick up the engine while creating database
- Connectivity:
- we can deploy RDS db into a specific VPC (further we would need to deploy lambda function into as the same VPC as it was for RDS db),
- or we can set it up to Data API which allows interacting with db using a http enpoint by lambda for instance. - Databse will be created with a bunch of technical details like enpoint and port which can be checked in Connectivity and security tab.
- Amazon Aurora is a compromise between performance of traditional enterprise databases and simplicity and cost-effectiveness of open-source databases. Once creating we don't have to specify the storage as it grows along with the size of it.
- Query editor:
- in order to query a database we need to create a connection when entering the query editor:
Amazon Simple Storage Service (Amazon S3):
- S3 is the durable storage system and is based on object storage.
- In S3 we have buckets that are like folders where we can store multiple objects (files). Bucket names are unique across enitre AWS namespace. Buckets can have subfolders.
- Can be used for storing simple websites at the lower costs. With that solution there is no need for instantiating EC2 server.
- When you upload some files to the cloud storage it backs file up automatically.
- S3 has a lifecycle of a files which means file can either be moved to a cheaper storage or archived or even deleted when it's older than x days.
- We can still configure S3 to define some replicatioon rules to copy into different bucket when file is uploaded. We can also set some events which can be triggered once a file uploaded.
- Each object in the s3 bucket has its own object URL assigned. Everyone can access any object as long as someone has its url and we select 'Everyone' in Access control list of Permisson's tab.
- When there are a big security requirements, each S3 bucket needs to have individual KMS encryption key (one to one relationship). At best, the key alias should reference S3 bucket name - key ids are not self-explainatory.
DynamoDB
- noSQL database.
- DynamoDB Stream:
- It's a feature that launches events when record modifications occur on a dynamodb table.
- We distinguish 3 types of events on a table: insert, update, remove.
- Events can carry the content of the rows being modified so we can have a look at before change and after chagne.
- Events are in the same order as in which the modifications take place,
- We can be detecting changes in a dynamoDB table using a lambda function - every time event occurs, lambda gets invoked.
- Lambda's arguments are the contents of the change that has occured.
- No performance impact on source table.
Elastic Container Service (ECS)
- Deploying docker containers and making sure that containers are isloated from one another.
- Allow launching, seting-up and monitoring docker containers on ECS cluster.
- Serverless (with fargate) or managed (with EC2) options.
- Auto-scaling of number of containers ensured based on traffic, memory or cpu utilization.
- Either for ad-hoc jobs or full scale services.
- Cost effective as we can host multiple different containers on a single computing resource.
- With docker we only need one operating system wich is as opposed to virtual machines.
- ECS elements and workflow:
1. Running docker file and uploading an image to amazon Elastic Container Repository (ECR) - like S3 for docker images.
2. Defining a task in ECS - task is an abstracion on the top of a container that tells ECS how we want to spin up docker containers. Task can contain more than one container.
3. A cluster - resoucrce farm (ec2 instances). We take task and run it on ecs cluster.
4. We can put the sevice on a ECS cluster - it allows to specify a min number of tasks and therefore container running on the cluster at point of time.
5. Load balancer.
Simple Queue Service (SQS)
- Distributed message queuing service.
- Supports standard queue (ordering not preserved) or FIFO (First In Frist Out).
- Integration:
Client
>SQS
>Lambda function
- we can set lambda function as a consumer of queue's messages. - SQS holds messages until someones come along (lambda) and read the message off the queue. Once done processing it - lambda deletes the message.
- Lmabda code handling the message:
exports.handler = async (event) => { for ( const { messageId, body } of event.Records ) { console.log('SQS message %s, %j', messageId, body); }; return 'Successfully processed ${event.Records.length} messages.'; }
- We need to enable the SQS trigger in lambd function in order to integrate both :
Simple Notification Service (SNS)
- Relation one publisher (publishing messages to the topics) to many consumers of the messages in a specific topic.
- We can set up different kind of consumers: email, HTTP endpoint in nodJS or Python Flask app that is listening on a specific port, SQS etc.
- While SQS has pulling mechanism (pulling messages from the queue), SNS has pushing mechanism (pushing messages to the subscribers).
- Two main elements: Topics and Subscriptions.
- Purpose: App to Person or App to App.
- App to App model:
1. External customer service publishes a message (details about order for instance).
2. Serverless lambda function taking data, applies business logic optionally and pushes further to the database.
3. or SQS for receiving SNS messages that can be consumed at a later time (no need for immediate data processing). - It's necessary to have SNS in the middle in any of model as we don't want external customer service to know about each of consumer. Not having SNS in the middle causes also perfomance and scaling (adding more consumers) problems.
- When setting a topic up we need to decide who is going to be able to publish messages to this topic and who can subscribe the topic:
- when we select everyone as publishers, then anyone who has ARN of a topic can publish to that one.
AWS Lambda:
- Fully managed compute service that runs our code when event appears (for instance uploading objects to S3 can trigger Lambda function) or on the time base.
- AWS Lambda provides:
- servers,
- capacity,
- deployment,
- scaling,
- high-availability,
- os updates,
- security. - What we provide:
- code,
- money - as much as we use it. - All functions in lambda are stateless. To keep data there is a need for an integration with S3 od DynamoDB.
- When we upload the code we receive so-called Amazon Resource Name (ARN):
- ARN is an unique identifier for a particular lambda application,
- using ARN we have a mechanism to invoke lambda function,
- behind the invocation there is the load balancer that manages compute resources (EC2),
- so when invocation comes in lambda, the load balancer deploys the code into one or more EC2 instances,
- multiple EC2 available when concurrent invocaions appear. - Available integrations:
- lambda function behind API Gateway to creatre REST APIs,
- hooking up S3 to lambda funciton in data processing - when new file inserted/updated/deleted, the lambda gets triggered to respond to that change,
- SQS with lambda for message buffering and processing,
- SNS with lambda for mesagge processing,
- Step functions with lambda for workflow orchestration,
- Snowflake or DynamoDB with lamda for a change detecting in a database's table. - There are many AWS-related events that can trigger a lambda function:
- trigger makes the lambda code being executed,
- the event is the your code itself, the lambda code will get a copy of the event and usually we want to inspect that copy of the event data and perform an action based on it. - We can specify Runtime while configuration:
- We can integrate lambda with S3 bucket getting a file from it:
import boto3 import csv key ='sub_folder_name/file_name.csv' bucket = 's3_bucket_name' def lambda_handler(event, context): s3_resource = boto3.resource('s3') s3_object = s3_resource.Object(bucket, key) data = s3_object.get()['Body'].read().decode('utf-8').splitlines() lines = list(csv.reader(data)) print(len(lines)) print(lines[0]) # for line in list(lines): # print(line)
-
lambda is automatically passed a context JSON object that's essentially metadata about your lambda function
if you need to access it's attributes within your code:
lambda_handler(event, context)
- event - action info about trigger that caused invocation ot the lambda instance,
- context - provides info about information about the invocation, function itself and execution environment. - Function logs are stored in the AWS cloud watch.
- Handler name = lamdbda_function from lambda.function.py + lambda_handler from def lambda_hanlder()
- Getting connected with RDS database:
import boto3 rds = boto3.client('rds-data') db_name = 'db_name' db_cluster_arn = 'db_arn' db_credentials_secret_arn = '...' def lambda_handler(event, context): resp = execute_sql('SELECT * FROM db_name.tbl_employees') return resp['records'] def execute_sql(sql_string): resp = rds_client.execute_statement( secretArn = db_credentials_secret_arn, database = db_name, resourceArn = db_cluster_arn, sql=sql_string ) return resp
-boto3
is the library for interacting with AWS services endpoints.
- We can set timeout in the configuration that terminates the runtime for a function.
- We need to have a new role that have following permissions:
+ AmazonRDSDataFullAccess which hanldes alos permissons for secret keys,
+ AWSLambdaBasicExecutionRole.
If lambda has created its own role we need to replace it with the newly-customized one.
- We receive back db result in JSON format. - Integratin Lmabda with AWS Athena:
- we need following persmissions for a role assigned to a lambda function:
1. athena:StartQueryExecution,
2. athena:GetQueryExecution,
3. athena:GetQueryResults,
4. glue:GetTable.
- processflow:
Lambda function
> query >Athena
> output >S3 bucket
- example code:import boto3 import json import time def lambda_handler(event, context): client = boto3.client('athena') query = client.start_query_execution( QueryString = 'SELECT * FROM aws_athen_example_table;', QueryExecutionContext = { 'Database': 'db_name' }, ResultConfiguration = { 'OutputLocation': 's3://bucket_name/' } ) queryId = query['QueryExecutionId'] time.sleep(10) results = client.get_query_results(QueryExecutionId = queryId) for row in results['ResultSet']['Rows']: print(row)
API Gateway:
- AWS service that allows us to build HTTP or REST APIs.
- API is an accessible by client logic that dictates how (methods), where (endpoints) and to what (resources) client's app can get access.
- API has so-called endpoints under which a specific resource appears. For each endpoint we can assign a method like GET, POST, DELETE and so on.
- We can connect API Gateway endpoints to Lambda functions.
- When configuring we need to provide:
- integration: lmabda function,
- routes: methods (GET, POST), resource path (/getResource
) and integration target that will handle the request (lambda function, database,...),
- stage name that API will be deployed to,
- select auto-deploy every time there is a change to HTTP API. - When created, AWS gives us invoking URL that is going to invoke the lambda function - this is a kind of endpoint that can be changed interacting with Route 53.
- Using invoking URL with a declared resource path with a browser will send GET method lounching lambda and getting lambda's result.
- When POST method with values in the request's body sent to the resource path - the values are being passed into a lambda function in the event parameter.
We can be extracting values and some additional info like following:
-event['rawPath']
- to get resource path,
-event['queryStringParameter']['param_name']
- to get value assigned toparam_name
within an url.
-decoded = json.loads(event['body'])
  name = decoded['name']
  we need to parse JSON format in order to get body's values.
AWS CloudFront:
- Service for Content Delivery Network (CDN) which acts like a proxy that receives requests and forward those requests to the backend systems.
- CDN caches website or application files, HTML, CSS, JS, images or videos at data centers around the world. Even when backend server goes down, CDN is able to serve the content of a static web-site back to the end user..
- When setting up the CloudFront service we define amount of data centers - the edge locations. The more edge locations, the higher perfomrance and the lower latency of getting server's content as reposnse on user's request.
- The edge locations allow users to download the app content much faster from the nearest edge location rather than when request would need to go all the way to the origin server.
- User's request may need to go to the origin server in case when the content at the closest edge location is not present at the moment.
- Content is being cached at the edge location for a specific period of time - Time To Live (TTL).
AWS Storage Gateway:
- A service that lets the on-premise application to access and use the cloud storage.
- In Gateway Stored Volumes configuration, there is on-premise storage for an application server. Whenever a file added to that special local storage, it's being uploaded asynchronously to the AWS S3 or AWS EBS in a compressed manner.
- In Gateway Cached Volumes configuration, there is no on-premises storage. Data is stored primarly on AWS S3, what we have locally on-premise server is a cache of recently read or writthen data.
source: AWS
Amazon Redshift
- Coulmnar data storage.
- Data compression to blocks.
- Architecture:
source: simplilearn
- The main element of Amazon Redshift is a cluster of nodes - datawarehouse cluster.
- There are compute nodes that process data and the leader node that gives instructions.
- Leader node also manages client applications, like BI tools, that require data from the Redshift.
- Leader node uses JDBS (Java Database Connectivity) to monitor all of the connections to client appications.
- Client applications uses ODBC (Open Database Connectivity) to interact with live data of the datawarehouse cluster sending SQL queries.
- Compute nodes are divided into slice with dedicated memory space. They run in parallel to process the data in a fast manner.
- Workflow:
1. Client app sends query to the leader node.
2. Leader node receives the query and develop a suitable execution plan.
3. Once the plan is set up, compute nodes and compute slices start working on this plan.
4. Compute nodes work in parallel and transfer data among themeselves in order to solve the query.
5. Once excution is done. the leader node aggregates the results and sends it backt to clien app.
AWS Glue:
- ETL service to categorize, clean, enrich, and reliably move data between various data stores.
- It collects different data carriers, then identifies the data types, suggests some transformations and generates editable code (ETL script) to execute the overall transformation and datawarehouse loading process.
- AWS Glue has 3 main components:
- Data Catalog - central metadata repository that alwyas stays in sync with the underlying data thanks to the so-called crawlers.
- Job Authoring - ETL engine that automatically generates Python or Scala code.
- Job Execution - flexible scheduler that handles dependency resolution, job monitoring, potential retries and alerting. - ETL scripts uses the dynamic frame - similar to the Apache Spark dataframe. Dynamic frame is a data abstraction to organize data into rows and columns where each record is self-describing so no schema is required initially. We can freely be converting dynamic frames into Spark dataframes.
- Here are some applicatons:
- AWS Glue can catalog S3 data lake making it available for quering with Amazon Athena and Amazon Redshift.
- Making event-driven ETL pipelines: running ETL jobs as soon as new data comes in Amazon S3 by invoking your AWS Glue ETL jobs from an AWS Lambda function.
- Cataloging data for quick search of datasets and maintaining relevant metadata in once central repository.
AWS Athena:
- Query service for managing Amazon S3 (or DynamoDB) data with standard SQL.
- There is no need for any underlying compute infrastructure, no need for loading data into Amazon Athena or transforming it for the analysis.
- We can access Athena through either the AWS Management Console, an application programming interface (API) or a Java Database Connectivity driver, then we we define schema and here we go - we can execute SQL queires.
- We can put S3 data through AWS Glue's crawler in order to get a data schema into a table where the data gets stored. Data schema is being kept is so-called Data Catalog then. Data schema can be defined manally as well.
- There is also Quick Sight available that can help us visualize the data.
- Once table created we can move on with quering it with common SQL.
AWS Elastic MapReduce (EMR):
- Platform for computational processing of vast amounts of data with the help of the MapReduce framework. It simplifies the enitre setup and management of the cluster and Hadoop components.
- Hence, It uses Hadoop to distribute the data and process it across an auto-scaling cluster of computing nodes (EC2 instances).
- EMR continously monitors nodes in the cluster. It retrieves failed tasks and replaces poorly performing instances.
- We can choose computation engine while cluster establishing:
Data Lake:
- Data Lake vs Data Warehouse
Data Lake Data Warehouse Stores raw, unstructured and unprocessed data. Stores refined, structured and processed data. Stores data that may never be used hence larger storage capacity required. Saves storage space by not maintaining data that may never be used. Poor data quality. Data quality ensured. Purpose of data gathering not determined. Data being gathered for a specific business purpose. Easy accessible and quick to update because of the lack of structures . More complicated and costly to make changes. Used by Data Scientists. Used by Business Analysts. Requires specialized tools like Machine Learning to understand and translate data into usage. Can be used with regular Business Intelligence tools to visualize data with charts, tables. - Data Warehouse based on AWS services:
source: AWS
- data ingestion - Amazon Kinesis Data Firehose.
- data storage - Amazon S3.
- data processing - AWS Lambda and AWS Glue.
- data migration - AWS Data Migration Service (AWS DMS) and AWS Glue.
- orchestration and metadata management.
- querying and data visualization Amazon Athena and Amazon QuickSight. - Data Lakes in AWS:
- Offers more agility and flexibility than traditional data management systems.
- Allows companies to store all of their data from various sources, regardless if they are structured and unstructured, in a centralized repository.
- Configures the core AWS services to easily tag, search, share, and govern subsets of data across a company.
- Stores and registers datasets of any size in the secure, durable, scalable AWS S3.
- Allows users to upload and catalog new datasets with searchable metadata and integrate it with AWS Glue and Amazon Athena to transform and analyze.
- Crawls the data sources, idetifies data formats, and then suggests scheams and transofmrations with no hand-coding data flows.
- Adds user-defined tags into AWS DynamoDB to add business-relevant context to each dataset.
- Allows users to browse available datasets or search on dataset attributes and tags to quickly find and access data relevant to their business needs. - Data Lake based on AWS services:
source: AWS
- The data lake infrastructure can be provisioned by AWSC CloudFormation.
- Data lake API leverages Amazon API Gateway to provide access to data lake microservices through AWS Lambda functions.
- These microservices interact with Amazon S3, AWS Glue, AWS Athena, AWS DynamoDB, AWS ES, and AWS CloudWatch Logs to provide data storage, management, and audit functions.
Amazon Kinesis Firehose
- Allows to deliver streaming or event data into various destinations such as BI database, data storage (S3, redshift, ElasticShearch), dashboards.
- In other words, it creates a single data ingestion point and provides a means to deliver that data to some destination services we can specify when setting-up kinesis firehose stream.
- Ingested data (large number of individaul events in an application) can be compressed or batched into a single output file and sent to one of the destinations.
- Example worflow:
source: Be a better dev
1. SNS data put to a topic in form of JSON.
2. Lambda as the subscriber to SNS topic:
- for every event on a specific topic there is the lambda function invocation,
- lambda performs put operations into kinesis firehose endpoint.
3. Kinesis organizes data and delivers due to buffer interval or buffer size.
4. S3 where data is delivered to store.
5. Either kinesis firehose can invoke another lambda function that applies bussines logic to the data. - Buffer size dictates how many files we end up in S3. If data is 10 MB large and buffer size is set up to 5MB then there will be two files stored in S3 at the end.
- Buffer interval, if we don't achieve 5MB in size of dat then we push it to S3 after x seconds.
AWS CloudWatch:
- Monitoring services on AWS servers and applications.
- It collects and monitors log files, sets alarms, reacts to changes in the AWS resources automatically, for example:
when CPU utilization in an EC2 instance is grater than 70 % then you get the email alarming notification.
- AWS CloudWatch Logs:
- Server can keep a lot of log files both sytem and application logs.
- It is iportant to have a log files during application's debugging. If there is something that doesn't work as expected then we would need to check the errors in the specific log file.
- In traditional way, when debugging we need to grant access to the server for an individual who wants to check a log file. On the other hand, what poses the risk is that when server termiantes, the logs are lost.
- The better way is to create a central logs server in which we want to push the log files from individual systems. Then we can go with the central log monitoring.
- AWS CloudWatch Logs is the centralized logs management to monitor, store, and access log files form Amazon EC2 instances, Route 53 and other sourcers.
- Can catch all the trafic that runs over SFTP server.
AWS Route 53:
- Domain Name System (DNS) web service used to route end users to Internet applications by translating names like www.example.com into the numeric IP addresses like 192.0.2.1 that computers use to connect to each other.
- Amazon Route 53 connects user requests to infrastructure running in AWS (EC2, S3) or routes users to infrastructure outside of AWS.
- With AWS Route 53 I can route traffic to differnet app's endpoints idependently and monitor their health.
- It is not limited to a specific region. We can launch it globally.
AWS ElastiCache:
- In-memory cache in the cloud. It caches the response associated with frequent queries.
- This allows better response time and decreases the load on the database server.
- When user sents the same query once again, the response comes from cache engine instead of from database server.
AWS Transfer Family:
- Setting up SFTP protocal for transferring files into S3 or Amazon EFS.
- It provides endpoint for file transfers directly into and out of Amazon S3 using protocols:
- Secure File Transfer Protocol (SFTP),
- File Transfer Protocol over SSL (FTPS),
- File Transfer Protocol (FTP). - Here is the workflow:
source: Amazon
- example of file transfer clients: WinSCP, Filezilla,
- IAM roles are used to grant access to S3 bucket from file transfer clients in a secure way.
Setup
Following installation required:
- MobaXterm from https://mobaxterm.mobatek.net/download-home-edition.html