DevOps Engineer Professional (C02)

Comments per Service

CodeStar

CodeCommit

CodeCommit requires CloudWatch Events rule to trigger CodePipeline
Can trigger lambda functions out of CodeCommit events
AWS provides several managed policies:
- AWSCodeCommitFullAccess , AWSCodeCommitPowerUser , AWSCodeCommitReadOnly
Can use Approval Rule templates to e.g. trigger unit tests via CodeBuild

CodePipeline

CodePipeline can execute cross-region actions
CodePipeline can deploy straight into S3
CodePipeline can have custom actions that invoke job workers

CodeBuild

CodeBuild can be triggered directly from Github via web hook
CodeBuild supports build badges, which provide an embeddable, dynamically generated image (badge) that displays the status of the latest build for a project

CodeDeploy

In EC2/On-Premises deployment, a CodeDeploy deployment group is a set of individual instances targeted for a deployment. A deployment group contains individually tagged instances, Amazon EC2 instances in Amazon EC2 Auto Scaling groups, or both.
CodeDeploy can terminate the original instances in the deployment group with a waiting period of 1 hour.
CodeDeploy has a default timeout of 1 hour to wait for scripts to finish
CodeDeploy failing on AllowTraffic can mean that health checks on ELB are misconfigured
Notifies via CloudWatch Events
- Lambda
- SNS
- Kinesis streams
- SQS
- Built-in targets (CloudWatch Alarms actions)

CodeGuru

Amazon CodeGuru Profiler helps developers understand the runtime behaviour of their applications, improve performance, and decrease infrastructure costs.
Amazon CodeGuru Reviewer is an automated code review service that identifies critical defects and deviation from coding best practices for Java and Python code. Works on PRs
Reviewer can protect secrets and suggest code changes to mitigate

IaC

CloudFormation

CFN custom resources -> pre-signed URLs
In a stackset, global resources (like S3) have to be unique
CloudFormation drift detection requires manual intervention; use AWS Config to automate detection.

Service Catalog

By using a launch role via launch constraint, you can instead limit the end users’ permissions to the minimum they require for that product
The template constraint limits the options that are available to end-users when they launch a product. It works by narrowing the allowable values for parameters that are defined in the product’s underlying AWS CloudFormation template
- Apply template constraints to ensure that the end users can use products without breaching the compliance requirements of your organization

OpsWorks

OpsWorks can create time-based instances for scaling of predictable workload, or load-based using CPU utilisation or load, or memory utilisation

Compute

EC2

EC2 memory metrics are not collected by default and need to have CloudWatch agent installed
EC2 can use built-in instance recovery
An instance is scheduled to be retired when AWS detects irreparable failure of the underlying hardware that hosts the instance.
- When an instance reaches its scheduled retirement date, it is stopped or terminated by AWS.
- AWS also sends an AWS Health event, which you can monitor and manage by using Amazon CloudWatch Events.

ASG

ASG lifecycle states:
- Pending (hooks Pending:Wait, Pending:Proceed)
- InService
- Terminating (hooks Terminating:Wait, Terminating:Proceed)
- Terminated
Pending:Wait lifecycle hook can allow AMI upgrades before bringing them into service
Terminating:Wait lifecycle hook to collect instance data (e.g. logs) before final termination
Tags mentioned in the Auto Scaling group are not propagated to EBS volumes
ASG: A warm pool gives you the ability to decrease latency for your applications that have exceptionally long boot times, for example, because instances need to write massive amounts of data to disk.
- Can keep instances in pool running or stopped
ASG can notify via SNS on failed instance launch
Can use Amazon EventBridge or Amazon CloudWatch Events to track the Auto Scaling Events
- Can trigger Lambdas from ASG by filtering on EventBridge events
CloudFormation + ASGs:
- AutoScalingReplacingUpdate: WillReplace true will wait for a complete replacement of the ASG and its instances before deleting the old ASG
- AutoScalingRollingUpdate: replaces existing instance in ASG; valid options: MaxBatchSize, MinInstancesInService, MinSuccessfulInstancesPercent, PauseTime, SuspendProcesses, WaitOnResourceSignals

Storage Gateway

Storage Gateway does not automatically refresh the cache if the files were added directly to S3. RefreshCache can be used to refresh the cache periodically.
- Tape gateway is backed up by glacier, meant for backups etc
- File gatewayEC2 gets on-premises data into the cloud
- Volume gateway is cloud-backed iSCSI block storage volumes

SSM

AWS-AmazonLinuxDefaultPatchBaseline is a predefined patch baseline, doesn't do custom patches
aws:runDocument plugin runs SSM documents stored in Systems Manager or on a local share
aws:downloadContent plugin downloads an SSM document from a remote location to a local share
Can use SSM to create AMIs

ELB

ALBs can be configured for 'dual stack' mode that allows IPv4 and IPv6
ALBs can have weightings between target groups

Security

IAM

iam:passrole passes a role to a service. E.g. a developer role to CloudFormation

Firewall Manager

Firewall Manager can be used to configure and apply WAF ACLs to the ALBs in an AWS account. It can help centrally manage as well as apply them to new accounts added to the Organization in the future.

KMS

KMS grants are commonly used by AWS services that integrate with AWS KMS to encrypt your data at rest.
- The service creates a grant on behalf of a user in the account, uses its permissions, and retires the grant as soon as its task is complete.

Compliance

GuardDuty

Can be used for org-wide compliance
AWS recommends a separate delegated GuardDuty administrator account
Can auto-enable GuardDuty for all future Org accounts
Can configure GuardDuty Trusted IP list and Threat IP list and work with findings based on those
GuardDuty needs EventBridge for filtering

Config

AWS Config can ensure all EC2 instances are managed by AWS Systems Manager.
AWS Config can find ec2-volume-inuse-check, but cannot detect how long a volume was unused for
cloudformation-stack-drift-detection-check checks if the actual configuration of a CloudFormation stack differs, or has drifted
s3-bucket-ssl-requests-only checks whether S3 buckets have policies that require requests to use SSL
Can deploy conformance packs into org accounts (from a delegated admin account)
Config itself is per region, use Config Aggregator for centralised collection of findings across regions & accounts
- Uses aggregator account
By default, AWS Config will not automatically remediate the accounts that disabled its CloudTrail. You must manually set this up using a CloudWatch Events rule and a custom Lambda function that calls the StartLogging API to enable CloudTrail back again. Furthermore, the cloudtrail-enabled AWS Config managed rule is only available for the periodic trigger type and not Configuration changes.

ControlTower

Use EventBridge to get notifications on Control Tower events like CreateManagedAccount
Customizations for AWS Control Tower (CfCT) helps you customize your AWS Control Tower landing zone and stay aligned with AWS best practices. Customizations are implemented with AWS CloudFormation templates and service control policies (SCPs).
- CfCT capability is integrated with AWS Control Tower lifecycle events so that your resource deployments remain synchronized with your landing zone

Org Policies

Are inherited down the path org root -> ou -> accounts

Trusted Advisor

AWS Trusted Advisor checks identify ways to optimize your AWS infrastructure, improve security and performance, reduce costs, and monitor service quotas
- Cost Optimization, Performance, Security, Fault Tolerance, and Service Limits
TrustedAdvisor can check for under-utilized EC2
Trusted Advisor's primary integration point is CloudWatch Events
With Trusted Advisor’s Service Limit Dashboard, you can view, refresh, and export utilization and limit data on a per-limit basis.
- Metrics are published on Amazon CloudWatch in which you can create custom alarms

Health

AWS Health is scanning public repos and can send events for compromised keys
On detection of an exposed IAM access key, AWS Health generates an AWS_RISK_CREDENTIALS_EXPOSED CloudWatch Event.
Also lists AWS Scheduled maintenance events on Health Dashboard
- Can use CloudWatch Events/EventBridge to trigger workflows based on events
Can monitor AWS Health events using Amazon EventBridge or CloudWatch Events by calling the AWS Health API

CloudTrail

Can set up trails for
- Data events: These events provide insight into the resource operations performed on or within a resource. These are also known as data plane operations.
  - For S3 or Lambda data events
- Management events: Management events provide insight into management operations that are performed on resources in your AWS account. These are also known as control plane operations.

Networking

VPC

NAT gateway does not span multiple AZs (instead: one gateway per AZ)
Can send VPC Flow Logs to CloudWatch Logs

Storage

Aurora

Read replicas are always asynchronous
AWS Aurora Global Database uses storage-based replication with typical latency of less than 1 second, using dedicated infrastructure that leaves your database fully available to serve application workloads.
- 1 primary region (read/write), up to 5 secondary regions (read)
- In the event of a regional degradation or outage, one of the second regions can be promoted to read and write capabilities in less than 1 minute.
Aurora endpoints
- single built-in cluster endpoint, connects to the primary instance of the cluster
- reader endpoint for read-only connections for your Aurora cluster
- can have custom cluster endpoints (managed by Aurora) that can be READER. WRITER or ANY

RDS

RDS creates and saves automated backups of your DB instance or Multi-AZ DB cluster during the backup window of your database.
- default: 30min backup during 8h per-region night
Amazon RDS uses SNS to provide notification when an Amazon RDS event occurs.
- Can also use CloudWatch Events/Eventbridge
Failover:
- AZ outages => RDS multi-AZ deployment
- Regional outages => RDS read replica
- Multi-region deployments are like multi-AZ deployments, but other regions can be used for reads. Read replicas can be in the same AZ, same region, or cross-region
- Read replicas have best RTO/RPO, but highest cost

DynamoDB

In DynamoDb ThrottledWriteRequests can help adjusting increase the maximum write capacity units for the table's Auto Scaling policy.
WriteThrottleEvents are requests to DynamoDB that exceed the provisioned write capacity units for a table or a global secondary index.
Can use Kinesis Data Streams to capture changes to DynamoDB
Amazon DynamoDB global tables provide a single-digit millisecond latency and make sure the data is available across regions.
DynamoDB Global Tables requires
- tables are created in each region already
- DynamoDB Streams is enabled
Don't have multiple lambdas read from DynamoDB Streams
- Only one process per shard!
- Better to use fan-out pattern

Glue

AWS Glue is an efficient way to store object metadata. Combination: S3 - Glue - Athena - QuickSight

S3

Can include a pre-calculated checksum as part of your request. Amazon S3 compares the provided checksum to the checksum that it calculates by using your specified algorithm
Can activate access logs and use Athena for analysis/queries
S3 cross-region replication is push-based: source bucket gets a replication rule, destination bucket gets a bucket policy, source needs IAM role for S3 service to assume
- Configure a replication rule within the source bucket to activate the replication process.
- Create a bucket policy in the destination bucket that grants the source bucket permission to replicate objects into it.
- In the source AWS account, create an IAM role that Amazon S3 can assume to replicate objects. Enable versioning in both buckets.
AWS CloudTrail only logs bucket-level actions in your Amazon S3 buckets by default. If you want to record all object-level API activity in your S3 bucket, you can set up data events in CloudTrail

Serverless

API Gateway

API Gateway does not have specific metrics for individual http error codes like 403, only a generic 4XXError metric
Can enable API caching in Amazon API Gateway to cache your endpoint's responses

ECS

can set ECS tasks as a target of CloudWatch events
ECS/Fargate logs
- add the required logConfiguration parameters to your task definition to turn on the awslogs log driver
ECS/EC2
- container instances have an attached IAM role that contains logs:CreateLogStream and logs:PutLogEvents
- to turn on the awslogs log driver, your Amazon ECS container instances require at least version 1.9.0 of the container agent

Application Auto Scaling

Is a web service for automatically scaling scalable resources for individual AWS services beyond Amazon EC2
- Lambda function provisioned concurrency
- DynamoDB tables and global secondary indexes
- Aurora replicas
- Amazon Elastic Container Service (ECS) services
- ...
Target tracking scaling – Scale a resource based on a target value for a specific CloudWatch metric.
Step scaling – Scale a resource based on a set of scaling adjustments that vary based on the size of the alarm breach.
Scheduled scaling – Scale a resource one time only or on a recurring schedule.

Content Delivery

CloudFront

OriginGroup: An origin group includes two origins (a primary origin and a second origin to failover to) and a failover criteria that you specify.

Notifications/Events

SNS

SNS defines a delivery policy for each delivery protocol. The delivery policy defines how Amazon SNS retries the delivery of messages when server-side errors occur (when the system that hosts the subscribed endpoint becomes unavailable).
- When the delivery policy is exhausted, Amazon SNS stops retrying the delivery and discards the message
- —> unless a dead-letter queue is attached to the subscription.
For ECS notifications on essential task stopped, used EventBridge
For S3 fanout, use SNS and subscribe consumers to it

Logging/Monitoring/Notification

CloudWatch

CloudWatch Logs are always encrypted
CloudWatch Metrics filters can be used to filter CloudWatch Logs
Can create CloudWatch Alarm for the StatusCheckFailed_System metric and select the EC2 action to recover the instance
CloudWatch Logs Subscription for near realtime feed of log events
- "Getting logs out of CloudWatch for further processing"
- from CloudWatch Logs, to Kinesis, ElasticSearch or Lambda
CloudWatch has a predefined dashboard for CodeBuild metrics
You can call the EC2 CreateSnapshot API directly as a target from CloudWatch Events.

KMS

KMS monitors to CloudWatch, can define alarms and alert

Xray

Can run X-Ray daemon on AWS Elastic Beanstalk
X-Ray daemon uses UDP port 2000

Comments per Topic

Implement CI/CD Pipelines

CodeDeploy states + lifecycle hooks
CodeCommit IAM policies
CodeCommit needs CloudWatch Events/EventBridge to detect PRs
- (EventBridge is the same service as CloudWatch Events, just with a new interface and more features exposed.)
GitHub needs a web hook to start a CodePipeline
CodeDeploy lifecycle hooks (reserved for CodeDeploy in parentheses):
- ApplicationStop
- (DownloadBundle)
- BeforeInstall
- (Install)
- AfterInstall
- ApplicationStart
- ValidateService
- BeforeBlockTraffic
- (BlockTraffic)
- AfterBlockTraffic
- BeforeAllowTraffic
- (AllowTraffic)
- AfterAllowTraffic
Integrate automated testing into CI/CD pipelines
- CloudWatch Logs + EventBridge to automate based on CodeBuild job results
- CodeDeploy + EventBridge to automate based on CodeDeploy job results
- EventBridge for CodePipeline scheduled events
- CodeDeploy can integrate with CloudWatch Alarms to pause deployments
Build and manage artifacts
- CodeBuild + CodePipeline + CodeDeploy + S3 for artifacts
- S3 versioning + encryption required for CodePipeline
Implement deployment strategies for instance, container, and serverless environments
- Elastic Beanstalk policies
  - All at once - fastest, but causes downtime; all remaining options have zero downtime
  - Rolling - still uses batches
  - Rolling with additional batch - to maintain full capacity during deploy
  - Immutable for when new & old versions must not be mixed and for fast rollback
  - Traffic splitting: for canary deploys
  - Blue/Green deployments: swap environment URLs; keep RDS in a separate stack; requires DNS change (all previous ones do not)
- Lambda
  - canary deployments via alias weights
  - use CodeDeploy default deploy options:
    - Lambda: LambdaLinear10PercentEvery10Minutes (10% of traffic shifted at a time), LambdaCanary10Percent10Minutes (one 10% and one 90% deploy)
    - EC2: AllAtOnce, OneAtATime, HalfAtATime
  - ALB + EC2 + Route53 alias record swaps
  - OpsWorks Stack cloning + Route53 alias swaps
    - OpsWorks lifecycle stages

Config Management and IaC

Define cloud infrastructure and reusable components to provision and manage systems throughout their lifecycle
- CloudFormation cross-stack references use exports + Fn::ImportValue
- Inline Lambda functions in CFN
- Custom resource is used to invoke a Lambda function in AWS CloudFormation, the request will include a pre-signed URL. The Lambda function is responsible for returning a response to the pre-signed URL to indicate if the resource creation was successful or not.
Deploy automation to create, onboard, and secure AWS accounts in a multi-account/multi-region environment
Design and build automated solutions for complex tasks and large-scale environments
- CloudFormation + ASGs:
  - AutoScalingReplacingUpdate: WillReplace:true will wait for a complete replacement of the ASG and its instances before deleting the old ASG
  - AutoScalingRollingUpdate: replaces existing instance in ASG; valid options: MaxBatchSize, MinInstancesInService, MinSuccessfulInstancesPercent, PauseTime, SuspendProcesses, WaitOnResourceSignals
- OpsWorks can create time-based instances for scaling of predictable workload, or load-based using CPU utilisation or load, or memory utilisation
- Collecting on-prem info:
  - Application Discovery Agent (install on each VM) or Agentless Discovery Connector (separate VM)

Resilient Cloud Solutions

Implement highly available solutions to meet resilience and business requirements
- RDS:
  - AZ outages => RDS multi-AZ deployment
  - Regional outages => RDS read replica
  - Multi-region deployments are like multi-AZ deployments, but other regions can be used for reads. Read replicas can be in the same AZ, same region, or cross-region
  - Read replicas have best RTO/RPO, but highest cost
- Frontend traffic switching => Route53 failover
- AutoScaling with a min & max of 1 is actually sensible - it makes the instance auto-redeploy if it dies
- Route53 policies: simple, failover, geolocation, geoproximity, latency, multi-value answer, weighted
Implement solutions that are scalable to meet business requirements
- ASG lifecycle states:
  - Pending (hooks Pending:Wait, Pending:Proceed)
  - InService
  - Terminating (hooks Terminating:Wait, Terminating:Proceed)
  - Terminated
- EC2 autoscaling Pending:Wait lifecycle hook can allow AMI upgrades before bringing them into service
- Terminating:Wait lifecycle hook to collect instance data (e.g. logs) before final termination
- EKS: k8s cluster autoscaler or karpenter
- EKS networking:
  - VPC CNI plugin
  - Load Balancer Controller
  - CoreDNS
  - kube-proxy
  - Calico
Hybrid environment patching
Implement automated recovery processes to meet RTO/RPO requirements

Monitoring and Logging

Configure the collection, aggregation, and storage of logs and metrics
- AWS Config Aggregator for centralised collection of findings across regions & accounts
- EC2 custom logging requirements => CloudWatch Logs Agent
- ECS Fargate logs => awslogs driver on task definition
- CloudWatch has a predefined dashboard for CodeBuild metrics
Audit, monitor, and analyze logs and metrics to detect issues
- near real time dashboards => QuickSight
- near real time processing on CloudWatch logs:
  - Lambda subscription filter
  - Kinesis stream filter
  - ElasticSearch (OpenSearch) subscription filter
- CloudTrail has log integrity checking which must be turned on
Automate monitoring and event management of complex environments
- Service limit alerting => Trusted Advisor + CloudWatch Alarms + ServiceLimitUsage metric

Incident and Event Response

Manage event sources to process, notify, and take action in response to events
- S3 event notifications for data notifications like file deletion
- RDS event notifications for multi-AZ failover events
- EventBridge + AWS Health for notification about IAM credentials being exposed on GitHub, and for notifications about instance outages, etc.
- CloudTrail data events for object-level activity on S3
- EC2 Auto Scaling groups => EventBridge
- CodePipeline stage => EventBridge
- CodeDeploy => CloudWatch Alarm + MinimumHealthyHosts metric can be used for rollbacks
- OpsWorks self-healing => EventBridge
Implement configuration changes in response to events
Troubleshoot system and application failures

Security and Compliance

Implement techniques for identity and access management at scale
- Limit CodeCommit permissions via IAM policy which matches repo
- S3 bucket policies for requiring TLS
Apply automation for security controls and data protection
- Lifecycle management + auto-rotation of secrets => Secrets Manager
- Cost-effective => SSM Parameter Store SecureStrings
- Patching => SSM Patch Manager
Implement security monitoring and auditing solutions

Files

devops-engineer-professional-02.md

Latest commit

History

devops-engineer-professional-02.md

File metadata and controls

DevOps Engineer Professional (C02)

Comments per Service

CodeStar

CodeCommit

CodePipeline

CodeBuild

CodeDeploy

CodeGuru

IaC

CloudFormation

Service Catalog

OpsWorks

Compute

EC2

ASG

Storage Gateway

SSM

ELB

Security

IAM

Firewall Manager

KMS

Compliance

GuardDuty

Config

ControlTower

Org Policies

Trusted Advisor

Health

CloudTrail

Networking

VPC

Storage

Aurora

RDS

DynamoDB

Glue

S3

Serverless

API Gateway

ECS

Application Auto Scaling

Content Delivery

CloudFront

Notifications/Events

SNS

Logging/Monitoring/Notification

CloudWatch

KMS

Xray

Comments per Topic

Implement CI/CD Pipelines

Config Management and IaC

Resilient Cloud Solutions

Monitoring and Logging

Incident and Event Response

Security and Compliance