Professional Cloud DevOps Engineers implement processes throughout the systems development lifecycle using Google-recommended methodologies and tools. They build and deploy software and infrastructure delivery pipelines, optimize and maintain production systems and services, and balance service reliability with delivery speed.

–Google

The exam is 2 hours long and costs $200.

Exam Content & Outline – What Will You Be Tested On?

There are FIVE main crucial capabilities that the exam will test you on:

Bootstrapping a Google Cloud organization for DevOps
Building and implementing CI/CD pipelines for a service
Applying site reliability engineering practices to a service
Implementing service monitoring strategies
Optimizing service performance

Some key technologies/concepts to learn:

Terraform
CI/CD
Kubernetes
Jenkins

Let’s look at each of these in more detail and find out what exactly to study in order to be certified as a Google Professional Cloud DevOps Engineer.

Bootstrapping a Google Cloud Organization for DevOps

Designing the overall resource hierarchy for an organization
- Projects and folders
- Shared networking
- Identity and Access Management (IAM) roles and organization-level policies
- Creating and managing service accounts
Managing infrastructure as code
- Infrastructure as code tooling (e.g., Cloud Foundation Toolkit, Config Connector, Terraform, Helm)
- Making infrastructure changes using Google-recommended practices and infrastructure as code blueprints
- Immutable architecture
Designing a CI/CD architecture stack in Google Cloud, hybrid, and multi-cloud environments
- CI with Cloud Build
- CD with Google Cloud Deploy
- Widely used third-party tooling (e.g., Jenkins, Git, ArgoCD, Packer)
- Security of CI/CD tooling
Managing multiple environments (e.g., staging, production)
- Determining the number of environments and their purpose
- Creating environments dynamically for each feature branch with Google Kubernetes Engine (GKE) and Terraform
- Anthos Config Management

Building and Implementing CI/CD Pipelines for a Service

Designing and managing CI/CD pipelines
- Artifact management with Artifact Registry
- Deployment to hybrid and multi-cloud environments (e.g., Anthos, GKE)
- CI/CD pipeline triggers
- Testing a new application version in the pipeline
- Configuring deployment processes (e.g., approval flows)
- CI/CD of serverless applications
Implementing CI/CD pipelines
- Auditing and tracking deployments (e.g., Artifact Registry, Cloud Build, Google Cloud Deploy, Cloud Audit Logs)
- Deployment strategies (e.g., canary, blue/green, rolling, traffic splitting)
- Rollback strategies
- Troubleshooting deployment issues
Managing CI/CD configuration and secrets
- Secure storage methods and key rotation services (e.g., Cloud Key Management Service, Secret Manager)
- Secret management
- Build versus runtime secret injection
Securing the CI/CD deployment pipeline
- Vulnerability analysis with Artifact Registry
- Binary Authorization
- IAM policies per environment

Applying Site Reliability Engineering Practices to a Service

Balancing change, velocity, and reliability of the service
- Discovering SLIs (e.g., availability, latency)
- Defining SLOs and understanding SLAs
- Error budgets
- Toil automation
- Opportunity cost of risk and reliability (e.g., number of “nines”)
Managing service lifecycle
- Service management (e.g., introduction of a new service by using a pre-service onboarding checklist, launch plan, or deployment plan, deployment, maintenance, and retirement)
- Capacity planning (e.g., quotas and limits management)
- Autoscaling using managed instance groups, Cloud Run, Cloud Functions, or GKE
- Implementing feedback loops to improve a service
Ensuring healthy communication and collaboration for operations
- Preventing burnout (e.g., setting up automation processes to prevent burnout)
- Fostering a culture of learning and blamelessness
- Establishing joint ownership of services to eliminate team silos
Mitigating incident impact on users
- Communicating during an incident
- Draining/redirecting traffic
- Adding capacity
Conducting a postmortem
- Documenting root causes
- Creating and prioritizing action items
- Communicating the postmortem to stakeholders

Implementing Service Monitoring Strategies

Managing logs
- Collecting structured and unstructured logs from Compute Engine, GKE, and serverless platforms using Cloud Logging
- Configuring the Cloud Logging agent
- Collecting logs from outside Google Cloud
- Sending application logs directly to the Cloud Logging API
- Log levels (e.g., info, error, debug, fatal)
- Optimizing logs (e.g., multiline logging, exceptions, size, cost)
Managing metrics with Cloud Monitoring
- Collecting and analyzing application and platform metrics
- Collecting networking and service mesh metrics
- Using Metrics Explorer for ad hoc metric analysis
- Creating custom metrics from logs
Managing dashboards and alerts in Cloud Monitoring
- Creating a monitoring dashboard
- Filtering and sharing dashboards
- Configuring alerting
- Defining alerting policies based on SLOs and SLIs
- Automating alerting policy definition using Terraform
- Using Google Cloud Managed Service for Prometheus to collect metrics and set up monitoring and alerting
Managing Cloud Logging platform
- Enabling data access logs (e.g., Cloud Audit Logs)
- Enabling VPC Flow Logs
- Viewing logs in the Google Cloud console
- Using basic versus advanced log filters
- Logs exclusion versus logs export
- Project-level versus organization-level export
- Managing and viewing log exports
- Sending logs to an external logging platform
- Filtering and redacting sensitive data (e.g., personally identifiable information [PII], protected health information [PHI])
Implementing logging and monitoring access controls
- Restricting access to audit logs and VPC Flow Logs with Cloud Logging
- Restricting export configuration with Cloud Logging
- Allowing metric and log writing with Cloud Monitoring

Optimizing Service Performance

Identifying service performance issues
- Using Google Cloud’s operations suite to identify cloud resource utilization
- Interpreting service mesh telemetry
- Troubleshooting issues with compute resources
- Troubleshooting deploy time and runtime issues with applications
- Troubleshooting network issues (e.g., VPC Flow Logs, firewall logs, latency, network details)
Implementing debugging tools in Google Cloud
1. Application instrumentation
2. Cloud Logging
3. Cloud Trace
4. Error Reporting
5. Cloud Profiler
6. Cloud Monitoring
Optimizing resource utilization and costs
- Preemptible/Spot virtual machines (VMs)
- Committed-use discounts (e.g., flexible, resource-based)
- Sustained-use discounts
- Network tiers
- Sizing recommendations

Recommended Study Materials

Books

Exam Content & Outline – What Will You Be Tested On?

Bootstrapping a Google Cloud Organization for DevOps

Building and Implementing CI/CD Pipelines for a Service

Applying Site Reliability Engineering Practices to a Service

Implementing Service Monitoring Strategies

Optimizing Service Performance

Recommended Study Materials

You may have missed

The Business Value of Using Apigee API Management

Create New Business Opportunities by Exposing and Monetizing Public-Facing APIs

Understanding Application Programming Interfaces (APIs)

The Business Value of Deploying Containers with Google Cloud Products: Google Kubernetes Engine (GKE) and Cloud Run

The Main Benefits of Containers and Microservices for Application Modernization

Everywhere You Look: The Omnipresent Cloud