Professional Cloud DevOps Engineers implement processes throughout the systems development lifecycle using Google-recommended methodologies and tools. They build and deploy software and infrastructure delivery pipelines, optimize and maintain production systems and services, and balance service reliability with delivery speed.
The exam is 2 hours long and costs $200.
Exam Content & Outline – What Will You Be Tested On?
There are FIVE main crucial capabilities that the exam will test you on:
- Bootstrapping a Google Cloud organization for DevOps
- Building and implementing CI/CD pipelines for a service
- Applying site reliability engineering practices to a service
- Implementing service monitoring strategies
- Optimizing service performance
Some key technologies/concepts to learn:
- Terraform
- CI/CD
- Kubernetes
- Jenkins
Let’s look at each of these in more detail and find out what exactly to study in order to be certified as a Google Professional Cloud DevOps Engineer.
Bootstrapping a Google Cloud Organization for DevOps
- Designing the overall resource hierarchy for an organization
- Projects and folders
- Shared networking
- Identity and Access Management (IAM) roles and organization-level policies
- Creating and managing service accounts
- Managing infrastructure as code
- Infrastructure as code tooling (e.g., Cloud Foundation Toolkit, Config Connector, Terraform, Helm)
- Making infrastructure changes using Google-recommended practices and infrastructure as code blueprints
- Immutable architecture
- Designing a CI/CD architecture stack in Google Cloud, hybrid, and multi-cloud environments
- CI with Cloud Build
- CD with Google Cloud Deploy
- Widely used third-party tooling (e.g., Jenkins, Git, ArgoCD, Packer)
- Security of CI/CD tooling
- Managing multiple environments (e.g., staging, production)
- Determining the number of environments and their purpose
- Creating environments dynamically for each feature branch with Google Kubernetes Engine (GKE) and Terraform
- Anthos Config Management
Building and Implementing CI/CD Pipelines for a Service
- Designing and managing CI/CD pipelines
- Artifact management with Artifact Registry
- Deployment to hybrid and multi-cloud environments (e.g., Anthos, GKE)
- CI/CD pipeline triggers
- Testing a new application version in the pipeline
- Configuring deployment processes (e.g., approval flows)
- CI/CD of serverless applications
- Implementing CI/CD pipelines
- Auditing and tracking deployments (e.g., Artifact Registry, Cloud Build, Google Cloud Deploy, Cloud Audit Logs)
- Deployment strategies (e.g., canary, blue/green, rolling, traffic splitting)
- Rollback strategies
- Troubleshooting deployment issues
- Managing CI/CD configuration and secrets
- Secure storage methods and key rotation services (e.g., Cloud Key Management Service, Secret Manager)
- Secret management
- Build versus runtime secret injection
- Securing the CI/CD deployment pipeline
- Vulnerability analysis with Artifact Registry
- Binary Authorization
- IAM policies per environment
Applying Site Reliability Engineering Practices to a Service
- Balancing change, velocity, and reliability of the service
- Discovering SLIs (e.g., availability, latency)
- Defining SLOs and understanding SLAs
- Error budgets
- Toil automation
- Opportunity cost of risk and reliability (e.g., number of “nines”)
- Managing service lifecycle
- Service management (e.g., introduction of a new service by using a pre-service onboarding checklist, launch plan, or deployment plan, deployment, maintenance, and retirement)
- Capacity planning (e.g., quotas and limits management)
- Autoscaling using managed instance groups, Cloud Run, Cloud Functions, or GKE
- Implementing feedback loops to improve a service
- Ensuring healthy communication and collaboration for operations
- Preventing burnout (e.g., setting up automation processes to prevent burnout)
- Fostering a culture of learning and blamelessness
- Establishing joint ownership of services to eliminate team silos
- Mitigating incident impact on users
- Communicating during an incident
- Draining/redirecting traffic
- Adding capacity
- Conducting a postmortem
- Documenting root causes
- Creating and prioritizing action items
- Communicating the postmortem to stakeholders
Implementing Service Monitoring Strategies
- Managing logs
- Collecting structured and unstructured logs from Compute Engine, GKE, and serverless platforms using Cloud Logging
- Configuring the Cloud Logging agent
- Collecting logs from outside Google Cloud
- Sending application logs directly to the Cloud Logging API
- Log levels (e.g., info, error, debug, fatal)
- Optimizing logs (e.g., multiline logging, exceptions, size, cost)
- Managing metrics with Cloud Monitoring
- Collecting and analyzing application and platform metrics
- Collecting networking and service mesh metrics
- Using Metrics Explorer for ad hoc metric analysis
- Creating custom metrics from logs
- Managing dashboards and alerts in Cloud Monitoring
- Creating a monitoring dashboard
- Filtering and sharing dashboards
- Configuring alerting
- Defining alerting policies based on SLOs and SLIs
- Automating alerting policy definition using Terraform
- Using Google Cloud Managed Service for Prometheus to collect metrics and set up monitoring and alerting
- Managing Cloud Logging platform
- Enabling data access logs (e.g., Cloud Audit Logs)
- Enabling VPC Flow Logs
- Viewing logs in the Google Cloud console
- Using basic versus advanced log filters
- Logs exclusion versus logs export
- Project-level versus organization-level export
- Managing and viewing log exports
- Sending logs to an external logging platform
- Filtering and redacting sensitive data (e.g., personally identifiable information [PII], protected health information [PHI])
- Implementing logging and monitoring access controls
- Restricting access to audit logs and VPC Flow Logs with Cloud Logging
- Restricting export configuration with Cloud Logging
- Allowing metric and log writing with Cloud Monitoring
Optimizing Service Performance
- Identifying service performance issues
- Using Google Cloud’s operations suite to identify cloud resource utilization
- Interpreting service mesh telemetry
- Troubleshooting issues with compute resources
- Troubleshooting deploy time and runtime issues with applications
- Troubleshooting network issues (e.g., VPC Flow Logs, firewall logs, latency, network details)
- Implementing debugging tools in Google Cloud
- Application instrumentation
- Cloud Logging
- Cloud Trace
- Error Reporting
- Cloud Profiler
- Cloud Monitoring
- Optimizing resource utilization and costs
- Preemptible/Spot virtual machines (VMs)
- Committed-use discounts (e.g., flexible, resource-based)
- Sustained-use discounts
- Network tiers
- Sizing recommendations
Recommended Study Materials
- Books
- The DevOps Handbook: How to Create World-Class Agility, Reliability, & Security in Technology Organizations
- Google Cloud for DevOps Engineers: A practical guide to SRE and achieving Google’s Professional Cloud DevOps Engineer certification
- Google Professional Cloud DevOps Engineer Preparation NEW & Exclusive Version: Pass Your Exam on your first Try