April 29, 2024

Data Engineer

Professional Data Engineers enable data-driven decision making by collecting, transforming, and publishing data. A Data Engineer should be able to design, build, operationalize, secure, and monitor data processing systems with a particular emphasis on security and compliance; scalability and efficiency; reliability and fidelity; and flexibility and portability. A Data Engineer should also be able to leverage, deploy, and continuously train pre-existing machine learning models.

The exam is 2 hours long and costs $200.

Exam Content & Outline – What Will You Be Tested On?

There are FOUR main crucial capabilities that the exam will test you on:

    1. Designing data processing systems
    1. Building and operationalizing data processing systems
    1. Operationalizing machine learning tools
    1. Ensuring solution quality

Let’s look at each of these in more detail and find out what exactly to study in order to be certified as a Google Professional Data Engineer.

Designing Data Processing Systems

This section is all about the planning phase – what sort of technologies will you choose to build your data engineering projects? GCP and others offer a wide range of possible tools you can use to manipulate your data in any way you want. The key is to learn the pros and cons of each technology and how to use these technologies to create a winning solution.

    1. Selecting the appropriate storage technologies
        • Mapping storage systems to business requirements
        • Data modeling
        • Trade-offs involving latency, throughput, transactions
        • Distributed systems
        • Schema design
    1. Designing data pipelines
        • Data publishing and visualization (e.g. BigQuery)
        • Batch and streaming data (e.g. Dataflow, Dataproc, Apache Beam, Apache Spark/Hadoop, Pub/Sub, Apache Kafka)
        • Online (interactive) vs. batch predictions
        • Job automation and orchestration (e.g. Cloud Composer)
    1. Designing a data processing solution
        • Choice of infrastructure
        • System availability and fault tolerance
        • Use of distributed systems
        • Capacity planning
        • Hybrid cloud and edge computing
        • Architecture options (e.g. message brokers, message queues, middleware, service-oriented architecture, serverless functions)
        • At least once, in-order, and exactly once, etc., event processing
    1. Migrating data warehousing and data processing
        • Awareness of current state and how to migrate a design to a future state
        • Migrating from on-premises to cloud (Data Transfer Service, Transfer Appliance, Cloud Networking)
        • Validating a migration

Building and Operationalizing Data Processing Systems

This section builds on the previous section by taking the design document and actually getting your hands dirty to turn the blueprint into a tangible and operational data processing systems. Make sure you have experience in building these on GCP.

    1. Building and operationalizing storage systems
        • Effective use of managed services (Cloud Bigtable, Cloud Spanner, Cloud SQL, BigQuery, Cloud Storage, Datastore, Memorystore)
        • Storage costs and performance
        • Life cycle management of data
    1. Building and operationalizing pipelines
        • Data cleansing
        • Batch and streaming
        • Transformation
        • Data acquisition and import
        • Integrating with new data sources
    1. Building and operationalizing processing infrastructure
        • Provisioning resources
        • Monitoring pipelines
        • Adjusting pipelines
        • Testing and quality control

Operationalizing Machine Learning Models

Machine learning can be applied to your data once it’s been collected and processed. This will result in a smarter AI which you can use. GCP offers a host of ML (machine learning) tools that you can use over your data. You will need to know what these tools and how you can use these tools to create practical value out of the data you collected. For a quick introduction to Machine Learning, a good place is knowing the difference between AI and ML.

    1. Leveraging pre-built ML models as a service
        • ML APIs (e.g. Vision API, Speech API)
        • Customizing ML APIs (e.g. AutoML Vision, AutoML Text)
        • Conversational experiences (e.g. Dialogflow)
    1. Deploying an ML pipeline
        • Ingesting appropriate data
        • Retraining of machine learning models (AI Platform Prediction and Training, BigQuery ML, Kubeflow, Spark ML)
        • Continuous evaluation
    1. Choosing the appropriate training and serving infrastructure
        • Distributed vs. single machine
        • Use of edge compute
        • Hardware accelerators (e.g. GPU, TPU)
    1. Measuring, monitoring, and troubleshooting machine learning models
        • Machine learning terminology (e.g. features, labels, models, regression, classification, recommendation, supervised vs unsupervised learning, evaluation metrics)
        • Impact of dependencies of machine learning models
        • Common sources of error (e.g. assumptions about data)

Ensuring Solution Quality

Your project won’t produce exceptional results if you don’t take into considerations various techniques to fine-tune the efficiency of your systems. This section will ensure that you know how to harden your systems with leading practices in cybersecurity while taking into consideration elements of techniques that will fluidly allow your systems to grow within a constantly changing environment.

    1. Designing for security and compliance
        • Identity and access management (e.g. Cloud IAM)
        • Data security (encryption, key management)
        • Ensuring privacy (e.g. DLP API)
        • Legal compliance (e.g. HIPAA, COPPA, FedRAMP, GDRP)
    1. Ensuring scalability and efficiency
        • Building and running test suites
        • Pipeline monitoring (e.g. Cloud Monitoring)
        • Assessing, troubleshooting, and improving data representations and data processing infrastructure
        • Resizing and autoscaling resources
    1. Ensuring reliability and fidelity
        • Performing data preparation and quality control (e.g. Dataprep)
        • Verification and monitoring
        • Planning, executing, and stress testing data recovery (fault tolerance, rerunning failed jobs, performing retrospective re-analysis)
        • Choosing between ACID, idempotent, eventually consistent requirements
    1. Ensuring flexibility and portability
        • Mapping to current and future business requirements
        • Designing for data and application portability (e.g. multicloud, data residency requirements)
        • Data staging, cataloging, and discovery

Recommended Study Materials

    1. Books