Data Engineer Practice Questions

Question #1

Your infrastructure team has set up an interconnect link between Google Cloud and the on-premises network. You are designing a high-throughput streaming pipeline to ingest data in streaming from an Apache Kafka cluster hosted on-premises. You want to store the data in BigQuery, with as minimal latency as possible. What should you do?

  1. Use Dataflow, write a pipeline that reads the data from Kafka, and writes the data to BigQuery.
  2. Use a proxy host in the VPC in Google Cloud connecting to Kafka. Write a Dataflow pipeline, read data from the proxy host, and write the data to BigQuery.
  3. Setup a Kafka Connect bridge between Kafka and Pub/Sub. Use a Google-provided Dataflow template to read the data from Pub/Sub, and write the data to BigQuery.
  4. Setup a Kafka Connect bridge between Kafka and Pub/Sub. Write a Dataflow pipeline, read the data from Pub/Sub, and write the data to BigQuery.

 

Question #2

You are designing a fault-tolerant architecture to store data in a regional BigQuery dataset. You need to ensure that your application is able to recover from a corruption event in your tables that occurred within the past seven days. You want to adopt managed services with the lowest RPO and most cost-effective solution. What should you do?

  1. Create a BigQuery table snapshot on a daily basis.
  2. Migrate your data to multi-region BigQuery buckets.
  3. Access historical data by using time travel in BigQuery.
  4. Export the data from BigQuery into a new table that excludes the corrupted data.

 

Question #3

You are migrating your on-premises data warehouse to BigQuery. One of the upstream data sources resides on a MySQL database that runs in your on-premises data center with no public IP addresses. You want to ensure that the data ingestion into BigQuery is done securely and does not go through the public internet. What should you do?

  1. Use Datastream to replicate data from your on-premises MySQL database to BigQuery. Gather Datastream public IP addresses of the Google Cloud region that will be used to set up the stream. Add those IP addresses to the firewall allowlist of your on-premises data center. Use “IP Allowlisting” as the connectivity method and “Server-only” as the encryption type when setting up the connection profile in Datastream.
  2. Use Datastream to replicate data from your on-premises MySQL database to BigQuery. Set up Cloud Interconnect between your on-premises data center and Google Cloud. Use “Private connectivity” as the connectivity method and allocate an IP address range within your VPC network to the Datastream connectivity configuration. Use “Server-only” as the encryption type when setting up the connection profile in Datastream.
  3. Update your existing on-premises ETL tool to write to BigQuery by using the BigQuery Open Database Connectivity (ODBC) driver. Set up the “proxy” parameter in the “simba.googlebigqueryodbc.ini” file to point to your data center’s NAT gateway.
  4. Use Datastream to replicate data from your on-premises MySQL database to BigQuery. Use “Forward-SSH tunnel” as the connectivity method to establish a secure tunnel between Datastream and your on-premises MySQL database through a tunnel server in your on-premises data center. Use “None” as the encryption type when setting up the connection profile in Datastream.

 

Question #4

You recently deployed several data processing jobs into your Cloud Composer 2 environment. You notice that some tasks are failing in Apache Airflow. On the monitoring dashboard, you see an increase in the total workers memory usage, and there were worker pod evictions. You need to resolve these errors. What should you do? Choose 2 answers.

  1. Increase the memory available to the Airflow workers.
  2. Increase the Cloud Composer 2 environment size from medium to large.
  3. Increase the directed acyclic graph (DAG) file parsing interval.
  4. Increase the memory available to the Airflow triggerer.
  5. Increase the maximum number of workers and reduce worker concurrency.

 

Question #5

You are part of a healthcare organization where data is organized and managed by respective data owners in various storage services. As a result of this decentralized ecosystem, discovering and managing data has become difficult. You need to quickly identify and implement a cost-optimized solution to assist your organization with the following:

  • Data management and discovery
  • Data lineage tracking
  • Data quality validation

How should you build the solution?

  1. Use Dataplex to manage data, track data lineage, and perform data quality validation.
  2. Build a new data discovery tool on Google Kubernetes Engine that helps with new source onboarding and data lineage tracking.
  3. Use BigLake to convert the current solution into a data lake architecture.
  4. Use BigQuery to track data lineage, and use Dataprep to manage data and perform data quality validation.