Summer Sale Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: ecus65

Google Professional-Data-Engineer - Google Professional Data Engineer Exam

Page: 1 / 7
Total 376 questions

You are designing a data warehouse in BigQuery to analyze sales data for a telecommunication service provider. You need to create a data model for customers, products, and subscriptions All customers, products, and subscriptions can be updated monthly, but you must maintain a historical record of all data. You plan to use the visualization layer for current and historical reporting. You need to ensure that the data model is simple, easy-to-use. and cost-effective. What should you do?

A.

Create a normalized model with tables for each entity. Use snapshots before updates to track historical data

B.

Create a normalized model with tables for each entity. Keep all input files in a Cloud Storage bucket to track historical data

C.

Create a denormalized model with nested and repeated fields Update the table and use snapshots to track historical data

D.

Create a denormalized, append-only model with nested and repeated fields Use the ingestion timestamp to track historical data.

You are using Cloud Bigtable to persist and serve stock market data for each of the major indices. To serve the trading application, you need to access only the most recent stock prices that are streaming in How should you design your row key and tables to ensure that you can access the data with the most simple query?

A.

Create one unique table for all of the indices, and then use the index and timestamp as the row key design

B.

Create one unique table for all of the indices, and then use a reverse timestamp as the row key design.

C.

For each index, have a separate table and use a timestamp as the row key design

D.

For each index, have a separate table and use a reverse timestamp as the row key design

You are building an application to share financial market data with consumers, who will receive data feeds. Data is collected from the markets in real time. Consumers will receive the data in the following ways:

    Real-time event stream

    ANSI SQL access to real-time stream and historical data

    Batch historical exports

Which solution should you use?

A.

Cloud Dataflow, Cloud SQL, Cloud Spanner

B.

Cloud Pub/Sub, Cloud Storage, BigQuery

C.

Cloud Dataproc, Cloud Dataflow, BigQuery

D.

Cloud Pub/Sub, Cloud Dataproc, Cloud SQL

You are designing storage for very large text files for a data pipeline on Google Cloud. You want to support ANSI SQL queries. You also want to support compression and parallel load from the input locations using Google recommended practices. What should you do?

A.

Transform text files to compressed Avro using Cloud Dataflow. Use BigQuery for storage and query.

B.

Transform text files to compressed Avro using Cloud Dataflow. Use Cloud Storage and BigQuery

permanent linked tables for query.

C.

Compress text files to gzip using the Grid Computing Tools. Use BigQuery for storage and query.

D.

Compress text files to gzip using the Grid Computing Tools. Use Cloud Storage, and then import into

Cloud Bigtable for query.

You have several Spark jobs that run on a Cloud Dataproc cluster on a schedule. Some of the jobs run in sequence, and some of the jobs run concurrently. You need to automate this process. What should you do?

A.

Create a Cloud Dataproc Workflow Template

B.

Create an initialization action to execute the jobs

C.

Create a Directed Acyclic Graph in Cloud Composer

D.

Create a Bash script that uses the Cloud SDK to create a cluster, execute jobs, and then tear down the cluster

Your analytics team wants to build a simple statistical model to determine which customers are most likely to work with your company again, based on a few different metrics. They want to run the model on Apache Spark, using data housed in Google Cloud Storage, and you have recommended using Google Cloud Dataproc to execute this job. Testing has shown that this workload can run in approximately 30 minutes on a 15-node cluster, outputting the results into Google BigQuery. The plan is to run this workload weekly. How should you optimize the cluster for cost?

A.

Migrate the workload to Google Cloud Dataflow

B.

Use pre-emptible virtual machines (VMs) for the cluster

C.

Use a higher-memory node so that the job runs faster

D.

Use SSDs on the worker nodes so that the job can run faster

A shipping company has live package-tracking data that is sent to an Apache Kafka stream in real time. This is then loaded into BigQuery. Analysts in your company want to query the tracking data in BigQuery to analyze geospatial trends in the lifecycle of a package. The table was originally created with ingest-date partitioning. Over time, the query processing time has increased. You need to implement a change that would improve query performance in BigQuery. What should you do?

A.

Implement clustering in BigQuery on the ingest date column.

B.

Implement clustering in BigQuery on the package-tracking ID column.

C.

Tier older data onto Cloud Storage files, and leverage extended tables.

D.

Re-create the table using data partitioning on the package delivery date.

You have Cloud Functions written in Node.js that pull messages from Cloud Pub/Sub and send the data to BigQuery. You observe that the message processing rate on the Pub/Sub topic is orders of magnitude higher than anticipated, but there is no error logged in Stackdriver Log Viewer. What are the two most likely causes of this problem? Choose 2 answers.

A.

Publisher throughput quota is too small.

B.

Total outstanding messages exceed the 10-MB maximum.

C.

Error handling in the subscriber code is not handling run-time errors properly.

D.

The subscriber code cannot keep up with the messages.

E.

The subscriber code does not acknowledge the messages that it pulls.

You created an analytics environment on Google Cloud so that your data scientist team can explore data without impacting the on-premises Apache Hadoop solution. The data in the on-premises Hadoop Distributed File System (HDFS) cluster is in Optimized Row Columnar (ORC) formatted files with multiple columns of Hive partitioning. The data scientist team needs to be able to explore the data in a similar way as they used the on-premises HDFS cluster with SQL on the Hive query engine. You need to choose the most cost-effective storage and processing solution. What should you do?

A.

Import the ORC files lo Bigtable tables for the data scientist team.

B.

Import the ORC files to BigOuery tables for the data scientist team.

C.

Copy the ORC files on Cloud Storage, then deploy a Dataproc cluster for the data scientist team.

D.

Copy the ORC files on Cloud Storage, then create external BigQuery tables for the data scientist team.

You are designing a cloud-native historical data processing system to meet the following conditions:

    The data being analyzed is in CSV, Avro, and PDF formats and will be accessed by multiple analysis tools including Cloud Dataproc, BigQuery, and Compute Engine.

    A streaming data pipeline stores new data daily.

    Peformance is not a factor in the solution.

    The solution design should maximize availability.

How should you design data storage for this solution?

A.

Create a Cloud Dataproc cluster with high availability. Store the data in HDFS, and peform analysis as needed.

B.

Store the data in BigQuery. Access the data using the BigQuery Connector or Cloud Dataproc and Compute Engine.

C.

Store the data in a regional Cloud Storage bucket. Aceess the bucket directly using Cloud Dataproc, BigQuery, and Compute Engine.

D.

Store the data in a multi-regional Cloud Storage bucket. Access the data directly using Cloud Dataproc, BigQuery, and Compute Engine.