Databricks Databricks-Certified-Professional-Data-Engineer - Databricks Certified Data Engineer Professional Exam

Databricks Databricks-Certified-Professional-Data-Engineer Premium Access Download Demo

Page: 2 / 4
Total 120 questions

A junior data engineer has manually configured a series of jobs using the Databricks Jobs UI. Upon reviewing their work, the engineer realizes that they are listed as the "Owner" for each job. They attempt to transfer "Owner" privileges to the "DevOps" group, but cannot successfully accomplish this task.

Which statement explains what is preventing this privilege transfer?

Databricks jobs must have exactly one owner; "Owner" privileges cannot be assigned to a group.

The creator of a Databricks job will always have "Owner" privileges; this configuration cannot be changed.

Other than the default "admins" group, only individual users can be granted privileges on jobs.

A user can only transfer job ownership to a group if they are also a member of that group.

Only workspace administrators can grant "Owner" privileges to a group.

Question # 12

The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs Ul. A new data engineering hire is onboarding to the team and has requested access to one of these notebooks to review the production logic.

What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?

Can manage

Can edit

Can run

Can Read

Question # 13

A data engineer is testing a collection of mathematical functions, one of which calculates the area under a curve as described by another function.

Which kind of the test does the above line exemplify?

Integration

Unit

Manual

functional

Question # 14

A junior data engineer on your team has implemented the following code block.

The viewnew_eventscontains a batch of records with the same schema as theeventsDelta table. Theevent_idfield serves as a unique key for this table.

When this query is executed, what will happen with new records that have the sameevent_idas an existing record?

They are merged.

They are ignored.

They are updated.

They are inserted.

They are deleted.

Question # 15

Which Python variable contains a list of directories to be searched when trying to locate required modules?

importlib.resource path

,sys.path

os-path

pypi.path

pylib.source

Question # 16

A member of the data engineering team has submitted a short notebook that they wish to schedule as part of a larger data pipeline. Assume that the commands provided below produce the logically correct results when run as presented.

Which command should be removed from the notebook before scheduling it as a job?

Cmd 2

Cmd 3

Cmd 4

Cmd 5

Cmd 6

Question # 17

A production cluster has 3 executor nodes and uses the same virtual machine type for the driver and executor.

When evaluating the Ganglia Metrics for this cluster, which indicator would signal a bottleneck caused by code executing on the driver?

The five Minute Load Average remains consistent/flat

Bytes Received never exceeds 80 million bytes per second

Total Disk Space remains constant

Network I/O never spikes

Overall cluster CPU utilization is around 25%

Explanation:

This is the correct answer because it indicates a bottleneck caused by code executing on the driver. A bottleneck is a situation where the performance or capacity of a system is limited by a single component or resource. A bottleneck can cause slow execution, high latency, or low throughput. A production cluster has 3 executor nodes and uses the same virtual machine type for the driver and executor. When evaluating the Ganglia Metrics for this cluster, one can look for indicators that show how the cluster resources are being utilized, such as CPU, memory, disk, or network. If the overall cluster CPU utilization is around 25%, it means that only one out of the four nodes (driver + 3 executors) is using its full CPU capacity, while the other three nodes are idle or underutilized. This suggests that the code executing on the driver is taking too long or consuming too much CPU resources, preventing the executors from receiving tasks or data to process. This can happen when the code has driver-side operations that are not parallelized or distributed, such as collecting large amounts of data to the driver, performing complex calculations on the driver, or using non-Spark libraries on the driver. Verified References: [Databricks Certified Data Engineer Professional], under â€œSpark Coreâ€ section; Databricks Documentation, under â€œView cluster status and event logs - Ganglia metricsâ€ section; Databricks Documentation, under â€œAvoid collecting large RDDsâ€ section.

In a Spark cluster, the driver node is responsible for managing the execution of the Spark application, including scheduling tasks, managing the execution plan, and interacting with the cluster manager. If the overall cluster CPU utilization is low (e.g., around 25%), it may indicate that the driver node is not utilizing the available resources effectively and might be a bottleneck.

Question # 18

A junior data engineer seeks to leverage Delta Lake's Change Data Feed functionality to create a Type 1 table representing all of the values that have ever been valid for all rows in abronzetable created with the propertydelta.enableChangeDataFeed = true. They plan to execute the following code as a daily job:

Which statement describes the execution and results of running the above query multiple times?

Each time the job is executed, newly updated records will be merged into the target table, overwriting previous values with the same primary keys.

Each time the job is executed, the entire available history of inserted or updated records will be appended to the target table, resulting in many duplicate entries.

Each time the job is executed, the target table will be overwritten using the entire history of inserted or updated records, giving the desired result.

Each time the job is executed, the differences between the original and current versions are calculated; this may result in duplicate entries for some records.

Each time the job is executed, only those records that have been inserted or updated since the last execution will be appended to the target table giving the desired result.

Question # 19

A data pipeline uses Structured Streaming to ingest data from kafka to Delta Lake. Data is being stored in a bronze table, and includes the Kafka_generated timesamp, key, and value. Three months after the pipeline is deployed the data engineering team has noticed some latency issued during certain times of the day.

A senior data engineer updates the Delta Table's schema and ingestion logic to include the current timestamp (as recoded by Apache Spark) as well the Kafka topic and partition. The team plans to use the additional metadata fields to diagnose the transient processing delays:

Which limitation will the team face while diagnosing this problem?

New fields not be computed for historic records.

Updating the table schema will invalidate the Delta transaction log metadata.

Updating the table schema requires a default value provided for each file added.

Spark cannot capture the topic partition fields from the kafka source.

Question # 20

Which of the following technologies can be used to identify key areas of text when parsing Spark Driver log4j output?

Regex

Julia

pyspsark.ml.feature

Scala Datasets

C++

Weekend Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: xmas50

Databricks Databricks-Certified-Professional-Data-Engineer - Databricks Certified Data Engineer Professional Exam

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation: