New Year Sale Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: xmas50

Databricks Databricks-Certified-Professional-Data-Engineer - Databricks Certified Data Engineer Professional Exam

The data architect has mandated that all tables in the Lakehouse should be configured as external Delta Lake tables.

Which approach will ensure that this requirement is met?

A.

Whenever a database is being created, make sure that the location keyword is used

B.

When configuring an external data warehouse for all table storage. leverage Databricks for all ELT.

C.

Whenever a table is being created, make sure that the location keyword is used.

D.

When tables are created, make sure that the external keyword is used in the create table statement.

E.

When the workspace is being configured, make sure that external cloud object storage has been mounted.

A junior developer complains that the code in their notebook isn't producing the correct results in the development environment. A shared screenshot reveals that while they're using a notebook versioned with Databricks Repos, they're using a personal branch that contains old logic. The desired branch named dev-2.3.9 is not available from the branch selection dropdown.

Which approach will allow this developer to review the current logic for this notebook?

A.

Use Repos to make a pull request use the Databricks REST API to update the current branch to dev-2.3.9

B.

Use Repos to pull changes from the remote Git repository and select the dev-2.3.9 branch.

C.

Use Repos to checkout the dev-2.3.9 branch and auto-resolve conflicts with the current branch

D.

Merge all changes back to the main branch in the remote Git repository and clone the repo again

E.

Use Repos to merge the current branch and the dev-2.3.9 branch, then make a pull request to sync with the remote repository

A junior data engineer has manually configured a series of jobs using the Databricks Jobs UI. Upon reviewing their work, the engineer realizes that they are listed as the "Owner" for each job. They attempt to transfer "Owner" privileges to the "DevOps" group, but cannot successfully accomplish this task.

Which statement explains what is preventing this privilege transfer?

A.

Databricks jobs must have exactly one owner; "Owner" privileges cannot be assigned to a group.

B.

The creator of a Databricks job will always have "Owner" privileges; this configuration cannot be changed.

C.

Other than the default "admins" group, only individual users can be granted privileges on jobs.

D.

A user can only transfer job ownership to a group if they are also a member of that group.

E.

Only workspace administrators can grant "Owner" privileges to a group.

Incorporating unit tests into a PySpark application requires upfront attention to the design of your jobs, or a potentially significant refactoring of existing code.

Which statement describes a main benefit that offset this additional effort?

A.

Improves the quality of your data

B.

Validates a complete use case of your application

C.

Troubleshooting is easier since all steps are isolated and tested individually

D.

Yields faster deployment and execution times

E.

Ensures that all steps interact correctly to achieve the desired end result

A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on task A.

If tasks A and B complete successfully but task C fails during a scheduled run, which statement describes the resulting state?

A.

All logic expressed in the notebook associated with tasks A and B will have been successfully completed; some operations in task C may have completed successfully.

B.

All logic expressed in the notebook associated with tasks A and B will have been successfully completed; any changes made in task C will be rolled back due to task failure.

C.

All logic expressed in the notebook associated with task A will have been successfully completed; tasks B and C will not commit any changes because of stage failure.

D.

Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until ail tasks have successfully been completed.

E.

Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task C failed, all commits will be rolled back automatically.

A Delta Lake table in the Lakehouse named customer_parsams is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

Immediately after each update succeeds, the data engineer team would like to determine the difference between the new version and the previous of the table.

Given the current implementation, which method can be used?

A.

Parse the Delta Lake transaction log to identify all newly written data files.

B.

Execute DESCRIBE HISTORY customer_churn_params to obtain the full operation metrics for the update, including a log of all records that have been added or modified.

C.

Execute a query to calculate the difference between the new version and the previous version using Delta Lake’s built-in versioning and time travel functionality.

D.

Parse the Spark event logs to identify those rows that were updated, inserted, or deleted.

A member of the data engineering team has submitted a short notebook that they wish to schedule as part of a larger data pipeline. Assume that the commands provided below produce the logically correct results when run as presented.

Which command should be removed from the notebook before scheduling it as a job?

A.

Cmd 2

B.

Cmd 3

C.

Cmd 4

D.

Cmd 5

E.

Cmd 6

The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:

item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING

The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.

A junior data engineer suggests converting this data to Delta Lake will improve query performance.

Which response to the junior data engineer s suggestion is correct?

A.

Delta Lake statistics are not optimized for free text fields with high cardinality.

B.

Text data cannot be stored with Delta Lake.

C.

ZORDER ON review will need to be run to see performance gains.

D.

The Delta log creates a term matrix for free text fields to support selective filtering.

E.

Delta Lake statistics are only collected on the first 4 columns in a table.

A data engineer wants to create a cluster using the Databricks CLI for a big ETL pipeline. The cluster should have five workers, one driver of type i3.xlarge, and should use the '14.3.x-scala2.12' runtime.

Which command should the data engineer use?

A.

databricks clusters create 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name DataEngineer_cluster

B.

databricks clusters add 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster

C.

databricks compute add 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster

D.

databricks compute create 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster

Which statement describes integration testing?

A.

Validates interactions between subsystems of your application

B.

Requires an automated testing framework

C.

Requires manual intervention

D.

Validates an application use case

E.

Validates behavior of individual elements of your application