Pre-Summer Sale Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: xmas50

Databricks Databricks-Certified-Professional-Data-Engineer - Databricks Certified Data Engineer Professional Exam

A Structured Streaming job deployed to production has been resulting in higher than expected cloud storage costs. At present, during normal execution, each micro-batch of data is processed in less than 3 seconds; at least 12 times per minute, a micro-batch is processed that contains 0 records. The streaming write was configured using the default trigger settings. The production job is currently scheduled alongside many other Databricks jobs in a workspace with instance pools provisioned to reduce start-up time for jobs with batch execution. Holding all other variables constant and assuming records need to be processed in less than 10 minutes, which adjustment will meet the requirement?

A.

Set the trigger interval to 500 milliseconds; setting a small but non-zero trigger interval ensures that the source is not queried too frequently.

B.

Set the trigger interval to 3 seconds; the default trigger interval is consuming too many records per batch, resulting in spill to disk that can increase volume costs.

C.

Set the trigger interval to 10 minutes; each batch calls APIs in the source storage account, so decreasing trigger frequency to the maximum allowable threshold should minimize this cost.

D.

Use the trigger once option and configure a Databricks job to execute the query every 10 minutes; this approach minimizes costs for both compute and storage.

The data engineer is using Spark ' s MEMORY_ONLY storage level.

Which indicators should the data engineer look for in the spark UI ' s Storage tab to signal that a cached table is not performing optimally?

A.

Size on Disk is > 0

B.

The number of Cached Partitions > the number of Spark Partitions

C.

The RDD Block Name included the ' ' annotation signaling failure to cache

D.

On Heap Memory Usage is within 75% of off Heap Memory usage

A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint 2.0/jobs/create .

Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload three times?

A.

Three new jobs named " Ingest new data " will be defined in the workspace, and they will each run once daily.

B.

The logic defined in the referenced notebook will be executed three times on new clusters with the configurations of the provided cluster ID.

C.

Three new jobs named " Ingest new data " will be defined in the workspace, but no jobs will be executed.

D.

One new job named " Ingest new data " will be defined in the workspace, but it will not be executed.

E.

The logic defined in the referenced notebook will be executed three times on the referenced existing all purpose cluster.

What statement is true regarding the retention of job run history?

A.

It is retained until you export or delete job run logs

B.

It is retained for 30 days, during which time you can deliver job run logs to DBFS or S3

C.

t is retained for 60 days, during which you can export notebook run results to HTML

D.

It is retained for 60 days, after which logs are archived

E.

It is retained for 90 days or until the run-id is re-used through custom run configuration

A data engineering team is configuring access controls in Databricks Unity Catalog . They grant the SELECT privilege on the sales catalog to the analyst_group, expecting that members of this group will automatically have SELECT access to all current and future schemas, tables, and views within the catalog.

What describes the privilege inheritance behavior in Unity Catalog?

A.

Granting SELECT at the catalog level applies to existing schemas and tables but not to those created in the future.

B.

Privileges in Unity Catalog do not cascade; SELECT must be explicitly granted on each schema and table, even if granted at the catalog level.

C.

Privileges granted at the schema level override any catalog-level privileges and prevent access unless explicitly revoked.

D.

Granting SELECT on a catalog automatically applies SELECT to all current and future schemas, tables, and views within that catalog.

To reduce storage and compute costs, the data engineering team has been tasked with curating a series of aggregate tables leveraged by business intelligence dashboards, customer-facing applications, production machine learning models, and ad hoc analytical queries.

The data engineering team has been made aware of new requirements from a customer-facing application, which is the only downstream workload they manage entirely. As a result, an aggregate table used by numerous teams across the organization will need to have a number of fields renamed, and additional fields will also be added.

Which of the solutions addresses the situation while minimally interrupting other teams in the organization without increasing the number of tables that need to be managed?

A.

Send all users notice that the schema for the table will be changing; include in the communication the logic necessary to revert the new table schema to match historic queries.

B.

Configure a new table with all the requisite fields and new names and use this as the source for the customer-facing application; create a view that maintains the original data schema and table name by aliasing select fields from the new table.

C.

Create a new table with the required schema and new fields and use Delta Lake ' s deep clone functionality to sync up changes committed to one table to the corresponding table.

D.

Replace the current table definition with a logical view defined with the query logic currently writing the aggregate table; create a new table to power the customer-facing application.

E.

Add a table comment warning all users that the table schema and field names will be changing on a given date; overwrite the table in place to the specifications of the customer-facing application.

A data engineer has created a new cluster using shared access mode with default configurations. The data engineer needs to allow the development team access to view the driver logs if needed.

What are the minimal cluster permissions that allow the development team to accomplish this?

A.

CAN ATTACH TO

B.

CAN MANAGE

C.

CAN VIEW

D.

CAN RESTART

A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on task A.

If tasks A and B complete successfully but task C fails during a scheduled run, which statement describes the resulting state?

A.

All logic expressed in the notebook associated with tasks A and B will have been successfully completed; some operations in task C may have completed successfully.

B.

All logic expressed in the notebook associated with tasks A and B will have been successfully completed; any changes made in task C will be rolled back due to task failure.

C.

All logic expressed in the notebook associated with task A will have been successfully completed; tasks B and C will not commit any changes because of stage failure.

D.

Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until ail tasks have successfully been completed.

E.

Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task C failed, all commits will be rolled back automatically.

A data team ' s Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a new field to track the number of times this promotion code is used for each item. A junior data engineer suggests updating the existing query as follows: Note that proposed changes are in bold.

Which step must also be completed to put the proposed query into production?

A.

Increase the shuffle partitions to account for additional aggregates

B.

Specify a new checkpointlocation

C.

Run REFRESH TABLE delta, /item_agg '

D.

Remove .option (mergeSchema ' , true ' ) from the streaming write

Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?

A.

spark.sql.files.maxPartitionBytes

B.

spark.sql.autoBroadcastJoinThreshold

C.

spark.sql.files.openCostInBytes

D.

spark.sql.adaptive.coalescePartitions.minPartitionNum

E.

spark.sql.adaptive.advisoryPartitionSizeInBytes