Databricks Databricks-Certified-Data-Engineer-Associate - Databricks Certified Data Engineer Associate Exam

Databricks Databricks-Certified-Data-Engineer-Associate Premium Access Download Demo

Page: 4 / 6
Total 176 questions

A data engineer and data analyst are working together on a data pipeline. The data engineer is working on the raw, bronze, and silver layers of the pipeline using Python, and the data analyst is working on the gold layer of the pipeline using SQL. The raw source of the pipeline is a streaming input. They now want to migrate their pipeline to use Delta Live Tables.

Which of the following changes will need to be made to the pipeline when migrating to Delta Live Tables?

None of these changes will need to be made

The pipeline will need to stop using the medallion-based multi-hop architecture

The pipeline will need to be written entirely in SQL

The pipeline will need to use a batch source in place of a streaming source

The pipeline will need to be written entirely in Python

Question # 32

A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.

The cade block used by the data engineer is below:

If the data engineer only wants the query to execute a micro-batch to process data every 5 seconds, which of the following lines of code should the data engineer use to fill in the blank?

trigger( " 5 seconds " )

trigger()

trigger(once= " 5 seconds " )

trigger(processingTime= " 5 seconds " )

trigger(continuous= " 5 seconds " )

Question # 33

Which method should a Data Engineer apply to ensure Workflows are being triggered on schedule?

Scheduled Workflows require an always-running cluster, which is more expensive but reduces processing latency.

Scheduled Workflows process data as it arrives at configured sources.

Scheduled Workflows can reduce resource consumption and expense since the cluster runs only long enough to execute the pipeline.

Scheduled Workflows run continuously until manually stopped.

Question # 34

A data engineering project involves processing large batches of data on a daily schedule using ETL. The jobs are resource-intensive and vary in size, requiring a scalable, cost-efficient compute solution that can automatically scale based on the workload.

Which compute approach will satisfy the needs described?

Databricks SQL Serverless

Dedicated Cluster

All-Purpose Cluster

Job Cluster

Question # 35

Which of the following describes the relationship between Bronze tables and raw data?

Bronze tables contain less data than raw data files.

Bronze tables contain more truthful data than raw data.

Bronze tables contain aggregates while raw data is unaggregated.

Bronze tables contain a less refined view of data than raw data.

Bronze tables contain raw data with a schema applied.

Question # 36

A data engineer only wants to execute the final block of a Python program if the Python variable day_of_week is equal to 1 and the Python variable review_period is True.

Which of the following control flow statements should the data engineer use to begin this conditionally executed code block?

if day_of_week = 1 and review_period:

if day_of_week = 1 and review_period = " True " :

if day_of_week == 1 and review_period == " True " :

if day_of_week == 1 and review_period:

if day_of_week = 1 & review_period: = " True " :

Question # 37

A data engineer needs to use a Delta table as part of a data pipeline, but they do not know if they have the appropriate permissions.

In which of the following locations can the data engineer review their permissions on the table?

Databricks Filesystem

Jobs

Dashboards

Repos

Data Explorer

Question # 38

Which of the following describes a scenario in which a data engineer will want to use a single-node cluster?

When they are working interactively with a small amount of data

When they are running automated reports to be refreshed as quickly as possible

When they are working with SQL within Databricks SQL

When they are concerned about the ability to automatically scale with larger data

When they are manually running reports with a large amount of data

Explanation:

The scenario in which a data engineer will want to use a single-node cluster is when they are working interactively with a small amount of data.Â A single-node cluster is a cluster consisting of an Apache Spark driver and no Spark workers1.Â A single-node cluster supports Spark jobs and all Spark data sources, including Delta Lake1.Â A single-node cluster is helpful for single-node machine learning workloads that use Spark to load and save data, and for lightweight exploratory data analysis1.Â A single-node cluster can run Spark locally, spawn one executor thread per logical core in the cluster, and save all log output in the driver log1.Â A single-node cluster can be created by selecting the Single Node button when configuring a cluster1.

The other options are not suitable for using a single-node cluster.Â When running automated reports to be refreshed as quickly as possible, a data engineer will want to use a multi-node cluster that can scale up and down automatically based on the workload demand2.Â When working with SQL within Databricks SQL, a data engineer will want to use a SQL Endpoint that can execute SQL queries on a serverless pool or an existing cluster3.Â When concerned about the ability to automatically scale with larger data, a data engineer will want to use a multi-node cluster that can leverage the Databricks Lakehouse Platform and the Delta Engine to handle large-scale data processing efficiently and reliably4. When manually running reports with a large amount of data, a data engineer will want to use a multi-node cluster that can distribute the computation across multiple workers and leverage the Spark UI to monitor the performance and troubleshoot the issues.

1:Â Single Node clusters | Databricks on AWS

2:Â Autoscaling | Databricks on AWS

3:Â SQL Endpoints | Databricks on AWS

4:Â Databricks Lakehouse Platform | Databricks on AWS

[Spark UI | Databricks on AWS]

Question # 39

Which of the following commands will return the number of null values in the member_id column?

SELECT count(member_id) FROM my_table;

SELECT count(member_id) - count_null(member_id) FROM my_table;

SELECT count_if(member_id IS NULL) FROM my_table;

SELECT null(member_id) FROM my_table;

SELECT count_null(member_id) FROM my_table;

Question # 40

A data engineer has a Job with multiple tasks that runs nightly. Each of the tasks runs slowly because the clusters take a long time to start.

Which of the following actions can the data engineer perform to improve the start up time for the clusters used for the Job?

They can use endpoints available in Databricks SQL

They can use jobs clusters instead of all-purpose clusters

They can configure the clusters to be single-node

They can use clusters that are from a cluster pool

They can configure the clusters to autoscale for larger data sizes

Pre-Summer Sale Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: xmas50

Databricks Databricks-Certified-Data-Engineer-Associate - Databricks Certified Data Engineer Associate Exam

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

The Answer Is:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation: