Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 - Databricks Certified Associate Developer for Apache Spark 3.5 – Python

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Premium Access Download Demo

Page: 1 / 4
Total 136 questions

Question # 1

Given the following code snippet in my_spark_app.py:

What is the role of the driver node?

The driver node orchestrates the execution by transforming actions into tasks and distributing them to worker nodes

The driver node only provides the user interface for monitoring the application

The driver node holds the DataFrame data and performs all computations locally

The driver node stores the final result after computations are completed by worker nodes

Question # 2

A developer is working with a pandas DataFrame containing user behavior data from a web application.

Which approach should be used for executing a groupBy operation in parallel across all workers in Apache Spark 3.5?

Use the applylnPandas API

Use the applyInPandas API:

df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()

Use the mapInPandas API:

df.mapInPandas(mean_func, schema="user_id long, value double").show()

Use a regular Spark UDF:

from pyspark.sql.functions import mean

df.groupBy("user_id").agg(mean("value")).show()

Use a Pandas UDF:

@pandas_udf("double")

def mean_func(value: pd.Series) -> float:

return value.mean()

df.groupby("user_id").agg(mean_func(df["value"])).show()

Question # 3

What is the benefit of Adaptive Query Execution (AQE)?

It allows Spark to optimize the query plan before execution but does not adapt during runtime.

It enables the adjustment of the query plan during runtime, handling skewed data, optimizing join strategies, and improving overall query performance.

It optimizes query execution by parallelizing tasks and does not adjust strategies based on runtime metrics like data skew.

It automatically distributes tasks across nodes in the clusters and does not perform runtime adjustments to the query plan.

Question # 4

43 of 55.

An organization has been running a Spark application in production and is considering disabling the Spark History Server to reduce resource usage.

What will be the impact of disabling the Spark History Server in production?

Prevention of driver log accumulation during long-running jobs

Improved job execution speed due to reduced logging overhead

Loss of access to past job logs and reduced debugging capability for completed jobs

Enhanced executor performance due to reduced log size

Question # 5

A data scientist has identified that some records in the user profile table contain null values in any of the fields, and such records should be removed from the dataset before processing. The schema includes fields like user_id, username, date_of_birth, created_ts, etc.

The schema of the user profile table looks like this:

Which block of Spark code can be used to achieve this requirement?

Options:

filtered_df = users_raw_df.na.drop(thresh=0)

filtered_df = users_raw_df.na.drop(how='all')

filtered_df = users_raw_df.na.drop(how='any')

filtered_df = users_raw_df.na.drop(how='all', thresh=None)

Question # 6

21 of 55.

What is the behavior of the function date_sub(start, days) if a negative value is passed into the days parameter?

The number of days specified will be added to the start date.

An error message of an invalid parameter will be returned.

The same start date will be returned.

The number of days specified will be removed from the start date.

Question # 7

5 of 55.

What is the relationship between jobs, stages, and tasks during execution in Apache Spark?

A job contains multiple tasks, and each task contains multiple stages.

A stage contains multiple jobs, and each job contains multiple tasks.

A stage contains multiple tasks, and each task contains multiple jobs.

A job contains multiple stages, and each stage contains multiple tasks.

Question # 8

Given:

python

CopyEdit

spark.sparkContext.setLogLevel("")

Which set contains the suitable configuration settings for Spark driver LOG_LEVELs?

ALL, DEBUG, FAIL, INFO

ERROR, WARN, TRACE, OFF

WARN, NONE, ERROR, FATAL

FATAL, NONE, INFO, DEBUG

Question # 9

A data scientist wants each record in the DataFrame to contain:

The first attempt at the code does read the text files but each record contains a single line. This code is shown below:

The entire contents of a file

The full file path

The issue: reading line-by-line rather than full text per file.

Code:

corpus = spark.read.text("/datasets/raw_txt/*") \

.select('*', '_metadata.file_path')

Which change will ensure one record per file?

Options:

Add the option wholetext=True to the text() function

Add the option lineSep='\n' to the text() function

Add the option wholetext=False to the text() function

Add the option lineSep=", " to the text() function

Question # 10

15 of 55.

A data engineer is working on a Streaming DataFrame (streaming_df) with the following streaming data:

name

count

timestamp

Delhi

2024-09-19T10:11

Delhi

2024-09-19T10:12

London

2024-09-19T10:15

Paris

2024-09-19T10:18

Paris

2024-09-19T10:20

Washington

2024-09-19T10:22

Which operation is supported with streaming_df?

streaming_df.count()

streaming_df.filter("count < 30")

streaming_df.select(countDistinct("name"))

streaming_df.show()

Winter Sale Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: ecus65

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 - Databricks Certified Associate Developer for Apache Spark 3.5 – Python

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation: