Winter Sale Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: ecus65

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 - Databricks Certified Associate Developer for Apache Spark 3.5 – Python

Given the following code snippet in my_spark_app.py:

What is the role of the driver node?

A.

The driver node orchestrates the execution by transforming actions into tasks and distributing them to worker nodes

B.

The driver node only provides the user interface for monitoring the application

C.

The driver node holds the DataFrame data and performs all computations locally

D.

The driver node stores the final result after computations are completed by worker nodes

A developer is working with a pandas DataFrame containing user behavior data from a web application.

Which approach should be used for executing a groupBy operation in parallel across all workers in Apache Spark 3.5?

A)

Use the applylnPandas API

B)

C)

D)

A.

Use the applyInPandas API:

df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()

B.

Use the mapInPandas API:

df.mapInPandas(mean_func, schema="user_id long, value double").show()

C.

Use a regular Spark UDF:

from pyspark.sql.functions import mean

df.groupBy("user_id").agg(mean("value")).show()

D.

Use a Pandas UDF:

@pandas_udf("double")

def mean_func(value: pd.Series) -> float:

return value.mean()

df.groupby("user_id").agg(mean_func(df["value"])).show()

What is the benefit of Adaptive Query Execution (AQE)?

A.

It allows Spark to optimize the query plan before execution but does not adapt during runtime.

B.

It enables the adjustment of the query plan during runtime, handling skewed data, optimizing join strategies, and improving overall query performance.

C.

It optimizes query execution by parallelizing tasks and does not adjust strategies based on runtime metrics like data skew.

D.

It automatically distributes tasks across nodes in the clusters and does not perform runtime adjustments to the query plan.

43 of 55.

An organization has been running a Spark application in production and is considering disabling the Spark History Server to reduce resource usage.

What will be the impact of disabling the Spark History Server in production?

A.

Prevention of driver log accumulation during long-running jobs

B.

Improved job execution speed due to reduced logging overhead

C.

Loss of access to past job logs and reduced debugging capability for completed jobs

D.

Enhanced executor performance due to reduced log size

A data scientist has identified that some records in the user profile table contain null values in any of the fields, and such records should be removed from the dataset before processing. The schema includes fields like user_id, username, date_of_birth, created_ts, etc.

The schema of the user profile table looks like this:

Which block of Spark code can be used to achieve this requirement?

Options:

A.

filtered_df = users_raw_df.na.drop(thresh=0)

B.

filtered_df = users_raw_df.na.drop(how='all')

C.

filtered_df = users_raw_df.na.drop(how='any')

D.

filtered_df = users_raw_df.na.drop(how='all', thresh=None)

21 of 55.

What is the behavior of the function date_sub(start, days) if a negative value is passed into the days parameter?

A.

The number of days specified will be added to the start date.

B.

An error message of an invalid parameter will be returned.

C.

The same start date will be returned.

D.

The number of days specified will be removed from the start date.

5 of 55.

What is the relationship between jobs, stages, and tasks during execution in Apache Spark?

A.

A job contains multiple tasks, and each task contains multiple stages.

B.

A stage contains multiple jobs, and each job contains multiple tasks.

C.

A stage contains multiple tasks, and each task contains multiple jobs.

D.

A job contains multiple stages, and each stage contains multiple tasks.

Given:

python

CopyEdit

spark.sparkContext.setLogLevel("")

Which set contains the suitable configuration settings for Spark driver LOG_LEVELs?

A.

ALL, DEBUG, FAIL, INFO

B.

ERROR, WARN, TRACE, OFF

C.

WARN, NONE, ERROR, FATAL

D.

FATAL, NONE, INFO, DEBUG

A data scientist wants each record in the DataFrame to contain:

The first attempt at the code does read the text files but each record contains a single line. This code is shown below:

The entire contents of a file

The full file path

The issue: reading line-by-line rather than full text per file.

Code:

corpus = spark.read.text("/datasets/raw_txt/*") \

.select('*', '_metadata.file_path')

Which change will ensure one record per file?

Options:

A.

Add the option wholetext=True to the text() function

B.

Add the option lineSep='\n' to the text() function

C.

Add the option wholetext=False to the text() function

D.

Add the option lineSep=", " to the text() function

15 of 55.

A data engineer is working on a Streaming DataFrame (streaming_df) with the following streaming data:

id

name

count

timestamp

1

Delhi

20

2024-09-19T10:11

1

Delhi

50

2024-09-19T10:12

2

London

50

2024-09-19T10:15

3

Paris

30

2024-09-19T10:18

3

Paris

20

2024-09-19T10:20

4

Washington

10

2024-09-19T10:22

Which operation is supported with streaming_df?

A.

streaming_df.count()

B.

streaming_df.filter("count < 30")

C.

streaming_df.select(countDistinct("name"))

D.

streaming_df.show()