Weekend Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: xmas50

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 - Databricks Certified Associate Developer for Apache Spark 3.5-Python

A data analyst builds a Spark application to analyze finance data and performs the following operations:filter,select,groupBy, andcoalesce.

Which operation results in a shuffle?

A.

groupBy

B.

filter

C.

select

D.

coalesce

A developer wants to test Spark Connect with an existing Spark application.

What are the two alternative ways the developer can start a local Spark Connect server without changing their existing application code? (Choose 2 answers)

A.

Execute their pyspark shell with the option--remote "https://localhost"

B.

Execute their pyspark shell with the option--remote "sc://localhost"

C.

Set the environment variableSPARK_REMOTE="sc://localhost"before starting the pyspark shell

D.

Add.remote("sc://localhost")to their SparkSession.builder calls in their Spark code

E.

Ensure the Spark propertyspark.connect.grpc.binding.portis set to 15002 in the application code

A data engineer is reviewing a Spark application that applies several transformations to a DataFrame but notices that the job does not start executing immediately.

Which two characteristics of Apache Spark's execution model explain this behavior?

Choose 2 answers:

A.

The Spark engine requires manual intervention to start executing transformations.

B.

Only actions trigger the execution of the transformation pipeline.

C.

Transformations are executed immediately to build the lineage graph.

D.

The Spark engine optimizes the execution plan during the transformations, causing delays.

E.

Transformations are evaluated lazily.

A Spark application developer wants to identify which operations cause shuffling, leading to a new stage in the Spark execution plan.

Which operation results in a shuffle and a new stage?

A.

DataFrame.groupBy().agg()

B.

DataFrame.filter()

C.

DataFrame.withColumn()

D.

DataFrame.select()

A developer notices that all the post-shuffle partitions in a dataset are smaller than the value set forspark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold.

Which type of join will Adaptive Query Execution (AQE) choose in this case?

A.

A Cartesian join

B.

A shuffled hash join

C.

A broadcast nested loop join

D.

A sort-merge join

A data engineer is building a Structured Streaming pipeline and wants the pipeline to recover from failures or intentional shutdowns by continuing where the pipeline left off.

How can this be achieved?

A.

By configuring the optioncheckpointLocationduringreadStream

B.

By configuring the optionrecoveryLocationduring the SparkSession initialization

C.

By configuring the optionrecoveryLocationduringwriteStream

D.

By configuring the optioncheckpointLocationduringwriteStream

An MLOps engineer is building a Pandas UDF that applies a language model that translates English strings into Spanish. The initial code is loading the model on every call to the UDF, which is hurting the performance of the data pipeline.

The initial code is:

def in_spanish_inner(df: pd.Series) -> pd.Series:

model = get_translation_model(target_lang='es')

return df.apply(model)

in_spanish = sf.pandas_udf(in_spanish_inner, StringType())

How can the MLOps engineer change this code to reduce how many times the language model is loaded?

A.

Convert the Pandas UDF to a PySpark UDF

B.

Convert the Pandas UDF from a Series → Series UDF to a Series → Scalar UDF

C.

Run thein_spanish_inner()function in amapInPandas()function call

D.

Convert the Pandas UDF from a Series → Series UDF to an Iterator[Series] → Iterator[Series] UDF

A data engineer is running a Spark job to process a dataset of 1 TB stored in distributed storage. The cluster has 10 nodes, each with 16 CPUs. Spark UI shows:

Low number of Active Tasks

Many tasks complete in milliseconds

Fewer tasks than available CPUs

Which approach should be used to adjust the partitioning for optimal resource allocation?

A.

Set the number of partitions equal to the total number of CPUs in the cluster

B.

Set the number of partitions to a fixed value, such as 200

C.

Set the number of partitions equal to the number of nodes in the cluster

D.

Set the number of partitions by dividing the dataset size (1 TB) by a reasonable partition size, such as 128 MB

A data engineer is working on the DataFrame:

(Referring to the table image: it has columnsId,Name,count, andtimestamp.)

Which code fragment should the engineer use to extract the unique values in theNamecolumn into an alphabetically ordered list?

A.

df.select("Name").orderBy(df["Name"].asc())

B.

df.select("Name").distinct().orderBy(df["Name"])

C.

df.select("Name").distinct()

D.

df.select("Name").distinct().orderBy(df["Name"].desc())

A developer runs:

What is the result?

Options:

A.

It stores all data in a single Parquet file.

B.

It throws an error if there are null values in either partition column.

C.

It appends new partitions to an existing Parquet file.

D.

It creates separate directories for each unique combination of color and fruit.