New Year Sale Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: xmas50

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 - Databricks Certified Associate Developer for Apache Spark 3.5 – Python

40 of 55.

A developer wants to refactor older Spark code to take advantage of built-in functions introduced in Spark 3.5.

The original code:

from pyspark.sql import functions as F

min_price = 110.50

result_df = prices_df.filter(F.col("price") > min_price).agg(F.count("*"))

Which code block should the developer use to refactor the code?

A.

result_df = prices_df.filter(F.col("price") > F.lit(min_price)).agg(F.count("*"))

B.

result_df = prices_df.where(F.lit("price") > min_price).groupBy().count()

C.

result_df = prices_df.withColumn("valid_price", when(col("price") > F.lit(min_price), True))

D.

result_df = prices_df.filter(F.lit(min_price) > F.col("price")).count()

47 of 55.

A data engineer has written the following code to join two DataFrames df1 and df2:

df1 = spark.read.csv("sales_data.csv")

df2 = spark.read.csv("product_data.csv")

df_joined = df1.join(df2, df1.product_id == df2.product_id)

The DataFrame df1 contains ~10 GB of sales data, and df2 contains ~8 MB of product data.

Which join strategy will Spark use?

A.

Shuffle join, as the size difference between df1 and df2 is too large for a broadcast join to work efficiently.

B.

Shuffle join, because AQE is not enabled, and Spark uses a static query plan.

C.

Shuffle join because no broadcast hints were provided.

D.

Broadcast join, as df2 is smaller than the default broadcast threshold.

20 of 55.

What is the difference between df.cache() and df.persist() in Spark DataFrame?

A.

Both functions perform the same operation. The persist() function provides improved performance as its default storage level is DISK_ONLY.

B.

persist() — Persists the DataFrame with the default storage level (MEMORY_AND_DISK_DESER), and cache() — Can be used to set different storage levels.

C.

Both cache() and persist() can be used to set the default storage level (MEMORY_AND_DISK_DESER).

D.

cache() — Persists the DataFrame with the default storage level (MEMORY_AND_DISK_DESER), and persist() — Can be used to set different storage levels to persist the contents of the DataFrame.

A data engineer is streaming data from Kafka and requires:

Minimal latency

Exactly-once processing guarantees

Which trigger mode should be used?

A.

.trigger(processingTime='1 second')

B.

.trigger(continuous=True)

C.

.trigger(continuous='1 second')

D.

.trigger(availableNow=True)

You have:

DataFrame A: 128 GB of transactions

DataFrame B: 1 GB user lookup table

Which strategy is correct for broadcasting?

A.

DataFrame B should be broadcasted because it is smaller and will eliminate the need for shuffling itself

B.

DataFrame B should be broadcasted because it is smaller and will eliminate the need for shuffling DataFrame A

C.

DataFrame A should be broadcasted because it is larger and will eliminate the need for shuffling DataFrame B

D.

DataFrame A should be broadcasted because it is smaller and will eliminate the need for shuffling itself

25 of 55.

A Data Analyst is working on employees_df and needs to add a new column where a 10% tax is calculated on the salary.

Additionally, the DataFrame contains the column age, which is not needed.

Which code fragment adds the tax column and removes the age column?

A.

employees_df = employees_df.withColumn("tax", col("salary") * 0.1).drop("age")

B.

employees_df = employees_df.withColumn("tax", lit(0.1)).drop("age")

C.

employees_df = employees_df.dropField("age").withColumn("tax", col("salary") * 0.1)

D.

employees_df = employees_df.withColumn("tax", col("salary") + 0.1).drop("age")

Which feature of Spark Connect is considered when designing an application to enable remote interaction with the Spark cluster?

A.

It provides a way to run Spark applications remotely in any programming language

B.

It can be used to interact with any remote cluster using the REST API

C.

It allows for remote execution of Spark jobs

D.

It is primarily used for data ingestion into Spark from external sources

A Spark engineer must select an appropriate deployment mode for the Spark jobs.

What is the benefit of using cluster mode in Apache Sparkâ„¢?

A.

In cluster mode, resources are allocated from a resource manager on the cluster, enabling better performance and scalability for large jobs

B.

In cluster mode, the driver is responsible for executing all tasks locally without distributing them across the worker nodes.

C.

In cluster mode, the driver runs on the client machine, which can limit the application's ability to handle large datasets efficiently.

D.

In cluster mode, the driver program runs on one of the worker nodes, allowing the application to fully utilize the distributed resources of the cluster.

28 of 55.

A data analyst builds a Spark application to analyze finance data and performs the following operations:

filter, select, groupBy, and coalesce.

Which operation results in a shuffle?

A.

filter

B.

select

C.

groupBy

D.

coalesce

46 of 55.

A data engineer is implementing a streaming pipeline with watermarking to handle late-arriving records.

The engineer has written the following code:

inputStream \

.withWatermark("event_time", "10 minutes") \

.groupBy(window("event_time", "15 minutes"))

What happens to data that arrives after the watermark threshold?

A.

Any data arriving more than 10 minutes after the watermark threshold will be ignored and not included in the aggregation.

B.

Records that arrive later than the watermark threshold (10 minutes) will automatically be included in the aggregation if they fall within the 15-minute window.

C.

Data arriving more than 10 minutes after the latest watermark will still be included in the aggregation but will be placed into the next window.

D.

The watermark ensures that late data arriving within 10 minutes of the latest event time will be processed and included in the windowed aggregation.