Weekend Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: xmas50

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 - Databricks Certified Associate Developer for Apache Spark 3.5-Python

An engineer has two DataFrames: df1 (small) and df2 (large). A broadcast join is used:

python

CopyEdit

frompyspark.sql.functionsimportbroadcast

result = df2.join(broadcast(df1), on='id', how='inner')

What is the purpose of using broadcast() in this scenario?

Options:

A.

It filters the id values before performing the join.

B.

It increases the partition size for df1 and df2.

C.

It reduces the number of shuffle operations by replicating the smaller DataFrame to all nodes.

D.

It ensures that the join happens only when the id values are identical.

A data engineer is working on a real-time analytics pipeline using Apache Spark Structured Streaming. The engineer wants to process incoming data and ensure that triggers control when the query is executed. The system needs to process data in micro-batches with a fixed interval of 5 seconds.

Which code snippet the data engineer could use to fulfil this requirement?

A)

B)

C)

D)

Options:

A.

Uses trigger(continuous='5 seconds') – continuous processing mode.

B.

Uses trigger() – default micro-batch trigger without interval.

C.

Uses trigger(processingTime='5 seconds') – correct micro-batch trigger with interval.

D.

Uses trigger(processingTime=5000) – invalid, as processingTime expects a string.

A data engineer uses a broadcast variable to share a DataFrame containing millions of rows across executors for lookup purposes. What will be the outcome?

A.

The job may fail if the memory on each executor is not large enough to accommodate the DataFrame being broadcasted

B.

The job may fail if the executors do not have enough CPU cores to process the broadcasted dataset

C.

The job will hang indefinitely as Spark will struggle to distribute and serialize such a large broadcast variable to all executors

D.

The job may fail because the driver does not have enough CPU cores to serialize the large DataFrame

A data engineer is building an Apache Sparkâ„¢ Structured Streaming application to process a stream of JSON events in real time. The engineer wants the application to be fault-tolerant and resume processing from the last successfully processed record in case of a failure. To achieve this, the data engineer decides to implement checkpoints.

Which code snippet should the data engineer use?

A.

query = streaming_df.writeStream \

.format("console") \

.option("checkpoint", "/path/to/checkpoint") \

.outputMode("append") \

.start()

B.

query = streaming_df.writeStream \

.format("console") \

.outputMode("append") \

.option("checkpointLocation", "/path/to/checkpoint") \

.start()

C.

query = streaming_df.writeStream \

.format("console") \

.outputMode("complete") \

.start()

D.

query = streaming_df.writeStream \

.format("console") \

.outputMode("append") \

.start()

A data scientist is working on a project that requires processing large amounts of structured data, performing SQL queries, and applying machine learning algorithms. The data scientist is considering using Apache Spark for this task.

Which combination of Apache Spark modules should the data scientist use in this scenario?

Options:

A.

Spark DataFrames, Structured Streaming, and GraphX

B.

Spark SQL, Pandas API on Spark, and Structured Streaming

C.

Spark Streaming, GraphX, and Pandas API on Spark

D.

Spark DataFrames, Spark SQL, and MLlib