Weekend Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: xmas50

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 - Databricks Certified Associate Developer for Apache Spark 3.5-Python

Which command overwrites an existing JSON file when writing a DataFrame?

A.

df.write.mode("overwrite").json("path/to/file")

B.

df.write.overwrite.json("path/to/file")

C.

df.write.json("path/to/file", overwrite=True)

D.

df.write.format("json").save("path/to/file", mode="overwrite")

A data scientist is working with a Spark DataFrame called customerDF that contains customer information.The DataFrame has a column named email with customer email addresses. The data scientist needs to split this column into username and domain parts.

Which code snippet splits the email column into username and domain columns?

A.

customerDF.select(

col("email").substr(0, 5).alias("username"),

col("email").substr(-5).alias("domain")

)

B.

customerDF.withColumn("username", split(col("email"), "@").getItem(0)) \

.withColumn("domain", split(col("email"), "@").getItem(1))

C.

customerDF.withColumn("username", substring_index(col("email"), "@", 1)) \

.withColumn("domain", substring_index(col("email"), "@", -1))

D.

customerDF.select(

regexp_replace(col("email"), "@", "").alias("username"),

regexp_replace(col("email"), "@", "").alias("domain")

)

An engineer has a large ORC file located at/file/test_data.orcand wants to read only specific columns to reduce memory usage.

Which code fragment will select the columns, i.e.,col1,col2, during the reading process?

A.

spark.read.orc("/file/test_data.orc").filter("col1 = 'value' ").select("col2")

B.

spark.read.format("orc").select("col1", "col2").load("/file/test_data.orc")

C.

spark.read.orc("/file/test_data.orc").selected("col1", "col2")

D.

spark.read.format("orc").load("/file/test_data.orc").select("col1", "col2")

A data engineer is working with a large JSON dataset containing order information. The dataset is stored in a distributed file system and needs to be loaded into a Spark DataFrame for analysis. The data engineer wants to ensure that the schema is correctly defined and that the data is read efficiently.

Which approach should the data scientist use to efficiently load the JSON data into a Spark DataFrame with a predefined schema?

A.

Use spark.read.json() to load the data, then use DataFrame.printSchema() to view the inferred schema, and finally use DataFrame.cast() to modify column types.

B.

Use spark.read.json() with the inferSchema option set to true

C.

Use spark.read.format("json").load() and then use DataFrame.withColumn() to cast each column to the desired data type.

D.

Define a StructType schema and use spark.read.schema(predefinedSchema).json() to load the data.

In the code block below,aggDFcontains aggregations on a streaming DataFrame:

Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?

A.

complete

B.

append

C.

replace

D.

aggregate

A Spark developer wants to improve the performance of an existing PySpark UDF that runs a hash function that is not available in the standard Spark functions library. The existing UDF code is:

import hashlib

import pyspark.sql.functions as sf

from pyspark.sql.types import StringType

def shake_256(raw):

return hashlib.shake_256(raw.encode()).hexdigest(20)

shake_256_udf = sf.udf(shake_256, StringType())

The developer wants to replace this existing UDF with a Pandas UDF to improve performance. The developer changes the definition ofshake_256_udfto this:CopyEdit

shake_256_udf = sf.pandas_udf(shake_256, StringType())

However, the developer receives the error:

What should the signature of theshake_256()function be changed to in order to fix this error?

A.

def shake_256(df: pd.Series) -> str:

B.

def shake_256(df: Iterator[pd.Series]) -> Iterator[pd.Series]:

C.

def shake_256(raw: str) -> str:

D.

def shake_256(df: pd.Series) -> pd.Series:

Given the schema:

event_ts TIMESTAMP,

sensor_id STRING,

metric_value LONG,

ingest_ts TIMESTAMP,

source_file_path STRING

The goal is to deduplicate based on: event_ts, sensor_id, and metric_value.

Options:

A.

dropDuplicates on all columns (wrong criteria)

B.

dropDuplicates with no arguments (removes based on all columns)

C.

groupBy without aggregation (invalid use)

D.

dropDuplicates on the exact matching fields

What is the difference betweendf.cache()anddf.persist()in Spark DataFrame?

A.

Bothcache()andpersist()can be used to set the default storage level (MEMORY_AND_DISK_SER)

B.

Both functions perform the same operation. Thepersist()function provides improved performance asits default storage level isDISK_ONLY.

C.

persist()- Persists the DataFrame with the default storage level (MEMORY_AND_DISK_SER) andcache()- Can be used to set different storage levels to persist the contents of the DataFrame.

D.

cache()- Persists the DataFrame with the default storage level (MEMORY_AND_DISK) andpersist()- Can be used to set different storage levels to persist the contents of the DataFrame

A data analyst wants to add a column date derived from a timestamp column.

Options:

A.

dates_df.withColumn("date", f.unix_timestamp("timestamp")).show()

B.

dates_df.withColumn("date", f.to_date("timestamp")).show()

C.

dates_df.withColumn("date", f.date_format("timestamp", "yyyy-MM-dd")).show()

D.

dates_df.withColumn("date", f.from_unixtime("timestamp")).show()

A data engineer is running a batch processing job on a Spark cluster with the following configuration:

10 worker nodes

16 CPU cores per worker node

64 GB RAM per node

The data engineer wants to allocate four executors per node, each executor using four cores.

What is the total number of CPU cores used by the application?

A.

160

B.

64

C.

80

D.

40