Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 - Databricks Certified Associate Developer for Apache Spark 3.5-Python
Total 85 questions
Which command overwrites an existing JSON file when writing a DataFrame?
A data scientist is working with a Spark DataFrame called customerDF that contains customer information.The DataFrame has a column named email with customer email addresses. The data scientist needs to split this column into username and domain parts.
Which code snippet splits the email column into username and domain columns?
An engineer has a large ORC file located at/file/test_data.orcand wants to read only specific columns to reduce memory usage.
Which code fragment will select the columns, i.e.,col1,col2, during the reading process?
A data engineer is working with a large JSON dataset containing order information. The dataset is stored in a distributed file system and needs to be loaded into a Spark DataFrame for analysis. The data engineer wants to ensure that the schema is correctly defined and that the data is read efficiently.
Which approach should the data scientist use to efficiently load the JSON data into a Spark DataFrame with a predefined schema?
In the code block below,aggDFcontains aggregations on a streaming DataFrame:
Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?
A Spark developer wants to improve the performance of an existing PySpark UDF that runs a hash function that is not available in the standard Spark functions library. The existing UDF code is:
import hashlib
import pyspark.sql.functions as sf
from pyspark.sql.types import StringType
def shake_256(raw):
return hashlib.shake_256(raw.encode()).hexdigest(20)
shake_256_udf = sf.udf(shake_256, StringType())
The developer wants to replace this existing UDF with a Pandas UDF to improve performance. The developer changes the definition ofshake_256_udfto this:CopyEdit
shake_256_udf = sf.pandas_udf(shake_256, StringType())
However, the developer receives the error:
What should the signature of theshake_256()function be changed to in order to fix this error?
Given the schema:
event_ts TIMESTAMP,
sensor_id STRING,
metric_value LONG,
ingest_ts TIMESTAMP,
source_file_path STRING
The goal is to deduplicate based on: event_ts, sensor_id, and metric_value.
Options:
What is the difference betweendf.cache()anddf.persist()in Spark DataFrame?
A data analyst wants to add a column date derived from a timestamp column.
Options:
A data engineer is running a batch processing job on a Spark cluster with the following configuration:
10 worker nodes
16 CPU cores per worker node
64 GB RAM per node
The data engineer wants to allocate four executors per node, each executor using four cores.
What is the total number of CPU cores used by the application?