Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 - Databricks Certified Associate Developer for Apache Spark 3.0 Exam

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Premium Access Download Demo

Page: 3 / 6
Total 180 questions

Which of the following code blocks returns a DataFrame that matches the multi-column DataFrame itemsDf, except that integer column itemId has been converted into a string column?

itemsDf.withColumn("itemId", convert("itemId", "string"))

itemsDf.withColumn("itemId", col("itemId").cast("string"))

(Correct)

itemsDf.select(cast("itemId", "string"))

itemsDf.withColumn("itemId", col("itemId").convert("string"))

spark.cast(itemsDf, "itemId", "string")

Question # 22

Which of the following statements about the differences between actions and transformations is correct?

Actions are evaluated lazily, while transformations are not evaluated lazily.

Actions generate RDDs, while transformations do not.

Actions do not send results to the driver, while transformations do.

Actions can be queued for delayed execution, while transformations can only be processed immediately.

Actions can trigger Adaptive Query Execution, while transformation cannot.

Question # 23

Which of the following is the idea behind dynamic partition pruning in Spark?

Dynamic partition pruning is intended to skip over the data you do not need in the results of a query.

Dynamic partition pruning concatenates columns of similar data types to optimize join performance.

Dynamic partition pruning performs wide transformations on disk instead of in memory.

Dynamic partition pruning reoptimizes physical plans based on data types and broadcast variables.

Dynamic partition pruning reoptimizes query plans based on runtime statistics collected during query execution.

Question # 24

Which of the following code blocks generally causes a great amount of network traffic?

DataFrame.select()

DataFrame.coalesce()

DataFrame.collect()

DataFrame.rdd.map()

DataFrame.count()

Question # 25

Which of the following code blocks returns a copy of DataFrame transactionsDf that only includes columns transactionId, storeId, productId and f?

Sample of DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.+-------------+---------+-----+-------+---------+----+

transactionsDf.drop(col("value"), col("predError"))

transactionsDf.drop("predError", "value")

transactionsDf.drop(value, predError)

transactionsDf.drop(["predError", "value"])

transactionsDf.drop([col("predError"), col("value")])

Question # 26

Which of the following code blocks writes DataFrame itemsDf to disk at storage location filePath, making sure to substitute any existing data at that location?

itemsDf.write.mode("overwrite").parquet(filePath)

itemsDf.write.option("parquet").mode("overwrite").path(filePath)

itemsDf.write(filePath, mode="overwrite")

itemsDf.write.mode("overwrite").path(filePath)

itemsDf.write().parquet(filePath, mode="overwrite")

Question # 27

Which of the following code blocks prints out in how many rows the expression Inc. appears in the string-type column supplier of DataFrame itemsDf?

1.counter = 0

3.for index, row in itemsDf.iterrows():

4. if 'Inc.' in row['supplier']:

5. counter = counter + 1

7.print(counter)

1.counter = 0

3.def count(x):

4. if 'Inc.' in x['supplier']:

5. counter = counter + 1

7.itemsDf.foreach(count)

8.print(counter)

print(itemsDf.foreach(lambda x: 'Inc.' in x))

print(itemsDf.foreach(lambda x: 'Inc.' in x).sum())

1.accum=sc.accumulator(0)

3.def check_if_inc_in_supplier(row):

4. if 'Inc.' in row['supplier']:

5. accum.add(1)

7.itemsDf.foreach(check_if_inc_in_supplier)

8.print(accum.value)

Question # 28

Which of the following code blocks returns a single row from DataFrame transactionsDf?

Full DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.| 4| null| null| 3| 2|null|

8.| 5| null| null| null| 2|null|

9.| 6| 3| 2| 25| 2|null|

10.+-------------+---------+-----+-------+---------+----+

transactionsDf.where(col("storeId").between(3,25))

transactionsDf.filter((col("storeId")!=25) | (col("productId")==2))

transactionsDf.filter(col("storeId")==25).select("predError","storeId").distinct()

transactionsDf.select("productId", "storeId").where("storeId == 2 OR storeId != 25")

transactionsDf.where(col("value").isNull()).select("productId", "storeId").distinct()

Question # 29

Which of the following code blocks returns a new DataFrame with only columns predError and values of every second row of DataFrame transactionsDf?

Entire DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.| 4| null| null| 3| 2|null|

8.| 5| null| null| null| 2|null|

9.| 6| 3| 2| 25| 2|null|

10.+-------------+---------+-----+-------+---------+----+

transactionsDf.filter(col("transactionId").isin([3,4,6])).select([predError, value])

transactionsDf.select(col("transactionId").isin([3,4,6]), "predError", "value")

transactionsDf.filter("transactionId" % 2 == 0).select("predError", "value")

transactionsDf.filter(col("transactionId") % 2 == 0).select("predError", "value")

(Correct)

1.transactionsDf.createOrReplaceTempView("transactionsDf")

2.spark.sql("FROM transactionsDf SELECT predError, value WHERE transactionId % 2 = 2")

transactionsDf.filter(col(transactionId).isin([3,4,6]))

Question # 30

The code block shown below should return a new 2-column DataFrame that shows one attribute from column attributes per row next to the associated itemName, for all suppliers in column supplier

whose name includes Sports. Choose the answer that correctly fills the blanks in the code block to accomplish this.

Sample of DataFrame itemsDf:

1.+------+----------------------------------+-----------------------------+-------------------+

3.+------+----------------------------------+-----------------------------+-------------------+

7.+------+----------------------------------+-----------------------------+-------------------+

Code block:

itemsDf.__1__(__2__).select(__3__, __4__)

1. filter

2. col("supplier").isin("Sports")

3. "itemName"

4. explode(col("attributes"))

1. where

2. col("supplier").contains("Sports")

3. "itemName"

4. "attributes"

1. where

2. col(supplier).contains("Sports")

3. explode(attributes)

4. itemName

1. where

2. "Sports".isin(col("Supplier"))

3. "itemName"

4. array_explode("attributes")

1. filter

2. col("supplier").contains("Sports")

3. "itemName"

4. explode("attributes")

Explanation:

Explanation

Output of correct code block:

+----------------------------------+------+

|itemName |col |

+----------------------------------+------+

|Thick Coat for Walking in the Snow|blue |

|Thick Coat for Walking in the Snow|winter|

|Thick Coat for Walking in the Snow|cozy |

|Outdoors Backpack |green |

|Outdoors Backpack |summer|

|Outdoors Backpack |travel|

+----------------------------------+------+

The key to solving this QUESTION NO: is knowing about Spark's explode operator. Using this operator, you can extract values from arrays into single rows. The following guidance steps through

the

answers systematically from the first to the last gap. Note that there are many ways to solving the gap questions and filtering out wrong answers, you do not always have to start filtering out from the

first gap, but can also exclude some answers based on obvious problems you see with them.

The answers to the first gap present you with two options: filter and where. These two are actually synonyms in PySpark, so using either of those is fine. The answer options to this gap therefore do

not help us in selecting the right answer.

The second gap is more interesting. One answer option includes "Sports".isin(col("Supplier")). This construct does not work, since Python's string does not have an isin method. Another option

contains col(supplier). Here, Python will try to interpret supplier as a variable. We have not set this variable, so this is not a viable answer. Then, you are left with answers options that include col

("supplier").contains("Sports") and col("supplier").isin("Sports"). The QUESTION NO: states that we are looking for suppliers whose name includes Sports, so we have to go for the contains operator

here.

We would use the isin operator if we wanted to filter out for supplier names that match any entries in a list of supplier names.

Finally, we are left with two answers that fill the third gap both with "itemName" and the fourth gap either with explode("attributes") or "attributes". While both are correct Spark syntax, only explode

("attributes") will help us achieve our goal. Specifically, the QUESTION NO: asks for one attribute from column attributes per row - this is what the explode() operator does.

One answer option also includes array_explode() which is not a valid operator in PySpark.

More info: pyspark.sql.functions.explode â€” PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 39 (Databricks import instructions)

Summer Sale Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: ecus65

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 - Databricks Certified Associate Developer for Apache Spark 3.0 Exam

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation: