Weekend Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: xmas50

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 - Databricks Certified Associate Developer for Apache Spark 3.0 Exam

In which order should the code blocks shown below be run in order to create a DataFrame that shows the mean of column predError of DataFrame transactionsDf per column storeId and productId,

where productId should be either 2 or 3 and the returned DataFrame should be sorted in ascending order by column storeId, leaving out any nulls in that column?

DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

2.|transactionId|predError|value|storeId|productId| f|

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.| 4| null| null| 3| 2|null|

8.| 5| null| null| null| 2|null|

9.| 6| 3| 2| 25| 2|null|

10.+-------------+---------+-----+-------+---------+----+

1. .mean("predError")

2. .groupBy("storeId")

3. .orderBy("storeId")

4. transactionsDf.filter(transactionsDf.storeId.isNotNull())

5. .pivot("productId", [2, 3])

A.

4, 5, 2, 3, 1

B.

4, 2, 1

C.

4, 1, 5, 2, 3

D.

4, 2, 5, 1, 3

E.

4, 3, 2, 5, 1

Which of the following code blocks reads JSON file imports.json into a DataFrame?

A.

spark.read().mode("json").path("/FileStore/imports.json")

B.

spark.read.format("json").path("/FileStore/imports.json")

C.

spark.read("json", "/FileStore/imports.json")

D.

spark.read.json("/FileStore/imports.json")

E.

spark.read().json("/FileStore/imports.json")

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A.

itemsDf.persist(StorageLevel.MEMORY_ONLY)

B.

itemsDf.cache(StorageLevel.MEMORY_AND_DISK)

C.

itemsDf.store()

D.

itemsDf.cache()

E.

itemsDf.write.option('destination', 'memory').save()

The code block shown below should return a DataFrame with columns transactionsId, predError, value, and f from DataFrame transactionsDf. Choose the answer that correctly fills the blanks in the

code block to accomplish this.

transactionsDf.__1__(__2__)

A.

1. filter

2. "transactionId", "predError", "value", "f"

B.

1. select

2. "transactionId, predError, value, f"

C.

1. select

2. ["transactionId", "predError", "value", "f"]

D.

1. where

2. col("transactionId"), col("predError"), col("value"), col("f")

E.

1. select

2. col(["transactionId", "predError", "value", "f"])

The code block displayed below contains an error. The code block should trigger Spark to cache DataFrame transactionsDf in executor memory where available, writing to disk where insufficient

executor memory is available, in a fault-tolerant way. Find the error.

Code block:

transactionsDf.persist(StorageLevel.MEMORY_AND_DISK)

A.

Caching is not supported in Spark, data are always recomputed.

B.

Data caching capabilities can be accessed through the spark object, but not through the DataFrame API.

C.

The storage level is inappropriate for fault-tolerant storage.

D.

The code block uses the wrong operator for caching.

E.

The DataFrameWriter needs to be invoked.

Which of the following statements about Spark's configuration properties is incorrect?

A.

The maximum number of tasks that an executor can process at the same time is controlled by the spark.task.cpus property.

B.

The maximum number of tasks that an executor can process at the same time is controlled by the spark.executor.cores property.

C.

The default value for spark.sql.autoBroadcastJoinThreshold is 10MB.

D.

The default number of partitions to use when shuffling data for joins or aggregations is 300.

E.

The default number of partitions returned from certain transformations can be controlled by the spark.default.parallelism property.

Which of the following statements about stages is correct?

A.

Different stages in a job may be executed in parallel.

B.

Stages consist of one or more jobs.

C.

Stages ephemerally store transactions, before they are committed through actions.

D.

Tasks in a stage may be executed by multiple machines at the same time.

E.

Stages may contain multiple actions, narrow, and wide transformations.

Which of the following code blocks returns a DataFrame where columns predError and productId are removed from DataFrame transactionsDf?

Sample of DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

2.|transactionId|predError|value|storeId|productId|f |

3.+-------------+---------+-----+-------+---------+----+

4.|1 |3 |4 |25 |1 |null|

5.|2 |6 |7 |2 |2 |null|

6.|3 |3 |null |25 |3 |null|

7.+-------------+---------+-----+-------+---------+----+

A.

transactionsDf.withColumnRemoved("predError", "productId")

B.

transactionsDf.drop(["predError", "productId", "associateId"])

C.

transactionsDf.drop("predError", "productId", "associateId")

D.

transactionsDf.dropColumns("predError", "productId", "associateId")

E.

transactionsDf.drop(col("predError", "productId"))

Which of the following code blocks returns a one-column DataFrame for which every row contains an array of all integer numbers from 0 up to and including the number given in column predError of

DataFrame transactionsDf, and null if predError is null?

Sample of DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

2.|transactionId|predError|value|storeId|productId| f|

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.| 4| null| null| 3| 2|null|

8.| 5| null| null| null| 2|null|

9.| 6| 3| 2| 25| 2|null|

10.+-------------+---------+-----+-------+---------+----+

A.

1.def count_to_target(target):

2. if target is None:

3. return

4.

5. result = [range(target)]

6. return result

7.

8.count_to_target_udf = udf(count_to_target, ArrayType[IntegerType])

9.

10.transactionsDf.select(count_to_target_udf(col('predError')))

B.

1.def count_to_target(target):

2. if target is None:

3. return

4.

5. result = list(range(target))

6. return result

7.

8.transactionsDf.select(count_to_target(col('predError')))

C.

1.def count_to_target(target):

2. if target is None:

3. return

4.

5. result = list(range(target))

6. return result

7.

8.count_to_target_udf = udf(count_to_target, ArrayType(IntegerType()))

9.

10.transactionsDf.select(count_to_target_udf('predError'))

(Correct)

D.

1.def count_to_target(target):

2. result = list(range(target))

3. return result

4.

5.count_to_target_udf = udf(count_to_target, ArrayType(IntegerType()))

6.

7.df = transactionsDf.select(count_to_target_udf('predError'))

E.

1.def count_to_target(target):

2. if target is None:

3. return

4.

5. result = list(range(target))

6. return result

7.

8.count_to_target_udf = udf(count_to_target)

9.

10.transactionsDf.select(count_to_target_udf('predError'))

The code block displayed below contains an error. The code block should return a new DataFrame that only contains rows from DataFrame transactionsDf in which the value in column predError is

at least 5. Find the error.

Code block:

transactionsDf.where("col(predError) >= 5")

A.

The argument to the where method should be "predError >= 5".

B.

Instead of where(), filter() should be used.

C.

The expression returns the original DataFrame transactionsDf and not a new DataFrame. To avoid this, the code block should be transactionsDf.toNewDataFrame().where("col(predError) >= 5").

D.

The argument to the where method cannot be a string.

E.

Instead of >=, the SQL operator GEQ should be used.