Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 - Databricks Certified Associate Developer for Apache Spark 3.0 Exam

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Premium Access Download Demo

Page: 2 / 6
Total 180 questions

Which of the following code blocks immediately removes the previously cached DataFrame transactionsDf from memory and disk?

array_remove(transactionsDf, "*")

transactionsDf.unpersist()

(Correct)

del transactionsDf

transactionsDf.clearCache()

transactionsDf.persist()

Question # 12

The code block displayed below contains one or more errors. The code block should load parquet files at location filePath into a DataFrame, only loading those files that have been modified before

2029-03-20 05:44:46. Spark should enforce a schema according to the schema shown below. Find the error.

Schema:

1.root

2. |-- itemId: integer (nullable = true)

3. |-- attributes: array (nullable = true)

4. | |-- element: string (containsNull = true)

5. |-- supplier: string (nullable = true)

Code block:

1.schema = StructType([

2. StructType("itemId", IntegerType(), True),

3. StructType("attributes", ArrayType(StringType(), True), True),

4. StructType("supplier", StringType(), True)

5.])

7.spark.read.options("modifiedBefore", "2029-03-20T05:44:46").schema(schema).load(filePath)

The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark's DataFrameReader is incorrect.

Columns in the schema definition use the wrong object type and the syntax of the call to Spark's DataFrameReader is incorrect.

The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.

Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.

Columns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly.

Explanation:

Explanation

Correct code block:

schema = StructType([

StructField("itemId", IntegerType(), True),

StructField("attributes", ArrayType(StringType(), True), True),

StructField("supplier", StringType(), True)

])

spark.read.options(modifiedBefore="2029-03-20T05:44:46").schema(schema).parquet(filePath)

This QUESTION NO: is more difficult than what you would encounter in the exam. In the exam, for this QUESTION NO: type, only one error needs to be identified and not "one or multiple" as in the

question.

Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.

Correct! Columns in the schema definition should use the StructField type. Building a schema from pyspark.sql.types, as here using classes like StructType and StructField, is one of multiple ways

of expressing a schema in Spark. A StructType always contains a list of StructFields (see documentation linked below). So, nesting StructType and StructType as shown in the QUESTION NO: is

wrong.

The modification date threshold should be specified by a keyword argument like options(modifiedBefore="2029-03-20T05:44:46") and not two consecutive non-keyword arguments as in the original

code block (see documentation linked below).

Spark cannot identify the file format correctly, because either it has to be specified by using the DataFrameReader.format(), as an argument to DataFrameReader.load(), or directly by calling, for

example, DataFrameReader.parquet().

Columns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly.

No. If StructField would be used for the columns instead of StructType (see above), the third argument specified whether the column is nullable. The original schema shows that columns should be

nullable and this is specified correctly by the third argument being True in the schema in the code block.

It is correct, however, that the modification date threshold is specified incorrectly (see above).

The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark's DataFrameReader is incorrect.

Wrong. The attributes array is specified correctly, following the syntax for ArrayType (see linked documentation below). That Spark cannot identify the file format is correct, see correct answer

above. In addition, the DataFrameReader is called correctly through the SparkSession spark.

Columns in the schema definition use the wrong object type and the syntax of the call to Spark's DataFrameReader is incorrect.

Incorrect, the object types in the schema definition are correct and syntax of the call to Spark's DataFrameReader is correct.

The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.

False. The data type of the schema is StructType and an accepted data type for the DataFrameReader.schema() method. It is correct however that the modification date threshold is specified

incorrectly (see correct answer above).

Question # 13

Which of the following code blocks returns a DataFrame with a single column in which all items in column attributes of DataFrame itemsDf are listed that contain the letter i?

Sample of DataFrame itemsDf:

1.+------+----------------------------------+-----------------------------+-------------------+

3.+------+----------------------------------+-----------------------------+-------------------+

7.+------+----------------------------------+-----------------------------+-------------------+

itemsDf.select(explode("attributes").alias("attributes_exploded")).filter(attributes_exploded.contains("i"))

itemsDf.explode(attributes).alias("attributes_exploded").filter(col("attributes_exploded").contains("i"))

itemsDf.select(explode("attributes")).filter("attributes_exploded".contains("i"))

itemsDf.select(explode("attributes").alias("attributes_exploded")).filter(col("attributes_exploded").contains("i"))

itemsDf.select(col("attributes").explode().alias("attributes_exploded")).filter(col("attributes_exploded").contains("i"))

Question # 14

Which of the following code blocks concatenates rows of DataFrames transactionsDf and transactionsNewDf, omitting any duplicates?

transactionsDf.concat(transactionsNewDf).unique()

transactionsDf.union(transactionsNewDf).distinct()

spark.union(transactionsDf, transactionsNewDf).distinct()

transactionsDf.join(transactionsNewDf, how="union").distinct()

transactionsDf.union(transactionsNewDf).unique()

Question # 15

The code block shown below should add a column itemNameBetweenSeparators to DataFrame itemsDf. The column should contain arrays of maximum 4 strings. The arrays should be composed of

the values in column itemsDf which are separated at - or whitespace characters. Choose the answer that correctly fills the blanks in the code block to accomplish this.

Sample of DataFrame itemsDf:

1.+------+----------------------------------+-------------------+

2.|itemId|itemName |supplier |

3.+------+----------------------------------+-------------------+

4.|1 |Thick Coat for Walking in the Snow|Sports Company Inc.|

5.|2 |Elegant Outdoors Summer Dress |YetiX |

6.|3 |Outdoors Backpack |Sports Company Inc.|

7.+------+----------------------------------+-------------------+

Code block:

itemsDf.__1__(__2__, __3__(__4__, "[\s\-]", __5__))

1. withColumn

2. "itemNameBetweenSeparators"

3. split

4. "itemName"

5. 4

(Correct)

1. withColumnRenamed

2. "itemNameBetweenSeparators"

3. split

4. "itemName"

5. 4

1. withColumnRenamed

2. "itemName"

3. split

4. "itemNameBetweenSeparators"

5. 4

1. withColumn

2. "itemNameBetweenSeparators"

3. split

4. "itemName"

5. 5

1. withColumn

2. itemNameBetweenSeparators

3. str_split

4. "itemName"

5. 5

Question # 16

The code block shown below should return a copy of DataFrame transactionsDf with an added column cos. This column should have the values in column value converted to degrees and having

the cosine of those converted values taken, rounded to two decimals. Choose the answer that correctly fills the blanks in the code block to accomplish this.

Code block:

transactionsDf.__1__(__2__, round(__3__(__4__(__5__)),2))

1. withColumn

2. col("cos")

3. cos

4. degrees

5. transactionsDf.value

1. withColumnRenamed

2. "cos"

3. cos

4. degrees

5. "transactionsDf.value"

1. withColumn

2. "cos"

3. cos

4. degrees

5. transactionsDf.value

1. withColumn

2. col("cos")

3. cos

4. degrees

5. col("value")

. 1. withColumn

2. "cos"

3. degrees

4. cos

5. col("value")

Question # 17

The code block shown below should return a copy of DataFrame transactionsDf without columns value and productId and with an additional column associateId that has the value 5. Choose the

answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__(__2__, __3__).__4__(__5__, 'value')

1. withColumn

2. 'associateId'

3. 5

4. remove

5. 'productId'

1. withNewColumn

2. associateId

3. lit(5)

4. drop

5. productId

1. withColumn

2. 'associateId'

3. lit(5)

4. drop

5. 'productId'

1. withColumnRenamed

2. 'associateId'

3. 5

4. drop

5. 'productId'

1. withColumn

2. col(associateId)

3. lit(5)

4. drop

5. col(productId)

Question # 18

In which order should the code blocks shown below be run in order to assign articlesDf a DataFrame that lists all items in column attributes ordered by the number of times these items occur, from

most to least often?

Sample of DataFrame articlesDf:

1.+------+-----------------------------+-------------------+

2.|itemId|attributes |supplier |

3.+------+-----------------------------+-------------------+

4.|1 |[blue, winter, cozy] |Sports Company Inc.|

5.|2 |[red, summer, fresh, cooling]|YetiX |

6.|3 |[green, summer, travel] |Sports Company Inc.|

7.+------+-----------------------------+-------------------+

1. articlesDf = articlesDf.groupby("col")

2. articlesDf = articlesDf.select(explode(col("attributes")))

3. articlesDf = articlesDf.orderBy("count").select("col")

4. articlesDf = articlesDf.sort("count",ascending=False).select("col")

5. articlesDf = articlesDf.groupby("col").count()

4, 5

2, 5, 3

5, 2

2, 3, 4

2, 5, 4

Question # 19

Which of the following code blocks applies the boolean-returning Python function evaluateTestSuccess to column storeId of DataFrame transactionsDf as a user-defined function?

1.from pyspark.sql import types as T

2.evaluateTestSuccessUDF = udf(evaluateTestSuccess, T.BooleanType())

3.transactionsDf.withColumn("result", evaluateTestSuccessUDF(col("storeId")))

1.evaluateTestSuccessUDF = udf(evaluateTestSuccess)

2.transactionsDf.withColumn("result", evaluateTestSuccessUDF(storeId))

1.from pyspark.sql import types as T

2.evaluateTestSuccessUDF = udf(evaluateTestSuccess, T.IntegerType())

3.transactionsDf.withColumn("result", evaluateTestSuccess(col("storeId")))

1.evaluateTestSuccessUDF = udf(evaluateTestSuccess)

2.transactionsDf.withColumn("result", evaluateTestSuccessUDF(col("storeId")))

1.from pyspark.sql import types as T

2.evaluateTestSuccessUDF = udf(evaluateTestSuccess, T.BooleanType())

3.transactionsDf.withColumn("result", evaluateTestSuccess(col("storeId")))

Question # 20

The code block displayed below contains an error. The code block should merge the rows of DataFrames transactionsDfMonday and transactionsDfTuesday into a new DataFrame, matching

column names and inserting null values where column names do not appear in both DataFrames. Find the error.

Sample of DataFrame transactionsDfMonday:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.| 5| null| null| null| 2|null|

5.| 6| 3| 2| 25| 2|null|

6.+-------------+---------+-----+-------+---------+----+

Sample of DataFrame transactionsDfTuesday:

1.+-------+-------------+---------+-----+

3.+-------+-------------+---------+-----+

4.| 25| 1| 1| 4|

5.| 2| 2| 2| 7|

6.| 3| 4| 2| null|

7.| null| 5| 2| null|

8.+-------+-------------+---------+-----+

Code block:

sc.union([transactionsDfMonday, transactionsDfTuesday])

The DataFrames' RDDs need to be passed into the sc.union method instead of the DataFrame variable names.

Instead of union, the concat method should be used, making sure to not use its default arguments.

Instead of the Spark context, transactionDfMonday should be called with the join method instead of the union method, making sure to use its default arguments.

Instead of the Spark context, transactionDfMonday should be called with the union method.

Instead of the Spark context, transactionDfMonday should be called with the unionByName method instead of the union method, making sure to not use its default arguments.

Explanation:

Explanation

Correct code block:

transactionsDfMonday.unionByName(transactionsDfTuesday, True)

Output of correct code block:

+-------------+---------+-----+-------+---------+----+

+-------------+---------+-----+-------+---------+----+

| 6| 3| 2| 25| 2|null|

| 1| null| 4| 25| 1|null|

| 2| null| 7| 2| 2|null|

| 4| null| null| 3| 2|null|

+-------------+---------+-----+-------+---------+----+

For solving this question, you should be aware of the difference between the DataFrame.union() and DataFrame.unionByName() methods. The first one matches columns independent of their

names, just by their order. The second one matches columns by their name (which is asked for in the question). It also has a useful optional argument, allowMissingColumns. This allows you to

merge DataFrames that have different columns - just like in this example.

sc stands for SparkContext and is automatically provided when executing code on Databricks. While sc.union() allows you to join RDDs, it is not the right choice for joining DataFrames. A hint away

from sc.union() is given where the QUESTION NO: talks about joining "into a new DataFrame".

concat is a method in pyspark.sql.functions. It is great for consolidating values from different columns, but has no place when trying to join rows of multiple DataFrames.

Finally, the join method is a contender here. However, the default join defined for that method is an inner join which does not get us closer to the goal to match the two DataFrames as instructed,

especially given that with the default arguments we cannot define a join condition.

More info:

- pyspark.sql.DataFrame.unionByName â€” PySpark 3.1.2 documentation

- pyspark.SparkContext.union â€” PySpark 3.1.2 documentation

- pyspark.sql.functions.concat â€” PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 45 (Databricks import instructions)

Summer Sale Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: ecus65

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 - Databricks Certified Associate Developer for Apache Spark 3.0 Exam

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation: