Weekend Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: xmas50

Databricks Databricks-Machine-Learning-Associate - Databricks Certified Machine Learning Associate Exam

A data scientist has developed a machine learning pipeline with a static input data set using Spark ML, but the pipeline is taking too long to process. They increase the number of workers in the cluster to get the pipeline to run more efficiently. They notice that the number of rows in the training set after reconfiguring the cluster is different from the number of rows in the training set prior to reconfiguring the cluster.

Which of the following approaches will guarantee a reproducible training and test set for each model?

A.

Manually configure the cluster

B.

Write out the split data sets to persistent storage

C.

Set a speed in the data splitting operation

D.

Manually partition the input data

A data scientist wants to use Spark ML to one-hot encode the categorical features in their PySpark DataFramefeatures_df. A list of the names of the string columns is assigned to theinput_columnsvariable.

They have developed this code block to accomplish this task:

The code block is returning an error.

Which of the following adjustments does the data scientist need to make to accomplish this task?

A.

They need to specify the method parameter to the OneHotEncoder.

B.

They need to remove the line with the fit operation.

C.

They need to use Stringlndexer prior to one-hot encodinq the features.

D.

They need to useVectorAssemblerprior to one-hot encoding the features.

Which statement describes a Spark ML transformer?

A.

A transformer is an algorithm which can transform one DataFrame into another DataFrame

B.

A transformer is a hyperparameter grid that can be used to train a model

C.

A transformer chains multiple algorithms together to transform an ML workflow

D.

A transformer is a learning algorithm that can use a DataFrame to train a model

Which of the following tools can be used to distribute large-scale feature engineering without the use of a UDF or pandas Function API for machine learning pipelines?

A.

Keras

B.

pandas

C.

PvTorch

D.

Spark ML

E.

Scikit-learn

A data scientist wants to use Spark ML to impute missing values in their PySpark DataFrame features_df. They want to replace missing values in all numeric columns in features_df with each respective numeric column’s median value.

They have developed the following code block to accomplish this task:

The code block is not accomplishing the task.

Which reasons describes why the code block is not accomplishing the imputation task?

A.

It does not impute both the training and test data sets.

B.

The inputCols and outputCols need to be exactly the same.

C.

The fit method needs to be called instead of transform.

D.

It does not fit the imputer on the data to create an ImputerModel.

A data scientist has written a data cleaning notebook that utilizes the pandas library, but their colleague has suggested that they refactor their notebook to scale with big data.

Which of the following approaches can the data scientist take to spend the least amount of time refactoring their notebook to scale with big data?

A.

They can refactor their notebook to process the data in parallel.

B.

They can refactor their notebook to use the PySpark DataFrame API.

C.

They can refactor their notebook to use the Scala Dataset API.

D.

They can refactor their notebook to use Spark SQL.

E.

They can refactor their notebook to utilize the pandas API on Spark.

A data scientist has written a feature engineering notebook that utilizes the pandas library. As the size of the data processed by the notebook increases, the notebook's runtime is drastically increasing, but it is processing slowly as the size of the data included in the process increases.

Which of the following tools can the data scientist use to spend the least amount of time refactoring their notebook to scale with big data?

A.

PySpark DataFrame API

B.

pandas API on Spark

C.

Spark SQL

D.

Feature Store

Which of the following hyperparameter optimization methods automatically makes informed selections of hyperparameter values based on previous trials for each iterative model evaluation?

A.

Random Search

B.

Halving Random Search

C.

Tree of Parzen Estimators

D.

Grid Search

A data scientist has produced two models for a single machine learning problem. One of the models performs well when one of the features has a value of less than 5, and the other model performs well when the value of that feature is greater than or equal to 5. The data scientist decides to combine the two models into a single machine learning solution.

Which of the following terms is used to describe this combination of models?

A.

Bootstrap aggregation

B.

Support vector machines

C.

Bucketing

D.

Ensemble learning

E.

Stacking

A data scientist has replaced missing values in their feature set with each respective feature variable’s median value. A colleague suggests that the data scientist is throwing away valuable information by doing this.

Which of the following approaches can they take to include as much information as possible in the feature set?

A.

Impute the missing values using each respective feature variable's mean value instead of the median value

B.

Refrain from imputing the missing values in favor of letting the machine learning algorithm determine how to handle them

C.

Remove all feature variables that originally contained missing values from the feature set

D.

Create a binary feature variable for each feature that contained missing values indicating whether each row's value has been imputed

E.

Create a constant feature variable for each feature that contained missing values indicating the percentage of rows from the feature that was originally missing