Weekend Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: xmas50

Databricks Databricks-Machine-Learning-Associate - Databricks Certified Machine Learning Associate Exam

Which of the following tools can be used to distribute large-scale feature engineering without the use of a UDF or pandas Function API for machine learning pipelines?

A.

Keras

B.

Scikit-learn

C.

PyTorch

D.

Spark ML

A machine learning engineer has grown tired of needing to install the MLflow Python library on each of their clusters. They ask a senior machine learning engineer how their notebooks can load the MLflow library without installing it each time. The senior machine learning engineer suggests that they use Databricks Runtime for Machine Learning.

Which of the following approaches describes how the machine learning engineer can begin using Databricks Runtime for Machine Learning?

A.

They can add a line enabling Databricks Runtime ML in their init script when creating their clusters.

B.

They can check the Databricks Runtime ML box when creating their clusters.

C.

They can select a Databricks Runtime ML version from the Databricks Runtime Version dropdown when creating their clusters.

D.

They can set the runtime-version variable in their Spark session to “ml”.

What is the name of the method that transforms categorical features into a series of binary indicator feature variables?

A.

Leave-one-out encoding

B.

Target encoding

C.

One-hot encoding

D.

Categorical

E.

String indexing

A health organization is developing a classification model to determine whether or not a patient currently has a specific type of infection. The organization's leaders want to maximize the number of positive cases identified by the model.

Which of the following classification metrics should be used to evaluate the model?

A.

RMSE

B.

Precision

C.

Area under the residual operating curve

D.

Accuracy

E.

Recall

Which of the following machine learning algorithms typically uses bagging?

A.

Gradient boosted trees

B.

K-means

C.

Random forest

D.

Linear regression

E.

Decision tree

A data scientist is working with a feature set with the following schema:

Thecustomer_idcolumn is the primary key in the feature set. Each of the columns in the feature set has missing values. They want to replace the missing values by imputing a common value for each feature.

Which of the following lists all of the columns in the feature set that need to be imputed using the most common value of the column?

A.

customer_id, loyalty_tier

B.

loyalty_tier

C.

units

D.

spend

E.

customer_id

A machine learning engineer has identified the best run from an MLflow Experiment. They have stored the run ID in the run_id variable and identified the logged model name as "model". They now want to register that model in the MLflow Model Registry with the name "best_model".

Which lines of code can they use to register the model associated with run_id to the MLflow Model Registry?

A.

mlflow.register_model(run_id, "best_model")

B.

mlflow.register_model(f"runs:/{run_id}/model”, "best_model”)

C.

millow.register_model(f"runs:/{run_id)/model")

D.

mlflow.register_model(f"runs:/{run_id}/best_model", "model")

A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column discount is less than or equal 0.

Which of the following code blocks will accomplish this task?

A.

spark_df.loc[:,spark_df["discount"] <= 0]

B.

spark_df[spark_df["discount"] <= 0]

C.

spark_df.filter (col("discount") <= 0)

D.

spark_df.loc(spark_df["discount"] <= 0, :]

A data scientist uses 3-fold cross-validation when optimizing model hyperparameters for a regression problem. The following root-mean-squared-error values are calculated on each of the validation folds:

• 10.0

• 12.0

• 17.0

Which of the following values represents the overall cross-validation root-mean-squared error?

A.

13.0

B.

17.0

C.

12.0

D.

39.0

E.

10.0

Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?

A.

pandas API on Spark DataFrames are single-node versions of Spark DataFrames with additional metadata

B.

pandas API on Spark DataFrames are more performant than Spark DataFrames

C.

pandas API on Spark DataFrames are made up of Spark DataFrames and additional metadata

D.

pandas API on Spark DataFrames are less mutable versions of Spark DataFrames

E.

pandas API on Spark DataFrames are unrelated to Spark DataFrames