Databricks Databricks-Certified-Professional-Data-Engineer - Databricks Certified Data Engineer Professional Exam

Databricks Databricks-Certified-Professional-Data-Engineer Premium Access Download Demo

Page: 6 / 6
Total 202 questions

The marketing team is looking to share data in an aggregate table with the sales organization, but the field names used by the teams do not match, and a number of marketing specific fields have not been approval for the sales org.

Which of the following solutions addresses the situation while emphasizing simplicity?

Create a view on the marketing table selecting only these fields approved for the sales team alias the names of any fields that should be standardized to the sales naming conventions.

Use a CTAS statement to create a derivative table from the marketing table configure a production jon to propagation changes.

Add a parallel table write to the current production pipeline, updating a new sales table that varies as required from marketing table.

Create a new table with the required schema and use Delta Lake ' s DEEP CLONE functionality to sync up changes committed to one table to the corresponding table.

Question # 52

A DLT pipeline includes the following streaming tables:

Raw_lot ingest raw device measurement data from a heart rate tracking device.

Bgm_stats incrementally computes user statistics based on BPM measurements from raw_lot.

How can the data engineer configure this pipeline to be able to retain manually deleted or updated records in the raw_iot table while recomputing the downstream table when a pipeline update is run?

Set the skipChangeCommits flag to true on bpm_stats

Set the SkipChangeCommits flag to true raw_lot

Set the pipelines, reset, allowed property to false on bpm_stats

Set the pipelines, reset, allowed property to false on raw_iot

Question # 53

A data engineer created a daily batch ingestion pipeline using a cluster with the latest DBR version to store banking transaction data, and persisted it in a MANAGED DELTA table called prod.gold.all_banking_transactions_daily. The data engineer is constantly receiving complaints from business users who query this table ad hoc through a SQL Serverless Warehouse about poor query performance. Upon analysis, the data engineer identified that these users frequently use high-cardinality columns as filters. The engineer now seeks to implement a data layout optimization technique that is incremental, easy to maintain, and can evolve over time.

Which command should the data engineer implement?

Alter the table to use Hive-Style Partitions + Z-ORDER and implement a periodic OPTIMIZE command.

Alter the table to use Liquid Clustering and implement a periodic OPTIMIZE command.

Alter the table to use Hive-Style Partitions and implement a periodic OPTIMIZE command.

Alter the table to use Z-ORDER and implement a periodic OPTIMIZE command.

Question # 54

A data team is implementing an append-only Delta Lake pipeline that processes both batch and streaming data . They want to ensure that schema changes in the source data are automatically incorporated without breaking the pipeline.

Which configuration should the team use when writing data to the Delta table?

ignoreChanges = false

mergeSchema = true

overwriteSchema = true

validateSchema = false

Question # 55

The data engineering team maintains the following code:

Assuming that this code produces logically correct results and the data in the source tables has been de-duplicated and validated, which statement describes what will occur when this code is executed?

A batch job will update the enriched_itemized_orders_by_account table, replacing only those rows that have different values than the current version of the table, using accountID as the primary key.

The enriched_itemized_orders_by_account table will be overwritten using the current valid version of data in each of the three tables referenced in the join logic.

An incremental job will leverage information in the state store to identify unjoined rows in the source tables and write these rows to the enriched_iteinized_orders_by_account table.

An incremental job will detect if new rows have been written to any of the source tables; if new rows are detected, all results will be recalculated and used to overwrite the enriched_itemized_orders_by_account table.

No computation will occur until enriched_itemized_orders_by_account is queried; upon query materialization, results will be calculated using the current valid version of data in each of the three tables referenced in the join logic.

Explanation:

The provided PySpark code performs the following operations:

Reads Data from silver_customer_sales Table:

The code starts by accessing the silver_customer_sales table using the spark.table method.

Groups Data by customer_id:

The .groupBy( " customer_id " ) function groups the data based on the customer_id column.

Aggregates Data:

The .agg() function computes several aggregate metrics for each customer_id:

F.min( " sale_date " ).alias( " first_transaction_date " ): Determines the earliest sale date for the customer.

F.max( " sale_date " ).alias( " last_transaction_date " ): Determines the latest sale date for the customer.

F.mean( " sale_total " ).alias( " average_sales " ): Calculates the average sale amount for the customer.

F.countDistinct( " order_id " ).alias( " total_orders " ): Counts the number of unique orders placed by the customer.

F.sum( " sale_total " ).alias( " lifetime_value " ): Calculates the total sales amount (lifetime value) for the customer.

Writes Data to gold_customer_lifetime_sales_summary Table:

The .write.mode( " overwrite " ).table( " gold_customer_lifetime_sales_summary " ) command writes the aggregated data to the gold_customer_lifetime_sales_summary table.

The mode( " overwrite " ) specifies that the existing data in the gold_customer_lifetime_sales_summary table will be completely replaced by the new aggregated data.

Conclusion:

When this code is executed, it reads all records from the silver_customer_sales table, performs the specified aggregations grouped by customer_id, and then overwrites the entire gold_customer_lifetime_sales_summary table with the aggregated results. Therefore, option D accurately describes this process: " The gold_customer_lifetime_sales_summary table will be overwritten by aggregated values calculated from all records in the silver_customer_sales table as a batch job. "

[References:, PySpark DataFrame groupBy, PySpark Basics, ]

Question # 56

What is true for Delta Lake?

Views in the Lakehouse maintain a valid cache of the most recent versions of source tables at all times.

Delta Lake automatically collects statistics on the first 32 columns of each table, which are leveraged in data skipping based on query filters.

Z-ORDER can only be applied to numeric values stored in Delta Lake tables.

Primary and foreign key constraints can be leveraged to ensure duplicate values are never entered into a dimension table.

Question # 57

Which statement characterizes the general programming model used by Spark Structured Streaming?

Structured Streaming leverages the parallel processing of GPUs to achieve highly parallel data throughput.

Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka.

Structured Streaming uses specialized hardware and I/O streams to achieve sub-second latency for data transfer.

Structured Streaming models new data arriving in a data stream as new rows appended to an unbounded table.

Structured Streaming relies on a distributed network of nodes that hold incremental state values for cached stages.

Question # 58

A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.

In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?

Set the configuration delta.deduplicate = true.

VACUUM the Delta table after each batch completes.

Perform an insert-only merge with a matching condition on a unique key.

Perform a full outer join on a unique key and overwrite existing data.

Rely on Delta Lake schema enforcement to prevent duplicate records.

Question # 59

The data architect has decided that once data has been ingested from external sources into the

Databricks Lakehouse, table access controls will be leveraged to manage permissions for all production tables and views.

The following logic was executed to grant privileges for interactive queries on a production database to the core engineering group.

GRANT USAGE ON DATABASE prod TO eng;

GRANT SELECT ON DATABASE prod TO eng;

Assuming these are the only privileges that have been granted to the eng group and that these users are not workspace administrators, which statement describes their privileges?

Group members have full permissions on the prod database and can also assign permissions to other users or groups.

Group members are able to list all tables in the prod database but are not able to see the results of any queries on those tables.

Group members are able to query and modify all tables and views in the prod database, but cannot create new tables or views.

Group members are able to query all tables and views in the prod database, but cannot create or edit anything in the database.

Group members are able to create, query, and modify all tables and views in the prod database, but cannot define custom functions.

Question # 60

The data engineer team has been tasked with configured connections to an external database that does not have a supported native connector with Databricks. The external database already has data security configured by group membership. These groups map directly to user group already created in Databricks that represent various teams within the company.

A new login credential has been created for each group in the external database. The Databricks Utilities Secrets module will be used to make these credentials available to Databricks users.

Assuming that all the credentials are configured correctly on the external database and group membership is properly configured on Databricks, which statement describes how teams can be granted the minimum necessary access to using these credentials?

â€˜â€™Readâ€™â€™ permissions should be set on a secret key mapped to those credentials that will be used by a given team.

No additional configuration is necessary as long as all users are configured as administrators in the workspace where secrets have been added.

â€œReadâ€ permissions should be set on a secret scope containing only those credentials that will be used by a given team.

â€œManageâ€ permission should be set on a secret scope containing only those credentials that will be used by a given team.

Summer Sale Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: xmas50

Databricks Databricks-Certified-Professional-Data-Engineer - Databricks Certified Data Engineer Professional Exam

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation:

The Answer Is:

Explanation: