Databricks Databricks-Certified-Professional-Data-Engineer Practice Exam - 122 Unique Questions [Q72-Q88]

Share

Databricks Databricks-Certified-Professional-Data-Engineer Practice Exam - 122 Unique Questions

Latest Questions Databricks-Certified-Professional-Data-Engineer Guide to Prepare Free Practice Tests


To prepare for the exam, Databricks offers a range of training resources, including online courses, workshops, and certification bootcamps. These resources cover topics such as data engineering, data science, machine learning, and data analytics on the Databricks platform. Additionally, candidates can also access the Databricks Academy, which provides self-paced learning modules and practice exams to help them prepare for the certification exam.


The DCPDE certification is an excellent way for data professionals to demonstrate their expertise in the Databricks platform. Databricks Certified Professional Data Engineer Exam certification is recognized globally and is highly valued by employers looking for data professionals with expertise in Databricks. The DCPDE certification provides professionals with the opportunity to enhance their career prospects and increase their earning potential.

 

NEW QUESTION # 72
What steps need to be taken to set up a DELTA LIVE PIPELINE as a job using the workspace UI?

  • A. Select Workflows UI and Delta live tables tab, under task type select Delta live tables pipeline and select the pipeline JSON file
  • B. Use Pipeline creation UI, select a new pipeline and job cluster
  • C. DELTA LIVE TABLES do not support job cluster
  • D. Select Workflows UI and Delta live tables tab, under task type select Delta live tables pipeline and select the notebook

Answer: D

Explanation:
Explanation
The answer is,
Select Workflows UI and Delta live tables tab, under task type select Delta live tables pipeline and select the notebook.
Create a pipeline
To create a new pipeline using the Delta Live Tables notebook:
1.Click Workflows in the sidebar, click the Delta Live Tables tab, and click Create Pipeline.
2.Give the pipeline a name and click to select a notebook.
3.Optionally enter a storage location for output data from the pipeline. The system uses a de-fault location if you leave Storage Location empty.
4.Select Triggered for Pipeline Mode.
5.Click Create.
The system displays the Pipeline Details page after you click Create. You can also access your pipeline by clicking the pipeline name in the Delta Live Tables tab.


NEW QUESTION # 73
A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on task A.
If tasks A and B complete successfully but task C fails during a scheduled run, which statement describes the resulting state?

  • A. Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until ail tasks have successfully been completed.
  • B. All logic expressed in the notebook associated with tasks A and B will have been successfully completed; any changes made in task C will be rolled back due to task failure.
  • C. All logic expressed in the notebook associated with task A will have been successfully completed; tasks B and C will not commit any changes because of stage failure.
  • D. All logic expressed in the notebook associated with tasks A and B will have been successfully completed; some operations in task C may have completed successfully.
  • E. Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task C failed, all commits will be rolled back automatically.

Answer: D

Explanation:
Explanation
The query uses the CREATE TABLE USING DELTA syntax to create a Delta Lake table from an existing Parquet file stored in DBFS. The query also uses the LOCATION keyword to specify the path to the Parquet file as /mnt/finance_eda_bucket/tx_sales.parquet. By using the LOCATION keyword, the query creates an external table, which is a table that is stored outside of the default warehouse directory and whose metadata is not managed by Databricks. An external table can be created from an existing directory in a cloud storage system, such as DBFS or S3, that contains data files in a supported format, such as Parquet or CSV.
The resulting state after running the second command is that an external table will be created in the storage container mounted to /mnt/finance_eda_bucket with the new name prod.sales_by_store. The command will not change any data or move any files in the storage container; it will only update the table reference in the metastore and create a new Delta transaction log for the renamed table. Verified References: [Databricks Certified Data Engineer Professional], under "Delta Lake" section; Databricks Documentation, under "ALTER TABLE RENAME TO" section; Databricks Documentation, under "Create an external table" section.


NEW QUESTION # 74
You are currently working on a production job failure with a job set up in job clusters due to a data issue, what cluster do you need to start to investigate and analyze the data?

  • A. Existing job cluster can be used to investigate the issue
  • B. Databricks SQL Endpoint can be used to investigate the issue
  • C. A Job cluster can be used to analyze the problem
  • D. All-purpose cluster/ interactive cluster is the recommended way to run commands and view the data.

Answer: D

Explanation:
Explanation
Answer is All-purpose cluster/ interactive cluster is the recommended way to run commands and view the data.
A job cluster can not provide a way for a user to interact with a notebook once the job is submitted, but an Interactive cluster allows to you display data, view visualizations write or edit quries, which makes it a perfect fit to investigate and analyze the data.


NEW QUESTION # 75
The data governance team is reviewing code used for deleting records for compliance with GDPR. They note the following logic is used to delete records from the Delta Lake table named users.

Assuming that user_id is a unique identifying key and that delete_requests contains all users that have requested deletion, which statement describes whether successfully executing the above logic guarantees that the records to be deleted are no longer accessible and why?

  • A. Yes; Delta Lake ACID guarantees provide assurance that the delete command succeeded fully and permanently purged these records.
  • B. No; files containing deleted records may still be accessible with time travel until a vacuum command is used to remove invalidated data files.
  • C. No; the Delta Lake delete command only provides ACID guarantees when combined with the merge into command.
  • D. No; the Delta cache may return records from previous versions of the table until the cluster is restarted.
  • E. Yes; the Delta cache immediately updates to reflect the latest data files recorded to disk.

Answer: B

Explanation:
The code uses the DELETE FROM command to delete records from the users table that match a condition based on a join with another table called delete_requests, which contains all users that have requested deletion. The DELETE FROM command deletes records from a Delta Lake table by creating a new version of the table that does not contain the deleted records. However, this does not guarantee that the records to be deleted are no longer accessible, because Delta Lake supports time travel, which allows querying previous versions of the table using a timestamp or version number. Therefore, files containing deleted records may still be accessible with time travel until a vacuum command is used to remove invalidated data files from physical storage. Verified Reference: [Databricks Certified Data Engineer Professional], under "Delta Lake" section; Databricks Documentation, under "Delete from a table" section; Databricks Documentation, under "Remove files no longer referenced by a Delta table" section.


NEW QUESTION # 76
What is the purpose of gold layer in Multi hop architecture?

  • A. Preserves grain of original data, without any aggregations
  • B. Data quality checks and schema enforcement
  • C. Optimized query performance for business-critical data
  • D. Optimizes ETL throughput and analytic query performance
  • E. Eliminate duplicate records

Answer: C

Explanation:
Explanation
Medallion Architecture - Databricks
Gold Layer:
1. Powers Ml applications, reporting, dashboards, ad hoc analytics
2. Refined views of data, typically with aggregations
3. Reduces strain on production systems
4. Optimizes query performance for business-critical data
Exam focus: Please review the below image and understand the role of each layer(bronze, silver, gold) in medallion architecture, you will see varying questions targeting each layer and its purpose.
Sorry I had to add the watermark some people in Udemy are copying my content.


NEW QUESTION # 77
All records from an Apache Kafka producer are being ingested into a single Delta Lake table with the following schema:
key BINARY, value BINARY, topic STRING, partition LONG, offset LONG, timestamp LONG There are 5 unique topics being ingested. Only the "registration" topic contains Personal Identifiable Information (PII). The company wishes to restrict access to PII. The company also wishes to only retain records containing PII in this table for 14 days after initial ingestion. However, for non-PII information, it would like to retain these records indefinitely.
Which of the following solutions meets the requirements?

  • A. Data should be partitioned by the registration field, allowing ACLs and delete statements to be set for the PII directory.
  • B. Data should be partitioned by the topic field, allowing ACLs and delete statements to leverage partition boundaries.
  • C. Because the value field is stored as binary data, this information is not considered PII and no special precautions should be taken.
  • D. Separate object storage containers should be specified based on the partition field, allowing isolation at the storage level.
  • E. All data should be deleted biweekly; Delta Lake's time travel functionality should be leveraged to maintain a history of non-PII information.

Answer: B

Explanation:
Explanation
Partitioning the data by the topic field allows the company to apply different access control policies and retention policies for different topics. For example, the company can use the Table Access Control feature to grant or revoke permissions to the registration topic based on user roles or groups. The company can also use the DELETE command to remove records from the registration topic that are older than 14 days, while keeping the records from other topics indefinitely. Partitioning by the topic field also improves the performance of queries that filter by the topic field, as they can skip reading irrelevant partitions. References:
Table Access Control: https://docs.databricks.com/security/access-control/table-acls/index.html DELETE: https://docs.databricks.com/delta/delta-update.html#delete-from-a-table


NEW QUESTION # 78
The data engineering team has configured a Databricks SQL query and alert to monitor the values in a Delta Lake table. Therecent_sensor_recordingstable contains an identifyingsensor_idalongside thetimestampandtemperaturefor the most recent 5 minutes of recordings.
The below query is used to create the alert:

The query is set to refresh each minute and always completes in less than 10 seconds. The alert is set to trigger whenmean (temperature) > 120. Notifications are triggered to be sent at most every 1 minute.
If this alert raises notifications for 3 consecutive minutes and then stops, which statement must be true?

  • A. The total average temperature across all sensors exceeded 120 on three consecutive executions of the query
  • B. The average temperature recordings for at least one sensor exceeded 120 on three consecutive executions of the query
  • C. The maximum temperature recording for at least one sensor exceeded 120 on three consecutive executions of the query
  • D. The source query failed to update properly for three consecutive minutes and then restarted
  • E. Therecent_sensor_recordingstable was unresponsive for three consecutive runs of the query

Answer: B

Explanation:
This is the correct answer because the query is using a GROUP BY clause on the sensor_id column, which means it will calculate the mean temperature for each sensor separately. The alert will trigger when the mean temperature for any sensor is greater than 120, which means at least one sensor had an average temperature above 120 for three consecutive minutes. The alert will stop when the mean temperature for all sensors drops below 120. Verified References: [Databricks Certified Data Engineer Professional], under "SQL Analytics" section; Databricks Documentation, under "Alerts" section.


NEW QUESTION # 79
A table named user_ltv is being used to create a view that will be used by data analysts on various teams.
Users in the workspace are configured into groups, which are used for setting up data access using ACLs.
The user_ltv table has the following schema:
email STRING, age INT, ltv INT
The following view definition is executed:

An analyst who is not a member of the marketing group executes the following query:
SELECT * FROM email_ltv
Which statement describes the results returned by this query?

  • A. Three columns will be returned, but one column will be named "redacted" and contain only null values.
  • B. The email and ltv columns will be returned with the values in user itv.
  • C. Only the email and itv columns will be returned; the email column will contain all null values.
  • D. Only the email and ltv columns will be returned; the email column will contain the string
    "REDACTED" in each row.
  • E. The email, age. and ltv columns will be returned with the values in user ltv.

Answer: D

Explanation:
The code creates a view called email_ltv that selects the email and ltv columns from a table called user_ltv, which has the following schema: email STRING, age INT, ltv INT. The code also uses the CASE WHEN expression to replace the email values with the string "REDACTED" if the user is not a member of the marketing group. The user who executes the query is not a member of the marketing group, so they will only see the email and ltv columns, and the email column will contain the string "REDACTED" in each row.
Verified References: [Databricks Certified Data Engineer Professional], under "Lakehouse" section; Databricks Documentation, under "CASE expression" section.


NEW QUESTION # 80
A member of the data engineering team has submitted a short notebook that they wish to schedule as part of a larger data pipeline. Assume that the commands provided below produce the logically correct results when run as presented.

Which command should be removed from the notebook before scheduling it as a job?

  • A. Cmd 5
  • B. Cmd 6
  • C. Cmd 2
  • D. Cmd 3
  • E. Cmd 4

Answer: B

Explanation:
Cmd 6 is the command that should be removed from the notebook before scheduling it as a job. This command is selecting all the columns from the finalDF dataframe and displaying them in the notebook. This is not necessary for the job, as the finalDF dataframe is already written to a table in Cmd 7. Displaying the dataframe in the notebook will only consume resources and time, and it will not affect the output of the job.
Therefore, Cmd 6 is redundant and should be removed.
The other commands are essential for the job, as they perform the following tasks:
* Cmd 1: Reads the raw_data table into a Spark dataframe called rawDF.
* Cmd 2: Prints the schema of the rawDF dataframe, which is useful for debugging and understanding the data structure.
* Cmd 3: Selects all the columns from the rawDF dataframe, as well as the nested columns from the values struct column, and creates a new dataframe called flattenedDF.
* Cmd 4: Drops the values column from the flattenedDF dataframe, as it is no longer needed after flattening, and creates a new dataframe called finalDF.
* Cmd 5: Explains the physical plan of the finalDF dataframe, which is useful for optimizing and tuning the performance of the job.
* Cmd 7: Writes the finalDF dataframe to a table called flat_data, using the append mode to add new data to the existing table.


NEW QUESTION # 81
A junior data engineer seeks to leverage Delta Lake's Change Data Feed functionality to create a Type 1 table representing all of the values that have ever been valid for all rows in abronzetable created with the propertydelta.enableChangeDataFeed = true. They plan to execute the following code as a daily job:

Which statement describes the execution and results of running the above query multiple times?

  • A. Each time the job is executed, the entire available history of inserted or updated records will be appended to the target table, resulting in many duplicate entries.
  • B. Each time the job is executed, only those records that have been inserted or updated since the last execution will be appended to the target table giving the desired result.
  • C. Each time the job is executed, the differences between the original and current versions are calculated; this may result in duplicate entries for some records.
  • D. Each time the job is executed, newly updated records will be merged into the target table, overwriting previous values with the same primary keys.
  • E. Each time the job is executed, the target table will be overwritten using the entire history of inserted or updated records, giving the desired result.

Answer: A

Explanation:
Reading table's changes, captured by CDF, using spark.read means that you are reading them as a static source. So, each time you run the query, all table's changes (starting from the specified startingVersion) will be read.


NEW QUESTION # 82
Which of the statement is correct about the cluster pools?

  • A. Cluster pools are used to share resources among multiple teams
  • B. Cluster pools allow you to create a cluster
  • C. Cluster pools allow you to save time when starting a new cluster
  • D. Cluster pools allow you to perform load balancing
  • E. Cluster pools allow you to have all the nodes in the cluster from single physical server rack

Answer: C


NEW QUESTION # 83
A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE.
Three datasets are defined against Delta Lake table sources using LIVE TABLE . The table is configured to
run in Development mode using the Triggered Pipeline Mode.
Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after
clicking Start to update the pipeline?

  • A. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will
    be deployed for the update and terminated when the pipeline is stopped
  • B. All datasets will be updated continuously and the pipeline will not shut down. The compute resources
    will persist with the pipeline
  • C. All datasets will be updated once and the pipeline will shut down. The compute resources will persist to
    allow for additional testing
  • D. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will
    persist after the pipeline is stopped to allow for additional testing
  • E. All datasets will be updated once and the pipeline will shut down. The compute resources will be
    terminated

Answer: C


NEW QUESTION # 84
Which of the following statements can be used to test the functionality of code to test number of rows in the table equal to 10 in python?
row_count = spark.sql("select count(*) from table").collect()[0][0]

  • A. assert (row_count = 10, "Row count did not match")
  • B. assert if row_count == 10, "Row count did not match"
  • C. assert row_count == 10, "Row count did not match"
  • D. assert if (row_count = 10, "Row count did not match")
  • E. assert row_count = 10, "Row count did not match"

Answer: C

Explanation:
Explanation
The answer is assert row_count == 10, "Row count did not match"
Review below documentation


NEW QUESTION # 85
Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?

  • A. spark.sql.adaptive.advisoryPartitionSizeInBytes
  • B. spark.sql.adaptive.coalescePartitions.minPartitionNum
  • C. spark.sql.files.openCostInBytes
  • D. spark.sql.files.maxPartitionBytes
  • E. spark.sql.autoBroadcastJoinThreshold

Answer: D

Explanation:
This is the correct answer because spark.sql.files.maxPartitionBytes is a configuration parameter that directly affects the size of a spark-partition upon ingestion of data into Spark. This parameter configures the maximum number of bytes to pack into a single partition when reading files from file-based sources such as Parquet, JSON and ORC. The default value is 128 MB, which means each partition will be roughly 128 MB in size, unless there are too many small files or only one large file. Verified References: [Databricks Certified Data Engineer Professional], under "Spark Configuration" section; Databricks Documentation, under "Available Properties - spark.sql.files.maxPartitionBytes" section.


NEW QUESTION # 86
Which of the following commands results in the successful creation of a view on top of the delta stream(stream on delta table)?

  • A. Spark.read.format("delta").table("sales").trigger("stream").createOrReplaceTempView("streaming_vw")
  • B. Spark.readStream.format("delta").table("sales").createOrReplaceTempView("streaming_vw")
  • C. Spark.read.format("delta").table("sales").mode("stream").createOrReplaceTempView("streaming_vw")
  • D. You can not create a view on streaming data source.
  • E. Spark.read.format("delta").table("sales").createOrReplaceTempView("streaming_vw")
  • F. Spark.read.format("delta").stream("sales").createOrReplaceTempView("streaming_vw")

Answer: B

Explanation:
Explanation
The answer is
Spark.readStream.table("sales").createOrReplaceTempView("streaming_vw") When you load a Delta table as a stream source and use it in a streaming query, the query processes all of the data present in the table as well as any new data that arrives after the stream is started.
You can load both paths and tables as a stream, you also have the ability to ignore deletes and changes(updates, Merge, overwrites) on the delta table.
Here is more information,
https://docs.databricks.com/delta/delta-streaming.html#delta-table-as-a-source


NEW QUESTION # 87
You are asked to write a python function that can read data from a delta table and return the Data-Frame, which of the following is correct?

  • A. Python function will result in out of memory error due to data volume
  • B. Python function cannot return a DataFrame
  • C. Write SQL UDF to return a DataFrame
  • D. Write SQL UDF that can return tabular data
  • E. Python function can return a DataFrame

Answer: A

Explanation:
Explanation
The answer is Python function can return a DataFrame
The function would something like this,
1.get_source_dataframe(tablename):
2. df = spark.read.table(tablename)
3.return df
df = get_source_dataframe('test_table')
since there is no action spark returns a Dataframe and assigns to df python variable


NEW QUESTION # 88
......

Correct and Up-to-date Databricks Databricks-Certified-Professional-Data-Engineer BrainDumps: https://ensurepass.testkingfree.com/Databricks/Databricks-Certified-Professional-Data-Engineer-practice-exam-dumps.html