Amazon Web Services Data-Engineer-Associate - AWS Certified Data Engineer - Associate (DEA-C01)
Total 231 questions
A data engineer needs to schedule a workflow that runs a set of AWS Glue jobs every day. The data engineer does not require the Glue jobs to run or finish at a specific time.
Which solution will run the Glue jobs in the MOST cost-effective way?
A retail company stores customer data in an Amazon S3 bucket. Some of the customer data contains personally identifiable information (PII) about customers. The company must not share PII data with business partners.
A data engineer must determine whether a dataset contains PII before making objects in the dataset available to business partners.
Which solution will meet this requirement with the LEAST manual intervention?
A company uses Amazon S3 to store semi-structured data in a transactional data lake. Some of the data files are small, but other data files are tens of terabytes.
A data engineer must perform a change data capture (CDC) operation to identify changed data from the data source. The data source sends a full snapshot as a JSON file every day and ingests the changed data into the data lake.
Which solution will capture the changed data MOST cost-effectively?
A company is using an AWS Transfer Family server to migrate data from an on-premises environment to AWS. Company policy mandates the use of TLS 1.2 or above to encrypt the data in transit.
Which solution will meet these requirements?
A company is designing a serverless data processing workflow in AWS Step Functions that involves multiple steps. The processing workflow ingests data from an external API, transforms the data by using multiple AWS Lambda functions, and loads the transformed data into Amazon DynamoDB.
The company needs the workflow to perform specific steps based on the content of the incoming data.
Which Step Functions state type should the company use to meet this requirement?
A retail company stores data from a product lifecycle management (PLM) application in an on-premises MySQL database. The PLM application frequently updates the database when transactions occur.
The company wants to gather insights from the PLM application in near real time. The company wants to integrate the insights with other business datasets and to analyze the combined dataset by using an Amazon Redshift data warehouse.
The company has already established an AWS Direct Connect connection between the on-premises infrastructure and AWS.
Which solution will meet these requirements with the LEAST development effort?
A company builds a new data pipeline to process data for business intelligence reports. Users have noticed that data is missing from the reports.
A data engineer needs to add a data quality check for columns that contain null values and for referential integrity at a stage before the data is added to storage.
Which solution will meet these requirements with the LEAST operational overhead?
A company needs to automate data workflows from multiple data sources to run both on schedules and in response to events from Amazon EventBridge. The data sources are Amazon RDS and Amazon S3. The company needs a single data pipeline that can be invoked both by scheduled events and near real-time EventBridge events.
Which solution will meet these requirements with the LEAST operational overhead?
A company has a data warehouse that contains a table that is named Sales. The company stores the table in Amazon Redshift The table includes a column that is named city_name. The company wants to query the table to find all rows that have a city_name that starts with "San" or "El."
Which SQL query will meet this requirement?
A company maintains a data warehouse in an on-premises Oracle database. The company wants to build a data lake on AWS. The company wants to load data warehouse tables into Amazon S3 and synchronize the tables with incremental data that arrives from the data warehouse every day.
Each table has a column that contains monotonically increasing values. The size of each table is less than 50 GB. The data warehouse tables are refreshed every night between 1 AM and 2 AM. A business intelligence team queries the tables between 10 AM and 8 PM every day.
Which solution will meet these requirements in the MOST operationally efficient way?
