Amazon Redshift Interview Questions

Displaying 1 - 10 of 17

How do you secure your Redshift database from external security threats?

First, Redshift clusters can be encrypted using AWS KMS (Key Management Service), which will encrypt the data at rest within the compute nodes.

For in-transit encryption, you can enable SSL connections. There is a parameter in the Redshift parameter group that can be set to true, which allows only SSL-secured connections to the cluster. For UNLOAD and COPY commands, it is recommended to not use IAM keys as credentials. Instead, create an IAM role and attach it to the cluster. Use that IAM role for UNLOAD and COPY commands.

You can also enable enhanced VPC routing to secure the UNLOAD and COPY command traffic such that the traffic flows only within the AWS VPC and not over the public internet.

How do you decide the memory and concurrency values for different WLM queues?

First, you identify different workload types for your cluster. For example, if you are loading data into your Redshift cluster or performing any ETL operations, then it would qualify as one queue. If you have analytical users or BI tools connecting to your database, then it would qualify as another queue.

Then, you use the Redshift admin utility to determine the number of queries that are running in one queue and analyse whether there is any waiting time in a queue. Based on this information, you can tune the WLM parameters.

The alternate option is to use automated WLM in which Redshift will automatically determine the memory and concurrency for each queue.

How do you optimise costs for your Redshift cluster?

Costs for your Redshift cluster can be optimised in several ways, which are as follows:

  • Node reservations
  • Rightsizing
  • Proper column encoding/compression
  • Pausing Redshift cluster it is when not in use and resuming when required
  • Ensuring that vacuum is running and reclaiming the disk space

Suppose you manage a Redshift cluster for ABC Company. On Monday, one of the users informs you that they have accidentally deleted one fact table from the production database. This table is extremely critical for the overall operations. How would you approach this problem?

Ask the user the approximate time when they deleted the table. Check the latest available snapshot for that cluster before that timeframe. You can restore the table from that snapshot. If you do not have any snapshot for that cluster, then it is not possible to restore the table.

Suppose you are working at ABC Financial Company, and it has a strict requirement of keeping the history of user activity on their Redshift cluster for the recent one month. As a Redshift administrator.

what solution would you recommend/implement? Or what are the different ways of tracking user activity on the Redshift cluster?

Amazon Redshift provides the following three logging options:

  • Audit logs: Stored in Amazon Simple Storage Service (Amazon S3) buckets
  • STL tables: Stored on every node in the cluster
  • AWS CloudTrail: Stored in Amazon S3 buckets Audit logs and STL tables record database-level activities, such as which users who logged in and when. These tables also record the SQL activities and the time at which these users performed them. Redshift system tables do not store data for more than 3–5 days. It is recommended to create a custom script to incrementally copy the data from some important system tables to permanent user-defined tables to preserve the history of system tables. CloudTrail tracks the activities performed at the service/API level.

What is the best way to download data from a Redshift cluster to an S3 bucket in another AWS account?

You can use IAM role chaining. Create an IAM role in your account and another one in the target AWS Account (the one that has access to the S3 bucket in that account). Give the IAM role in your account the permission to assume the role in another AWS account. You can use the UNLOAD command with role chaining to assume the target account’s role and dump the data into the target account’s S3 bucket.

Suppose you are working at XYZ Company that uses Redshift for advanced analytics. It loads its data daily from 9 pm to 6 am for analytics to be run on the next day. It has two different user groups / teams that run queries on the Redshift cluster. One is the Advanced Data Science team that runs complex analytics queries on the data and the other one includes Business Users who run a few small queries throughout the day.

How would you ensure maximum efficiency and low contention on the Redshift cluster?

You can set up Workload Management on Redshift. Create different queues for ETL, ADS and Business Users.

Manual WLM: Assign most resources (suppose 90%) to the ETL queue from 9 pm to 6 am, and switch the WLM settings at 6 am. Assign approximately 60% resources to the ADS queue and 30% to the business users’ queue with short query acceleration and concurrency scaling on.

Automatic WLM: Redshift will take care of resource distribution automatically.

Suppose you are managing a Redshift cluster for a telecom company. Today, you received multiple complaints from the users about the queries running slowly. How would you diagnose the issue?

  1. Check whether any changes were made to the table structure recently or the data volume has unexpectedly increased in the underlying tables. It can be one of the potential causes for the queries running slower.
  2. Check whether or not the underlying tables have been vacuumed and analysed. Stale statistics for a table can cause the optimiser to generate bad query plans, which lead to performance degradation.
  3. Generate an EXPLAIN plan for the query. Read it and assess whether or not data is being broadcasted over the network. If yes, then review the sort keys and distribution keys for the underlying tables.
  4. In the EXPLAIN plan, check whether the data is being written to the disk frequently or not. If it is, then increase the WLM memory for the queue in which these queries are running. If less memory is available, then disk writes happen, which lead to performance degradation.

How does an interleaved sort key differ from a compound sort key?

A compound sort key is more efficient when query predicates use a prefix, which is a subset of the sort key columns in order.
An interleaved sort key gives equal weight to each column in the sort key, and so, query predicates can use any subset of the columns that make up the sort key in any order.