Google News
logo
Snowflake Interview Questions
Snowflake is a cloud computing–based data cloud company based in data warehouse(DWH) provided as a software-as-a-service (SaaS). It consists of unique architecture to handle multiple aspects of data and analytics. Snowflake sets itself apart from all other traditional data warehouse solutions with advanced capabilities like improved performance, simplicity, high concurrency and cost-effectiveness.
 
Snowflake is based data warehouse (DWH) platform that is built on the top of AWS (Amazon Web Services), Microsoft Azure, and Google Cloud infrastructures to provide companies with flexible, scalable storage solutions while also hosting BI (Business Intelligence) solutions.
 
Snowflake Inc. was founded in July 2012 in San Mateo, California by three data warehousing experts: Benoit Dageville, Thierry Cruanes and Marcin Żukowski. It was publicly launched in October 2014 after two years in stealth mode.
Snowflake’s architecture is a hybrid of traditional shared-disk and shared-nothing database architectures. Similar to shared-disk architectures, Snowflake uses a central data repository for persisted data that is accessible from all compute nodes in the platform. But similar to shared-nothing architectures, Snowflake processes queries using MPP (massively parallel processing) compute clusters where each node in the cluster stores a portion of the entire data set locally. This approach offers the data management simplicity of a shared-disk architecture, but with the performance and scale-out benefits of a shared-nothing architecture.
Snowfllake
The Snowflake architecture is divided into three key layers as shown below :   
 
Cloud Services Layer : The cloud services layer is a collection of services that coordinate activities across Snowflake. These services tie together all of the different components of Snowflake in order to process user requests, from login to query dispatch. The cloud services layer also runs on compute instances provisioned by Snowflake from the cloud provider.
 
Services managed in this layer include :
 
* Authentication & Access Control
* Infrastructure management
* Metadata management
* Query parsing and optimization
* Access control

Query Processing Layer : Query execution is performed in the processing layer. Snowflake processes queries using “virtual warehouses”. Each virtual warehouse is an MPP compute cluster composed of multiple compute nodes allocated by Snowflake from a cloud provider.
 
Each virtual warehouse is an independent compute cluster that does not share compute resources with other virtual warehouses. As a result, each virtual warehouse has no impact on the performance of other virtual warehouses.

Database Storage Layer : When data is loaded into Snowflake, Snowflake reorganizes that data into its internal optimized, compressed, columnar format. Snowflake stores this optimized data in cloud storage.
 
Snowflake manages all aspects of how this data is stored — the organization, file size, structure, compression, metadata, statistics, and other aspects of data storage are handled by Snowflake. The data objects stored by Snowflake are not directly visible nor accessible by customers; they are only accessible through SQL query operations run using Snowflake.
Snowflake has the following key features :
 
* With Snowflake, you can interact with the data cloud through a web interface. Users can navigate the web GUI to control their accounts, monitor resources, and monitor resources and system usage queries data, etc.

* Users can connect to Snowflake's data cloud using a wide range of client connectors and drivers. Among these connectors are Python Connector (an interface for writing Python applications to connect to Snowflake), Spark connector, NodeJS driver, .NET driver, JBDC driver for Java development, ODBC driver for C or C++ programming, etc.

* The core architecture of Snowflake enables it to operate on the public cloud, where it uses virtualized computing instances and efficient storage buckets for processing huge amounts of big data cost-effectively and scalable.

* Snowflake integrates with a number of big data tools, including business intelligence, machine learning, data integration, security, and governance tools.

* With advanced features such as simplicity, increased performance, high concurrency, and profitability, Snowflake is incomparable to other traditional data warehouse solutions.

* Snowflake supports the storage of both structured and semi-structured data (such as JSON, Avro, ORC, Parquet, and XML data).

* Snowflake automates cloud data management, security, governance, availability, and data resilience, resulting in reduced costs, no downtime, and better operational efficiency.

* With it, users can rapidly query data from a database, without having an impact on the underlying dataset. This allows them to receive data closer to real-time.

* Most DDL (Data Definition Language) and DML (Data Manipulation Language) commands in SQL are supported by the Snowflake data warehouse. Additionally, advanced DML, lateral views, transactions, stored procedures, etc., are also supported.
Snowflake cloud data warehouse platform provides instant, secure, and governed access to the entire data network and a core architecture to enable various types of data workloads, including a single platform for developing modern data applications. Snowflake brings together the power of data warehouses, the scalability of big data platforms, the elasticity of the cloud, and real-time data sharing at a fraction of the cost of traditional solutions.
The services layer acts as the brain of the Snowflake. In Snowflake, the Services layer authenticates user sessions, applies security functions, offers management, performs optimization, and organizes all the transactions.
Snowflake's data warehouse can be accessed using the following ways :  
 
* ODBC Drivers (a driver for connecting to Snowflake).

* JDBC Drivers (a driver enabling a Java application to interact with a database).

* Python Libraries (for creating Python applications that connect to Snowflake and perform standard operations).

* Web User Interface (can be used for almost any task you can accomplish with SQL and the command line, such as: Creating and managing users and other account-level objects).

* SnowSQL Command-line Client (Python-based command-line interface to connect to Snowflake from Windows, Linux, and MacOS).
Snowflake Redshift
Despite similar on-demand pricing, Snowflake and Redshift package their features differently. In Snowflake's pricing structure, compute usage is separate from storage usage In Redshift, both computer usage and storage usage are combined.
Snowflake is more robust than Redshift when it comes to JSON storage. In essence, Snowflake makes it possible to store and query JSON with built-in, native functions. On the other hand, when JSON is loaded into Redshift, it is split into strings, making it more difficult to query and work with.  
Snowflake offers security and compliance features specifically tailored to its editions so that your data is protected to the maximum level as per your data strategy. The Redshift platform offers a wide range of encryption solutions
Data vacuuming and compression can be automated on Snowflake. It offers the best advantage as it automates much of the process, saving time and effort.  Data vacuuming and compression cannot be automated on Redshift, so the system requires more hands-on maintenance.
Data security is a top priority of every organization. Snowflake meets the industry's highest safety standards for encrypting and securing data stored and customer accounts in Snowflake. It provides the best key management features without additional charge.
 
The following are the security measures that Snowflake uses to protect client data :
 
* Snowflake utilizes a managed key for automatically encrypting the data it contains.
* Snowflake relies on TLS for protecting communication between customers and servers.
* It enables you to choose a geographic location for storing your data depending on your cloud region.
By default, all the data is compressed in Snowflake. Snowflake selects the top compression algorithms and cannot be configured by end-users. The best thing about Snowflake is that after compressing the data, Snowflake charges the customer according to the final size of the compressed data. 
 
Benefits of data compression in snowflake :
 
* Storage costs are lower compared to native cloud storage costs due to compression.
* There is no storage cost for disk caches.
* Storage overheads close to zero when cloning or sharing data.
Stages are locations in Snowflake where data is stored, and staging is the process of uploading data into a stage. Data that needs to be loaded or stored within Snowflake is stored either elsewhere in the other cloud regions like in AWS (Amazon Web Service) S3, GCP (Google Cloud Platform), or Azure, or is stored internally within Snowflake. When data is stored in another cloud region, this is known as an external stage; when it is stored inside a snowflake, it is known as an internal stage. Internal stages can be further categorized as follows : 
 
User stages : Each of these stages pertains to a specific user, so they'll be assigned to every user by default for storing files.

Table stages : Each of these stages pertains to a specific database table, so they'll be assigned to every table by default.

Internal named stages : Compared to the user or table stages, these stages offer a greater degree of flexibility. As these are some of the Snowflake objects, all operations that can be performed on objects can also be performed on internally named stages. These stages must be created manually and we can specify file formats when creating these stages.
Snowflake supports both transformations during (ETL) or after loading (ELT). Snowflake integrates with a variety of data integration solutions, including Informatica, Talend, Tableau, Matillion, and others.
 
In data engineering, new tools and self-service pipelines are displacing traditional tasks such as manual ETL coding and data cleaning. With Snowflake's simple ETL and ELT options, data engineers can spend more time focusing on essential data strategy and pipeline improvement initiatives. Furthermore, using the Snowflake Cloud Platform as your data lake and data warehouse, extract, convert, and load may be efficiently avoided, as no pre-transformations or pre-schemas are needed.
Following are the best ETL tools for Snowflake
 
* Matillion
* Blendo
* Hevo Data
* StreamSets
* Etleap
13 .
What kind of SQL does Snowflake use?
Snowflake supports the most common standardized version of SQL, i.e., ANSI for powerful relational database querying.
In simple terms, Snowpipe is a continuous data ingestion service provided by Snowflake that loads files within minutes as soon as they are added to a stage and submitted for ingestion. Therefore, you can load data from files in micro-batches (organizing data into small groups/matches), allowing users to access the data within minutes (very less response time), rather than manually running COPY statements on a schedule to load large batches. By loading the data into micro-batches, Snowpipe makes it easier to analyze it. Snowpipe uses a combination of filenames and file checksums to ensure that only new data is processed.  
 
Advantages of Snowpipe :
 
* By eliminating roadblocks, Snowpipe facilitates real-time analytics.
* It is cost-effective.
* It is simple to use.
* There is no management required.
* It provides flexibility, resilience, and so on.
Snowflake offers four editions depending on your usage requirements.
 
Standard edition : Its introductory level offering provides unlimited access to Snowflake’s standard features.

Enterprise edition :
 Along with Standard edition features and services, offers additional features required for large-scale enterprises.

Business-critical edition :
 Also, called Enterprise for Sensitive Data (ESD). It offers high-level data protection for sensitive data to organization needs.

Virtual Private Snowflake (VPS) :
 Provides high-level security for organizations dealing with financial activities.
Horizontal Scaling : Horizontal scaling increases concurrency by scaling horizontally. As your customer base grows, you can use auto-scaling to increase the number of virtual warehouses, enabling you to respond instantly to additional queries.

Vertical Scaling : Vertical scaling involves increasing the processing power (e.g. CPU, RAM) of an existing machine. It generally involves scaling that can reduce processing time. Consider choosing a larger virtual warehouse-size if you want to optimize your workload and make it run faster.
Snowflake is developed as an OLAP (Online Analytical Processing) database system, not as an OLTP (Online Transaction Processing) database system. In OLTP (Online Transaction Processing), data is collected, stored, and processed from real-time transactions, but in OLAP (Online Analytical Processing), complex queries are used to evaluate aggregated historical data from OLTP systems. Snowflake is not designed to handle much updating and inserting of small amounts of data like a transactional database would.

Snowflake, for instance, cannot handle referential integrity because, even though it supports integrity and other constraints, they are not enforced except the NOT NULL constraint that is always enforced. Other constraints than NOT NULL are created as disabled constraints. However, depending on the use, we may also use it for online transaction processing (OLTP).
In Snowflake, a virtual warehouse, often known as a "warehouse," is a collection of computational resources. A virtual warehouse provides the resources required for the users like CPU, memory, and temporary storage to perform multiple Snowflake operations:
 
Execute the SQL SELECT statements that necessitate the use of computing resources  (e.g. retrieving rows from tables and views).

DML operations include :

* Updating table rows (DELETE , INSERT , UPDATE).
* Data Loading into tables (COPY INTO <table>).
* Data unloading from tables (COPY INTO <location>).
To reduce the risk factor, DBA typically performs complete and incremental data backups on a regular basis. This process consumes more storage space and can sometimes be double or triple. In addition, the process of recovering data is expensive, takes time, requires a company downtime, and even more.    
The snowflake data sharing option enables users to share data objects safely in a database of your account with other snowflake accounts. All database objects shared among snowflake accounts are readable and cannot be modified. Database objects that are shareable in Snowflake are external tables,  secure views, tables, Secure materialized views and Secure UDFs.
 
There are three kinds of data sharing :
 
* Sharing data between management units
* Sharing Data between functional units
* Data sharing between geographically dispersed area 
It is an implementation that enables us to create a copy of tables, schemas and databases without actually copying the data. For performing zero-copy in Snowflake, we must use a keyword named CLONE. This option allows you to obtain real-time data from production and carry out several actions.

Advantages :
 
* There are no additional storage costs associated with data replication.
* There is no waiting time for copying data from production to non-production contexts.
* There is no need for administrative efforts since cloning is as simple as a click of a button. 
* No copy, only clone: Data exists only in one place.
* Promote corrected/fixed data to production instantly.
In Snowflake, clustering is a type of data partitioning, where unique cluster keys are specified for each table. Cluster keys are subsets of a table's columns that are used to co-locate data within the table. These keys are appropriate for comprehensive tables. The process of managing clustered data in a table is known as re-clustering.
The columnar database is opposite to the conventional databases. It saves the data in columns in place of rows, eases the method for analytical query processing and offers more incredible performance for databases. Columnar database eases analytics processes, and it is the future of business intelligence.
A materialized view is a pre-computed dataset that is derived from query specification. Since the data is pre-computed, it becomes much easier to query the materialized view compared to a non-materialized view from the base table of the view. In plain language, materialized views are designed to improve the query performance of common and repetitive query models. Materialized views are the main database objects and accelerate projection, selection operations and expensive aggregation for queries that execute on larger datasets. 
Snowflake Time Travel tool enables you to access historical data at any given point within a defined time period. Using this you can see the data that has been deleted or changed. Using this tool you can perform the below tasks:
 
* Restore data-related objects (Schemas, tables, and databases) that might have lost accidentally.
* To examine data usage and changes made to data with a time period
* Backing up and duplicating data from key points in the past.
Snowflake is more popular due to the following reasons :
 
* Snowflake provides a broad range of technology areas like data integration, business intelligence, security, and modern analytics.
* It offers cloud infrastructure and endorses advanced design architectures ideal for dynamic and agile usage trends.
* Snowflake endorses unique features like scalable compute, data cloning, data sharing, and the partition of the compute and storage.
* Snowflake streamlines data processing.
Schemas like Star and Snowflake serve as a logical description of the entire database, or how data is organized in a database. 
 
Star Schema : A star schema typically consists of one fact table and several associated dimension tables. The star schema is so named because the structure has the appearance of a star. Dimensions of the star schema have been denormalized. When the same values are repeated within a table, it is considered denormalization.
 
Snowflake Schema : In a snowflake schema, the center has a fact table, which is associated with a number of dimension tables. Those dimension tables are in turn associated with other dimension tables. Snowflake schemas provide fully normalized data structures. Separate dimensional tables are used for the various levels of hierarchy (city > country > region).
Cloud-based data warehouse platforms like Snowflake and Amazon Redshift provide excellent performance, scalability, and business intelligence tools. In terms of core functionality, both platforms provide similar capabilities, such as relational management, security, scalability, cost efficiency, etc. There are, however, several differences between them, such as pricing, user experience and deployment options. 
 
* There is no maintenance required with Snowflake as it is a complete SaaS (Software as a Service) offering. In contrast, AWS Redshift clusters require manual maintenance.

* The Snowflake security model uses always-on encryption to enforce strict security checks, while Redshift uses a flexible, customizable approach.

* Storage and computation in Snowflake are completely independent, meaning the storage costs are approximately the same as those in S3. In contrast, AWS bypasses this problem with a Red Shift spectrum and lets you query data that is directly available in S3. Despite this, it is not flawless like Snowflake.
29 .
What is the use of the Compute layer in Snowflake?
In Snowflake, Virtual warehouses perform all the data handling tasks. Which are multiple clusters of the compute resources. While performing a query, virtual warehouses extract the least data needed from the storage layer to satisfy the query requests.
Definitely. AWS glue presents a comprehensive managed environment that easily connects with Snowflake as a data warehouse service. These two solutions collectively enable you to handle data ingestion and transformation with more ease and flexibility.
* The data retention period is an important aspect of Snowflake Time Travel.
 
* When data in a table is modified, such as deletion or discarding an object holding data, Snowflake saves the data's previous state. The data retention period determines the number of days that this historical data is kept and, as a result, Time Travel operations (SELECT, CREATE... CLONE, UNDROP) can be performed on it.
 
* The standard retention period is one day (24 hours) and is enabled by default for all Snowflake accounts.
Views are useful for displaying certain rows and columns in one or more tables. A view makes it possible to obtain the result of a query as if it were a table. The CREATE VIEW statement defines the query. Snowflake supports two different types of views :
 
Non-materialized views (often referred to as "views") - The results of a non-materialized view are obtained by executing the query at the moment the view is referenced in a query. When compared to materialised views, performance is slower. 

Materialized views - Although named as a type of view, a materialised view behaves more like a table in many aspects. The results of a materialised view are saved in a similar way to that of a table. This allows for faster access, but it necessitates storage space and active maintenance, both of which incur extra expenses.
If you would like to reset your environment by deleting all the objects created as part of this lab, run the SQL statements in a worksheet.
 
First, ensure you are using the ACCOUNTADMIN role in the worksheet :
use role accountadmin;
Then, run the following SQL commands to drop all the objects we created in the lab :
drop share if exists zero_to_snowflake_shared_data;
-- If necessary, replace "zero_to_snowflake-shared_data" with the name you used for the share

drop database if exists citibike;

drop database if exists weather;

drop warehouse if exists analytics_wh;

drop role if exists junior_dba;
In Snowflake, stages are data storage locations. If the data to be imported into Snowflake is stored in another cloud area, such as AWS S3, Azure, or GCP, these are referred to as External stages; if the data is stored within Snowflake, they are referred to as Internal stages.
 
Internal Stages are further divided as following :
 
* Table Stage
* User Stage
* Internal Named Stage
Stored procedures allow us to create modular code comprising complicated business logic by adding various SQL statements with procedural logic. For executing the Snowflake procedure, carry out the below steps :
 
* Run a SQL statement
* Extract the query results
* Extract the result set metadata
Yes, Snowflake maintains stored procedures. The stored procedure is the same as a function; it is created once and used several times. Through the CREATE PROCEDURE command, we can create it and through the “CALL” command, we can execute it. In Snowflake, stored procedures are developed in Javascript API. These APIs enable stored procedures for executing the database operations like SELECT, UPDATE, and CREATE.
Following are the Drivers and Connectors that exist in Snowflake :
 
* Node.js Driver
* .NET Driver
* ODBC Driver
* JDBC Driver
* Snowflake Connector for Kafka
* Snowflake Connector for Python
* Snowflake Connector for Spark
Following are the different types of tables available in Snowflake :
 
Permanent : It is a typical database table. It utilizes more space, and we can enable fail-safe and Time-travel periods. Permanent tables are useful for the data that requires a higher level of data recovery and data protection.

Transient : In Snowflake, we can create transient tables that exist until externally dropped and are accessible to all the users with relevant privileges.  

Temporary : We use the temporary table to store transitory and non-permanent data. Temporary tables exist only in the session in which they were developed.

External : External tables are read-only, and we cannot perform DML operations on them. We can use external tables for join and query operations.
In the CREATE TABLE DDL, specify the TEMPORARY keyword (or the TEMP abbreviation) to create a temporary table. The following syntax must be used to create temporary tables:
Create temporary table mytable (id number, creation_date date);