More and more of our customers are looking at alternative data warehouse infrastructures that are purpose built to handle rapid queries on big data, while providing a low cost of entry and minimal administration. In this blog, we’ll discuss the cloud-based data warehouse as a service from Snowflake Computing, one of the more fascinating providers in this new generation of analytic database technologies.
Snowflake started from the ground up with an optimized database platform for the cloud, currently Amazon Web Services. This enabled them to build a truly elastic data warehouse, unconstrained by the frequent internal hardware capacity and procurement/installation delays found in most of our clients.
As workloads vary, Snowflake can scale up and down with the press of a button. The architecture utilizes the virtually unlimited resources and elasticity of the cloud to separate storage from compute while provisioning required resources in real-time. If you add new nodes, they are available immediately -- no data reloads!
Our friends at Snowflake Computing provided a full-fledged environment for us to put their technology through its paces. After several weeks of work, we are quite impressed. In the following sections, we’ll explore several of Snowflake’s innovative features and capabilities (some different from anything we’ve seen before!) including the separation of storage and compute resources, time travel features, querying semi-structured data, and security.
Power Up (or Power Down): Faster and Easier Scaling with Snowflake
One of the compelling design features of Snowflake is the separation of storage and compute resources. This allows users to easily address the bottlenecks within their system and only pay for the resources they need when they need them.
Storage of data in Snowflake is managed inside a “database” construct, as well as related staging areas for the database. A database has the same general design patterns of a traditional relational database with schemas, tables, views, etc. But, all hardware configurations associated with those objects are managed by the Snowflake service. Staging areas are available for each table within a database and each user created within the system. As data is put into the staging area or inserted into a table, the storage size scales automatically.
Snowflake databases and tables are incredibly easy to set up, using ANSI standard SQL -- no special syntax required. For those who’ve spent hours devising indexing strategies to improve performance, you’ll be amazed at how quickly you get excellent performance without indexes. Even easy to administer Amazon Redshift requires more complexity to tune. With Snowflake, there are no indexes to fiddle with, no partitioning or distribution to fuss over. Frankly, it just works.
Computing resources within Snowflake are managed through the concept of a “warehouse.” As you scale up the size of your data warehouse, the more performant your environment becomes. If there are periods of heavier workloads in your system, warehouses can easily be sized up. After the workload completes, you size back down. This allows for consistent performance throughout the day whether there are one, or a hundred users in the system, while maximizing the economic benefits of elasticity.
For example, loading 50GB of csv files into an x-small warehouse took about 15 minutes. When we scaled up the warehouse to medium, the load ran in just over 7 minutes. When we scaled it to an X-Large, it cut performance in half again.
50GB Load Time
For context, we loaded extremely wide records. Snowflake employs heavy parallel processing during loads. After discussion with Snowflake engineers, we learned that load performance can be further improved by breaking the source into smaller files and running them in parallel.
This small experiment confirmed the benefits of Snowflake’s elasticity. We could reduce our data load time by 4x with the simply push of a button -- instantly provisioning an X-Large cluster for processing. Once completed, we push the button again and reduce the cluster to X-Small. (Of course, you don’t actually physically have to push a button. This whole scale-up and down process can be automated with Snowflake APIs!)
As the data is loaded, Snowflake uses internal compression algorithms to reduce storage demands, and improve query performance. For our ~50GB dataset containing ~160MM rows, Snowflake compression resulted in only 6.6GB of storage.
Total CSV File Input Size
Compression and elasticity matter. Snowflake users pay for storage and compute resources, independently, using Snowflake credits. Storage costs are based on monthly on-disk bytes in your databases and staging areas. The better the compression the less you pay. Compute resources are consumed per hour at a predetermined rate for the size of your warehouse. You only pay for what you consume. Snowflake also provides the ability to auto-pause and auto-resume the warehouse to further reduce cost.
Future Day is Here: Using Snowflake Time Travel
Another important feature of Snowflake is the ability to revert back to a snapshot of your data at a specific point in time, or easily recreate a dropped database or truncated table. With Snowflake Time Travel users can recover lost data, query historical data or create a clone of a database or table for development and testing purposes. Unlike the complex backup and time-consuming restore process for traditional relational databases, Snowflake time travel allows users to quickly reverse an undesired change in their database -- whether that is errant data content or structure changes.
Data retention periods can be specified at the database or table level and point in time recovery can be retained for up to 90 days. While this increases the storage size of your environment and thus your credit usage, it provides greater flexibility and more peace of mind knowing that you can rapidly recover lost or fouled data.
Snowflake Time Travel quickly became one of our favorite features. Amongst numerous uses, one of the best is to quickly create a development or testing environment.
With Snowflake, it’s unnecessary to have multiple, expensive environments with the same architecture. Databases, schemas and tables can easily be cloned as-is or for a specified point in time. This allows users to rapidly develop and push to test in a stable, identical architecture environment at fractions of the cost of a traditional database. These clones are created virtually instantly. No data is physically copied around, so the clone is available immediately. You no longer need to stand up separate development and staging/test environments with their own hardware, storage, software configuration….Wow!
Data without Borders: Snowflake Access to Semi-Structured Data
Increasingly data is arriving in semi-structured formats, like JSON. When semi-structured data is presented to most relational databases, you’re typically either forced to parse the JSON into a structured format, or use complex and slow expressions to access the detailed information. Neither is ideal.
With Snowflake, users can easily integrate and query across these various formats - with no need to pre-process. Currently, Snowflake supports the storage and query of JSON, Avro and XML semi-structured data formats. JSON and Avro are ready for prime time. At the time of this writing, XML support is currently in a preview state, but is available to explore.
For our investigation, we evaluated the JSON querying functionality. The ease of use and ability to join JSON queries with standard SQL queries is very compelling. Utilizing standard dot and bracket notation, users can easily access individual elements within their JSON objects. They also have a built-in “flatten” function to process repeated values.
The above example demonstrates how easy it is to create a table, load with JSON data and query. Note that you can write queries that join columns from this JSON data and normal relational data.
Access Control: Security in Snowflake
Security is one of the most important aspects of any database system. Being a hosted service, Snowflake emphasizes security to ensure customers can feel comfortable their data is safe. Within the Snowflake architecture, security is controlled at the system, account and database level.
Snowflake follows security control protocols such as the NIST 800-53 and SANS 20, and best-practices like the use of AES 256 bit encryption, with a hierarchical key model for all data stored in Snowflake.
They also take advantage of all the security controls employed by Amazon for their AWS environment. This provides robust physical and network protection that lays the foundation for Snowflake’s enterprise-class security.
On top of the standard object-level and role-based security for traditional databases, Snowflake utilizes multi-factor authentication (MFA) for access to a user’s account which prevents unauthorized access by requiring a user to provide authentication through multiple methods to ensure the user is properly verified.
To further increase security, the service utilizes monitoring and alerting tools for 24/7 data security. Users will be notified of any suspicious activity and immediate action will taken to mitigate any security risks.
Snowflake’s security protocols in combination with Time Travel, should service most organizations with what’s needed for secure data access.
An Impressive Platform For Next Generation Data Warehousing
Overall, Snowflake is a very promising new technology that overcomes many of the economic, provisioning, administration and development constraints inherent to classic data warehousing environments. As data volumes increase and new sources are acquired, additional storage is automatically, instantaneously provisioned. As workloads vary throughout the day, warehouses can be sized up or down to meet the demand. This combination provides a powerful and flexible platform that can be tailored to most data and analytic requirements.
During our examination of the Snowflake platform, we had to retrain our brains. No longer were we burdened with classic data warehouse environment constraints. Gone were the worries about managing multiple prod/test/dev environments and the uncomfortable discussions about budget and resources needed to support them. Gone was the need to deal with mundane index optimization and low-level JSON parsing.
Instead we were able to focus on the analytic tasks at hand, with system resource provisioning becoming an afterthought. For sure, a production deployment will need to consider when to light up and shut down elastic resources, but the ease and spontaneity of these changes makes this less traumatic. You can experiment and tune using the benefits of a cloud based service.
After previewing the service for several weeks, we are convinced that Snowflake is a compelling next generation of data warehouse computing. Organizations willing to break from the past and adopt an elastic, cloud-based analytic database service will find competitive advantage. We’re excited to work with Snowflake and anxiously await the next chapter in their journey.