为什么Apache Iceberg将统治云中的数据

cxounion.org

云允许数据团队以合理的成本收集大量数据并存储它们,为利用数据湖、数据网格和其他现代架构的新分析用例打开了大门。但是对于非常大的数据量,通用云存储在如何访问、管理和使用数据方面也提出了挑战和限制。

云中的典型blob存储系统缺乏显示文件之间关系或它们如何对应于表所需的信息,这使得查询引擎的工作更加困难。此外,文件本身并不容易更改表的模式,或在表上进行“时间旅行”。对于如何查询文件,每个查询引擎都必须有自己的视图。突然之间,看似容易实现的数据架构变得比预期的更加困难。

在这里,将表格格式应用于数据变得非常有用。表格式显式地定义了表、表的元数据和组成表的文件。客户端不是在读取数据时应用模式,而是在运行查询之前就已经知道了模式。此外,可以以一种提供更细粒度分区的方式保存表元数据。因此,对数据应用表格格式可以提供许多优点,例如:

1、更好的过滤或分区带来更快的性能

2、更容易的模式演变

3、能够跨表“时间旅行”以查看给定时间点的数据

4、表ACID依从性

为什么是Apache Iceberg?

选择使用哪种表格式是一个重要的决定,因为它可以启用或限制可用的特性。在过去的两年里,我们看到了对Apache Iceberg的大量支持,这是一种最初由Netflix开发的表格式,于2018年作为Apache孵化器项目开源,并于2020年从孵化器项目中毕业。

Iceberg的构建是为了解决Apache Hive在处理非常大的数据集时遇到的一些挑战,包括规模、可用性和性能问题。正如Netflix的一名工程师当时指出的那样,用于大规模数据集的表格格式应该像SQL一样可靠和可预测,“不会有任何令人不快的意外”。

有了几个可用的选项,我们相信Iceberg优于其他可用的开放表格式。这里有五个原因。

Iceberg与过去彻底决裂

过去对表格格式的工作方式有重大影响。一些表格格式是从旧的技术演变而来的,而另一些则完全不同。Iceberg属于后者。它是为了解决Apache Hive中的缺点而从头开始构建的,这意味着它避免了过去阻碍数据湖的一些不受欢迎的质量。如何处理模式更改(比如重命名列)就是一个很好的例子。

展望未来,这也意味着Iceberg不需要在不引起生产数据应用程序问题的情况下进一步考虑如何与相关工具分离。随着时间的推移,其他表格格式可能会赶上来,但就目前而言,Iceberg专注于交付下一组新功能,而不是回头修复旧问题。

Iceberg是不可知的处理引擎和文件格式

通过将处理引擎与表格式分离,Iceberg提供了更大的灵活性和选择。工程师不必被迫使用一种处理引擎,而是可以为工作选择最好的工具。选择之所以重要,至少有两个关键原因。首先,公司用于处理数据的引擎可能会随着时间而变化。例如,许多企业从Hadoop转移到Spark或Trino。其次,对于大型组织来说,使用几种不同的技术是很常见的,有了选择,他们就可以互换地使用几种工具。

Iceberg还支持多种文件格式,包括Apache Parquet、Apache Avro和Apache ORC。这在今天提供了灵活性,但也为将来可能出现的文件格式提供了更好的长期可插拔性。华东CIO大会、华东CIO联盟、CDLC中国数字化灯塔大会、CXO数字化研学之旅、数字化江湖-讲武堂,数字化江湖-大侠传、数字化江湖-论剑、CXO系列管理论坛(陆家嘴CXO管理论坛、宁波东钱湖CXO管理论坛等)、数字化转型网,走进灯塔工厂系列、ECIO大会等

Iceberg是一个运行良好的开源项目

Iceberg项目由Apache软件基金会管理,这意味着它遵循几个重要的Apache方式,包括获得的权威和共识决策。并非每个自称“开源”的项目都是如此。Apache Iceberg公开了它的项目管理,因此您知道谁在运行项目。其他表格格式没有披露谁拥有决策权。表格式是数据体系结构中的基本选择,因此选择真正开放和协作的项目可以显著降低意外锁定的风险。

Iceberg的合作正在产生新的想法和帮助

有几个迹象表明,围绕Apache Iceberg的协作社区正在使用户受益,并为项目的长期成功奠定了基础。对于用户来说,Slack频道和GitHub存储库显示出很高的参与度,无论是围绕新想法还是对现有功能的支持。重要的是,整个行业都在参与,而不仅仅是一个团队或冰山的原始作者。

高度的合作也使技术本身受益。该项目正在征集越来越多的提案,这些提案在思想上各不相同,并解决了许多不同的用例。此外,该项目还产生了新的项目和想法,如Project NessiePuffin Spec和开放的元数据API(open Metadata API.)

Iceberg包含了以其他表格格式支付的功能

与其他表项目不同,Iceberg从一开始就内置了面向性能的特性,这在一些方面对用户是有益的。首先,用户通常认为一个开放代码的项目包含性能特性,但最终却发现这些特性并没有包含在内,或者在将来会有含糊的承诺。其次,如果您希望移动工作负载(使用表格式应该很容易),那么您就不太可能在Iceberg实现中遇到重大差异。第三,一旦你开始使用开源的Iceberg,你就不太可能发现你需要的功能隐藏在付费墙后面。开放和非开放之间的区别也不是一个时间点的问题。

从一开始,作为一个开放项目,Iceberg的存在是为了解决实际问题,而不是业务用例。这是一个小而重要的区别:拥有为Iceberg提供支持的付费产品的供应商(如Snowflake、AWS、Apple、Cloudera、谷歌Cloud等)可以在实现Iceberg规范的程度上进行竞争,但Iceberg项目本身并不是为了推动特定公司的业务。

Snowflake和Iceberg

在Snowflake,我们很早就创建了自己的表格格式,从而实现了各种新功能。但随着企业转移到云数据平台,他们的需求和时间表会有所不同。一些公司有监管要求,限制数据的存储位置,或者有需要保护的现有投资。

支持像Iceberg这样的外部表格式,允许我们的客户从Snowflake内部利用他们的所有数据,即使其中一些数据需要驻留在不同的位置。这就是为什么我们在今年早些时候在Snowflake中添加了对Iceberg的支持作为额外的表选项,最近又引入了一种名为Iceberg Tables的新型Snowflake表。

Apache Iceberg入门

Apache Iceberg社区中有一些优秀的资源,可以帮助您更多地了解这个项目,并参与到开源工作中来。

1、Iceberg入门指南提供了如何在纯开源的Iceberg和Apache Spark中入门的示例。

2、Iceberg有几个强大的社区,你可以参与其中,比如公共Slack频道。

3、如果您想对Iceberg进行更改或提出一个新想法,请根据贡献指南创建一个pull请求。社区定期参与并结合社区请求。

如果你是Snowflake用户,你今天就可以开始使用我们的Iceberg私人预览支持。联系您的Snowflake帐户团队了解更多有关这些功能或注册。

1、Iceberg Tables在外部存储中尝试完全基于Iceberg和Parquet的新表类型,但具有Iceberg Tables的优点和类似的性能。

2、Iceberg的外部表:通过Snowflake外部表,可以方便地从Snowflake连接到现有的Iceberg table

James Malone是Snowflake公司产品管理的高级经理

原文:

The cloud has allowed data teams to collect vast quantities of data and store it at reasonable cost, opening the door to new analytics use cases that leverage data lakes, data mesh, and other modern architectures. But for very large volumes of data, generic cloud storage also presents challenges and limitations in how that data can be accessed, managed, and used.

Typical blob storage systems in the cloud lack the information required to show relationships between files or how they correspond to a table, making the job of query engines that much harder. Additionally, files by themselves do not make it easy to change schemas of a table, or to “time travel” over it. Each query engine must have its own view of how to query the files. All of a sudden, what seemed like an easy-to-implement data architecture becomes more difficult than expected.

This is where applying table formats to data becomes extremely useful. Table formats explicitly define a table, its metadata, and the files that compose the table. Instead of applying a schema when the data is read, clients already know the schema before the query is run. Moreover, the table metadata can be saved in a way that offers more fine-grained partitioning. Therefore, applying a table format to the data can offer a number of advantages, such as:

· Faster performance due to better filtering or partitioning

· Easier evolution of the schema

· Ability to “time travel” across the table to view data at a given point in time

· Table ACID compliance

Why Apache Iceberg?

Choosing which table format to use is an important decision because it can enable or limit the features available. Over the past two years, we have seen significant support emerging for Apache Iceberg, a table format originally developed by Netflix that was open-sourced as an Apache incubator project in 2018 and graduated from the incubator program in 2020.

Iceberg was built from the ground up to address some of the challenges in Apache Hive when working with very large data sets, including issues around scale, usability, and performance. As a Netflix engineer noted at the time, table formats for very large-scale data sets should work as reliably and predictably as SQL, “without any unpleasant surprises.”

With several options available, we believe Iceberg is superior to other open table formats available. Here are five reasons why.

Iceberg makes a clean break from the past

The past can have a major impact on how a table format works today. Some table formats have evolved from older technologies, while others have made a clean break. Iceberg is in the latter camp. It was built from the ground up to address shortcomings in Apache Hive, which means it has avoided some of the undesirable qualities that held data lakes back in the past. How schema changes can be handled, such as renaming a column, is a good example.

Looking ahead, this also means Iceberg does not need to rationalize how to further break from related tools without causing issues with production data applications. Over time, other table formats will likely catch up, but as of now, Iceberg is focused on delivering the next set of new features, instead of looking back to fix old problems.

Iceberg is agnostic to processing engine and file format

By decoupling the processing engine from the table format, Iceberg provides greater flexibility and choice. Instead of being forced to use one processing engine, engineers can pick the best tool for the job. Choice is important for at least two key reasons. First, the engines a company uses to process data can change over time. For example, many businesses moved from Hadoop to Spark or Trino. Second, it’s common for large organizations to use several different technologies, and having choice enables them to use several tools interchangeably.

Iceberg also supports multiple file formats, including Apache Parquet, Apache Avro, and Apache ORC. This provides flexibility today, but also enables better long-term plugability for file formats that may emerge in the future.

Iceberg is a well-run open source project

The Iceberg project is managed by the Apache Software Foundation, which means it adheres to several important Apache Ways, including earned authority and consensus decision making. This is not necessarily the case for every project calling itself “open source.” Apache Iceberg makes its project management public, so you know who is running the project. Other table formats do not disclose who has decision-making authority. A table format is a fundamental choice in a data architecture, so choosing a project that is truly open and collaborative can significantly reduce risks of accidental lock-in.

Collaboration in Iceberg is spawning new ideas and help

There are several signs that the collaborative community around Apache Iceberg is benefiting users and setting the project up for long-term success. For users, the Slack channel and GitHub repository show high engagement, both around new ideas and support for existing functionality. Critically, engagement is coming from across the industry, not just one group or the original authors of Iceberg.

The high degree of collaboration is also benefiting the technology itself. The project is soliciting a growing number of proposals that are perse in their thinking and solve many different use cases. Additionally, the project is spawning new projects and ideas, such as Project Nessie, the Puffin Spec, and the open Metadata API.

Iceberg includes features that are paid in other table formats

Unlike some other table projects, Iceberg has performance-oriented features built in from the start, which is beneficial for users in a few ways. First, users often assume a project with open code includes performance features, only to discover they are not included or vaguely promised in the future. Second, if you want to move workloads around, which should be easy with a table format, you’re much less likely to run into substantial differences in Iceberg implementations. Third, once you start using open source Iceberg, you’re unlikely to discover that a feature you need is hidden behind a paywall. The distinction between what is open and what isn’t is also not a point-in-time problem.

As an open project from the start, Iceberg exists to solve a practical problem, not a business use case. This is a small but important distinction: Vendors with paid products who provide support for Iceberg, such as Snowflake, AWS, Apple, Cloudera, Google Cloud, and more, can compete in how well they implement the Iceberg specification, but the Iceberg project itself is not intended to drive business for a specific company.

Snowflake and Iceberg

At Snowflake, we created our own table format early on, which enabled all sorts of new capabilities. But as businesses move to a cloud data platform, their needs and timelines vary. Some companies have regulatory requirements that restrict where data can be stored, or have existing investments they need to protect.

Supporting an external table format like Iceberg allows our customers to leverage all of their data from within Snowflake, even if some of it needs to reside in a different location. That’s why we added support for Iceberg as an additional table option within Snowflake earlier this year, and more recently introduced a new type of Snowflake table called Iceberg Tables.

Getting Started with Apache Iceberg

There are some excellent resources within the Apache Iceberg community to learn more about the project and to get involved in the open source effort.

· The Iceberg Getting Started guide provides examples of how to get started in purely open source Iceberg and Apache Spark.

· Iceberg has several robust communities where you can get involved, such as the public Slack channels.

· If you want to make changes to Iceberg or propose a new idea, create a pull request based on the contribution guide. The community regularly participates in and combines community requests.

If you’re a Snowflake user, you can get started with our Iceberg private-preview support today. Contact your Snowflake account team to learn more about these features or to sign up.

· Iceberg Tables: Try out our new table type based entirely on Iceberg and Parquet in external storage, but with the benefits and similar performance of Snowflake tables.

· External Tables for Iceberg: Enable easy connection from Snowflake with an existing Iceberg table via a Snowflake External Table.

James Malone is senior manager of product management at Snowflake.

本文主要内容转载原作者为James Malone,仅供广大读者参考,如有侵犯您的知识产权或者权益,请联系我提供证据,我会予以删除。

CXO联盟(CXO union)是一家聚焦于CIO,CDO,cto,ciso,cfo,coo,chro,cpo,ceo等人群的平台组织,其中在CIO会议领域的领头羊,目前举办了大量的CIO大会、CIO论坛、CIO活动、CIO会议、CIO峰会、CIO会展。如华东CIO会议、华南cio会议、华北cio会议、中国cio会议、西部CIO会议。在这里,你可以参加大量的IT大会、IT行业会议、IT行业论坛、IT行业会展、数字化论坛、数字化转型论坛,在这里你可以认识很多的首席信息官、首席数字官、首席财务官、首席技术官、首席人力资源官、首席运营官、首席执行官、IT总监、财务总监、信息总监、运营总监、采购总监、供应链总监。

数字化转型网(资讯媒体,是企业数字化转型的必读参考,在这里你可以学习大量的知识,如财务数字化转型、供应链数字化转型、运营数字化转型、生产数字化转型、人力资源数字化转型、市场营销数字化转型。通过关注我们的公众号,你就知道如何实现企业数字化转型?数字化转型如何做?

【CXO UNION部分社群会员】大连西太平洋石油化工有限公司CISO、重庆市博赛矿业(集团)有限公司CISO、德龙钢铁有限公司CISO、巨化集团有限公司CISO、得力集团有限公司CISO、河北鑫海控股集团有限公司CISO、滨化集团CISO、华新水泥股份有限公司CISO、北京顺鑫控股集团有限公司CISO、万丰奥特控股集团有限公司CISO、河北安丰钢铁有限公司CISO、福星集团控股有限公司CISO、河北天柱钢铁集团有限公司CISO、心里程控股集团有限公司CISO、花园集团有限公司CISO、金龙精密铜管集团股份有限公司CISO、华芳集团有限公司CISO、河北诚信集团有限公司CISO、淄博齐翔腾达化工股份有限公司CISO、波司登股份有限公司CISO、云南白药集团股份有限公司CISO、浙江元立金属制品集团有限公司CISO、香驰控股有限公司CISO、山东中海化工集团有限公司CISO、天士力控股集团有限公司CISO、河北东海特钢集团有限公司CISO、万通海欣控股集团股份有限公司CISO、河南中原黄金冶炼厂有限责任公司CISO、江苏沃得机电集团有限公司CISO、三花控股集团有限公司CISO、青岛啤酒股份有限公司CISO、山西建邦集团有限公司CISO、四川九洲电器集团有限责任公司CISO、中策橡胶集团有限公司CISO、华立集团股份有限公司CISO、河南金利金铅集团有限公司CISO、振石控股集团有限公司CISO、天津华北集团有限公司CISO、鹏鼎控股(深圳)股份有限公司CISO、山东东方华龙工贸集团有限公司CISO、山东永鑫能源集团有限公司CISO、浙江大华技术股份有限公司CISO、郑州煤矿机械集团股份有限公司CISO、江西济民可信集团有限公司CISO、道恩集团有限公司CISO、浙江东南网架集团有限公司CISO、哈尔滨电气集团有限公司CISO、天津食品集团有限公司CISO、欣旺达电子股份有限公司CISO、天洁集团有限公司CISO、浙江富春江通信集团有限公司CISO、江西博能实业集团有限公司CISO、广东德赛集团有限公司CISO、华鲁控股集团有限公司CISO、兴惠化纤集团有限公司CISO等

展开阅读全文

页面更新:2024-03-22

标签:数据   股份有限公司   河北   集团有限公司   首席   表格   格式   会议   引擎   项目

1 2 3 4 5

上滑加载更多 ↓
推荐阅读:
友情链接:
更多:

本站资料均由网友自行发布提供,仅用于学习交流。如有版权问题,请与我联系,QQ:4156828  

© CopyRight 2020-2024 All Rights Reserved. Powered By 71396.com 闽ICP备11008920号-4
闽公网安备35020302034903号

Top