Destination Lakehouse
Focus on the Lakehouse as a next generation platform, merging data lake & lakehouse.
Simplify all data with 1 platform for data & AI instead of a separate DW for SQL/ dashboarding.
Benchmarks on Data Warehousing & lakehouse workload.
But beyond benchmarks & costs this allows companies to unlock advanced use-cases that we can’t do with traditional DW.
Data Engineering
Delta Lake 2.0
All Delta is Open source as part of the Delta 2.0 release, we will include the following features:
- Support Change Data Feed on Delta tables.
- Support Z-Order clustering
- Support for idempotent writes to Delta tables
- Support for dropping columns in a Delta table as a metadata change operation.
- Support for dynamic partition overwrite.
- Experimental support for multi-part checkpoints
For more information about Delta Lake 2.0 feel free to read the Databricks Blog post
For more information about Delta Lake 2.0 feel free to read the Linux foundation Blog post
Project Lightspeed
The aim of this project is to improve the latency and ensure it’s predictable,enhance functionality for processing data with new connectors and improve ecosystem support for connectors.
For more information about the project lightspeed feel free to read the Blog post
Spark connect
Offering Apache SparkTM whenever and wherever, decoupling the client and server so it can be embedded everywhere, from application servers, IDEs, notebooks and all programming languages
Photon
Photon will be generally available on Databricks Workspaces on AWS and Azure in the coming weeks, further expanding Photon’s reach across the platform. It’s the next generation query engine on Databricks written to be directly compatible with Apache Spark APIs.
For more information about Photon feel free to read the Blog post
Delta Live Tables
ETL framework that uses a simple declarative approach to building reliable data pipelines and automatically managing infrastructure at scale.
What’s new ?
- Advanced Auto Scaling ( Link)
- CDC slowly changing Dimensions Type 2 (Link)
- Enzyme ETL Optimizer : New Optimization layer designed to speed up the process of doing ETL
For more information about Delta Live tables feel free to read the Blog post
Databricks workflows
Fully managed orchestration service that is deeply integrated with The Databricks lakehouse platform. It enables engineers to build a reliable data, analytics and ml workflow on any cloud without needing to manage complex infrastructure.
What’s new ?
- Build reliable production data and Ml pipelines with Git Support
- Run dbt projects in production
- Orchestration of SQL Tasks
- Save time and Money with Repair and rerun
- Easily share context between tasks
For more information about Databricks workflows feel free to read the Blog post
Data Sharing & Data Governance
Unity Catalog
Unity Catalog is going to be Generally available soon. It’s a unified governance solution for all data assets including files, tables, Dashboards and machine learning models in your lakehouse on any
cloud.
What’s new ?
- Automated Data Lineage for all workloads
- Built-in data search and discovery
- simplified access controls with privilege inheritance
- information schema
For more information about Unity Catalog feel free to read the Blog Post
Delta Sharing
Delta Sharing is going to be Generally available soon. It’s an open protocol for secure real-time exchange of large datasets, which enables secure data sharing across products for the first time. We’re developing Delta Sharing with partners at the top software and data providers in the world.
For more information about Delta Sharing feel free to read the Blog post
Databricks MarketPlace
Open marketplace for exchanging data products like datasets, notebooks, dashboards and ML models. Providers can now commercialize new offerings and shorten sales cycles by providing value-added services on top of their data. The marketplace is powered by Delta Sharing
For more information about Databricks Marketplace feel free to read the Blog post
Databricks Cleanrooms
Databricks Cleanrooms provides a secure hosted environment in which organizations can join their data and perform analysis on the aggregated Data. It allows organizations to meet collaborators on their prefered cloud and provide them the flexibility to run any complex computations & workloads in any language – SQL, R, Scala, Python.
For more information about Databricks Cleanrooms feel free to read the Blog post
Best Data Warehouse is a Lakehouse
Databricks SQL
Databricks SQL is a data warehouse on the Databricks Lakehouse Platform that lets you run all your SQL and BI applications at scale with up to 12x better price/performance, a unified governance model, open formats and APIs, and your tools of choice no lock-in
What’s new ?
- connect from everywhere: Open source connectors for Go, Node.js and Python to access the lakehouse from operational applications.
- Python user defined functions
- Materialized views : adding support for incrementally computed MVs in DB SQL to accelerate end-user queries and reduce infrastructure costs with efficient, incremental computation
- Query federation: Provides the ability to query remote data sourcing like PostgreSQL, Mysql, AWS Redshift…. Without the need to first extract and load the data from the source systems)
- Preview of Serverless in AWS : Instant elastic SQL serverless compute for low latency and high concurrency.
For more information about Connect from everywhere feel free to read the Blog post
For more information about Serverless in AWS feel free to read the Blog post
Machine Learning
Mlflow 2.0
Open source platform developed by Databricks to help manage the complete machine learning lifecycle.
What’s new ?
- Serverless model endpoints Serverless Model Endpoints improve upon existing Databricks-hosted model serving by offering horizontal scaling to thousands of QPS, potential cost savings through auto-scaling, and operational metrics for monitoring runtime performance.
- Model Monitoring Model Monitoring enables users to understand whether a deployed model works well in production. The proposed solution sets up a framework for logging model inputs/predictions and then analyzing model & data quality trends over time.
- Mlflow Pipelines MLflow Pipelines enable Data Scientists to create production-grade ML pipelines that combine modular ML code with software engineering best practices to make model development and deployment fast and scalable.
For more information about MLflow Pipelines feel free to read the Blog post
For more information about ML announcements feel free to read the Blog post
On demand content
Did you miss a session ? It’s ok recording from the 250 sessions are available on demand in the Data+AI summit platform through July 15.
Article written by Youssef Mrini