Dataflow: Sharing ETL Processes in Cloud

by | Jul 24, 2020 | Articles

With many companies beginning to employ self-service BI, the question of shared work comes to the forefront. Last year, Power BI introduced dataflows that aim to improve collaboration.

Power BI Desktop is a great tool when there is one BI analyst working on one PBIX file. However, multiple users cannot work on one PBIX file at the same time.

Furthermore, an analyst working independently in Power Query may bypass a company’s ETL processes. Other users cannot see ETL changes made by one analyst. As a result, users have to manually repeat these processes during which potential mistakes may happen.

Dataflows comes as solution

A great solution for these kinds of challenges is the use of Dataflows in Power BI.

Dataflow is basically a Power Query process that runs in the cloud, independently from Power BI reports and datasets. Essentially, “Dataflow is a self-service data warehouse for Power BI,” said Reza Rad, a consultant at RADACAD. Reza, who’s a Microsoft MVP, was giving a presentation at the Community Summit Europe.

Using Dataflow one can create ETL processes in a cloud and other users can see them right away. This immediately solves the problem of having multiple versions of same tables and datasets in different PBIX files.

Power Query loads queries into a Power BI model. Meanwhile, a dataflow, having no datasets attached to its processes, stores the outputs in the Microsoft Azure Data Lake. Now, this could be a problem for smaller companies that have no Azure account. However, there is still a free option available to all Power BI Pro users. They can store up to 10 GB on the Power BI website, using it as the internal data lake.

Practical use cases

Below are a couple of Dataflow use cases we’d like to share with you. The information was inspired from Reza Rad’s presentation earlier this month.

1.      Using one Power Query table in many Power BI reports.

Analysts would often use one table in multiple reports. For example, tables with dates, products or customers can be used in many Power BI reports. And often, we would replicate these tables every time when building a new report. This isn’t only a very manual process, but could also create issues in the future. What if you add a new column to the table, then you’d have to replicate that change in all files that you have created.

With dataflow all that changes. An analyst creates one table using a dataflow, makes it available in the cloud or shared workspace. Then everyone can directly access the table and use it in all future reports. All changes done in the original file will then automatically take place in other reports, because the source of the file is one.

2.      Setting different schedule of refresh for different queries

Often, there are situations when different tables in a report need different schedule refresh times. For example, you might want to refresh a sales transactions table every day, as purchases are being made constantly. Meanwhile, a product table can be refreshed once a month. Now if you have these queries in one file, then the only option is to refresh both queries at the same time, which is every day. This means the product table is being refreshed pointlessly, for no real purpose, but to take up CPU.

In this case, Dataflow can be do amazing things. Analysts can create separate dataflows and schedule their refresh times individually. This will make the queries update based on your needs.

Share This