Select Page

Dataflow: Sharing ETL Processes in Cloud

by | Jul 24, 2020 | Articles

With many companies beginning to employ self-service BI, the issue of multiple users being able to work on building reports, often at the same time, comes to the forefront.

Power BI Desktop is a great tool when there is one BI analyst or developer, who does everything he or she needs in one PBIX file; however, multiple users cannot work on one file at the same time.

Furthermore, an analyst working independently in Power Query may bypass a company’s ETL processes. Other users cannot see ETL changes made by that analyst and as a result, at the very least, have to manually repeat them or, in worst cases, mistakes are bound to happen.

Dataflow comes as solution

A great solution for these kinds of challenges is the use of Dataflow in Power BI.

Dataflow is basically the Power Query process that runs in a cloud, independently from Power BI reports and datasets. Essentially, “Dataflow is a self-service data warehouse for Power BI,” said Reza Rad, a Microsoft MVP and consultant at RADACAD, during his recent presentation at the Community Summit Europe.

Using Dataflow one can create ETL processes in a cloud and other users can see them right away. This immediately solves the problem of having multiple versions of same tables and datasets in different PBIX files.

While Power Query loads queries into a Power BI model, Dataflow, having no datasets attached to its processes, stores the outputs in the Microsoft Azure Data lake cloud storage. Now, this could be a problem for smaller companies that have no Azure account; however, there is still an option available for all Power BI Pro users, who can store up to 10 GB on the Power BI website, using it as the internal data lake.

Practical use cases

Below are a couple of Dataflow use cases we’d like to share with you. The information was inspired from Reza Rad’s presentation earlier this month.

1.      Using one Power Query table in many Power BI reports.

Analysts would often use one table in multiple reports. For example, tables with dates, products or customers can be used in many Power BI reports. And often, we would replicate these tables every time when building a new report. This isn’t only a very manual process, but could also create issues in the future – what if there a new column added on that table, then you’d have to replicate that change in all files that you have created.

With Dataflow all that changes. An analyst creates one table using a Dataflow, makes it available in the cloud or shared workspace, and then everyone can directly access the table and use it in all future reports. All changes done in the original file will then automatically take place in other reports, because the source of the file is one.

2.      Setting different schedule of refresh for different queries

Often, there are situations when different tables in a report need different schedule refresh times. For example, you might want to refresh a sales transactions table every day, as purchases are being made constantly, while a product table can be refreshed once a month. Now if you have these two queries in one file, then the only option is to refresh both queries at the same time, which is every day. This means the product table is being refreshed pointlessly, for no real purpose, but to take up CPU.

In this case, Dataflow can be do amazing things. Analysts can create separate Dataflows and schedule their refresh times individually, making your queries update based on your needs.

Share This