I use Power BI to visualise the team I work with and to see where we have been spending our time. Recently, I was trying to see if I could visualise what people would be busy with over the next few months and found it a bit challenging since the source data becomes multi-dimensional. I used a small Ruby script to bail me out of the immediate need that I had but I wanted to look for a method that worked within Power BI itself (using PowerQuery and/ or DAX) so that the solution was self-contained.
Disclaimer: I am still relatively new to Power BI though I do enjoy using it for visualisations, and presenting data for consumption in meetings, discussions, etc. So, please use the content in this post with caution. Also, if you know better, please post in the comments below so that I can learn more :-) and can update the post.
For this walkthrough, I will use Fruits and Vegetables to explain the problem and the solution.
Let’s say we have the data as below:
- A list of records, organised into rows
- For each row, our schema is:
- Name – of the individual
- Fruits – a list of fruit the individual likes
- Vegetables – a list of vegetables the individual likes
Assuming that this table is imported as a source into the Power BI project, what I want to be able to do is this:
- Show each person and the list of the of fruit and vegetables they like: this is simple – just put in a table and add the Name, Fruits and Vegetables as values to be shown. In fact, the picture above was created exactly that way.
- See how many unique fruits and vegetables were mentioned in the list
- See which people like a specific vegetable or fruit
In my real case:
- the column [Fruits] represented [2020-Jun] and the values were a ‘/’ separated list of things that they might work on in May 2020
- likewise, the column [Vegetables] represented [2020-Jul] and the values were a ‘/’ separated list of things that they might work on in July 2020
As you can imagine, all of these are very easy to do if we have a table that looks like the picture below – we can easily create graphs and cards that give us the information we need.
So, that’s our problem statement: How do we create a new table that has this structure? It means doing the following:
- Create a new table
- Create a new record for each column (Fruits and Vegetables) with the name and category attached to it
- Further, if the value in that column (Fruits/ Vegetables) has multiple entries in it, split each entry into multiple records (with name and each individual value)
- Combine all the values together into the final table
I was a bit daunted by what needed to be done and did what I do best – I wrote a Ruby script to quickly create this new table and added it as a data source to my project. This allowed me to get my meeting done! But, running something externally is not something I want to do forever – so, I wanted to come back and see how to do this in Power BI. Numerous web searches later, I am able to do this but there are a few steps, so they are listed here for future reference.
Creating a new table
Power BI has a number of ways to create calculated tables. Here is what you need to do to create a new table:
- Go to the data view
- Click on Table Tools
- Click on New Table
- Now, you need to enter the DAX for this which will something like: @My New Table Name = …. @
The challenge is to know what to enter here – so, we will come back to it.
Creating a new row for each column
I decided to approach the problem in this sequence:
- Create a new row for each column with the full list of items for that column in it
- Then, worry about splitting each column based on the list inside it
Essentially, I wanted to get this table first and worry about splitting individual columns later.
The approach we want to take is this:
- Conceptually have a 3 column structure: Name, Type and Product
- First, create records for this using the data from the input table [Name] and [Fruits] (as product) and a type field with value “Fruits”
- Second, create records for this using the data from the input table [Name] and [Vegetables] (as product) and a type field with value “Vegetables”
- Join them together to get the table
The good thing is that we have easy ways to achieve this in Power BI when creating Calculated Tables. We want:
- A function that selects rows based on data from columns of a table (
SELECTCOLUMNS
) - An operators that allows records to be stuck together (
UNION
)
The first item we use is SELECTCOLUMNS
which selects columns from an existing table with the following syntax:
- Table Name
- Column Descriptions:
- Name of the column (in the target table)
- Value to put in – this can be a column name in the source table, or a combination using other functions, or a constant value – in fact, any function that can return a value works here.
For our purpose, we use this manner for the Fruits:
SELECTCOLUMNS(Input,"Name",[Name],"Type", "Fruits","Element",[Fruits]),
which comes from:
- Source Table Name: input
- Description of first column to bring:
- Column Name in new table: “Name”
- Data comes from: Input Table → column [Name]
- Description of second column to bring:
- Column Name in new table: “Type”
- Data comes from: constant value “Fruits”
- Description of third column to bring:
- Column Name in new table: “Element”
- Data comes from: Input Table → column [Fruits]
As you can see, we will get back a number of records that have 3 columns each. We do the same for the [Vegetables] field with:
SELECTCOLUMNS(Input,"Name",[Name],"Type", "Vegetables","Element",[vegetables])
Now, we have 2 collections:
- A set of 3 column records for Fruits with type set to “Fruits”
- A set of 3 column records for Vegetables with type set to “Vegetables”
All we need to do is combine these using UNION which is quicker to understand:
- You just use
UNION(...)
- The parameters are any number of DAX functions that return one or more columns – you can have as many of these separated by comma. In our case, this is just a number of SELECTCOLUMNS statements
Putting this all together, we get:
TBL_Step1 =
UNION(
SELECTCOLUMNS(Input,"Name",[Name],"Type", "Fruits","Element",[Fruits]),
SELECTCOLUMNS(Input,"Name",[Name],"Type", "Vegetables","Element",[vegetables])
)
We are now able to get the intermediate table that we need. All we need to do is split the data in the columns and add records for them.
Splitting Data in the Columns
Unfortunately, there isn’t a single simple split columns function that can be called right now. From what I see, we need to write a piece of code that looks a bit more complicated. So, let’s put it out there first – this is what works!
TBL_Step2 =
VAR myvalues =
ADDCOLUMNS ( TBL_Step1, "ElementPaths", SUBSTITUTE ( [Element], "/", "|" ) )
RETURN
SELECTCOLUMNS (
GENERATE (
myvalues,
ADDCOLUMNS (
GENERATESERIES ( 1, PATHLENGTH ( [ElementPaths] ) ),
"MyElements",
PATHITEM ( [ElementPaths], [Value], TEXT )
)
),
"Name", [Name],
"Category", [Type],
"Product", [MyElements]
)
The easy bit – SELECTCOLUMNS
So, you see the familiar SELECTCOLUMNS function on Line 5 and you remember that it takes as parameters:
- Table Name/ Reference
- Column Descriptions in 2 values each: {New Table Column, Source Data}
So, the last part is simple to read:
- Take field Name from column [Name]
- Take field Category from column [Type]
- Take field Product from column [MyElements]
The two new things are:
- How is the source table is created based on the
TBL_Step1
? - How do we get the column
[MyElements]
which seems to hold an individual fruit/ vegetable value?
New DAX Functions
We have a bunch of new DAX functions that we need to look at. Each one is linked to the DAX document but I try to explain briefly what it does and especially what it does for us in our code:
- SUBSTITUTE – substitutes text in a string with another text. We use it to replace all occurences of ‘/’ in our data with ‘|’ in the string stored in the [Elements] column
- ADDCOLUMNS – adds a calculated column to a table and then returns a table with the new columns added to it. We use it to get back the table
TBL_Step1
with a column added to it in which the ‘/’ in the [Element] column is replaced with a ‘|’. We do this because some of the future functions rely on that. - PATHLENGTH – returns the number of parents to the specified item in a given PATH result, including self. For us, this helps to count the number of items that are separated by ‘|’
- PATHITEM – returns the value of the element at the position provided. In our case, the path is a set of ‘|’ separated values, so an item at position 1 means the first item, position 2 means second item, etc. So, if the path is “Apple|Orange”, PATHITEM at position 1 is ‘Apple’ and at position 2 is ‘Orange’.
- GENERATE – returns a table with the Cartesian product between each row in table1 and the table that results from evaluating table2 in the context of the current row from table1. This is a bit more difficult to explain, but let me try. There is an example on the linked page but for us, what we need to know is that GENERATE goes through 2 tables (table1 and table2) and evaluate the table2 based on the data in table1. We will look at this more closely below.
- GENERATESERIES – this returns a single column table (representing the values of an arithmetic series, that is, a sequence of values in which each differs from the preceding by a constant quantity). The name of the column returned is
Value
. In our case, it returns a table that has a column called [Value] and in subsequent rows, this is 1, 2, 3, … N (depending on the value ofPATHLENGTH
). In our case,PATHLENGTH
will return the number of components separated by ‘|’, soGENERATESERIES
basically gives us a table that contains values from 1, 2, … up to the number of elements in our [Fruits] or [Vegetables] column in each record that it handles.
Trying to put it together
Now, let’s try to take stock of what we know from the explanations above:
First, we create a variable call myvalues
and add to it the whole input table + an extra column that converts the [Element] column to be ‘|’ separated rather than ‘/’ separated. So, we make the following transformations:
- Name,Fruits,Orange/ Watermelon → Ally,Fruits,Orange/ Watermelon,Orange| Watermelon
- Ally,Fruits,Orange/ Watermelon → Ally,Fruits,Orange/ Watermelon,Orange| Watermelon
- David,Vegetables,Peas → David,Vegetables,Peas,Peas (no change in the new column)
So, myvalues
holds a table that looks like this:
Let’s look at this part carefully:
GENERATE (
myvalues,
ADDCOLUMNS (
GENERATESERIES ( 1, PATHLENGTH ( [ElementPaths] ) ),
"MyElements",
PATHITEM ( [ElementPaths], [Value], TEXT )
)
)
Remember that myvalues
is a table that has the extra column ElementPaths
which are ‘|’ separated. Let’s use one value of ElementPaths
for our example now – “Ally,Orange| Watermelon”. This is what happens on a single row level:
- When we do
PATHLENGTH
on this value, we get the value 2 (since there are 2 items – Orange and Watermelon) - When we do GENERATESERIES(1, 2), we get a table that has 2 rows and 1 column
- Value
- 1
- 2
- Let’s look at what
PATHITEM ([ElementPaths], some_number, TEXT)
does – it returns the TEXT at the position given by some_number. So, for our case above:- if some_number is 1, we get Orange
- if some_number is 2, we get Watermelon
- When we try to do
ADDCOLUMNS([{1},{2}], "MyElements", PATHITEM([ElementPaths], [Value], TEXT))
– we get a new column called MyElements which is created from callingPATHITEM
with the [Value] in the table created byGENERATESERIES
… in essence, we callPATHITEM
with 1, then 2 to get the values for the column [MyElements]. We get Orange, then Watermelon. This is added as an extra column called[MyElements]
to the temporary table represented by theGENERATESERIES
:- Value, MyElements
- 1,Orange
- 2, Watermelon
- Finally, we use GENERATE which takes
myvalues
as table1 and the table fromADDCOLUMNS
as table2 and executes for each row in table1, the function for table2. So, it takes the record we hadAlly,Fruits,Orange/ Watermelon,Orange| Watermelon
and runs it against the 2 row table we just created [{1,Orange}, {2, Watermelon}] to make multiple records out of it (Cartesian Product). We get:- Name,Type,Elements,ElementPaths,Value,MyElements (Items in bold come from ADDCOLUMNS)
- Ally,Fruits,Orange/ Watermelon,Orange| Watermelon,1,Orange
- Ally,Fruits,Orange/ Watermelon,Orange| Watermelon,2, Watermelon
- Then, we pick from this [Name],[Type] as Category,[MyElements] as Product:
- Name,Category,Product
- Ally,Fruits,Orange
- Ally,Fruits, Watermelon
Of course, Power BI does this for all the rows and the values and we get the table that we want!
Sorry that’s a lot of words but there is no easy way (for me) to explain it – hopefully, the details help you understand.
Plugging it into Power BI
To get this whole thing working, you now need to do this:
- Add a new table (as explained above) and paste the code for TBL_Step1 into it – that will create the first table
- Then, again add a new table and past the code for TBL_Step2 which will produce the final table that we want
Now that you have this nice new table, you can use this in your visualisations and can add items that use this.
We can see that the new table has 21 records with 9 for vegetables and 12 for fruits. Clicking on items in the donuts or on any item in the slicers will automatically filter the list. Almost perfect!
“Almost” Perfect!? One Last thing [1 of 2]
Did you notice that vegetables like Spinach or Fruits like Orange appear twice? Why does it do that, and how do we fix it?
This is down to the way that the splitting of the terms happens using PATHITEM
– when we pass it “Orange| Watermelon” we get back 2 terms – “Orange” and " Watermelon" (with a leading space). In the same way, if it was “Apple| Orange”, we would get back “Apple” and " Orange". Since “Orange” and " Orange" are not the same, we get 2 entries that look the same but group the records separately. The solution is simple – we need to remove that space using TRIM
:https://docs.microsoft.com/en-us/dax/trim-function-dax which removes leading, trailing and other extra spaces.
That only leaves where we add it – we want to add it to the point where we are bringing that term into the table. It’s probably easiest to do this as soon as we start to get the individual text items. That is done by PATHITEM ( [ElementPaths], [Value], TEXT )
– we can change this line to TRIM (PATHITEM ( [ElementPaths], [Value], TEXT ))
so that it trims the extra spaces from the text that it returns.
For reference, then, our second table is now defined as:
TBL_Step2 =
VAR myvalues =
ADDCOLUMNS ( TBL_Step1, "ElementPaths", SUBSTITUTE ( [Element], "/", "|" ) )
RETURN
SELECTCOLUMNS (
GENERATE (
myvalues,
ADDCOLUMNS (
GENERATESERIES ( 1, PATHLENGTH ( [ElementPaths] ) ),
"MyElements",
TRIM (PATHITEM ( [ElementPaths], [Value], TEXT ))
)
),
"Name", [Name],
"Category", [Type],
"Product", [MyElements]
)
With all this in place, it works fine, and you see the updated picture below.
Selecting a slicer also quickly filters everything.
You could use the same method if you had tagged your data (#weekend, #sports, etc.) and adjust the methods in this post to make it work for you.
“Almost” Perfect!? One Last thing [2 of 2]
Updated on 3 Jun 2020
After I published this post, I tried to use this on a dataset that I have, and I discovered that there is still another problem that was not apparent when using the sample data above. The creation of the second table fails if any column is blank. We didn’t see it because our fruit and vegetables data had data in every column but my other real dataset did not.
The problem is with this line when the column is blank:
GENERATESERIES ( 1, PATHLENGTH ( [ElementPaths] ) ),
The PATHLENGTH ( [ElementPaths] )
does not work if ElementPaths
is blank, and crashes out. You need to change it to the line below where the IF condition returns PATHLENGTH ( [ElementPaths] )
if it is more than 0 or it returns just 1 which lets the code run correctly:
GENERATESERIES ( 1, IF(PATHLENGTH ( [ElementPaths]) > 0,PATHLENGTH ( [ElementPaths]), 1) ),
So, with that changed, the final query for the second table becomes as below and all should be good!
TBL_Step2 =
VAR myvalues =
ADDCOLUMNS ( TBL_Step1, "ElementPaths", SUBSTITUTE ( [Element], "/", "|" ) )
RETURN
SELECTCOLUMNS (
GENERATE (
myvalues,
ADDCOLUMNS (
GENERATESERIES ( 1, IF(PATHLENGTH ( [ElementPaths]) > 0,PATHLENGTH ( [ElementPaths]), 1) ),
"MyElements",
TRIM (PATHITEM ( [ElementPaths], [Value], TEXT ))
)
),
"Name", [Name],
"Category", [Type],
"Product", [MyElements]
)
What I don’t know how to do
As illustrated above, right now, I need to know all the columns that will come up in the data since the UNION relies on that; also, the Step 1 Table Creation uses SELECTCOLUMNS which also identifies the columns by name. So, if someone added a new column for ice-creams or pasta or something, the process above would not pick it up. Admittedly, you would have to add a single extra SELECTCOLUMNS line and add it to the UNION (which is not a lot of work) but it is something that bothers me a bit. If I get the chance to find out how to make this generic, I will add to this post.
References
For Step 1 Table Creation, take a look at:
- https://docs.microsoft.com/en-us/dax/selectcolumns-function-dax
- https://community.powerbi.com/t5/Desktop/Create-a-new-table-from-columns-from-existing-tables/td-p/98338
For Step 2 Column splitting, my main reference was:
- http://www.excelnaccess.com/text-split-using-dax/
Disclaimer (reminder): I am still relatively new to Power BI though I do enjoy using it for visualisations, and presenting data for consumption in meetings, discussions, etc. So, please use the content in this post with caution. Also, if you know better, please post in the comments below so that I can learn more :-) and can update the post.