Cleaning and Transforming Data in Power BI
Once you have imported data into Power BI is it essential to clean and transform your data before you build any visualizations or reporting. This is an essential step in building quality visualizations in Power BI. Cleaning and transforming your data enables you to build visualizations from sound and clean data. When cleaning and transformations are not done dilligently your visualizations will not behave as you expect.
Best Practice Tip:
Always make sure that your data is properly cleaned and transformed before building any visualizations or queries. This will save you a lot of time and frustration in the long term.
This blog is the first of many that will follow covering a range of topics on data preparation. This blog post will cover the basics steps to follow when importing data for the first time into Power BI. The following blog posts will cover more advanced features of preparing your data for visualization.
Verifying imported data
Upon the first import of data, it is essential to verify the data and to make sure that the data is displayed as you expect. This is the time to ensure that your column headers are correct, that errors in your data are cleaned and that your data is ready to be used for visualization and reporting.
Let’s start with some practical examples. We will use the below data from Excel. This file contains data of the products that we are selling. The data is not very well structured and ready to be used for sound reporting and visualization of our data.
After the above data is imported into Power BI, it is displayed as in the below figure.
Take some time to familiarise yourself with the data and the manner in which it was imported into Power BI. As you go through the data think of steps that must be performed to clean and transform the data.
Good Practice Tip:
First, refrain from sorting your data in any manner until the data cleansing and transformation has been completed.
When importing data for the first time follow the below steps:
- Remove any leading or trailing lines of data.
- Verify column headers and promote headers if necessary
- Verify null values and errors. If null values can be completed make sure to filled downwards all null values.
What we can see from our data on the first inspection is that there are some leading and trailing lines circled in red and some null values in column one that must be removed.
The leading and trailing lines come from our Excel file where our products are broken down into categories on the same sheet and we don’t require that data. The null values come from the empty spaces in the file. We need to retain ‘ITEMCODE’ circled in blue as this is our column header. To remove these values, select the dropdown arrow from column 1 and uncheck leading and trailing rows circled in red.
When this step is completed ITEMCODE is in the first row of our table. Let’s turn the first row of our data into our column headers. From the Home tab under the Transform group tab, select Use First Row as Headers.
The result is all the items in the first row of our columns are displayed as the header.
Lastly, let’s fill the ITEMCATCODE and ITEMCATNAME null column values downward by selecting both column headers. Go to the Transform tab, under the Any Column Group, click on the small arrow next to the Fill button, and finally select the Down option to fill all null values in the data downward.
Now that our data values make sense the next step is to verify your data types.
A data type, in programming, is a classification that specifies which type of value a variable has and what type of mathematical, relational or logical operations can be applied to it without causing an error. A string, for example, is a data type that is used to classify text. An integer, on the other hand, is a data type used to classify whole numbers.
When importing data from a database or CSV file, Power BI will assign a data type to that field. While Power BI is very good at assigning data types you must always verify it especially when it is not coming directly from a database. The incorrect data type is a frequent cause of missing relationships and wrong default summarizations. This will lead to your visuals not displaying correctly. When this step is skipped you may run into endless frustrations on why your data is not displaying correctly.
Power BI was not very good in our example to assign data types. Power BI assigned the data type ‘Any’ to all of our columns.
Let’s change all of our text columns to data type ‘Text’ and our value column to ‘Whole Number’. You can simply select the dropdown arrow next to the Data Type and select the relevant data types.
Good Practice Tip: Format all numeric and date fields
The importance of formatting numeric and date fields upfront cannot be stressed enough. If not set correctly users will have to format the field every time it is used in the report. Defining units as whole numbers, or amounts as a currency is a helpful timesaver for reporting.
Verify Data Categories
It is good practice to specify data categories for data fields (columns). Power BI is quite good on setting this automatically but again make sure to check all the data fields. Especially for address-related fields where the data category must be set correctly to ensure that ensure geocoding can occur properly and date fields.
In our example, we would like to keep the ITEMLOCATIONCODE to a Data Category of Uncategorized. The Data Category can be set from the Modeling tab. Select the field you would like to set the Data Category for and select the relevant category.
Set Default Summarization for All Numeric Columns
Power BI default to sum numeric columns, however, there are instances where you would rather like a numeric column to be counted or for the minimum value to be displayed. Some example might be the speed of the wind on a given day. You would not like to see the sum of the wind speed. You would rather want the maximum or minimum speed and the time of the maximum or minimum speed. Another example might be your customer id field, where you would like to see the customer id, by selecting the don’t summarize option.
Set field name logically
The naming convention of field (column) names is extremely important. Again this might sound like a simple step, but with large datasets, this becomes critical in data preparation. Let’s give one example from our dataset. We have a field called ‘itemlocationname’ which is not very descriptive. It can point to a physical location in a warehouse, a store location a city etc. This field refers to our store location, so let’s rename the field by double-clicking on the field name and changing it to ‘Store’.
Keep field names unique across the dataset
Power BI allows for field names to be the same in different tables. However, this may create some confusion when building your reports and leave room for error. When field names are the same across tables give each column a unique name that relates to the table wherein the field is. Again, to put emphasis on naming, make it logical and self-explanatory.
When the case arises that there are two or more of the same field names in your model, ensure that only the field that is used in the report is visible and ‘hide’ the other field names. This will ensure no accidental usage of the ‘wrong’ field.
Renaming Applied Steps
The above picture indicates the Query Setting on the right-hand side where all your Applied Steps are captured. These are all the steps we performed in the above examples, but the steps are given general names. It is critical to name these steps very well. Again something that seems simple, but with multiple reports and large datasets, you will thank yourself if it is done properly.
You may rename your applied steps by simply right-clicking on the name of the Applied Step. We have renamed our Applied Steps as indicated in the figure below. Give you applied steps a logical description of the step performed.
It cannot be stressed enough how important the cleaning and transformation of your data is when imported into Power BI. It is worthwhile to spend extra time on this step to ensure that you can build visualizations that are accurate and behave as you expect. Naming convention and well documented applied steps are essential to easily find errors in your data model.
Be sure to follow our other blog posts on data preparation to see some more advanced cleaning and transforming steps in Power BI.