Power BI with R: R Script Connector Unzips, Combines and Loads the Zipped csv/xlsx/xlsb/xls files

 



In this Video I shall share a method on using R in Power BI desktop to unzip the compressed excel files based on the identified file name extensions of extracted files, such as csv, xlsx, xls and xlsb etc, then combine them if they have same column headers and table structure, Finally load the combined data frame into power BI for further data modelling. The introduced R script in this video only focus on loading zipped files from local machine but it is possible to have slight modifications on the R script to be able to fulfil such automatic features online at cloud sites with the required respective http URL paths and access logins.

 

Let’s get start:

 

First, we need to prepare the received compressed sample files, we have four excel files with extensions, csv, xls, xlsx and  xlsb. All of them have the same data table structure, same column numbers with same column headers. After compressing them we get a zipped folder for the zipped four files as “Downloaded Zipped BPM_2023_2020”. Inside the folder we can see they have small compressed size in different format types.  Assume that we have received and downloaded automatically the zipped files from email attachment or from cloud site to our local hard drive.

 

In the demo approach I use the R studio to edit R Code lines, Regrading the local installations of R engine and R Studio please refer to my previous video suggested in right top corner.   

 

Launch R studio , in source editor,

 

We first need to install and load the four packages, readxl, zip, openxlsx and readxlsb.

 

# Step 1: Install necessary packages

if (!require("readxl")) install.packages("readxl", dependencies=TRUE)

if (!require("zip")) install.packages("zip", dependencies=TRUE)

if (!require('openxlsx')) install.packages("openxlsx", dependencies=TRUE)

if (!require('readxlsb')) install.packages("readxlsb", dependencies=TRUE)


# load necessary packages

library(readxl)

library(zip)

library(openxlsx)

library(readxlsb)



Step 2: Define the source path where we download the zipped files, and create destination path for the output of extracted files:


source_zip_path <- "C:/Users/User/Documents/01_13_youtubeshorts/18_15 PBI R script unzip zipped files/source data for demo/Downloaded Zipped BPM_2023_2020.zip"


output_dir <- "C:/Users/User/Documents/01_13_youtubeshorts/18_15 PBI R script unzip zipped files/source data for demo/Unzipped and Convertd Zipped Files"


 

Step 3:  Check if output_dir exists, if not create it, if exists already, check if any previous file retained there, if there is any one then clear it to empty the folder:


if (!file.exists(output_dir)) {

  # If it doesn't exist, create it

  dir.create(output_dir)

} else {

  # If it exists, Store file list inside the block and clear any file in the block

  files_to_delete <- list.files(output_dir, full.names = TRUE) 

  if (length(files_to_delete) > 0) {

    file.remove(files_to_delete)

  }

}

 

Step 4: Unzip compressed files:


# Unzip files

unzip(zipfile = source_zip_path, overwrite = TRUE, exdir = output_dir)


# Get all excel files with their extensions

excel_files <- list.files(path = output_dir, pattern = "\\.[a-zA-Z]+$", full.names = TRUE)

 

This first line unzips the contents of a compressed ZIP file with Arguments to specify the path to the ZIP file we want to extract; Allows overwriting existing files in the output directory if any has the same name as the ZIP file; Sets the output directory where the unzipped files will be placed.

 

The second line creates a list of all Excel files with their full paths within the specified output directory. Argument pattern Uses an expression to match files with any alphabetic extension (e.g., .xlsx, .xls, .csv, etc.). The  full.names Returns the full file paths instead of just the filenames.

 

Step 5: Read and store the data frame from unzipped excel files:


# Create an empty list to store the read data frames

data_list <- list()


# Loop through each excel file and read it based on its extension

for (i in seq_along(excel_files)) {

  

  # Extract the file extension

  file_extension <- trimws(tolower(tools::file_ext(excel_files[i])))

  

  # Read the excel file based on its extension

  

  if (file_extension == "csv") {

    data_list[[i]] <- read.csv(excel_files[i])

  } 


else if (file_extension == "xls") {

    data_list[[i]] <- readxl::read_excel(excel_files[i])

    # Convert file to CVS and read it into the data list

    write.csv(data_list[[i]], file.path(output_dir, paste0(tools::file_path_sans_ext(basename(excel_files[i])),".csv")), row.names = FALSE)

    data_list[[i]] <- read.csv(file.path(output_dir, paste0(tools::file_path_sans_ext(basename(excel_files[i])),".csv")))

  }

  

  else if (file_extension == "xlsx") {

    data_list[[i]] <- readxl::read_excel(excel_files[i])

    # Convert file to CSv and read it into the data list

    write.csv(data_list[[i]], file.path(output_dir, paste0(tools::file_path_sans_ext(basename(excel_files[i])), ".csv")), row.names = FALSE)

    data_list[[i]] <- read.csv(file.path(output_dir, paste0(tools::file_path_sans_ext(basename(excel_files[i])), ".csv")))

  }

  

  else if (file_extension == "xlsb") {

    data_list[[i]] <- read_xlsb(excel_files[i], sheet = 1)

    # Convert UNIXTimeStamp to UTC

    data_list[[i]]$Date..dd.mm.yyyy.<- as.POSIXct(data_list[[i]]$Date..dd.mm.yyyy., origin ="1970-01-01",tz = "UTC") 

    # Convert file to CVS and read it into the data list

    write.csv(data_list[[i]], file.path(output_dir, paste0(tools::file_path_sans_ext(basename(excel_files[i])), ".csv")), row.names = FALSE)

    data_list[[i]] <- read.csv(file.path(output_dir, paste0(tools::file_path_sans_ext(basename(excel_files[i])), ".csv")))

  }

}

 

The first line initializes an empty list called data_list. It will be used to store the data frames that will be readed from the Excel files in the next steps of the loop codes.

 

Then we follow up a big part of code lines which are Looping through each excel file, read it based on its extension, and convert it to CSV file type if they are not CSV.

 

So for each extracted excel file, get it’s file extension,

if the extension is csv, read the data frame into storage Data list;

 

if the extension is “xls”, then first read the data frame into storage data list by readxl function as the i-th element, then convert this xls file in storage as csv file with arguments paste0(...) to Combine strings for creating the full filename by  basename function, which actually Removes the extension from the original Excel filename.

and Appends the ".csv" as extension. ; the row.names to Prevent writing row names to the CSV file.

 

If the extension is “xlsx”, same actions as above to read and  convert this “xlsx” type file.

 

If the extension is “xlsb”, the converting step are almost same as above . only differences is that before converting xlsb to csv, we need to ensure that all datetimestamps in xlsb are UTC from UNIXtimestamps.

 

After the step 5 we have only csv file data frames stored in data_list.

 

Last step :Check if all data frame structures are compatible, preventing errors due to mismatched column name. Then Creates a single, unified data frame for further analysis or processing:


In this step we first initialize a combined_df data frame holder with predefined column header names, if the frame holder has been there before, then we need to clean it’s data by removing it’s data rows only.


if (!exists("combined_df")) {

  column_headers <- c("Measured.Personal.Email", "Date..dd.mm.yyyy.", "Blood.Pressure.Reading..Systolic.BP.Diastolic.BP.","BP","Remarks..free.text.","X__PowerAppsId__")  

  combined_df <- data.frame(matrix(ncol = length(column_headers), nrow = 0))

  names(combined_df) <- column_headers

} else {

  # If it exists, clear any existing data

  combined_df <- combined_df[0, ]

}



Then we Loop through the all data frames in data_list and combine them if they have same column names:

we checks if each file data frames within data_list have identical column names with the ones in combined_df data frame holder. If Yes combine it into the combined_df. Otherwise leave notes “Column name mismatch in file” in an new added row.  


# Loop through the data frames

for (i in 1:length(data_list)) {

  df <- data_list[[i]]

  

  # Check only for matching column names (data type check removed)

  if (all(names(df) == names(combined_df))) {

    combined_df <- rbind(combined_df, df)  # Combine if column names match

  }

  

  else {

    combined_df <- rbind(combined_df, data.frame("Note" = paste("Column name mismatch in file", i))) 

  }

}


 

Run the whole R script code lines, have a quick check on the generated result: combined data frame combined_df. Looks Good. Let’s copy the R codes and move to Power BI desktop file,

 

In the desktop file Under home tab in top ribbon, click the “get Date” icon, choose the R Script and connect, in the R Script editor , paste the copied codes. OK, the generated combined data frame appears for selection, select it and click “transform” button. In the query editor check the data in each column and do necessary data transformations and cleanings. Then close and apply the queries. We get the combined dataset in power BI at data view. From here we can do furth works on modelling and reporting.

 

That is all about in the demo.  This video shared a method by using R in Power BI desktop to unzip the compressed excel files based on the identified file names extensions, such as csv, xlsx, xls and xlsb etc, then combine them if they have same column headers in data table, finally load the combined data frame into power BI for further data modelling and analysing.

 


Comments

Popular posts from this blog

Power BI DAX Formula for Dynamic Ranking Calculations Based on User Selections and Filters

Power BI DAX and Visualizations: Active employees across time (as x-axis) by Tenure group or age group

Excel VBA Tips: A Macro to Generate Line Chart Dynamically for Loan Schedule Trend Report