Power BI with R: R Script Connector Unzips, Combines and Loads the Zipped csv/xlsx/xlsb/xls files
In this Video I shall share a method
on using R in Power BI desktop to unzip the compressed excel files based on the
identified file name extensions of extracted files, such as csv, xlsx, xls and
xlsb etc, then combine them if they have same column headers and table
structure, Finally load the combined data frame into power BI for further data
modelling. The introduced R script in this video only focus on loading zipped
files from local machine but it is possible to have slight modifications on the
R script to be able to fulfil such automatic features online at cloud sites
with the required respective http URL paths and access logins.
Let’s get start:
First, we need to prepare the
received compressed sample files, we have four excel files with extensions, csv,
xls, xlsx and xlsb. All of them have the
same data table structure, same column numbers with same column headers. After compressing
them we get a zipped folder for the zipped four files as “Downloaded Zipped
BPM_2023_2020”. Inside the folder we can see they have small compressed size in
different format types. Assume that we have
received and downloaded automatically the zipped files from email attachment or
from cloud site to our local hard drive.
In the demo approach I use the R
studio to edit R Code lines, Regrading the local installations of R engine and
R Studio please refer to my previous video suggested in right top corner.
Launch R studio , in source editor,
We first need to install and load
the four packages, readxl, zip, openxlsx and readxlsb.
# Step 1: Install necessary packages
if (!require("readxl")) install.packages("readxl", dependencies=TRUE)
if (!require("zip")) install.packages("zip", dependencies=TRUE)
if (!require('openxlsx')) install.packages("openxlsx", dependencies=TRUE)
if (!require('readxlsb')) install.packages("readxlsb", dependencies=TRUE)
# load necessary packages
library(readxl)
library(zip)
library(openxlsx)
library(readxlsb)
Step 2: Define the source path where we download the zipped files, and create destination path for the output of extracted files:
source_zip_path <- "C:/Users/User/Documents/01_13_youtubeshorts/18_15 PBI R script unzip zipped files/source data for demo/Downloaded Zipped BPM_2023_2020.zip"
output_dir <- "C:/Users/User/Documents/01_13_youtubeshorts/18_15 PBI R script unzip zipped files/source data for demo/Unzipped and Convertd Zipped Files"
Step 3: Check if output_dir exists, if not create it, if exists already, check if any previous file retained there, if there is any one then clear it to empty the folder:
if (!file.exists(output_dir)) {
# If it doesn't exist, create it
dir.create(output_dir)
} else {
# If it exists, Store file list inside the block and clear any file in the block
files_to_delete <- list.files(output_dir, full.names = TRUE)
if (length(files_to_delete) > 0) {
file.remove(files_to_delete)
}
}
Step 4: Unzip compressed files:
# Unzip files
unzip(zipfile = source_zip_path, overwrite = TRUE, exdir = output_dir)
# Get all excel files with their extensions
excel_files <- list.files(path = output_dir, pattern = "\\.[a-zA-Z]+$", full.names = TRUE)
This first line unzips the contents
of a compressed ZIP file with Arguments to specify the path to the ZIP file we
want to extract; Allows overwriting existing files in the output directory if
any has the same name as the ZIP file; Sets the output directory where the
unzipped files will be placed.
The second line creates a list of
all Excel files with their full paths within the specified output directory.
Argument pattern Uses an expression to match files with any alphabetic
extension (e.g., .xlsx, .xls, .csv, etc.). The full.names Returns the full file paths instead
of just the filenames.
Step 5: Read and store the data frame from unzipped excel files:
# Create an empty list to store the read data frames
data_list <- list()
# Loop through each excel file and read it based on its extension
for (i in seq_along(excel_files)) {
# Extract the file extension
file_extension <- trimws(tolower(tools::file_ext(excel_files[i])))
# Read the excel file based on its extension
if (file_extension == "csv") {
data_list[[i]] <- read.csv(excel_files[i])
}
else if (file_extension == "xls") {
data_list[[i]] <- readxl::read_excel(excel_files[i])
# Convert file to CVS and read it into the data list
write.csv(data_list[[i]], file.path(output_dir, paste0(tools::file_path_sans_ext(basename(excel_files[i])),".csv")), row.names = FALSE)
data_list[[i]] <- read.csv(file.path(output_dir, paste0(tools::file_path_sans_ext(basename(excel_files[i])),".csv")))
}
else if (file_extension == "xlsx") {
data_list[[i]] <- readxl::read_excel(excel_files[i])
# Convert file to CSv and read it into the data list
write.csv(data_list[[i]], file.path(output_dir, paste0(tools::file_path_sans_ext(basename(excel_files[i])), ".csv")), row.names = FALSE)
data_list[[i]] <- read.csv(file.path(output_dir, paste0(tools::file_path_sans_ext(basename(excel_files[i])), ".csv")))
}
else if (file_extension == "xlsb") {
data_list[[i]] <- read_xlsb(excel_files[i], sheet = 1)
# Convert UNIXTimeStamp to UTC
data_list[[i]]$Date..dd.mm.yyyy.<- as.POSIXct(data_list[[i]]$Date..dd.mm.yyyy., origin ="1970-01-01",tz = "UTC")
# Convert file to CVS and read it into the data list
write.csv(data_list[[i]], file.path(output_dir, paste0(tools::file_path_sans_ext(basename(excel_files[i])), ".csv")), row.names = FALSE)
data_list[[i]] <- read.csv(file.path(output_dir, paste0(tools::file_path_sans_ext(basename(excel_files[i])), ".csv")))
}
}
The first line initializes an empty list called data_list. It will be used
to store the data frames that will be readed from the Excel files in the next
steps of the loop codes.
Then we follow up a big part of code
lines which are Looping through each excel file, read it based on its extension,
and convert it to CSV file type if they are not CSV.
So for each extracted excel file,
get it’s file extension,
if the extension is csv, read the
data frame into storage Data list;
if the extension is “xls”, then
first read the data frame into storage data list by readxl function as the i-th
element, then convert this xls file in storage as csv file with arguments paste0(...)
to Combine strings for creating the full filename by basename function, which actually Removes the
extension from the original Excel filename.
and Appends the ".csv" as extension.
; the row.names to Prevent writing row names to the CSV file.
If the extension is “xlsx”, same
actions as above to read and convert
this “xlsx” type file.
If the extension is “xlsb”, the
converting step are almost same as above . only differences is that before
converting xlsb to csv, we need to ensure that all datetimestamps in xlsb are
UTC from UNIXtimestamps.
After the step 5 we have only csv
file data frames stored in data_list.
Last step :Check if all data frame structures are compatible, preventing errors due to mismatched column name. Then Creates a single, unified data frame for further analysis or processing:
In this step we first initialize a combined_df data frame holder with predefined column header names, if the frame holder has been there before, then we need to clean it’s data by removing it’s data rows only.
if (!exists("combined_df")) {
column_headers <- c("Measured.Personal.Email", "Date..dd.mm.yyyy.", "Blood.Pressure.Reading..Systolic.BP.Diastolic.BP.","BP","Remarks..free.text.","X__PowerAppsId__")
combined_df <- data.frame(matrix(ncol = length(column_headers), nrow = 0))
names(combined_df) <- column_headers
} else {
# If it exists, clear any existing data
combined_df <- combined_df[0, ]
}
Then we Loop through the all data
frames in data_list and combine them if they have same column names:
we checks if each file data frames
within data_list have identical column names with the ones in combined_df data
frame holder. If Yes combine it into the combined_df. Otherwise leave notes “Column
name mismatch in file” in an new added row.
# Loop through the data frames
for (i in 1:length(data_list)) {
df <- data_list[[i]]
# Check only for matching column names (data type check removed)
if (all(names(df) == names(combined_df))) {
combined_df <- rbind(combined_df, df) # Combine if column names match
}
else {
combined_df <- rbind(combined_df, data.frame("Note" = paste("Column name mismatch in file", i)))
}
}
Run the whole R script code lines,
have a quick check on the generated result: combined data frame combined_df. Looks
Good. Let’s copy the R codes and move to Power BI desktop file,
In the desktop file Under home tab
in top ribbon, click the “get Date” icon, choose the R Script and connect, in
the R Script editor , paste the copied codes. OK, the generated combined data
frame appears for selection, select it and click “transform” button. In the
query editor check the data in each column and do necessary data transformations
and cleanings. Then close and apply the queries. We get the combined dataset in
power BI at data view. From here we can do furth works on modelling and
reporting.
Comments
Post a Comment