Week 7: Further Limbo

jesstsai7028
Jun 25, 2021
2 min read

Still waiting on paperwork to finish processing with the client site, so I am continuing my limbo work-ready status. On Monday I presented my Spatial Statistics presentation to one of Talent Path's cohorts destined for Dell placement. It was a refreshing break from my usual work-ready activities, and it was great to see some familiar faces in the cohort and instructional team.

Python ETL Pipeline

From last week, I continued to debug the problem with one of my subtasks. I was trying to break down the cleaning steps into separate definitions like how it could be broken down in a data modeling software. This was a good exercise in practicing Python and building an ETL pipeline without a GUI.

Screenshot from ETL Pipeline in Python self-assigned exercise.

The pipeline here is fairly simple. Using Python's requests library, I got a list of links to the GitHub data source and read in the .csv files. I first focused on processing a single data file by working off of a sample file from the data source, and then creating the functions needed to clean that particular data set. Once that was done, the pipeline was essentially mostly constructed and all that was left was to implement the pipeline.

While doing so, there proved to be some more data errors I had to take care of. Not all of the files had the same number of columns, so I had to modify some of the inner functions in the pipeline. Other files had different column headings, which required adding to my dictionary of columns to rename. Then, to execute the pipeline, all I did was extract the data from the GitHub links, transform the data using my data cleaning functions, and then load the cleaned data into a dataframe as my target destination source.

This method of insertion follows something similar to the "merge" method mentioned in my ETL Pipelining in SQL Server course. In that course, the sample SQL query creates a sort of CTE with or subquery to "update" the target source. Here, I am creating a target dataframe to store the transformed data, which I appended together in a variable. Next, I intend to try the other methods of loading data into destination sources using SQL. I also decided to try to turn a simple table into a data warehouse.

Data Warehousing in SQL Server

Screenshot of the T-SQL used to create the irisSpecies table.

In my current SQL Server, I have a copy of AdventureWorks and a copy of a database titled Flowers. AdventureWorks is already a full database, so I intend to try to turn the simple table inside Flowers named "iris" into a database. I succeeded in creating a reference table for iris species, but I haven't decided what I will do from there. I will likely create more dummy data to populate the database.

Jessica Tsai

Week 7: Further Limbo

Python ETL Pipeline

Data Warehousing in SQL Server

Recent Posts

Comments