Upskilling for the Client

jesstsai7028
Jun 4, 2021
3 min read

I'd been practicing TensorFlow on a public movie dataset, when I got the news that I got placed with a client! Now my work-ready courses have taken the turn towards concretely preparing for the role of Enterprise Data Engineer.

TensorFlow

Using a public dataset from Kaggle, I decided I wanted to try to predict movie income based on budget spent on the movie. If there is as belief that more money spent on the film equates to a higher budget, then this prediction model could help decide whether or not that belief is accurate.

ree — Predicting Income by Budget, linear regression model test set.

While running the code for the training sets using a 30-70 split, I noticed my code was producing strange fluctuating loss values and the predicted line didn't graph as well as I'd have liked. Visualizing loss, it became immediately obvious the it wasn't converging so I significantly increased the number of epochs to fix the problem. This seemed to work, and after running the test set, I concluded I'd found the best lineup; the regression was linearly aligned along the best fit, and the mean squared error was as low as it could be at 0.002, meaning this was the best possible model that could be produced.

Upskilling Courses

The role I am going into for the client feels different from the role that I trained for at Talent Path; a data engineer vs. a data analyst. Looking at the job description sent by the sales team, it seemed like most of the tasks aligned more with data architecture than data engineering; most of the tasks seemed to revolve around organizing and building a database rather than developing algorithms to find insightful trends. Nevertheless, this would require some extra research and preparation on my part to meet the client's standards. The client provided sales team with some recommended links and workshops for me to take to get up-to-speed before starting with them.

ETL Pipelining in Microsoft SQL Server

One of the first tasks listed in the role description mentioned ETL Pipelining, so before having even received any reference links from the client, I found this course on Pluralsight which seemed useful to know. ETL stands for "Extract, Transform, Load", basically referring to how to extract data from tables, transform it to schema conventions, and then load the data into the data warehouse or model.

The course goes through how to load a database from a backup and reviews T-SQL code to modify tables in the database using Microsoft SQL Server; the client will be using Erwin Data Modeler, but this course helped me understand how a pipeline works and how to solve the different errors that may be encountered during the process. I'm still working through this course.

Enterprise Data Modeling

This Pluralsight course was recommended to me as a substitute for one of the Udemy courses recommended by the client. The course reviewed some familiar concepts about relational data modeling and normalization levels, and also taught me new concepts about dimensional modeling and vocabulary. This course also solidified my understanding of online transactional processing (OLTP) and online analytical processing (OLAP), and introduced me to decision support systems (DSS).

A DSS is a system that integrates OLAP in their design to facilitate higher read-to-write transaction ratios in the database. A higher read-to-write transaction ratio indicates that it handles queries to retrieve data more efficiently than queries to modify data in the database.

ree — Example structures for DSS, from Pluralsight course

The data warehouse and OLAP cubes are both considered types of data warehouses, but built and structured differently. The top-down structure indicates a large, flattened, and denormalized database while the bottom-up structure are smaller purpose-driven cubes that pre-aggregate after defining tables and other relevant data aspects, and permutates measures by dimensions and granularity. The data lake, unlike the other two data warehouses, is a literal pool of data where the raw data is stored and processed on-demand.

I was also introduced to the idea of "slowly changing dimensions", where a dimension or variable of a record has changed over time, and the change must be recorded or overwritten in the database. With type 0, the table can have one absolute dimension from which another relevant dimension can compute calculations based on it (ex: birthdate can be used to calculate age). Type 1 directly overwrites the data, while Type 2 uses the "change data capture" (CDC) method to populate the table with multiple entries for the same record holder but change the entry for the dimension as it "updates".

Erwin Data Modeling

A simple YouTube search for Erwin Data Modeling turned up a bunch of tutorials and product descriptions for the product my client is working with. While getting a copy of the actual product proved to be a hassle to complete, I could at least take screenshots and notes on the tool's capabilities. From what I gather, this should be similar to creating ERDs in MySQL Workbench.

Jessica Tsai