top of page
Search

Week 8: More SQL and Reviewing ML

So I worked a little more on populating the iris dataset. While reviewing what other skills I can brush up on based on the role description I decided to review machine learning. Then on Thursday, I was notified that I was cleared to start client work the next week! So while this has been good busy-work to keep my skills sharp, I am excited to finally be able to work towards a client goal!


Iris Transactions

Last week, I created additional tables for "iris" in my Flowers database. I copied over the species name to a new table, and added business keys and surrogate keys to both tables. Then, realizing I should probably practice some more transactional statements in SQL, this week I worked on INSERT and UPDATE to add in more columns to my created table. I also played around with putting these statements inside TRANSACTIONs. Getting used to putting SQL statements inside transactions is good practice for making sure if I make any sort of mistake in my SQL code, I can undo it with a simple ROLLBACK call and confirm it with a COMMIT command. Unfortunately none of these queries were necessarily long enough to warrant using a savepoint in the SQL. Perhaps this could be my next goal, to practice putting all of my updates and modifications in a single transaction and creating savepoints for each part.

Example of part of a TRANSACTION

Reviewing Machine Learning

Using the same dataset I used for my Python ETL Pipeline, I figured I could also try some sort of classification or regression model. I was stuck on cleaning the data longer than I expected, because of the way the data was stored. I would have to access the GitHub where the data files are stored and scrape their website links, so I can later read them into dataframes via pandas. Once in an individual dataframe, instead of immediately cleaning and prepping the data, I appended all of the individual csv sources together into a single dataframe so I could see all of the data and then decide how to deal with the data.

Replacing some null values

I then spent the rest of the week trying to fix the first of the null values, the FIPS_countyCode. I found that not all of the entries with a null FIPS_countyCode were outside of the US (which should all have FIPS codes), so I used impossible FIPS code -1 to replace the null values. This still left me with 5380 null values in FIPS code, which could only mean incorrectly input data. I looked up FIPS codes and was trying to figure out how to read and extract the information I needed using BeautifulSoup, so I could replace these values. I ended up using regular expressions, but it wasn't perfect. I spent the rest of my time trying to figure out how to clean up the regex capture.

The problem with the regex web scrape was imprecise scraping

I think this was a better example of how long it takes to actually clean and prepare data before applying it to machine learning, it was a great way to review some more Python and data cleaning practices.

Comments


bottom of page