All of this week's activities focused on dissecting the visualizations and machine learning found in this article: https://towardsdatascience.com/predicting-hourly-divvy-bike-sharing-checkouts-per-station-65b1d217d8a4. Half of the week was spent on replicating the exploratory data analysis using the plotly library in Python, and the other half of the week was spent on machine learning.
Plotly
The objective of recreating the visualizations was to gain experience and some level of comfort using the plotly library. I fount it easier to use than matplotlib because of plotly's great documentation and more intuitive syntax. The express module was a quick and easy way to create graphs on the fly, but the graphic_object module allowed for more customization and manipulation of the graphs' tiny details. It took more time to create graphs using graphic_object than express, but the visualizations were more similar.
plotly allows for Mapbox integration to create maps, but that is contingent on acquiring an access token with a Mapbox account. To avoid that hassle, I substituted Mapbox for scatter_geo to create a "scatter plots on maps".
Machine Learning
I'd never touched machine learning until now, having only heard about the uses and application from friends in industry and in college. A colleague recommended a short and sweet course on Pluralsight that I used to get a more clear idea of the data processing and workflow used in ML. I wasn't able to apply all of the course's code to the activity because we were advised to use regression models and the course used classification models, but it was very helpful in breaking down the process and logic behind preparing the data and selecting appropriate algorithms.
We were provided an example workbook that broke down the code used in the Divvy Bikes article, but that wasn't enough to "improve upon the models." My approach consisted of using the Pluralsight course to understand the basic steps to machine learning, combing through the original article to find flaws or turning points for me to act on, dissecting the example code to "cut and paste" the parts I want to use, and then relying on Google to find and select appropriate regression models and code to use.
This activity had to be completed in two days. In that time I gained a better understanding of machine learning and its purpose. Due to time constraints, I wasn't able to improve anything from the original analysis, but I plan to attempt a multiple linear regression model to test my new knowledge.
Statistical Deep Dive
Another assignment from this week was researching a statistical measure. I chose to look at spatial measures because I enjoy using map data. Spatial data is used to help navigation and direction, and is useful in determining geographic patterns in data. The data can be mapped in visualization software using coordinates, ZIP codes, shapefiles, GeoJSON files, etc.
Use cases for finding the nearest restaurants near you or reporting real-time traffic data is possible due to the spatial index algorithms, which rearrange the geometric data points for efficient searches. Basic query types include finding the K nearest neighbors, or using "range and radius" to draw geometric shapes around the nearest data points. Both queries are slow and break when handling large data, so the algorithms use "branch and bound" tree methods to optimize the search. Two tree examples include the R-tree and the K-d tree, which are both binary trees..
The R-tree approach is the most common, drawing rectangles around groups of points so each rectangle contains the same amount. The rectangles start big, and then work their way down to smaller sizes until each box contains a certain number of points. The K-d tree sorts the points into halves, alternating the partition between the x and y axes.
I've been asked how their runtime and memory compare, but without trying out the algorithms on a database of spatial data, I'm not sure which one runs better. Both are binary trees so undoubtedly they are quite fast and efficient at searching and sorting, and both are common spatial algorithm approaches.
Comments