My First Date with Quilt Data
July 21, 2020 § Leave a comment
I’ve known the good folks at Quilt Data for a long time. A company hackathon gave me a good excuse to actually use them “in anger” for an actual demo. These are my notes on how to configure quilt3 and create my first package (and panda data frame) from a CSV
- Create a Quilt account. Actually, they created one for me, since I don’t have access to my own S3 bucket
- Login and create a password. Make sure I save it.
- Install stuff
- $ jupyter notebook
- Weird browser. Uploaded notebook file. Opened it. Works/
- $ jupyter notebook CORD19.ipynb # Ah, much better
- Click “Run” to evaluate each cell
quilt3.login()
– how the heck do I do that in Jupyter?- ModuleNotFoundError: No module named ‘quilt3’
- $ python # try from repl
- same error
- $ which python
- /Users/nauto/opt/miniconda3/bin/python
- Ah. Maybe I need to install quilt also from conda
- conda install -c conda-forge quilt3
- Works!
- quilt3.login()
- Launching a web browser…
- Did not see that coming. Works!
- Work with packages
- b = quilt3.Bucket(“s3://quiltnauto”)
- q=quilt3.Package()
- # fix .quiltignore
- q.set_dir(“.”,”.”)
- q.push(“nauto/trips”,registry=”s3://quiltnauto”)
- quilt3.config()
- quilt3.config(default_remote_registry=”s3://quiltnauto”)
- qn = quilt3.Package.browse(“nauto/trips”, registry=”s3://quiltnauto”)
- trip_data = qn[“trip_report_data.csv”].deserialize()
COVID-19 Data Lake
- https://open.quiltdata.com/b/covid19-lake/tree/tableau-jhu/csv/COVID-19-Cases.csv?version=OXNN19GctMD4EW4BOk8TBP4aAtx6lc8t
- c = quilt3.Bucket(“s3://covid19-lake”)
- c.fetch(“tableau-jhu/csv/COVID-19-Cases.csv”, “./COVID-19-Cases.csv”)
- import pandas as pd
- covid_data = pd.read_csv(“./COVID-19-Cases.csv”)
Pandas
- trip_data.dtypes
- covid_data.dtypes
- len(covid_data.Province_State.unique())
- trip_data.head()
- trip_data.index
- trip_data.columns
- trip_data.describe()
- trip_loc = trip_data.loc[:,[‘account’, ‘fleet_id’,’trip_bucket’, ‘trip_start_location’, ‘trip_end_location’]]
- covid_loc = covid_data.loc[:,[‘Case_Type’,’Cases’,’Date’,’Lat’,’Long’]]
trip_loc[['start_lat','start_long']] =
trip_loc[‘trip_start_location’].str.replace('(','').str.replace(')','').str.split(", ", expand=True)
trip_loc[['end_lat','end_long']] =
trip_loc[‘trip_end_location’].str.replace('(','').str.replace(')','').str.split(", ", expand=True)
- trip_loc.loc[0]
GeoPandas
- conda install -c conda-forge geopandas,
- conda install -c conda-forge matplotlib
- conda install -c conda-forge shapely
- import geopandas
- import matplotlib.pyplot as plt
- from shapely.geometry import Polygon,Point
- # project lat/long points into meters
- trip_start = gp.GeoDataFrame(trip_loc, geometry=gp.points_from_xy(trip_loc.start_lat, trip_loc.start_long), crs=’epsg:4326′).to_crs(‘epsg:3310’)
- trip_end = gp.GeoDataFrame(trip_loc, geometry=gp.points_from_xy(trip_loc.end_lat, trip_loc.end_long), crs=’epsg:4326′).to_crs(‘epsg:3310’)
- covid_point = gp.GeoDataFrame(covid_loc, geometry=gp.points_from_xy(covid_loc[‘Lat’], covid_loc[‘Long’]), crs=’epsg:4326′).to_crs(‘epsg:3310’)
- covid_point = covid_point[covid_point.geometry.is_valid] # get rid of Inf points from the conversion to meters
Leave a Reply