SQL via sqlite#
In this page, we’ll cover some of the basics of SQL (structured querry language) by working through some examples. SQL is a set of language standards for databases, so we have to choose a specific implementation. We’ll use sqlite for this purpose. As its name implies, sqlite is a small implementation of SQL.
In my linux implementation, sqlite3 was pre-installed. Here’s a tutorial on installing for windows. Sqlite3 is a single file.
We’ll first create a database at the command line. Notice when we create a file
command prompt> sqlite3 class.db
sqlite> create table class(id int primary key, lname text,fname)
sqlite> insert into class values (1, "Wayne", "Bruce");
sqlite> insert into class values (2, "Jennifer", "Walters");
sqlite> .header on
sqlist> .mode column
sqlite> select * from class;
id lname fname
-- -------- -------
1 Wayne Bruce
2 Jennifer Walters
sqlite> .quit
The command
sqlite3 class.db
opens up the database, in this case creating a new one, and then enters into th sqlite command line.The command
create ...
creates our table within our databaseThe
insert ...
commands insert two recordsThe
.header ...
and.mode ...
commands format outputThe
select ...
command grabs all recordsThen
.quit
just quits the commmand line.
Performing an ls
in the current working directory now shows the file class.db
. Everything else we discuss below assumes working in the sqlite command prompt.
To work with sqlite, it’s nice to work with a development environment specifically created for sql. Specifically, one with nice highlighting and autocompletion. Since I’m writing these notes in jupyter, I’m just pasting code output.
It’s possible to use magic commands (like %%sql
) to use sqlite in jupyter cells. However, some of the sql commands are not allowed (like the pragma commands). Personally, I find this to be more trouble thant it’s worth. So I would suggest just using python, which has nice sql APIs if one wants to work in jupyter lab.
Sqlite has SQL commands, which must be typed with a semicolon at the end, and sqlite specific commands, which begin with a period and the pragma
commands, which are also sqlite specific. This is good to remember, since some things will be portable to other SQL implementations and others not.
]
A more reaslistic example#
Let’s create and work with a more realistic example. Consider the data Opiods in the US at Open Case Studies opencasestudies/ocs-bp-opioid-rural-urban as described here. Read over their writeup, as we’re mostly going to be showing how to duplicate a lot of their steps in sqlite.
First, you need to download the data, which you can do with a command:
wget https://raw.githubusercontent.com/opencasestudies/ocs-bp-opioid-rural-urban/master/data/simpler_import/county_pop_arcos.csv
wget https://raw.githubusercontent.com/opencasestudies/ocs-bp-opioid-rural-urban/master/data/simpler_import/land_area.csv
wget https://raw.githubusercontent.com/opencasestudies/ocs-bp-opioid-rural-urban/master/data/simpler_import/county_annual.csv
If you don’t have wget or would rather just use the direct links, you can use:
https://raw.githubusercontent.com/opencasestudies/ocs-bp-opioid-rural-urban/master/data/simpler_import/county_annual.csv and then saving the data.
Next, let’s import them into sqlite
command prompt> sqlite3 opioid.db
sqlite> .mode csv
sqlite> .import county_pop_arcos.csv population
sqlite> .import county_annual.csv annual
sqlite> .import land_area.csv land
sqlite> .tables
annual land population
What variables do the tables include? The pragma
command is unique to sqlite and contains a bunch of helper functions.
sqlite> pragma table_info(population);
cid name type notnull dflt_value pk
--- ------------ ---- ------- ---------- --
0 TEXT 0 0
1 BUYER_COUNTY TEXT 0 0
2 BUYER_STATE TEXT 0 0
3 countyfips TEXT 0 0
4 STATE TEXT 0 0
5 COUNTY TEXT 0 0
6 county_name TEXT 0 0
7 NAME TEXT 0 0
8 variable TEXT 0 0
9 year TEXT 0 0
10 population TEXT 0 0
sqlite> pragma table_info(annual);
cid name type notnull dflt_value pk
--- ------------ ---- ------- ---------- --
0 TEXT 0 0
1 BUYER_COUNTY TEXT 0 0
2 BUYER_STATE TEXT 0 0
3 year TEXT 0 0
4 count TEXT 0 0
5 DOSAGE_UNIT TEXT 0 0
6 countyfips TEXT 0 0
sqlite> pragma table_info(land)
cid name type notnull dflt_value pk
--- ----------- ---- ------- ---------- --
0 TEXT 0 0
1 Areaname TEXT 0 0
2 STCOU TEXT 0 0
3 LND010190F TEXT 0 0
4 LND010190D TEXT 0 0
5 LND010190N1 TEXT 0 0
(I truncated this latter output at 5.)
Working with data#
Let’s print out a few columns of the population data.
sqlite> select BUYER_COUNTY, BUYER_STATE, STATE, COUNTY, year, population from population limit 5;
BUYER_COUNTY BUYER_STATE STATE COUNTY year population
------------ ----------- ----- ------ ---- ----------
AUTAUGA AL 1 1 2006 51328
BALDWIN AL 1 3 2006 168121
BARBOUR AL 1 5 2006 27861
BIBB AL 1 7 2006 22099
BLOUNT AL 1 9 2006 55485
The limit 5
prints out five rows. Let’s perform some of the tasks in the write up. For example, they want to print out some of the missing data in the annual dataset.
sqlite> select * from annual where countyfips = "NA" limit 10;
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips
--- ------------ ----------- ---- ----- ----------- ----------
188 ADJUNTAS PR 2006 147 102800 NA
189 ADJUNTAS PR 2007 153 104800 NA
190 ADJUNTAS PR 2008 153 45400 NA
191 ADJUNTAS PR 2009 184 54200 NA
192 ADJUNTAS PR 2010 190 56200 NA
193 ADJUNTAS PR 2011 186 65530 NA
194 ADJUNTAS PR 2012 138 57330 NA
195 ADJUNTAS PR 2013 138 65820 NA
196 ADJUNTAS PR 2014 90 59490 NA
197 AGUADA PR 2006 160 49200 NA
Here, we used the condition “NA” to test for missingness, since the CSV files have the string NA values for missing data. Places other than Puerto Rico (PR)? Lets check some
sqlite> select * from annual where countyfips = "NA" and BUYER_STATE != "PR" limit 10;
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips
----- ------------ ----------- ---- ----- ----------- ----------
10072 GUAM GU 2006 319 265348 NA
10073 GUAM GU 2007 330 275600 NA
10074 GUAM GU 2008 313 286900 NA
10075 GUAM GU 2009 390 355300 NA
10076 GUAM GU 2010 510 413800 NA
10077 GUAM GU 2011 559 475600 NA
10078 GUAM GU 2012 616 564800 NA
10079 GUAM GU 2013 728 623200 NA
10080 GUAM GU 2014 712 558960 NA
17430 MONTGOMERY AR 2006 469 175390 NA
Inspect the missing data further on your own. It looks like its the unincorporated territories and a handful of Arkansas values missing countyfips
(Federal Information Processing Standard). Specifically, Montgomery county AR is missing FIPs codes. Since we want to look US states in specific, excluding territories, we will just set the Montgomery county ones to the correct value 05097 and ignore the other missing values.
sqlite> update annual set countyfips = 05097 where BUYER_STATE = "AR" and BUYER_COUNTY = "MONTGOMERY"
sqlite> select * from annual where BUYER_STATE = "AR" and BUYER_COUNTY = "MONTGOMERY"
BUYER_COUNTY BUYER_STATE year count DOSAGE_UNIT countyfips
----- ------------ ----------- ---- ----- ----------- ----------
17430 MONTGOMERY AR 2006 469 175390 5097
17431 MONTGOMERY AR 2007 597 241270 5097
17432 MONTGOMERY AR 2008 561 251760 5097
17433 MONTGOMERY AR 2009 554 244160 5097
Now lets delete rows from the annual
table that have missing county data. Check on these counties before and verify that the’ve been deleted afterwards. Also, we want to grab just three columns from the land
table, so let’s create a new one called land_area
. Also, the column there is called STCOU
, which we want to rename to coutyfips
. (I’m going to stop printing out the results of every step, so make sure you’re checking your work as you go.)
sqlite> delete from annual where BUYER_COUNTY = "NA"
sqlite> create table land_area as select Areaname, STCOU, LND110210D from land;
sqlite> alter table land_area rename column STCOU to countyfips;
Next we want to start joining the tables, so let’s left join our table and print out the counts to make sure we accounted correctly.
sqlite> create table county_info as select * from population left join land_area using(countyfips);
sqlite> select count(*) from land;
3198
sqlite> select count(*) from land_area;
3198
sqlite> select count(*) from county_info;
28265
sqlite> select count(*) from population;
Notes#
At this point, hopefully you have enough of a background to finish doing the example from Open Case Studies. I have to say, that working with SQL is pleasant, but I prefer python as a home base. In addition, after working with the data, I want to use plotting and analysis tools. In the next chapter, we’ll look at using python as a base language to interact with an sqlite database.