-
Notifications
You must be signed in to change notification settings - Fork 3
Embedded SQL
- Embedding SQL in other environments (Stata, R, etc.)
- Using a SQL query to populate a data structure
- Examples with RStudio and Stata
- HoE: Try embedded SQL with R, Stata, SAS, SPSS, Python, etc.
- Discussion
- Action Items (readings, videos and tasks)
install.packages("RSQLite")
library(RSQLite)
hist <- dbConnect(SQLite(),'places.sqlite')
> dbListTables(hist)
[1] "moz_anno_attributes" "moz_annos" "moz_bookmarks"
[4] "moz_bookmarks_roots" "moz_favicons" "moz_historyvisits"
[7] "moz_hosts" "moz_inputhistory" "moz_items_annos"
[10] "moz_keywords" "moz_places" "sqlite_sequence"
[13] "sqlite_stat1"
> dbListFields(hist,'moz_places')
[1] "id" "url" "title" "rev_host"
[5] "visit_count" "hidden" "typed" "favicon_id"
[9] "frecency" "last_visit_date" "guid"
> dbGetQuery(hist,'SELECT substr(url,0,26) as link,frecency from moz_places where link like "http%" order by frecency desc limit 3')
link frecency
1 https://www.google.com/ 2100
2 http://www.washington.edu/ 2000
3 http://www.slashdot.org/ 150
See also:
We will be creating a SQL table from an R data.frame and running this SQL query on it:
SELECT row_names AS State, Murder, Assault,
ROUND(100 * Murder / Assault, 1) AS MurderAssaultRatio
FROM arrests
ORDER BY MurderAssaultRatio DESC
LIMIT 10
Our query will SELECT
the top-ten states having the highest ratio of murders to assaults, which we will report as a percentage. This sort of analysis is may lead to further inquiries about which state may have a "murder problem" or a "reporting problem".
Here is how we will create the database and run the query in R.
> data("USArrests")
> str(USArrests)
'data.frame': 50 obs. of 4 variables:
$ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
$ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
$ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
$ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
> library("RSQLite")
Loading required package: DBI
> drv <- dbDriver("SQLite")
> sqlfile <- tempfile(tmpdir="~", fileext=".sqlite")
> sqlfile
[1] "~/file1ea87682fba.sqlite"
> con <- dbConnect(drv, dbname = sqlfile)
> data(USArrests)
> dbWriteTable(con, "arrests", USArrests)
[1] TRUE
> query <- "SELECT sql from sqlite_master WHERE type='table' and name = 'arrests';"
> cat(dbGetQuery(con, query)$sql)
CREATE TABLE arrests
( row_names TEXT,
Murder REAL,
Assault INTEGER,
UrbanPop INTEGER,
Rape REAL
);
> dbGetQuery(con, "SELECT COUNT(*) FROM arrests")[1, ]
[1] 50
> dbListTables(con)
[1] "arrests"
> dbListFields(con, "arrests")
[1] "row_names" "Murder" "Assault" "UrbanPop" "Rape"
> query <- "SELECT row_names AS State, Murder, Assault,
ROUND(100 * Murder / Assault, 1) AS MurderAssaultRatio
FROM arrests
ORDER BY MurderAssaultRatio DESC
LIMIT 10"
> dbGetQuery(con, query)
State Murder Assault MurderAssaultRatio
1 Hawaii 5.3 46 11.5
2 Kentucky 9.7 109 8.9
3 Georgia 17.4 211 8.2
4 Tennessee 13.2 188 7.0
5 West Virginia 5.7 81 7.0
6 Indiana 7.2 113 6.4
7 Texas 12.7 201 6.3
8 Louisiana 15.4 249 6.2
9 Mississippi 16.1 259 6.2
10 Ohio 7.3 120 6.1
And we can see that this same query result can be produced from the Bash shell using the sqlite3
command, using the same SQL SELECT statement. First we will set .header on
, .mode column
, and .width 15
so that the formatting of the output will be similar. We do this with an "init file" which we will create with echo
, which is a shell command that sends a text string to a file. The commands look like this:
$ echo -e ".header on\n.mode column\n.width 15" > sqlite.init
$ QUERY="SELECT row_names AS State, Murder, Assault,
ROUND(100 * Murder / Assault, 1) AS MurderAssaultRatio
FROM arrests
ORDER BY MurderAssaultRatio DESC
LIMIT 10"
$ sqlite3 -init sqlite.init ~/file1ea87682fba.sqlite "$QUERY"
-- Loading resources from sqlite.init
State Murder Assault MurderAssaultRatio
--------------- ---------- ---------- ------------------
Hawaii 5.3 46 11.5
Kentucky 9.7 109 8.9
Georgia 17.4 211 8.2
Tennessee 13.2 188 7.0
West Virginia 5.7 81 7.0
Indiana 7.2 113 6.4
Texas 12.7 201 6.3
Louisiana 15.4 249 6.2
Mississippi 16.1 259 6.2
Ohio 7.3 120 6.1
See also:
The R syntax to perform the same operations using MySQL as the database is almost the same as for SQLite. The only difference is there is not need to create the SQLite file and the dbConnect() function needs different parameters. Here we are connecting to a MySQL server on the same host (MyServer) as our R session and our username is MyUser and password is MyPassword. We are assuming that the "MyUser" user and "MyUser" database already exist. For example, they may been created with these SQL commands from the mysql
command-line utility as a user with enhanced privileges.
CREATE USER 'MyUser'@'localhost' IDENTIFIED BY 'MyPassword';
CREATE DATABASE MyUser;
GRANT ALL ON MyUser.* TO 'MyUser'@'localhost';
Here are the R commands...
> install.packages("RMySQL")
> library("RMySQL")
Loading required package: DBI
> drv <- dbDriver("MySQL")
> con <- dbConnect(drv, host="TheServer", dbname="MyUser", user="MyUser", password="MyPassword")
> data("USArrests")
> str(USArrests)
'data.frame': 50 obs. of 4 variables:
$ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
$ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
$ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
$ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
> dbListTables(con)
character(0)
> dbWriteTable(con, "arrests", USArrests)
[1] TRUE
> dbGetQuery(con, "SELECT COUNT(*) FROM arrests")[1, ]
[1] 50
> dbListTables(con)
[1] "arrests"
> dbListFields(con, "arrests")
[1] "row_names" "Murder" "Assault" "UrbanPop" "Rape"
> dbGetQuery(con, "SELECT row_names AS State, Murder, Assault, ROUND(100 * Murder / Assault, 1) AS MurderAssaultRatio FROM arrests ORDER BY MurderAssaultRatio DESC LIMIT 10")
State Murder Assault MurderAssaultRatio
1 Hawaii 5.3 46 11.5
2 Kentucky 9.7 109 8.9
3 Georgia 17.4 211 8.2
4 Tennessee 13.2 188 7.0
5 West Virginia 5.7 81 7.0
6 Indiana 7.2 113 6.4
7 Texas 12.7 201 6.3
8 Louisiana 15.4 249 6.2
9 Mississippi 16.1 259 6.2
10 Ohio 7.3 120 6.1
See also:
Here we will create the same table and run the same (or similar) query as in the last example, but using PostgreSQL from the same host the as database server runs on (TheServer). (See the comment in the code for connecting from another host.) Also, we are assuming that our account and default database already exists.
Here is the query, slightly modified to work with PostgreSQL:
SELECT "row.names" AS "State", "Murder", "Assault",
ROUND(CAST(100 * "Murder" / "Assault" AS NUMERIC), 1) AS "MurderAssaultRatio"
FROM private.arrests
ORDER BY "MurderAssaultRatio" DESC
LIMIT 10
You will notice that we had to quote the column names, the underscore (_) in "row.names" has been replaced by a period (.) and we had to CAST the percentage calculation result. Also, we have to refer to the table as private.arrests
since we are creating it in our private
schema. Using this schema is more secure, as the default schema is public
. However, it will make the commands a little more complex.
Since the private
schema may not already exist, you may wish to run these commands from the psql
utility first. This should only need to be done once.
$ psql "sslmode=require"
psql (9.1.12)
Type "help" for help.
MyUsername=> CREATE SCHEMA private AUTHORIZATION MyUsername;
ERROR: schema "private" already exists
MyUsername=> \q
You will also see, below, how the database connection is a little more complex than with SQLite, but very similar to connections with MySQL databases.
> data("USArrests")
> str(USArrests)
'data.frame': 50 obs. of 4 variables:
$ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
$ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
$ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
$ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
> install.packages("RPostgreSQL")
> library(RPostgreSQL)
Loading required package: DBI
> drv <- dbDriver("PostgreSQL")
> con <- dbConnect(drv)
# Or, if you are connecting from another host...
# con <- dbConnect(drv,host="TheServer",user="YourUsername", password="YourPassword")
# In either case, we did not specify the database name with dbname="MyDatabase" because
# we are using the default in this example. We will however, choose to specify the `private`
# schema in subsequent commands.
> dbWriteTable(con, c("private","arrests"), USArrests)
[1] TRUE
> rs <- dbSendQuery(con, 'SELECT "row.names" AS "State", "Murder", "Assault", ROUND(CAST(100 * "Murder" / "Assault" AS NUMERIC), 1) AS "MurderAssaultRatio" FROM private.arrests ORDER BY "MurderAssaultRatio" DESC LIMIT 10')
> df <- fetch(rs, n = 10)
> df
State Murder Assault MurderAssaultRatio
1 Hawaii 5.3 46 11.5
2 Kentucky 9.7 109 8.9
3 Georgia 17.4 211 8.2
4 West Virginia 5.7 81 7.0
5 Tennessee 13.2 188 7.0
6 Indiana 7.2 113 6.4
7 Texas 12.7 201 6.3
8 Louisiana 15.4 249 6.2
9 Mississippi 16.1 259 6.2
10 Ohio 7.3 120 6.1
> dbClearResult(rs)
[1] TRUE
If we connect from another host, we have to embed our database password in our code, which is not very secure. So, it is better to run the code from the same host that runs the server. That way, we do not have to authenticate within the code. If you do store the password in your code, be very careful about securing your code and any history files generated by your statistics package. Some will save a history file without prompting you, and that file will contain any command you may have typed at the prompt. Our server is configured to encrypt all network traffic (through SSL) by default. Make sure your server is configured this way before connecting to it with a password.
Interestingly, it is a lot less work to run from the PostgreSQL command-line utility.
$ psql "sslmode=require"
psql (9.1.12)
Type "help" for help.
MyUsername=> SELECT "row.names" AS "State", "Murder", "Assault", ROUND(CAST(100 * "Murder" / "Assault" AS NUMERIC), 1) AS "MurderAssaultRatio" FROM private.arrests ORDER BY "MurderAssaultRatio" DESC LIMIT 10;
State | Murder | Assault | MurderAssaultRatio
---------------+--------+---------+--------------------
Hawaii | 5.3 | 46 | 11.5
Kentucky | 9.7 | 109 | 8.9
Georgia | 17.4 | 211 | 8.2
West Virginia | 5.7 | 81 | 7.0
Tennessee | 13.2 | 188 | 7.0
Indiana | 7.2 | 113 | 6.4
Texas | 12.7 | 201 | 6.3
Louisiana | 15.4 | 249 | 6.2
Mississippi | 16.1 | 259 | 6.2
Ohio | 7.3 | 120 | 6.1
(10 rows)
MyUsername=> \q
See also: