Complete course Intro to SQL for Data Science and chapters Importing data from databases Part 1 and 2 on DataCamp. Read the RSQLite vignette.
There are essentially three ways to send SQL-queries directly from
RStudio to a database mybase
, where we assume
mybase
is an SQLite database stored in file
mybase.sqlite
.
dbGetQuery
after establishing a conection using
dbConnect
as incon <- dbConnect(SQLite(), "mybase.sqlite")
dbGetQuery(con, "SELECT * FROM table")
```{r}
con <- dbConnect(SQLite(), "mybase.sqlite")
```
```{sql, connection = con}
SELECT * FROM table
```
We recommend using the first option within an R script file.
Practise SQL joins at w3schools.
The SQLite database class_files/sthlm_metro.sqlite
contains data on the stops of the Stockholm metro at the platform level,
which was generated for the 2018 class by class_files/sthlm_metro.R
and originates from Trafiklab.se.
Connect to the database, list the tables and figure out how they relate to eachother.
Query for the LineName
from table Line
where LineNumber
is 18.
Query for the number of platforms at each station in descending order.
Query for the five most southern StationName
, where
position is measured as the average Latitude
of its
PlatformNumber
.
Query for the number of stations on LineNumber
18.
Query for all StationName
on LineNumber
18 in alphabetical order.
The SQLite database class_files/pokedex.sqlite
contains the full set of tables from https://github.com/veekun/pokedex/tree/master/pokedex/data/csv
generated by the file class_files/pokedexDB.R
.
Close any existing connections with dbDisconnect
and
connect to the database.
height
and average
weight
of pokemon (pokemon
table).weight
.ghost
-type
pokemon.pokemon_id
,
pokemon_name
and six additional columns giving the pokemons
base_stat
-value in each of categories hp
,
attack
, defense
, special-attack
,
special-defense
and speed
(see tables
stats
and pokemon_stats
). You may use SQL to
ask for the table in long format and convert it to the final wide format
using spread
in R.Ensembl hosts a
public genomic database with a MySQL server. You can connect to the
database holding the human genome using (note the use of
MySQL
rather than SQLite
)
library(RMySQL)
## Warning: package 'RMySQL' was built under R version 4.1.2
## Loading required package: DBI
con <- dbConnect(MySQL(), host = "ensembldb.ensembl.org",
user = "anonymous", password = "",
port = 3306)
we are now connected to a whole set of databases
databases <- dbGetQuery(con, "SHOW DATABASES")
head(databases)
## Database
## 1 information_schema
## 2 PERCONA_SCHEMA
## 3 acanthochromis_polyacanthus_core_100_1
## 4 acanthochromis_polyacanthus_core_101_1
## 5 acanthochromis_polyacanthus_core_102_1
## 6 acanthochromis_polyacanthus_core_103_1
nrow(databases)
## [1] 18853
Yes, plenty of databases. The latest version of the human genome is
in homo_sapiens_core_94_38
, we choose this by
dbSendQuery(con, "USE homo_sapiens_core_94_38")
## <MySQLResult:0,0,1>
dbListTables(con)
## [1] "alt_allele"
## [2] "alt_allele_attrib"
## [3] "alt_allele_group"
## [4] "analysis"
## [5] "analysis_description"
## [6] "assembly"
## [7] "assembly_exception"
## [8] "associated_group"
## [9] "associated_xref"
## [10] "attrib_type"
## [11] "biotype"
## [12] "coord_system"
## [13] "data_file"
## [14] "density_feature"
## [15] "density_type"
## [16] "dependent_xref"
## [17] "ditag"
## [18] "ditag_feature"
## [19] "dna"
## [20] "dna_align_feature"
## [21] "dna_align_feature_attrib"
## [22] "exon"
## [23] "exon_transcript"
## [24] "external_db"
## [25] "external_synonym"
## [26] "gene"
## [27] "gene_archive"
## [28] "gene_attrib"
## [29] "genome_statistics"
## [30] "identity_xref"
## [31] "interpro"
## [32] "intron_supporting_evidence"
## [33] "karyotype"
## [34] "map"
## [35] "mapping_session"
## [36] "mapping_set"
## [37] "marker"
## [38] "marker_feature"
## [39] "marker_map_location"
## [40] "marker_synonym"
## [41] "meta"
## [42] "meta_coord"
## [43] "misc_attrib"
## [44] "misc_feature"
## [45] "misc_feature_misc_set"
## [46] "misc_set"
## [47] "object_xref"
## [48] "ontology_xref"
## [49] "operon"
## [50] "operon_transcript"
## [51] "operon_transcript_gene"
## [52] "peptide_archive"
## [53] "prediction_exon"
## [54] "prediction_transcript"
## [55] "protein_align_feature"
## [56] "protein_feature"
## [57] "repeat_consensus"
## [58] "repeat_feature"
## [59] "seq_region"
## [60] "seq_region_attrib"
## [61] "seq_region_mapping"
## [62] "seq_region_synonym"
## [63] "simple_feature"
## [64] "stable_id_event"
## [65] "supporting_feature"
## [66] "transcript"
## [67] "transcript_attrib"
## [68] "transcript_intron_supporting_evidence"
## [69] "transcript_supporting_feature"
## [70] "translation"
## [71] "translation_attrib"
## [72] "unmapped_object"
## [73] "unmapped_reason"
## [74] "xref"
Check the assembly
table if you wonder how the tables relate to each other… We may take
a closer look at a table by DESCRIBE
dbGetQuery(con, "DESCRIBE gene")
## Field Type Null Key Default Extra
## 1 gene_id int(10) unsigned NO PRI <NA> auto_increment
## 2 biotype varchar(40) NO <NA>
## 3 analysis_id smallint(5) unsigned NO MUL <NA>
## 4 seq_region_id int(10) unsigned NO MUL <NA>
## 5 seq_region_start int(10) unsigned NO <NA>
## 6 seq_region_end int(10) unsigned NO <NA>
## 7 seq_region_strand tinyint(2) NO <NA>
## 8 display_xref_id int(10) unsigned YES MUL <NA>
## 9 source varchar(40) NO <NA>
## 10 description text YES <NA>
## 11 is_current tinyint(1) NO 1
## 12 canonical_transcript_id int(10) unsigned NO MUL <NA>
## 13 stable_id varchar(128) YES MUL <NA>
## 14 version smallint(5) unsigned YES <NA>
## 15 created_date datetime YES <NA>
## 16 modified_date datetime YES <NA>
gene
-table?biotype
in the gene
-table).dna
table.