Do this before class

Complete course Intro to SQL for Data Science and chapters Importing data from databases Part 1 and 2 on DataCamp. Read the RSQLite vignette.

During class

There are essentially three ways to send SQL-queries directly from RStudio to a database mybase, where we assume mybase is an SQLite database stored in file mybase.sqlite.

con <- dbConnect(SQLite(), "mybase.sqlite")
dbGetQuery(con, "SELECT * FROM table")
```{r}
con <- dbConnect(SQLite(), "mybase.sqlite")
```

```{sql, connection = con}
SELECT * FROM table
```

We recommend using the first option within an R script file.

SQL joins

Practise SQL joins at w3schools.

SQL for Storstockholms Lokaltrafik (SL)

The SQLite database class_files/sthlm_metro.sqlite contains data on the stops of the Stockholm metro at the platform level, which was generated for the 2018 class by class_files/sthlm_metro.R and originates from Trafiklab.se.

  • Connect to the database, list the tables and figure out how they relate to eachother.

  • Query for the LineName from table Line where LineNumber is 18.

  • Query for the number of platforms at each station in descending order.

  • Query for the five most southern StationName, where position is measured as the average Latitude of its PlatformNumber.

  • Query for the number of stations on LineNumber 18.

  • Query for all StationName on LineNumber 18 in alphabetical order.

Pokemon SQL

The SQLite database class_files/pokedex.sqlite contains the full set of tables from https://github.com/veekun/pokedex/tree/master/pokedex/data/csv generated by the file class_files/pokedexDB.R.

Close any existing connections with dbDisconnect and connect to the database.

  • Query for the average height and average weight of pokemon (pokemon table).
  • Query for the number of pokemon of less than average weight.
  • Query for the average weight of ghost-type pokemon.
  • Query for a table with columns pokemon_id, pokemon_name and six additional columns giving the pokemons base_stat-value in each of categories hp, attack, defense, special-attack, special-defense and speed (see tables stats and pokemon_stats). You may use SQL to ask for the table in long format and convert it to the final wide format using spread in R.

Connecting to an external database

Ensembl hosts a public genomic database with a MySQL server. You can connect to the database holding the human genome using (note the use of MySQL rather than SQLite)

library(RMySQL)
## Warning: package 'RMySQL' was built under R version 4.1.2
## Loading required package: DBI
con <- dbConnect(MySQL(), host = "ensembldb.ensembl.org",
                user = "anonymous", password = "", 
                port = 3306)

we are now connected to a whole set of databases

databases <- dbGetQuery(con, "SHOW DATABASES")
head(databases)
##                                 Database
## 1                     information_schema
## 2                         PERCONA_SCHEMA
## 3 acanthochromis_polyacanthus_core_100_1
## 4 acanthochromis_polyacanthus_core_101_1
## 5 acanthochromis_polyacanthus_core_102_1
## 6 acanthochromis_polyacanthus_core_103_1
nrow(databases)
## [1] 18853

Yes, plenty of databases. The latest version of the human genome is in homo_sapiens_core_94_38, we choose this by

dbSendQuery(con, "USE homo_sapiens_core_94_38")
## <MySQLResult:0,0,1>
dbListTables(con)
##  [1] "alt_allele"                           
##  [2] "alt_allele_attrib"                    
##  [3] "alt_allele_group"                     
##  [4] "analysis"                             
##  [5] "analysis_description"                 
##  [6] "assembly"                             
##  [7] "assembly_exception"                   
##  [8] "associated_group"                     
##  [9] "associated_xref"                      
## [10] "attrib_type"                          
## [11] "biotype"                              
## [12] "coord_system"                         
## [13] "data_file"                            
## [14] "density_feature"                      
## [15] "density_type"                         
## [16] "dependent_xref"                       
## [17] "ditag"                                
## [18] "ditag_feature"                        
## [19] "dna"                                  
## [20] "dna_align_feature"                    
## [21] "dna_align_feature_attrib"             
## [22] "exon"                                 
## [23] "exon_transcript"                      
## [24] "external_db"                          
## [25] "external_synonym"                     
## [26] "gene"                                 
## [27] "gene_archive"                         
## [28] "gene_attrib"                          
## [29] "genome_statistics"                    
## [30] "identity_xref"                        
## [31] "interpro"                             
## [32] "intron_supporting_evidence"           
## [33] "karyotype"                            
## [34] "map"                                  
## [35] "mapping_session"                      
## [36] "mapping_set"                          
## [37] "marker"                               
## [38] "marker_feature"                       
## [39] "marker_map_location"                  
## [40] "marker_synonym"                       
## [41] "meta"                                 
## [42] "meta_coord"                           
## [43] "misc_attrib"                          
## [44] "misc_feature"                         
## [45] "misc_feature_misc_set"                
## [46] "misc_set"                             
## [47] "object_xref"                          
## [48] "ontology_xref"                        
## [49] "operon"                               
## [50] "operon_transcript"                    
## [51] "operon_transcript_gene"               
## [52] "peptide_archive"                      
## [53] "prediction_exon"                      
## [54] "prediction_transcript"                
## [55] "protein_align_feature"                
## [56] "protein_feature"                      
## [57] "repeat_consensus"                     
## [58] "repeat_feature"                       
## [59] "seq_region"                           
## [60] "seq_region_attrib"                    
## [61] "seq_region_mapping"                   
## [62] "seq_region_synonym"                   
## [63] "simple_feature"                       
## [64] "stable_id_event"                      
## [65] "supporting_feature"                   
## [66] "transcript"                           
## [67] "transcript_attrib"                    
## [68] "transcript_intron_supporting_evidence"
## [69] "transcript_supporting_feature"        
## [70] "translation"                          
## [71] "translation_attrib"                   
## [72] "unmapped_object"                      
## [73] "unmapped_reason"                      
## [74] "xref"

Check the assembly table if you wonder how the tables relate to each other… We may take a closer look at a table by DESCRIBE

dbGetQuery(con, "DESCRIBE gene")
##                      Field                 Type Null Key Default          Extra
## 1                  gene_id     int(10) unsigned   NO PRI    <NA> auto_increment
## 2                  biotype          varchar(40)   NO        <NA>               
## 3              analysis_id smallint(5) unsigned   NO MUL    <NA>               
## 4            seq_region_id     int(10) unsigned   NO MUL    <NA>               
## 5         seq_region_start     int(10) unsigned   NO        <NA>               
## 6           seq_region_end     int(10) unsigned   NO        <NA>               
## 7        seq_region_strand           tinyint(2)   NO        <NA>               
## 8          display_xref_id     int(10) unsigned  YES MUL    <NA>               
## 9                   source          varchar(40)   NO        <NA>               
## 10             description                 text  YES        <NA>               
## 11              is_current           tinyint(1)   NO           1               
## 12 canonical_transcript_id     int(10) unsigned   NO MUL    <NA>               
## 13               stable_id         varchar(128)  YES MUL    <NA>               
## 14                 version smallint(5) unsigned  YES        <NA>               
## 15            created_date             datetime  YES        <NA>               
## 16           modified_date             datetime  YES        <NA>
  • What is the primary key in the gene-table?
  • Query for the total number of protein-coding genes in the human genome (see biotype in the gene-table).
  • Query for the first 10 DNA-sequences in the dna table.