Chapter 2 Prepare data
2.1 Load data
Load the original data files outputted by the bioinformatic pipeline.
2.1.6 Genome annotations
Downloading individual annotation files from ERDA using information in Airtable and writing them to a single compressed table takes a while. The following chunk only needs to be run once, to generate the genome_annotations table that is saved in the data directory. Note that the airtable connection requires a personal access token.
airtable("MAGs", "appWbHBNLE6iAsMRV") %>% #get base ID from Airtable browser URL
read_airtable(., fields = c("ID","mag_name","number_genes","anno_url"), id_to_col = TRUE) %>% #get 3 columns from MAGs table
filter(mag_name %in% paste0(genome_metadata$genome,".fa")) %>% #filter by MAG name
filter(number_genes > 0) %>% #genes need to exist
select(anno_url) %>% #list MAG annotation urls
pull() %>%
read_tsv() %>% #load all tables
rename(gene=1, genome=2, contig=3) %>% #rename first 3 columns
write_tsv(file="data/genome_annotations.tsv.xz") #write to overall compressed file
2.2 Create working objects
Transform the original data files into working objects for downstream analyses.
2.3 Prepare color scheme
AlberdiLab projects use unified color schemes developed for the Earth Hologenome Initiative, to facilitate figure interpretation.
phylum_colors <- read_tsv("https://raw.githubusercontent.com/earthhologenome/EHI_taxonomy_colour/main/ehi_phylum_colors.tsv") %>%
right_join(genome_metadata, by=join_by(phylum == phylum)) %>%
arrange(match(genome, genome_tree$tip.label)) %>%
select(phylum, colors) %>%
unique() %>%
arrange(phylum) %>%
pull(colors, name=phylum)