Chapter 7 Microbiome data preparation
7.1 Load data
7.1.1 Count table
This is the document containing the number of sequencing reads from each sample have been mapped to each MAG. Note that this is the raw data that needs to be further processed before running any statistics on them.
read_counts <- read_tsv("https://sid.erda.dk/share_redirect/CfgcycDTiJ/beho_mgquant/results/stats/coverm_genome_beho_mags.count.tsv") %>%
rename_all(~ str_remove_all(., ".lib1 Read Count")) %>% #simplify column names
rename(genome = 1)
Generate a vector of genome names to be employed for filtering and sorting data in downstream steps.
7.1.2 Base hit table
This is the document containing the number of nucleotide bases have been covered by at least one read in each sample and MAG. This information is used to calculate MAG coverage values.
7.1.3 Sample metadata table
dominance_data <- read_tsv(dir("data", pattern = "dominance_t\\d\\.tsv", full.names = TRUE)) %>% select(-cage) %>%
mutate(time = case_when(
time == "t1" ~ "OP",
time == "t2" ~ "HT",
time == "t3" ~ "CT",
time == "t4" ~ "DT",
time == "t5" ~ "AN",
time == "t6" ~ "T3",
TRUE ~ NA))
sample_metadata <- read_csv("https://sid.erda.dk/share_redirect/CfgcycDTiJ/BeHo_sample_metadata.csv") %>%
rename(sample=ERDA_sample_ID) %>%
mutate(animal = substr(mouse_ID, 1, 5)) %>%
mutate(cage = substr(mouse_ID, 1, 3)) %>%
mutate(treatment = substr(sample, 6, 7)) %>%
mutate(treatment = ifelse(treatment == "CD","CT",treatment)) %>%
mutate(group = ifelse(cage == "C04" | cage == "C13","invariable","variable")) %>%
select(sample,animal,cage,group,treatment) %>%
left_join(dominance_data,by=join_by(animal==mouse, treatment==time))
7.1.4 Genome metadata
Relevant metadata of genomes is fetched from 2-3 files and merged into one genome metadata object for downstream analyses.
7.1.4.1 Genome taxonomy
This is the raw taxonomy table generated by GTDBtk, which is simplified for downstream analyses.
genome_taxonomy <- read_tsv("https://sid.erda.dk/share_redirect/CfgcycDTiJ/mag_catalogue_raphael/gtdbtk.bac120.summary.tsv") %>%
rename(genome = user_genome) %>%
mutate(genome = str_replace_all(genome,"\\.fa", "")) %>%
separate(classification, c("domain","phylum","class","order","family","genus","species"), sep =";") %>%
select(genome,domain,phylum,class,order,family,genus,species) %>%
mutate(phylum = case_when(
phylum == "p__Firmicutes" ~ "p__Bacillota",
phylum == "p__Firmicutes_A" ~ "p__Bacillota_A",
phylum == "p__Firmicutes_B" ~ "p__Bacillota_B",
phylum == "p__Proteobacteria" ~ "p__Pseudomonadota",
phylum == "p__Actinobacteriota" ~ "p__Actinomycetota",
TRUE ~ phylum)) %>% #Replace phylum names for GTDB r214 taxonomy to match EHI taxonomy colors
arrange(match(genome, genomes))
7.1.4.2 Genome lengths
This information is needed to calculate coverage values and to normalise read counts by genome length.
7.1.5 Genome tree
This is the raw tree generated by GTDBtk, which needs to be pruned to obtain the phylogenetic tree of the MAGs.
genome_tree <- read.tree(url("https://sid.erda.dk/share_redirect/CfgcycDTiJ/mag_catalogue_raphael/gtdbtk.backbone.bac120.classify.tree"))
genome_tree$tip.label <- str_replace_all(genome_tree$tip.label,"\\.fa", "") #remove .fa extension
genome_tree <- keep.tip(genome_tree, tip=genomes) # keep only MAG tips
7.2 Filter and normalise data
Raw data needs to be filtered and normalised to make it useful for downstream analyses.
7.2.1 Generate coverage table
By dividing the number of base hits by the length of each genome, coverage values can be calculated.
7.2.2 Coverage filtering
Genomes that have less than 30% of their length covered by reads are turned into zeros to account for the random allocation of reads across genomes due to mapping heuristics.
7.3 Wrap working objects
In the last step, the objects that are needed for downstream analyses are stored in an R object.
save(dominance_predictions,read_counts, genome_counts, genome_tree, genome_metadata, sample_metadata, genome_gifts, file = "data/data.Rdata")
- read_counts: Number of reads mapped to each genome in each sample. Note this is the unfiltered and unnormalised raw community composition table.
- genome_counts: Number of genomes quantified in each sample, calculated through filtering and normalising read_counts. This is the community composition table to be used in downstream analyses unless otherwise stated.
- genome_tree: Phylogenetic tree of the genomes, to be employed in downstream phylogenetic analyses.
- genome_metadata: Taxonomic and quality information of the genomes.
- genome_gifts: Genome-inferred functional traits of the genomes, to be employed in downstream functional analyses.
- sample_metadata: Treatment/population and other relevant metadata of the samples.