Chapter 7 Microbiome data preparation

7.1 Load data

7.1.1 Count table

This is the document containing the number of sequencing reads from each sample have been mapped to each MAG. Note that this is the raw data that needs to be further processed before running any statistics on them.

read_counts <- read_tsv("https://sid.erda.dk/share_redirect/CfgcycDTiJ/beho_mgquant/results/stats/coverm_genome_beho_mags.count.tsv") %>%
  rename_all(~ str_remove_all(., ".lib1 Read Count")) %>% #simplify column names
  rename(genome = 1)

Generate a vector of genome names to be employed for filtering and sorting data in downstream steps.

genomes <- read_counts$genome # create list of genome names

7.1.2 Base hit table

This is the document containing the number of nucleotide bases have been covered by at least one read in each sample and MAG. This information is used to calculate MAG coverage values.

basehits <- read_tsv("https://sid.erda.dk/share_redirect/CfgcycDTiJ/beho_mgquant/results/stats/coverm_genome_beho_mags.covered_bases.tsv") %>%
  rename_all(~ str_remove_all(., ".lib1 Covered Bases")) %>% #simplify column names
  rename(genome = 1) %>%
  arrange(match(genome, genomes))

7.1.3 Sample metadata table

dominance_data <- read_tsv(dir("data", pattern = "dominance_t\\d\\.tsv", full.names = TRUE)) %>% select(-cage) %>%
    mutate(time = case_when(
        time == "t1" ~ "OP",
        time == "t2" ~ "HT",
        time == "t3" ~ "CT",
        time == "t4" ~ "DT",
        time == "t5" ~ "AN",
        time == "t6" ~ "T3",
        TRUE ~ NA))

sample_metadata <- read_csv("https://sid.erda.dk/share_redirect/CfgcycDTiJ/BeHo_sample_metadata.csv") %>%
   rename(sample=ERDA_sample_ID) %>%
    mutate(animal = substr(mouse_ID, 1, 5)) %>%
    mutate(cage = substr(mouse_ID, 1, 3)) %>%
    mutate(treatment = substr(sample, 6, 7)) %>%
    mutate(treatment = ifelse(treatment == "CD","CT",treatment)) %>%
    mutate(group = ifelse(cage == "C04" | cage == "C13","invariable","variable")) %>%
    select(sample,animal,cage,group,treatment) %>%
    left_join(dominance_data,by=join_by(animal==mouse, treatment==time))

7.1.4 Genome metadata

Relevant metadata of genomes is fetched from 2-3 files and merged into one genome metadata object for downstream analyses.

7.1.4.1 Genome taxonomy

This is the raw taxonomy table generated by GTDBtk, which is simplified for downstream analyses.

genome_taxonomy <- read_tsv("https://sid.erda.dk/share_redirect/CfgcycDTiJ/mag_catalogue_raphael/gtdbtk.bac120.summary.tsv") %>%
  rename(genome = user_genome) %>%
  mutate(genome = str_replace_all(genome,"\\.fa", "")) %>%
  separate(classification, c("domain","phylum","class","order","family","genus","species"),  sep =";") %>%
  select(genome,domain,phylum,class,order,family,genus,species) %>%
  mutate(phylum = case_when(
        phylum == "p__Firmicutes" ~ "p__Bacillota",
        phylum == "p__Firmicutes_A" ~ "p__Bacillota_A",
        phylum == "p__Firmicutes_B" ~ "p__Bacillota_B",
        phylum == "p__Proteobacteria" ~ "p__Pseudomonadota",
        phylum == "p__Actinobacteriota" ~ "p__Actinomycetota",
        TRUE ~ phylum)) %>% #Replace phylum names for GTDB r214 taxonomy to match EHI taxonomy colors
  arrange(match(genome, genomes))

7.1.4.2 Genome lengths

This information is needed to calculate coverage values and to normalise read counts by genome length.

genome_lengths <- read_tsv("https://sid.erda.dk/share_redirect/CfgcycDTiJ/beho_mgquant/results/stats/coverm_genome_beho_mags.length.tsv") %>%
  rename(genome=1,length=2) %>% #remove redundancy and rename
  select(genome,length)

7.1.4.3 Genome quality

Quality properties of the genomes.

genome_quality <- read_csv("https://sid.erda.dk/share_redirect/CfgcycDTiJ/mag_catalogue_raphael/checkm_all.csv") %>%
  mutate(genome = str_replace_all(genome,"\\.fa.gz", "")) %>% #remove suffix
  filter(genome %in% genomes) #retain only dereplicated genomes

7.1.4.4 Merged metadata object

Merge taxonomy, length and quality information

genome_metadata <- genome_taxonomy %>%
  left_join(genome_lengths,by=join_by(genome==genome)) %>% #join lengths - not used because Genome quality already contains this information
  left_join(genome_quality,by=join_by(genome==genome)) #join quality

7.1.5 Genome tree

This is the raw tree generated by GTDBtk, which needs to be pruned to obtain the phylogenetic tree of the MAGs.

genome_tree <- read.tree(url("https://sid.erda.dk/share_redirect/CfgcycDTiJ/mag_catalogue_raphael/gtdbtk.backbone.bac120.classify.tree"))
genome_tree$tip.label <- str_replace_all(genome_tree$tip.label,"\\.fa", "") #remove .fa extension
genome_tree <- keep.tip(genome_tree, tip=genomes) # keep only MAG tips

7.1.6 Genome functional annotations

This is the raw annotation table generated by DRAM, which is used to generate GIFT data using distillR.

genome_annotations <- paste0("https://sid.erda.dk/share_redirect/CfgcycDTiJ/mag_catalogue_raphael/DRAM_annotations/",genomes,"_annotations.tsv.gz") %>%
  read_tsv() %>% #load all annotations into a single table object
  rename(gene=1,genome=2) #rename first two columns

7.2 Filter and normalise data

Raw data needs to be filtered and normalised to make it useful for downstream analyses.

7.2.1 Generate coverage table

By dividing the number of base hits by the length of each genome, coverage values can be calculated.

genome_coverage <- basehits %>%
  mutate(across(where(is.numeric), ~ ./genome_metadata$length))

7.2.2 Coverage filtering

Genomes that have less than 30% of their length covered by reads are turned into zeros to account for the random allocation of reads across genomes due to mapping heuristics.

min_coverage=0.3
read_counts_filt <- genome_coverage %>%
  mutate(across(where(is.numeric), ~ ifelse(. > min_coverage, 1, 0))) %>%
  mutate(across(-1, ~ . * read_counts[[cur_column()]]))

7.2.3 Generate genome count table

After filtering the low-coverage reads, read counts are transformed into genome counts using genome-length and read-length information.

readlength=150 #change if sequencing read length is different
genome_counts <- read_counts_filt %>%
  mutate(across(where(is.numeric), ~ . / (genome_metadata$length / readlength) ))

7.2.4 Distil functional annotations

Raw functional annotations are distilled into genome-inferred functional traits to generate biologically more meaningful functional traits for downstream analyses.

genome_gifts <- distill(genome_annotations,GIFT_db,genomecol=2,annotcol=c(9,10,19))

7.3 Wrap working objects

In the last step, the objects that are needed for downstream analyses are stored in an R object.

save(dominance_predictions,read_counts, genome_counts, genome_tree, genome_metadata, sample_metadata, genome_gifts, file = "data/data.Rdata")

read_counts: Number of reads mapped to each genome in each sample. Note this is the unfiltered and unnormalised raw community composition table.
genome_counts: Number of genomes quantified in each sample, calculated through filtering and normalising read_counts. This is the community composition table to be used in downstream analyses unless otherwise stated.
genome_tree: Phylogenetic tree of the genomes, to be employed in downstream phylogenetic analyses.
genome_metadata: Taxonomic and quality information of the genomes.
genome_gifts: Genome-inferred functional traits of the genomes, to be employed in downstream functional analyses.
sample_metadata: Treatment/population and other relevant metadata of the samples.