This theme encompasses the hardware, software, statistical, and computational methods that underpin modern genomics data science research.
Storing, analysing, and visualising the large data sets generated by high-throughput genomics assays often requires dedicated local or cloud-based high-performance computing (HPC) infrastructure or specialised hardware such as GPUs and FPGAs, along with algorithms and pipelines that have been specifically developed to make efficient use of these resources. There is also a need (particularly for clinical genomics) to ensure reproducibility of analyses and results between different compute environments, to guarantee data security and integrity, to provide integration with other ‘omics and clinical data, and to facilitate ease of use for non-informatics users.
As well as being large, genomics data sets can be both heterogeneous and sparse. This provides excellent opportunities for the development of new statistical and computational methods, often requiring the use of penalised, non-parametric, Monte Carlo, Bayesian modelling, and causal inference approaches. Machine learning and deep learning algorithms have also become popular, based on both their predictive performance and their ability to identify non-linear features and patterns in high-dimensional genomic data. A key criticism of these approaches however, is their ‘black-box’ nature, which has led to increasing interest in explainable AI and improved model intepretability.
Students interested in the development of analytical pipelines and frameworks, algorithms and data structures, file formats, standards, and compression techniques, data security, virtualisation and containerisation, statistical methods, machine learning, and data visualisation will find a variety of suitable projects under this theme.