We propose to explore distinct approaches when creating or adding information to a pan-genome graph. The simplest approach is to map new sequences, indicating newly discovered variants and annotating existing ones. However, when the graph is getting too complex and/or too big, we may have interest to split it into two (or more) sub-graphs. The objective of this ESR will be to determine the best strategy to adopt depending on data size and complexity, from high-quality trustable sequences (perfectly assembled genomes) to lower quality sequences (badly assembled data) or even unassembled sequences.
The recruited Ph.D. will explore new ways for incrementing pan-genome graph(s) with novel sequences or pre-determined variants, while maintaining graph features (annotations, indexation). A special care will also be provided regarding the graph splitting strategies: when the graph is getting too complex and/or too big, we may have interest to split it into two (or more) sub-graphs.
Sub-graphs can also be seen as hierarchical, storing distinct level of informations (variants, strains, specices, metadata, ...)
Candidates must have strong interest and expertise in algorithmics, data structures and languages such as C/C++/Rust.
Knowledge or at least a deep motivation in genomics and biology is a prerequisite.
Candidate should be proficient in English language (academic level).
Through the form on our website jobs.inria.fr/public/classic/fr/offres/2020-03187