Job Description:
We are looking for a highly skilled bioinformatics scientist who specializes in designing, developing, deploying, and maintaining scalable bioinformatics pipelines on cloud-based infrastructure. The candidate will be responsible for the code base supporting the large-scale genomic processing and analysis pipelines at the Data Analysis Center of an NIH project, Somatic Mosaicism across Human Tissues (SMaHT), which manages multi-omic data (e.g., Illumina/PacBio/ONT Whole Genome Sequencing (WGS), RNA-Seq). The ideal candidate will have a deep understanding of next-generation sequencing (NGS) data analysis, workflow automation, cloud computing, and cloud software engineering best practices. This role will support research and production environments where reproducibility, scalability, and performance are critical.
Responsibilities:
Design, implement, and maintain bioinformatics pipelines for high-throughput sequencing data (e.g., alignment, QC, variant calling from WGS and RNA-seq) similar to those in existing repositories: github.com/smaht-dac/main-pipelines
Build reproducible, well-tested, and automated workflows using workflow management systems (particularly CWL).
Architect and manage AWS-based compute infrastructure to support pipeline execution, including automated deployment, scaling, and monitoring.
Containerize workflows using Docker or similar tools for managed execution and portability.
Integrate CI/CD tooling to automate testing, deployment, and version control to ensure data integrity and correct execution of the pipeline.
Develop utility tools for metadata management, file integrity checks, or conversion (e.g., VCF, BAM to CRAM), and integration with the SMaHT Data Portal.
Collaborate cross-functionally with research scientists, engineers, and IT teams to refine requirements and deliver high-quality solutions.
Document code, workflows, and infrastructure configurations clearly.
Qualifications:
An ideal candidate will have a PhD in computational biology/bioinformatics/statistics/CS or another quantitative field, as well as superb programming (Python, shell scripting) and communication skills.
In addition:
Extensive experience with analysis of high-throughput sequencing data and knowledge of bioinformatics tools for sequence alignment, variant calling, sequence data QC, etc.
Proficiency in Docker for creating a reproducible execution environment and Workflow Description Language for orchestrating complex tasks.
Strong understanding of AWS services (EC2, S3) or similar cloud platforms for compute and storage.
Version Control & CI/CD: Git, automated testing, deployment workflows.
Experience with Linux systems, HPC, and distributed computing environments.
Knowledge of optimizing pipelines for large-scale genomic projects.
Please send a CV to Elizabeth Chun (elizabeth_chun@hms.harvard.edu) or apply here: jobs.smartrecruiters.com/HarvardUniversity/3743990011801881-bioinformatics-software-engineer.