Senior Data Engineer

National Cancer Institute’s (NCI)
Division of Cancer Epidemiology and Genetics (DCEG)
United States MARYLAND Rockville
dceg.cancer.gov

Description

Full Job Description
The National Cancer Institute’s (NCI) Division of Cancer Epidemiology and Genetics (DCEG) is seeking a senior data Engineer for the Connect data analytics team to support a new prospective cohort of study that will recruit 200,000 adults in the United States. The study is designed to further investigate the etiology of cancer and its outcomes, which may inform new approaches in precision prevention and early detection. This new cohort study will capitalize on research innovations to advance the field of cancer epidemiology and prevention.

System description
The Connect 4 Cancer system is built primarily upon the Google Cloud Platform (GCP), with a goal to maximize use of managed services provided by GCP. The primary user interfaces and data collection systems are built upon GCP Firestore exposing an extensive API and set of client-side JavaScript applications that make use of this API. Collaborating clinical sites and study management contractors also access this API directly from their own internal systems. Data, mostly in the form of research participant specific JSON documents, are built in GCP Firestore and regularly moved into GCP BigQuery tables. Data pipelines are primarily developed in python and SQL and reporting and analytics workflows are primarily developed in R and SQL.

Work description
The successful candidate for this position will assume data engineering, management, and analytical responsibilities for Connect 4 Cancer data starting with their deposition into GCP BigQuery. Your role will be within the Connect 4 Cancer data analytics team. This group will be developing tools to support a large cohort, with the project’s expected lifetime to exceed 25 years. This duration imparts deep requirements for software flexibility and modularity, along with avoiding any lock-in to public or private standards that may be transient. Further, since this is a human subjects study, all developers, and the systems they create must remain cognizant of regulatory compliance viz patient protections and pay strict attention to where data are and where data go.


Qualifications

Minimum Qualifications:
• Master’s degree in computer science, data science, a related field, or equivalent practical experience (approximately 5 additional years of experience).
• Experience working with biologic or epidemiologic data
• 5 years of experience with large scale multi-source data collection and analysis. Analytical engagements outside class work while at school can be included.
• Proficiency with Google Cloud Platform (BigQuery, Cloud Scheduler, Cloud Functions, Cloud Build, Cloud Run, Cloud Storage, Cloud Composer, gcloud CLI) or equivalent cloud platforms and services
• Proficiency with R, python and SQL
• Experience with data pipeline orchestration tools, especially Apache Airflow
• Strong foundation in data modeling and data warehousing
• Familiarity with Docker containerization
• Experience using data visualization tools (such as Rshiny, ggplot, matplotlib, Quarto)
• Working knowledge of Continuous Integration/Continuous Development (CI/CD) with GitHub (repositories, Workflows, Pages, Issues, Projects)

Responsibilities
• Lead development of researcher-facing data warehouse, ensuring that data are appropriately cleaned, curated, de-identified, and secure
• Leverage Google Cloud Platform (GCP) resources to streamline data processing and automation
• Maintain an ETL to flatten Firestore (NoSQL) data for analysis in BigQuery (SQL)
• Develop and maintain reporting pipelines for survey, recruitment and biospecimen datasets
• Develop R-backed APIs using plumber
• Maintain Docker containers in which reporting scripts are executed in GCP
• Enhance and optimize QAQC frameworks for the Connect database
• Guide development of stakeholder-facing data products, e.g., shiny dashboards
• Partner with DevOps to create and implement agreed upon data structures
• Experience with the OMOP Common Data Model and related QAQC
• Ensure proper documentation of all ETL processes, pipelines, and decision-making thoroughly
• Mentor junior analysts, conduct code reviews, and uphold coding standards
• Deliver ad hoc requests for back-end testing and database issues promptly
• Assist partner sites with access to data and code resources
• Write well-documented, modular, and reusable code according to FAIR principles
• Remain current with cloud computing and data engineering advancements, particularly in GCP, R, and Python

Additional details:
Pay is commensurate with qualifications. This position is eligible to be fully remote within the US (with up to 5% travel). This is a contract position, arranged through a
government contractor. Interested candidates should send a CV, a list of relevant software applications and degrees of competency, professional references, and a cover letter to Davin Johnson. Davin@cirruslakesolutions.com


Start date

As soon as possible

How to Apply

email Davin Johnson. Davin@cirruslakesolutions.com


Contact

Davin Johnson, Davin@cirruslakesolutions.com
Davin@cirruslakesolutions.com