Anthony RamirezRob Syme
Anthony Ramirez & Rob Syme
Aug 27, 2025

Nebulaworks and Nextflow: Scaling Machine-Learning in Pharma

The Pharma ML Challenge: Beyond Proof-of-Concept

Machine learning (ML) is rapidly transforming pharma R&D, with an estimated 54% of companies already adopting ML into their R&D processes. Yet scaling these models into production is anything but straightforward. Teams face:

  • Fragile architectures: One model per property creates systems that crumble under data complexity
  • Infrastructure gaps: Legacy pipelines can't handle modern molecular datasets spanning thousands of features.
  • Organizational silos: Disconnected workflows between bench biologists, bioinformaticians, and data scientists.
  • Cost pressures: Traditional approaches waste resources on idle compute and manual processes.

Seqera's Nextflow, combined with Nebulaworks’ implementation expertise, delivers scalable, reproducible ML pipelines that integrate MLflow experiment tracking with portable workflow orchestration, unifying pharmaceutical teams and accelerating drug discovery.

MLflow Integration with Nextflow

Nextflow logo

The Power of Flexible Orchestration

Nextflow fundamentally transforms how scientists and researchers approach complex computational workflows. At its core, it provides an elegantly simple solution to a persistent challenge: how to take diverse collections of scripts, tools, and analyses - often written in different languages and requiring different computational resources, and orchestrate them into cohesive, scalable pipelines.

What makes Nextflow particularly powerful is its dual commitment to both parallelization at scale and rigorous reproducibility. Researchers can seamlessly execute the same workflow on their laptop during development, then deploy it unchanged to HPC clusters or cloud infrastructure for production runs. Every execution is traceable, every result reproducible, and every computational resource optimally utilized through intelligent task scheduling and caching.

💡New to Nextflow? Learn more about the leading open-source orchestrator for reproducible, scalable scientific workflows.
Discover Nextflow


Embracing the Ecosystem: Nextflow + MLflow

Crucially, Nextflow's flexibility means it doesn't force teams into an either-or decision with their existing tools. Its open architecture is designed to integrate with, rather than replace, the specialized frameworks that teams already trust. This is particularly important in pharmaceutical ML, where workflows have traditionally been homogeneous - often written entirely in Python, tightly coupled to specific infrastructure stacks, and managed through integrated platforms or bespoke scripts.

MLflow has become a popular standard for ML lifecycle management, offering critical capabilities for tracking experiments, packaging models, and managing deployments. When combined with Nextflow's orchestration capabilities, teams get the best of both worlds: Nextflow handles the complex workflow orchestration, resource management, and reproducibility across heterogeneous environments, while MLflow provides the specialized ML tracking and governance layer that data scientists expect.

Both tools are open source, and any team can combine them. However, making this integration work smoothly in production at scale benefits from experience with both platforms and the architectural patterns that connect them effectively.

Learn more about MLflow


The Nebulaworks Advantage

Nebulaworks logo

This is where partnering with Nebulaworks accelerates success. While research teams can certainly integrate these tools themselves, Nebulaworks brings proven experience designing and implementing production-grade Nextflow workflows that seamlessly incorporate MLflow. Their team has already solved the common integration patterns - from automatic experiment tracking and hierarchical run organization to performance optimization and cloud deployment strategies.

For pharmaceutical teams looking to move quickly from proof-of-concept to production, Nebulaworks offers the fastest path forward. They've developed reusable patterns and best practices that eliminate the trial-and-error phase of integration, letting scientists focus on their experiments rather than infrastructure. The result is a solution that preserves everything teams love about both tools while eliminating the friction of making them work together at scale.

Nebulaworks' Technical Implementation


MLflow Integration Architecture

Nebulaworks has architected an MLflow integration that transforms how pharma teams track and manage their ML experiments.

  • Seamless experiment tracking automatically capturing every model run with parameters and results
  • Hierarchical organization using parent/child run structures for managing complex experimental designs
  • Complete model lineage from raw molecular data to deployed predictions
  • Systematic model registry enabling versioning and controlled deployment pipelines


Cloud-Native Implementation

The implementation leverages cloud-native approaches to optimize performance and cost efficiency for pharmaceutical workloads.

  • AWS infrastructure optimization for pharmaceutical workloads
  • Containerized processes ensuring consistent environments across development and production
  • Cost-effective serverless compute strategies reducing operational overhead
  • DevOps best practices ensuring reliability and regulatory compliance

Combining Nextflow and MLflow


The Nextflow Solution

  • Multi-Framework Approach: ChemProp2 and scikit-learn running in parallel
  • Containerized Processes: Consistent environments across development and production
  • Parameter Sweeping: Automated exploration of model configurations
  • Custom Data Pipelines: Built when existing tools couldn't handle molecular data complexity


Model Training Strategy

  • ChemProp2 Models: Different predictors and aggregation methods
  • Scikit-learn Models: Various algorithms with feature transformations
  • Parallel Execution: Each combination runs as a separate containerized process
  • AWS Fargate: Serverless compute for cost-effective scaling


MLFlow Integration

  • Experiment Tracking: Every model run is automatically logged with parameters and results
  • Parent/Child Runs: Organized hierarchy for 17,000+ model experiments
  • Reproducibility: Complete lineage from raw SMILES to deployed models
  • Model Registry: Systematic versioning and deployment pipeline

This technical architecture was specifically designed to address a critical pharmaceutical challenge: rapidly identifying molecular compounds with unfavorable properties before costly synthesis and testing. Traditional single-model approaches struggled with the complexity of high-dimensional molecular datasets and the relatively low number of labeled samples available. The ensemble strategy enabled the team to evaluate dozens of diverse modeling approaches simultaneously, from different ChemProp2 configurations to various scikit-learn algorithms, creating robust predictions through model diversity. The scalability challenge was solved through Nextflow's orchestration capabilities, cloud-native scaling, and MLflow's lineage tracking for comprehensive experiment management.

Results from 10,000 Model Runs

Ensemble modeling achieved 80% accuracy in identifying unfavorable compounds, while response times improved from over two minutes to under 15 seconds per prediction. Rather than relying on a single modeling approach, this approach enabled dozens of diverse strategies to run in parallel, creating robust predictions through model diversity. The deployment process also accelerated, moving from initial idea to production-ready model in just one day.

The implementation also improved R&D productivity. Manual processes that previously bottlenecked experimentation were replaced with automated pipelines handling thousands of model variations simultaneously.

Key Benefits of the Combination

The integration of Nextflow and MLflow creates powerful synergies that address core challenges of ML in pharma:

  • Seamless Integration: Workflow execution automatically feeds experiment tracking
  • Scalability: From laptop development to cloud production without architecture changes
  • Reproducibility: Complete experimental lineage and version control
  • Community Ecosystem: Leveraging established bioinformatics and ML communities

This translates into faster innovation through rapid experimentation, reduced risk via systematic validation, cost efficiency through pay-per-use compute, and regulatory readiness with built-in audit trails, moving teams beyond proof-of-concept limitations to production-ready solutions.

Conclusion

Running thousands of models in parallel isn't just about computational muscle - it's about creating an environment where scientists can iterate freely without infrastructure constraints. Nextflow provides the flexible orchestration layer that makes this possible, while MLflow ensures every experiment is tracked and every model is production-ready.

For teams ready to move beyond proof-of-concept, the combination is powerful. And with Nebulaworks' experience turning this combination into production reality, pharmaceutical teams can skip the learning curve and start discovering. It's a practical example of how the right tools, combined with the right expertise, can transform how we approach drug discovery.

About Nebulaworks

Nebulaworks logo

Nebulaworks bridges the gap between groundbreaking scientific research and the scalable, secure, and production-grade technology needed to deliver discoveries to market. With over a decade of elite cloud and DevOps engineering experience, Nebulaworks specialize in uniting bio/cheminformatics expertise with modern cloud practices, embedding cross-functional teams directly alongside scientists to accelerate outcomes. Their expeditionary approach, prescriptive processes, and enterprise-grade execution ensure that complex biological data becomes actionable intelligence—enabling Life Sciences organizations to move faster, operate with confidence, and achieve transformative business results