Starting your FAIR data journey in Pharmaceutical Research? … then make sure to optimize first!

December 27, 2023

John Apathy

The modern biomedical research lab is awash in data…and the problem is only growing larger every day.

Data generated from leading-edge experimental methods have brought Biomedical Research data estates to a very large, complex, and often-times, poorly managed state. Recent advances in laboratory automation and high-throughput approaches have enabled an explosion in techniques such as single-cell and spatial transcriptomics, causing exponential (pun intended) growth in data assets and resources.

One of XponentL’s large pharmaceutical clients claims to have generated more data in 2023 than the whole of its 100+ year existence!

Multi-omics scientific pursuits such as Next Generation Sequencing (NGS) have pushed the boundaries for what is achievable through traditional data management approaches. High-throughput genomics techniques such as variant calling on an entire human exome can yield >5,500,000 rows per sample. Data of this magnitude is not well suited for relational tables and significant value is lost through partial data transformations and purpose-built tables to support even one analysis.

By leveraging modern, scalable data lake/data warehouse/data lakehouse platforms AND by utilizing compute infrastructure that includes high-throughput Apache Spark, modern Research Data Platforms and compute can handle data tables that reach trillions of rows and beyond in a very cost-effective manner.

Managing this data deluge presents a significant challenge when making data FAIR (Findable, Accessible, Interoperable, and Reusable), particularly in terms of cost, convenience, and usability. Data strategies increasingly rely on hybrid cloud environments as that design pattern has emerged in order to strike the optimal balance between access control, collaboration, data sharing, and economics — lowering the total cost of acquiring, storing, analyzing, and reporting results from data. Making optimal use of the powerful components of any modern research data platform solution requires a deep understanding of the overarching research strategies and objectives in order to manage the full life-cycle of data and compute resources.

Why Optimize?

With many Life Sciences Research enterprises only just beginning the process of migrating their research data to cloud-based solutions, a significant economic problem will be or has already been, created by many research organizations — where petabytes of data are placed in expensive readily available storage or must be moved to analytic workflows and compute, incurring increased cost. In addition to the economic issues presented through inefficient storage, many of the data used in the analysis process are merely an in-process file sitting in a high-cost object store (e.g., S3, ADLSv2) and may not have sustained (re-use) value to the research objectives.

In the absence of a data strategy accounting for the economics of the full data life cycle management, data piles up and unnecessarily consumes costly resources.

From our experience, there are upwards of hundreds of thousands of dollars a month to be saved through profiling and optimizing lab data — generating opportunities for value to be rapidly delivered back to the organization in order to fund further discovery, innovation, and breakthroughs.


Optimizing Research Lab Data for Cost and Convenience in a Hybrid Cloud Environment

Traditional on-premise (on-prem) infrastructure often cannot cope with the increasing volume and demands of high-dimensional data, leading to:

  • High infrastructure costs: Purchasing and maintaining hardware, software, and power can be a significant burden, especially for smaller research labs.

  • Limited scalability: On-prem systems often lack the flexibility to scale quickly and efficiently to accommodate bursts in data generation or analysis.

  • Reduced data accessibility: Siloed data across different systems can hinder collaboration between researchers and slow down the pace of discovery.

  • Security vulnerabilities: On-prem systems may be more susceptible to cyberattacks than cloud-based solutions.

In addition to the cost considerations, convenience and ease of use are often-times the most crucial components for researchers. They need easy and rapid access to their data from anywhere at any time, regardless of their location or device. This is especially important for collaborative research projects involving teams across different labs, departments, institutions, and/or different geographies.

Additionally, as data science and computational/in-silico biology methods are pushing drug discovery into the AI age, there is an ever-growing need to scale access to instrument-derived data from high-dimensional biological assays through query languages such as SQL. Methods to supplement individual data files for analysis have included loading derived data into traditional relational database formats (Oracle / PostgreSQL / MySQL, etc.). These tables have traditionally provided data integrity and fast lookups; however, as the volumes increase, the cycle time and ability to access these data becomes problematic — and as traditional approaches grow to hundreds of millions to billions of rows, their performance degrades.

Consider this as an essential first step in making Research data FAIR!


Optimizing Biomedical Research Data in a Hybrid Cloud: Strategies and Best Practices

By implementing the following strategies and best practices, research labs can optimize their data for cost and convenience in a hybrid cloud environment:

Data classification and tiering:

Classify data: Identify different types of data based on their access requirements, security needs, and frequency of use.

Tier data: Store frequently accessed data in the public cloud for easy access and performance. Archive less frequently accessed data in the private cloud for cost-efficiency.

Data compression and optimization:

Compress data: Reduce data storage requirements and minimize bandwidth usage by compressing data before storing it in the cloud.

Optimize data formats: Use efficient data formats to further reduce storage costs and improve data transfer speeds.

Data lifecycle management:

Implement a data lifecycle management policy: Define clear rules for data retention, deletion, and archiving to ensure compliance with regulations and optimize storage costs.

Automate data lifecycle management tasks: Automate data lifecycle management processes to free up researchers’ time and reduce the risk of human error.

Data security and compliance:

Encrypt data at rest and in transit: Implement strong encryption measures to protect sensitive data from unauthorized access.

Use cloud security services: Leverage the built-in security features and compliance offerings of cloud providers to ensure data protection and regulatory compliance.

Cloud cost management:

Monitor and analyze cloud usage: Regularly monitor cloud resource utilization and identify areas where costs can be optimized.

Optimize cloud resource allocation: Right-size cloud instances to ensure optimal performance and avoid paying for unused resources.

Negotiate with cloud providers: Negotiate with cloud providers to obtain discounts and other cost-saving benefits.

Data governance and access control:

Establish data governance policies: Implement clear policies and procedures for data access, sharing, and ownership to ensure proper data management and compliance with regulations.

Leverage IAM services: Utilize cloud-based identity and access management (IAM) services to control data access and ensure that only authorized users can access sensitive information.

Adopting the right tools:

Utilize data management platforms: Leverage cloud-based data management platforms to simplify data organization, access, and analysis.

Use data transfer tools: Employ efficient data transfer tools to move data between different cloud environments and on-prem systems.

Explore data analytics tools: Utilize cloud-based data analytics tools to gain valuable insights from research data and accelerate discoveries.

Collaboration and training:

Promote collaboration: Foster a culture of collaboration within the research lab to encourage efficient data sharing and resource utilization.

Provide training: Train researchers on best practices for data management and cloud computing to ensure optimal data usage.


Ready to Optimize?

Despite the majority of the industry having already moved its “corporate” data and workloads to the cloud, biomedical research lab data has often been “left behind” due to the complexity of legacy systems and bespoke data management approaches. A hybrid cloud environment offers the potential to address many of these challenges by combining the benefits of both public and private clouds:

  • Drive cost-effectiveness: Cloud services offer on-demand scalability and pay-as-you-go pricing, allowing research labs to pay only for the resources they need.

  • Enhance flexibility: Hybrid clouds can be easily scaled to accommodate changing data volumes and computational needs.

  • Improve accessibility: Data stored in the cloud can be accessed from anywhere with an internet connection, facilitating collaboration and remote work.

  • Increase security: Cloud providers offer a high level of security and compliance, mitigating the risk of cyberattacks and data loss.

XponentL has helped clients discover, profile, and optimize their research data and compute landscapes, bringing deep expertise and significant resources to modernizing data architectures and creating reusable data products and schemas that can encompass the full data life cycle and analysis at a near-limitless scale. These data can enhance the workflows of data scientists, and machine learning model generation, and then serve the growing need as a building block for generative artificial intelligence — and reduce the economic impact coming from the modern research lab data estate.

Starting your FAIR data journey? We encourage you to reach out and contact us at XponentL Data (john.apathy@xponentl.ai) for a conversation on how we might help to optimize your Research data and computing landscape.

About the Authors —

John Apathy, Chief Solutions Officer, Life Sciences, XponentL Data — Data and Digital Transformation enthusiast with 35 years+ of industry experience in the Pharmaceutical R&D Data and Technology field. (john.apathy@xponentl.ai)

Andrew Brown, Ph.D., Managing Director, Life Sciences, XponentL Data — Computational Genomics and Research Data Platform Engineering enthusiast with 20 years of software development, machine learning, and data platform experience in academia and the biopharmaceutical industry.