Wednesday, May 6, 2026

Data collection and repositories


Summary 

Data collection isn't just routine it’s a core way we build knowledge today. How we gather, check, and use information shapes research across many fields. People collect data through surveys, sensors, online transactions, and digital interactions. Each method carries its own assumptions about reliability and validity. Meanwhile, structured, semi-structured, and unstructured data show the complexity of modern information systems.

But there's a catch: data collection must follow ethical and legal rules. Austin et al. (2017) argue that accuracy, transparency, and GDPR compliance aren't optional they’re essential if data is to be trusted or reused. Data quality is therefore a governance issue, not just a technical one. For an overview, see SAGE Research Methods: Data Collection. Watch: Data Collection Methods Explained (YouTube).

Repositories aren't passive storage they’re active governance systems that turn raw data into lasting resources. By adding metadata, enforcing standards, and supporting collaboration, they make data citable and reusable. The shift from data warehouses to data lakes and lake houses shows how repositories balance efficiency with flexibility. In academia, research repositories operationalize open science, making datasets discoverable and reproducible (Harvard Biomedical Data Management, n.d.). Explore OpenDOAR. Watch: What Is a Data Repository? (YouTube).

Done right, repositories offer major benefits. They centralize datasets, strengthen collaboration, and protect information via encryption and compliance. Cloud-native repositories scale easily and work with AI tools for pattern detection (Airbyte, 2025). But challenges remain: integrating different data formats without losing meaning, managing rising storage costs, and sustaining performance under heavy workloads. Tenopir et al. (2011) highlight a lasting tension between openness and security one that needs policy choices, not just technical fixes. Read the study at Tenopir et al. (2011). Watch: Data Governance and Security (YouTube).

To address these tensions, the FAIR principles Findable, Accessible, Interoperable, and Reusable have become the standard for repository governance. These are practical imperatives guiding design, policy, and auditing. Automated governance, zero-trust security, and long-term preservation strengthen modern repositories. Wilkinson et al. (2016) show that FAIR-aligned communities produce more collaborative and reproducible research. Data alone isn't that valuable its worth comes from infrastructure that preserves, contextualizes, and mobilizes it for future inquiry. As global data markets grow, repositories are essential for innovation, compliance, and cross-sector collaboration. Read Wilkinson et al. (2016). Watch: FAIR Data Principles Explained (YouTube).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

References

Airbyte. (2025). What is a data repository? Definition and examples. Airbyte. https://airbyte.com/data-engineering-resources/data-repository

Austin, C. C., Bloom, T., Dallmeier-Tiessen, S., et al. (2017). Key components of data publishing: Using current best practices to develop a reference model for data publishing. International Journal on Digital Libraries, 18(2), 77–92. https://doi.org/10.1007/s00799-016-0178-2

Harvard Biomedical Data Management. (n.d.). Data repositories. Harvard University. https://datamanagement.hms.harvard.edu/collect-analyze/data-management-plans/data-repositories

Tenopir, C., Allard, S., Douglass, K., et al. (2011). Data sharing by scientists: Practices and perceptions. PLoS ONE, 6(6), e21101. https://doi.org/10.1371/journal.pone.0021101

Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., et al. (2016). The FAIR guiding principles for scientific data management and stewardship. Scientific Data, 3, 160018. https://doi.org/10.1038/sdata.2016.18

13 comments:

Data Curation and Preservation Issues: Budgets, Costs,Staffing, and Skills

  Budget limitations Budgetary issues remain one of the most pressing challenges in institutional data curation and preservation, as sus...