Seismic Data Lakehouse: An Architecture for Curation, Governance, and Reproducibility of Seismic Data in Northeast Brazil.
Seismological Data Lakehouse; FAIR principles; Data governance; Data lineage; Traceability; Seismological data management
The management of seismological data involves significant challenges related to integration, governance, traceability, and reuse within scientific workflows. In traditional environments, these processes are often fragmented and dependent on manual procedures, which limits reproducibility, scalability, and operational efficiency.
This dissertation proposes a domain-oriented Data Lakehouse architecture designed to support the complete lifecycle of seismological data. The architecture integrates data ingestion, storage, processing, and access within a unified and scalable framework, complemented by a transversal governance layer responsible for metadata management and end-to-end data lineage. This approach enables traceability, interoperability, and alignment with FAIR principles.
The solution was implemented using open-source Big Data technologies, including Apache Hadoop, Apache Spark, Apache Airflow, and Apache Atlas, combined with domain-specific tools such as ObsPy and MiniSEED for seismic data processing.
The evaluation was conducted through a multi-method approach, combining experimental validation with real-world seismic data, quantitative analysis, expert-based assessment, and Fuzzy Comprehensive Evaluation. Results demonstrate that the proposed architecture improves data organization, reduces manual intervention, and enhances traceability when compared to traditional workflows.
Experts reported high acceptance levels, with average scores above 4.6 (on a 5-point scale), and 90\% indicated willingness to adopt the solution. These findings confirm that the proposed Data Lakehouse provides a robust, scalable, and FAIR-aligned framework for seismological data management.