Tags:
create new tag
view all tags

eSDO 1321: Data Centre Integration Plan

This document can be viewed as a PDF.
Deliverable eSDO-1321
E. Auden
23 August 2005

UK Data Centre

The UK data centre will host some data products from the Helioseismic and Magnetic Imager (HMI) and the Atmospheric Imaging Assembly (AIA). Helioseismologists will be interested in HMI data collected for long periods of time. By contrast, AIA data will cater to solar physicists analysing events such as flares and coronal mass ejections; this audience will primarily be interested in recent data. Low and high level data products from both instruments will be permanently available through the US data centre, so the UK data centre holdings will focus on providing fast access to data of the most interest to solar scientists across the UK.

System

Three architecture models have been investigated for the UK data centre: one light footprint and two heavy footprints. In the light footprint model, we assume that the role of the UK data centre will be to provide fast access to cahced SDO export data products. The two heavier models describe an active global role in the JSOC's Data Resource Management System (DRMS) and Storage Unit Management System (SUMS). At the end of the eSDO project's Phase A, other global DRMS and SUMS instances are being considered in the US and possibly in Germany.

Architecture Model 1: Light Footprint

The "light footprint" approach will provide a 30 TB online disc cache that will store HMI and AIA data products in export format (FITS, JPEG, VOTable, etc) only. Export files will be retrieved from a major SDO data centre in the US or Europe. This 30 TB will be divided between a 'true cache' for popular HMI and AIA datasets and a rolling 60 day systematic cache of AIA products. No tape archive will be required.

The light footprint is currently favoured by the eSDO science advisory and JSOC teams. While the UK community is anticipated to have a regular interest in new AIA data, users requiring large volumes of HMI data will primarily be working with helioseismology groups at Birmingham and Sheffield that will have co-investigator accounts on JSOC processing machines at Stanford University. In addition, the HELAS project is considering plans for a European SDO data centre at the Max Planck Institute in Lindau. A disc cache of ~30 TB will provide the UK solar community with a sizeable cache for export products, but it is not large enough to warrant a heavyweight installation of a DRMS and SUMS.

  • UK Data Centre: light footprint:
    UK Data Centre: light footprint

Architecture Model 2: Heavy Footprint

The "heavy footprint" approach would provide a 300 TB disc cache that is a significant percentage of the US SDO disc cache size. This 300 TB cache would be interfaced to a UK instance of the DRMS and SUMS; entire storage units would be cached. Along with the DRMS and SUMS, the UK data centre would require software to extract export formats from storage units. This system would play active global role in storage unit management with other SDO data centres. This 300 TB will be divided between a 'true cache' for popular HMI and AIA datasets and a rolling 60 day systematic cache of AIA products. No tape archive will be required.

  • UK Data Centre: heavy footprint:
    UK Data Centre: heavy footprint

Architecture Model 3: Heavy Footprint with Tape Archive

This final "heavy footprint with tape archive" model is considered to be a fallback position if there is no major European SDO data centre and one is considered to be required. Similar to the "heavy" footprint described above, this approach would provide a 300 TB disc cache that is a significant percentage of the US SDO disc cache size. This 300 TB cache would be interfaced to a UK instance of the DRMS and SUMS; entire storage units would be cached. Along with the DRMS and SUMS, the UK data centre would require software to extract export formats from storage units. This system would play active global role in storage unit management with other SDO data centres. This 300 TB will be divided between a 'true cache' for popular HMI and AIA datasets and a rolling 60 day systematic cache of AIA products. In addition, HMI export format data products would be written to the ATLAS tape store to provide a permanent European helioseismology archive.

  • UK Data Centre: heavy footprint with tape archive:
    UK Data Centre: heavy footprint with tape archive

AIA and HMI Datasets

A number of level 1 and level 2 HMI science products will be available to users, including magnetograms, dopplergrams and continuum maps. Assuming that the "light" footprint data centre architectures is followed, HMI products will be held in export format on a disc cache following user requests. By contrast, if the "heavy" model is followed, then following user requests HMI products will be imported to the UK as JSOC storage units of ~20 GB. These storage units will be held in the large disc cache, and instances of DRMS and SUMS will be updated accordingly. Export formats of data products will be extracted from the storage units and returned to the user. Finally, if the "heavy with tape storage" architectural model is used, then HMI data will be systematically pulled from the JSOC archive and written to the ATLAS tape store inside JSOC storage units. Uncompressed storage for these data products is currently estimated at ~25 TB per year, culminating in 150 TB total storage by 2014.

AIA products will be held in a rolling 60 day cache; this will provide solar physicists with data from two most recent solar revolutions. Cached low level data will include low resolution full-disk images (8 per 10 seconds) along with high resolution images of tracked active regions (8 per 10 seconds). Several high level products generated at a much lower cadence will also be cached: thermal maps, DEM measures, irradiance estimates, and magnetic field extrapolations (1 to 10 per day). The storage requirement for a rolling 60 day cache is estimated at 11 TB.

Instrument Data Product Estimated Size Estimated Cadence Storage
HMI Line-of-sight magnetogram (full disk, full res) 15 MB 1 / 45 s cached on user request
HMI Vector magnetograms (tracked active region, full res) 3 MB 5 / 10 minutes cached on user request
HMI Averaged continuum maps (full disk, full res) 15 MB 1 / hour cached on user request
HMI Dopplergrams (full disk, full res) 20 MB 1 / 45 s cached on user request
AIA Images from 10 channels (full disk, full res) 15 MB(?) 8 / 10 s rolling 7 day cache
AIA Images from 10 channels (full disk, low res) ~1 MB(?) 8 / 10 s rolling 60 day cache
AIA Images from 10 channels (regions of interest, full res) ~1 MB(?) 40 / 10 s rolling 60 day cache
AIA Movies of regions of interest ~10 MB(?) 1 / day? rolling 60 day cache
AIA Other level 2 and 3 products (DEM, irradiance maps, etc) ~ 10MB? 10 - 20 / day? rolling 60 day cache

Integration Work

AstroGrid Deployment

The major tasks for integrating the UK eSDO centre with AstroGrid will be the deployment of the DataSet Access (DSA) and Commmon Execution Architecture (CEA) AstroGrid modules on a remote machine that can access data in the ATLAS storage facility. This development will be undertaken by MSSL early in Phase B in conjunction with work done to access Solar-B data also held at ATLAS. A relational database (MySQL) containing AIA and HMI data product metadata will reside on a remote machine. The DSA module will interface with this database, allowing a user to identify which data products are required. A request for the identified data products is sent to a CEA application on the same machine. The CEA application will issue the ATLAS commands necessary for data to be transferred from the ATLAS facility to the user's remote AstroGrid storage area, or "MySpace".

A number of test datasets will be placed in a disc cache at the ATLAS facility. Next, a MySQL database will be configured on the eSDO server at MSSL with sample metadata relating to the test datasets. Instances of DSA and CEA will be also be deployed on the eSDO server; the DSA will interface with the MySQL database and the CEA application will interface with ATLAS. Requested test datasets will be returned to an instance of the AstroGrid filestore and filemanager on the eSDO server.

Interface with JSOC Data Centre

Assuming that the "light footprint" architecture model is followed, export formatted SDO data products will need to be transferred from the JSOC data centre to the UK data center. In this model, when a UK user makes an SDO data request through the AstroGrid system, the request will first be sent to the UK data centre. If the required data is not present, the request will be redirected to the JSOC data centre. The required datasets will be exported back to the UK. The dataset will be cached in the UK system before a copy is passed to the user's MySpace area. In addition to user requests, the data centre will poll the JSOC system for new AIA data approximately twice an hour, and this data will be held in the UK cache for 60 days.

Development work will require a mechanism to poll the JSOC data centre new AIA data as well as a CEA application to pass user requests that cannot be fulfilled by the UK data centre to the JSOC system. This CEA application should cache the returned datasets in ATLAS, update the metadata accessible to the DSA deployed on the eSDO server, and pass the data on to the user's MySpace area.

US Data Centre

System

Detailed plans of the JSOC data centre and pipeline can be viewed at http://hmi.stanford.edu/doc/SOC_GDS_Plan/JSOC_GDS_Plan_Overview_CDR.pdf.

Archived Data

In addition to the HMI and AIA products listed above, a full description of archived and cached SDO products can be viewed at http://hmi.stanford.edu/doc/SOC_GDS_Plan/HMI_pipeline_JSOC_dataproducts.pdf.

Integration Work

AstroGrid Deployment

The Virtual Solar Observatory (VSO) will be the user front end to the JSOC data centre in the US. However, AstroGrid users may wish to incorporate a VSO search into an AstroGrid workflow that submits data to eSDO tools. Therefore, an AstroGrid to VSO interface will be developed using the CEA module. In addition, the JSOC data centre team is reviewing three AstroGrid components for use with their backend system. First, Karen Tian at Stanford University is investigating the DSA and CEA modules to enable data searching and data retrieval through the grid. Second, Rick Bogart has expressed interest in the AstroGrid workflow engine for use in driving JSOC pipeline execution.

The eSDO project will advise the JSOC team and aid development with these three AstroGrid components. As part of the Phase A research effort, Mike Smith has installed and configured the major AstroGrid components at MSSL: DSA, CEA, the workflow engine (JES), the filemanager / filestore, the registry and the portal. Documentation of this deployment is available to the solar community at http://www.mssl.ucl.ac.uk/twiki/bin/view/AstrogridInstallationV11, and it is also included as an appendix in the eSDO Phase A Report.

Network Latency Tests

Aim: establishing baselines for network latency involved in data transfer under various protocols. This information is required for the specification of both the optimal sizes for the eSDO data cache elements and the overall architecture for an interoperating system involving the JSOC capabilities for efficient data management at the two (or more) sites.

Transfer protocols to be investigated include simple higher-level ones such as ftp, wget, and rsync, and especially lower-level ones including gridFTP, nfs, and raw TCP data sockets handled through dedicated clients. The tests will require accounts for Elizabeth at MSSL & Stanford (existing) and UCL and/or RAL on designated Linux machines, and accounts for Rick at Stanford (existing) and MSSL plus UCL and/or RAL on the same machines, if possible. They will also require temporary access to approximately 100 GB of scratch disk space at each site. We will also need cooperation of the system administrators in opening any firewalls for the designated hosts, ports and services for the tests.

I. Transfer mechanisms

  1. non-interactive scp
    • login to Stanford machine: scp files to MSSL
    • login to Stanford machine: scp files from MSSL
    • login to MSSL:scp files to Stanford
    • login to MSSL: scp files from Stanford
    • N.B: These may require .rhosts type authentication for non-interactive transfers
  2. ftp
    • login to Stanford machine: ftp files to MSSL
    • login to Stanford machine: ftp files from MSSL
    • login to MSSL: ftp files to Stanford
    • login to MSSL: ftp files from Stanford
    • N.B. These may require access to anonymous ftp if use of plain-text .netrc files is to be avoided for non-interactive transfers
  3. gridFTP
    • login to Stanford machine: gridftp files to MSSL
    • login to Stanford machine: gridftp files from MSSL
    • login to MSSL: gridftp files to Stanford
    • login to MSSL: gridftp files from Stanford
  4. wget
    • login to Stanford machine: wget files to MSSL
    • login to Stanford machine: wget files from MSSL
    • login to MSSL: wget files to Stanford
    • login to MSSL: wget files from Stanford
  5. rsync
    • test rsync for speed in updating files that exist at both MSSL and Stanford
    • test rsync for speed in updating directory structures existent at MSSL and Stanford
    • test rsync for ability to update dynamic directory structures (in context of DRMS) at MSSL and Stanford
  6. direct access
    • test comparative network latency for file seeks, reads, and writes for files mounted via a standard protocol (nfs) from remote sites
    • test latencies for selected data transfers using tcp sockets

II. Test Requirements

  1. Remote accounts
    • Accounts for Elizabeth at Stanford (done), UCL and / or RAL
    • Accounts for Rick at MSSL, UCL, RAL if possible
  2. Test Data . up to 100 GB in scratch space for bulk transfers
  3. Test locations
    • Stanford
    • MSSL
    • UCL
    • RAL
  4. Test scripts to automatically (and non-interactively) initiate transfers for test data and store metrics for speed and timestamps
  5. Required programs (with opened ports) on test machines: scp, ftp, gridftp, wget, rsync

-- ElizabethAuden - 23 Aug 2005

Topic attachments
I Attachment History Action Size Date Who Comment
GIFgif eSDO_DC_heavy.gif r1 manage 13.7 K 2006-03-03 - 16:30 ElizabethAuden UK Data Centre: heavy footprint
GIFgif eSDO_DC_heavytape.gif r1 manage 14.6 K 2006-03-03 - 16:30 ElizabethAuden UK Data Centre: heavy footprint with tape archive
GIFgif eSDO_DC_light.gif r1 manage 13.0 K 2006-03-03 - 16:30 ElizabethAuden UK Data Centre: light footprint
Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | More topic actions
Topic revision: r7 - 2006-05-10 - ElizabethAuden
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback