Tags:
create new tag
view all tags

Gaia Odyssey Pipeline

Overview

Before February 2013 every DU associated at MSSL used Odyssey, but in a separate Framework or pipeline processing. Over time this created different tables and many special add-on technical modules that the individuals used to gather information. This approach means our pipelines were quite disjointed. As we near launch and with OR simulations, it was chosen that we need to join our pipelines and to make certain that we use the same tables along with the ability to run all the DU's including modules that are not coded at MSSL. This wiki page is to describe this new Full pipeline.

Who can use it and requirements?

Everybody. Instead of a download distribution, it will be initially be checked out from SVN and ran with 'ant' commands. MSSL will ingest data and run IncorporateInformation from the PreProcessing DU and will dump the needed data that can be downloaded. See 'Instruction' settings below on how to setup the data and run the pipeline.
Requirements:
  • Ant as normally setup in the Gaia environment.
  • Mysql -- Though Odyssey can run on any database, MSSL runs mysql. The contents that can be downloaded from MSSL will be in the form of a Mysql database. The setup instructions below will setup a new mysql user to be used on the dataset, hence a root (or admin type) mysql user will be required. All instructions below will be assuming a mysql database. It is assumed you will also have the 'mysql' client to run on the command line or terminal. Instructions below assume mysql is on localhost, though adding a --host {hostname} for external db's can be used.
    • To verify your mysql is working try (if neither of the commands below work, then you or somebody has setup a database with a particular user and password that you need to discover):
      • mysql -u root --password= -e"show databases"
      • or if that fails: mysql -u root -e"show databases"
      • finally one last try: mysql -u root --password=root -e"show databases"

Support

  • All questions should be sent to 'odyssey@majordomo.mssl.ucl.ac.uk' so everybody has a chance to view and answer the question. See 'Obtaining Dataset' information about dates for support for a particular dataset.
  • Advice : Be specific on subject lines such as what 'DU' and what module if it's about a particular module. Saying WindowedSpectrum does not look correct is not very helpful considering several modules make changes and produce WindowedSpectrum as output.
  • Be active.

Instructions and setup

For the purpose of making instructions simpler, it will be assumed that your starting directory will be '/Pipeline' and that you have a Terminal/Command Prompt are currently located in this directory.

Obtaining the pipeline and setup of the environment

svn is the better way to download and keep up to date. A download of everything related to svn will be located here in case you do not have svn capability. Do note you still need the environment variables setup as the instructions below give, it is assume if you have ever setup a CU6 environment and compiled a module then variables are already setup.

Database and obtaining the Dataset

These instructions will step through setting up a database for 'pipeline_template'
Note the mysql command above to do 'show databases'
  1. Download the particular database you wish to process here. i.e. pipeline_template.zip
  2. Unzip the downloaded file. You should now have a files called pipeline_tables.dat and pipeline.dat
  3. Create a new database:
    • Run this command: mysql -u root --password={yourmysqlpasswordifany} -e "create database pipeline_template";
  4. Grant a 'gaia' user to access the database:
    • Run this command: mysql -u root --password={yourmysqlpassword} -e "grant all on pipeline_template.* to 'gaia'@'localhost' identified by 'gaia';grant all on pipeline_template.* to 'gaia'@'%' identified by 'gaia';flush privileges"
  5. Import the table metadata first: mysql -u gaia --password=gaia pipeline_template < pipeline_template_tables.dat
  6. Import the data: mysql -u gaia --password=gaia pipeline_template < pipeline_template.dat
  7. At this stage all data is ready for the pipeline. To save space you may remove the .dat files.


As new datasets are delivered i.e. 'OR3' you will follow the steps above.

Datasets

Current list of datasets NOTE the date of support. The Date of Support is the estimated date that you can send queries to the Support mailing list. Feel free to send information, but queries that would require work should wait until the date on the table. Reason being you must give others especially people here at MSSL time to process the chains and understand the data.

Name Dataset Zip file Date of Support Description
Pipeline Template pipeline_template.zip here 15/04/2013 Dataset based on OR2 that is used as a test dataset for Odyssey Pipeline. Data is assumed to be incorrect, this dataset is to verify you can run the pipeline.
OR3 Dry Run Day 1 or3_day1.zip here 8/05/2013 OR3 Pre dataset, right before the official OR3 was a procesed OR3 dataset. * Updated on May 8th 10:00 -- Missing StarNormal table data.
OR3 (From CNES) Not Ready ? Currently being processed.

Running the Pipeline

The pipeline are still separated into individual DU plus individual chains as well inside the DU. On a terminal cd to the framework top level directory.
  • cd MSSL_Odyssey_FullFramework

  • Double check if your database is not local: In the 'conf' directory you will find property files that connect to databases. pipeline_template.properties should already be setup for the local database.

Tips:
  • By default the batch size on the commands below will be 5000 (or 15000 including strips), holding a lot of data in memory. If your machine gets very slow then you can change the batch size. See the FAQ below.
  • Start up another terminal window and do a 'tail -f {logfile}' to see the processing of your scripts.
  • See FAQ about monitoring the batch size.
  • Do you wish to run PBG? PBG is one of the slower modules because it calls so many other facades in the Chain, but this is one of the modules that can be skipped if desired. You can see information about skipping modules below or you can see a quick example in the properties file commented out.
  • By default 8000M (8GB) is in the build.xml, you might wish to lower this amount if it's more than your machine can take.


Running the Pipelines:

  • Pre Processing:
    • ant -Dfilename=xml/FullChain_PreProcessing.xml -DpropertyFile=conf/pipeline_template.properties runpipeline > mylogpreprocessing.txt

  • Extraction-Preliminary Chain:
    • ant -Dfilename=xml/FullChain_PrelimExtraction.xml -DpropertyFile=conf/pipeline_template.properties runpipeline > mylogprelimextraction.txt

  • Calibration Chain:
    • OR3_Pre dataset that is being processed has very few Ground Based Standards creating very few records. It might be wise to use the ShortCircuit below. Real OR3 data is currently being downloaded as well.
    • ant -Dfilename=xml/FullChain_Calibration.xml -DpropertyFile=conf/pipeline_template.properties runpipeline > mylogfullcalibration.txt

  • Calibration Short Circuit (Only run if you don't wish to run the Full Calibration and deal with first hour/cauNumber):
    • On small datasets e.g. 'pipeline_template' not all the calibration records will be written, which means less records for Full Extraction and STA. This is fine, but wanted to warn you. You can short circuit the full calibration by using this run, which will duplicate the initial cauNumber 0 to cauNumber 1 allowing you to go to Full Extraction and STA for first cauNumber and process all the spectra.
    • ant -Dfilename=xml/FullChain_CalibrationShortCircuit.xml -DpropertyFile=conf/pipeline_template.properties runpipeline > mylogshortcircuitcalibration.txt

  • Extraction-Full Chain, after calibration:
    • ant -Dfilename=xml/FullChain_Extraction.xml -DpropertyFile=conf/pipeline_template.properties runpipeline > mylogfullextraction.txt

  • STA-Full Chain:
    • ant -Dfilename=xml/FullChain_STA.xml -DpropertyFile=conf/pipeline_template.properties runpipeline > mylogfullsta.txt

  • MTA-Full Chain:
    • ant -Dfilename=xml/FullChain_MTA.xml -DpropertyFile=conf/pipeline_template.properties runpipeline > mylogfullmta.txt
    • No longer needed as of May 29th 2013 update on svn: An issue with STA is not producing radVelSpe though it is for other radial velocities. To short circuit this problem for the moment do a mysql command using this query: 'update sta_mdbtransitcharact s, sta_dpctransitcharact t set s.radVelSpe=t.radVelSpeCCDir where s.transitId=t.transitId'
    • On May 21st a serialize_info.dat is available here to update the serializeinformation table, which included a small change for STA. This dat file could be loaded into any of your current databases to fix an issue on datasets that were loaded previously to May 21st.
    • A proper MTA dataset is currently being processed and will be available shortly.
    • mta_mdbmeancharact table which is the main location of the mean characteristics will be cleaned and written again with the MTA modules running. If you need to create this mean characteristics again quickly then see the new IncoporateInformation below to run just the CreateMDBMean.xml

Running certain tests

It is quite common times you do 'NOT' want to run every module in the chain, but only particular module(s). In the property file (see commands above) you may uncomment out the property 'Odyssey.TESTONLY' and create an array 'space delimeted' of the modules you wish run. NOTE: Comments will be placed in the property file for modules that you must have i.e. TransitLoop and TransitLoopBack is required. This is because the chains work on a batch of data and these two small technical modules setup the batching.
  • Example of only calling RvsTimes and FieldAngles: Odyssey.TESTONLY= TransitLoop TestCalcRvsTimes TestCalcFieldAngles TransitLoopBack

Do be careful this property is read for any of the chain calls, hence if I used the example above on an Extraction or STA chain then nothing would be ran except for the TransitLoop and TransitLoopBack.


Likewise maybe you wish to run most modules, but skip one particular module. A property 'Odyssey.SKIPTESTS' can be used for this particular case. A good example is PointBackground might run once, but prefer to skip on future runs because your not working on that module and the values will not change. i.e. Odyssey.SKIPTESTS= TestPointBackgroundChain It is not advisable to have both of these properties used at the same time, Odyssey would take the SKIPTESTS priority over TESTONLY if such a case exists.


A Description of all the ID names and brief description is in a further section below.

Plotting and Analysing results

If your only wanting to view only a small subset and do basic plots then use MDBExplorer. For commissioning or to view the data quickly with plotting to png files then see the 'Data Extracting and Plotting with Odyssey' below.

Using the Official Version of MDB Explorer

As of August 2014 this now works correctly with local MySQL databases. A properties file describing the local databases is still required, as described below. This properties file can be used with the official MDB Explorer jar, available at:

http://gaia.esac.esa.int/mdbexp/

MDBExplorer and GbinInterogator

A customised MDBExplorer is created along with the GbinInterogator and can all be downloaded. This customised version of the MDB Explorer jar was adapted to work with MySQL databases, but the official version now works correctly (see link above), so try that first.

The MDBExplorer is bundled with mysql drivers to connect to the database. Because of passwords being located inside MDBExplorer, you must access it from svn (see below) or email odyssey@majordomo.mssl.ucl.ac.uk for access.

SVN is the quickest way to get MDBExplorer and property files: svn co http://gaia.esac.esa.int/dpacsvn/DPAC/CU6/integration/Tools/MDBExplorer_Mysql
To run use: java -Xmx800m -cp MDBExplorerStandalone.jar gaia.cu1.maindb.dbexplorer.progs.ExplorerApplication

In the above url you will find a dbexplorer_sample.properties file that can be edited or added to look at your database. In MDBExplorer do these tasks once you have edited the properties file:

  • File->Load Properties File
    • {Now load the properties file you edited}
  • File->Reload Data
NOTE: You only need to edit one line at the top of the properties file to make a keyname for your database and add at the bottom of the property file the information.
  • For MSSL internal users a dbexplorer_mssldb.properties file is ready to be used and will connect to the primary mssl database. Just download it and follow the two instructions above.

Your done, it should now be populated with your database.
INFO YOU MUST READ: MDBExplorer when connected to Mysql will show you a drop down(s) as related to your properties file. This drop down will be connected to 'ONLY' the one database you specified in the properties file. MDBExplorer will show you 'ALL' databases though, but actually only be connected to the one specified. Again add to the properties file to add more sections. Example: _mssl.properties file will show you a PIPELINETEMP and a OR3DAY1 sections, the drop-down will show you many many more databases (i.e. gibis_jul27) in each section, but they are only connected to one database.

When MDBExplorer is launched it will connect to your database, from there you can look at tables and see results. Look below for Table Names with descriptions. You can then see results and do plots in MDBExplorer. You may wish to save information as a Gbin then you can analyze the data with the Integrator.
TIPS for MDBExplorer:

  • When looking at results of a table and you want to plot Fluxes (and lets say Wavelengths) do these steps.
    • Ctrl-D to toggle for highlighting a single cell (also can be found in the Menu). You should now have a filled in blue box on a cell.
    • go to Fluxes and right click and say 'Convert Selected Column to new Window' now a new window appears with your fluxes. Clicking on a column will highlight the whole column. Right click and say plot as line.
    • Wavelengths and Fluxes XY plot: Again Ctrl-D to highlight the cell Wavelengths and hold down 'SHIFT' to you get to Fluxes. Right click and convert to new window
      • WARNING: DO not use CTRL to highlight just Wavelengths and Fluxes. It looks like it works, but when you see the new window the Fluxes are all the same value. USE SHIFT.
    • Now you can hold down ctrl and highlight Wavelengths and Fluxes in this new Window and do Plot XY. Note: My computer tends to take a good 10-15 seconds before the XY dialog comes up
  • Query using the above example was for calibrated spectrum looking at brighter than 10:
    • SELECT * FROM full_ext_applycal_calibratedspectrum m, incinfo_rawspectrumdesc r where m.transitId=r.transitId and m.gmdbstrip=r.gmdbstrip and 21-r.rvsMag/64 < 10
  • Warning to Mac users: Ctrl-C and Ctrl-V for any copy and pasting into MDBExplorer.


GbinInterogator can be downloaded here

GbinInterogator is a fantastic tool to look at Gbins and dump it to ASCII for further processing. Look at the metadata/columns:

  • java -cp gbcat.jar gaia.cu1.tools.util.GbinInterogator -m {gbinfile}

Dump alll columns:

  • java -cp gbcat.jar gaia.cu1.tools.util.GbinInterogator {gbinfile}

Dump certain columns e.g. SpectroObservation:

  • java -cp gbcat.jar gaia.cu1.tools.util.GbinInterogator -c "TransitId, Rvs1Samples" {gbinfile}

Data Extracting and Plotting with Odyssey

Plotting Scripts

Scripts and tools for creating plots of spectra are described in GaiaPlottingSpectra.

Introduction to Logic Classes

Odyssey with it's xml files are already chaining together modules to run each DU in the pipeline. We can exploit Odyssey and use these XML files along with some basic Logic classes to analyze and extract the data. Data can be extracted into ASCII files allowing scientist to look at the data in a format they are common to and parse into various scripts. Odyssey also can generate other text files such as a gnuplot text file allowing automatic plotting of the data. It is envisioned this will be used in commissioning to quickly gather much needed information from the datasets.

Some xml files for extracting and plotting are already setup. It is expected they will be improved over time as we know more that is needed for commissioning. XML files are located in: 'xml/diagnostics/' and Logic classes are in 'src/gaia/cu6/framework/logic'. See 'Help and Current Diagnostics', need your help to really improve and analyse the data.

Each xml file performs an SQL query that passes the particular data to a Logic class. This Logic class is simply pure java that gets the data and begins processing it in the fashion the developer wants such as dump to an ASCII file or print other particular information.


Few Tips

  • Read the following subsections below including the 'Questions'
  • Simply looking at the xml file and logic class you can quite quickly see how it works. Feel free to create one or more xml files and/or Logic classes.
  • MUST BE NOTED: That you cannot break the framework, all of this is simply extracting out the data as you want it and creating gnuplot files like you desire.
  • See the 'Questions' section, use the 'storage' hashtable in the Logic classes. It can be very useful and a good advantage in making data and/or gnuplot files. do note, don't do storage.clear() the xml files also place things into this storage hash table.

Running Logic Classes Using ant

Running is similar to the above commands simply use the appropriate filename: ant -Dfilename=xml/diagnostics/{filename} -DpropertyFile=conf/{propertyfile} runpipeline
i.e. ant -Dfilename=xml/diagnostics/DumpALLWindowedSpecAndRawSpecAfterBasicCleaning.xml -DpropertyFile=conf/pipeline_template.properties runpipeline

Output and Plot files

Output is sent to the data/output directory by default, you may specify in the properties file a different location 'Odyssey.PlotAnaysisDir' setting. There is way too much data to place everything in one directory on large datasets. The current Logic classes creates new directories based on a batch number and then multiple of 200 files then a further directory based on CLASS_{MagnitudeRange}. It is expected write permissions are available to create directories and files.

Filenames are also given long descriptive pattern to allow a user to quickly move files into categories or directories if desired. Current Logic classes pattern is: {TransitId}_{Strip}_{Keyword}_{Class}_{Fov}_{Row}_{Magnitude}.dat i.e. 49683597792404141_c-ws-calcfieldangles_1_1_0_6__7.796875.dat

  • Note the xml files 'can' create GBINS automatically, they are commented out in the xml files and just need uncommenting.

Logic classes will tend to create a gnuplot text file, a client goes to that directory and runs:
gnuplot gnuplot_{keyword}.txt
This command will create png files to quickly view the data. GNUPlot has many capabilities and they should be used to create the plots needed.

Because there are so many gnuplot files being created in each directory. A plot_datfiles.sh has been created to call all the gnuplot files with a find command from linux.
Run 'find . -name gnuplo
.txt -exec plot_datfiles.sh {} \; or just to do a batch 'find data/output/1/ -name gnuplo*.txt -exec plot_datfiles.sh {} \;'* NOTE: The gnuplot files are ALWAYS appended, so it is advisable to move the 'output' directory to another location if you wish to re-run an xml file

Keyword

Keywords are optional but should be used to construct the dat files, these keywords are defined in the xml file when calling the logic class. If we use an alphabet system on the keyword then when listing the data files alphabetically you would see the data and png files change as it goes through each module. Currently label 'a' as idt or incinfo, b calcrvstimes, h for deblendtocollapse. See the 'chain' image below of get a listing of all the current modules and there current order. Other details in the keywords can be used such as ws (WindowedSpectrum) and module name or other info you feel necessary to convey.

In the xml files you will see a keyword defined when calling the logic class i.e.:

      <logic type="MID" className="gaia.cu6.framework.logic.DumpWindowedSpectra" prop="c-ws-calcfieldangles"/>

Help and Current Diagnostics

Below is a table of the current xml files and there capabilities, it is expected they will improve. Help is needed though, the xml files and logic classes provided are a good start, but other people should be adding there own xml files and Logic classes for there particular module or other scientific needs. It is expected many will copy the current xml files and logic classes to quickly get integrated.

XML File Description Contact
DumpALLWindowedSpecAndRawSpecAfterIncInfo.xml Dump all WindowedSpectrum plus RawSpectrumDescription Information in the dataset after IncoporateInformation. Uses the batch mechanism. Currently creates gnuplot_{keyword}.txt to plot fluxes. Kevin
DumpALLWindowedSpecAndRawSpecAfterBasicCleaning.xml Dump all WindowedSpectrum plus RawSpectrumDescription Information in the dataset after BasicCleaning. Uses the batch mechanism. Currently creates gnuplot_{keyword}.txt to plot fluxes. Kevin
DumpALLExtractedSpecAndRawSpecAfterCollapse.xml Dump all ExtractedSpectrum plus RawSpectrumDescription Information in the dataset after DeblendToCollapse. Uses the batch mechanism. Currently creates gnuplot_{keyword}.txt to plot fluxes. Kevin
DumpExtractedSpecAndRawSpecAfterCollapse.xml Same as DumpALL version for ExtractedSpectrum, but the SQL query is limited to a certain amount and has a constraint on magnitude. Feel free to construct your own sql queries to quickly dump a small amount of the data. Kevin
DumpWindowedSpecAndRawSpecAfterIncInfo.xml Same as DumpALL version for WindowedSpectrum, but the SQL query is limited to a certain amount and has a constraint on magnitude. Feel free to construct your own sql queries to quickly dump a small amount of the data. Kevin
DumpWindowedSpecAndRawSpecAfterCalcFA.xml Same as DumpALL version for WindowedSpectrum it is after 'Calc Field Angles' though, the SQL query is limited to a certain amount and has a constraint on magnitude. Feel free to construct your own sql queries to quickly dump a small amount of the data. Kevin

Questions in this particular section

One big question that will come up is that you wish to analyze all the data and make plots based on all the data, not just for a particular batch of spectra i.e. 5000. How do I do this? Answer: You will notice in the Logic classes a 'storage' Hashtable that allows you to store anything such as objects, numbers etc... Use this to your advantage, the Hashtable is not cleared away meaning you can place things into the hashtable and retrieve it out in each batch call or even another Logic class. You might need to create a final Logic class to do the final plotting and/or writing a data file, this will probably be called at the very end, see the 'TransitLoopBack' in the xml file. It will look just like 'TransitLoopBack' as seen in many of the Extraction xml files except it will be after 'TransitLoopBack' test and using your Logic class to help create the gnuplot or data text file(s).

Table Names and ID

For MDBExplorer and analysis of xml files if you wish to dig into the actual queries used then you need to know the Table Names and the KEY (ID) used in the processing.

As the pipeline are ran they import particular table setting xml files i.e. PrimaryTableSettings.xml, PrelimTableSettings.xml These xml files do nothing more than store a 'KEY' with a 'VALUE'. In this particular case a 'KEY' is a unique name for a particular table and the 'VALUE' is the actual table name in the database. The huge benefit we get from this is the xml files will reference the 'KEY' and we can reuse the xml files by simply changing a 'VALUE' in the Settings xml files. For Example PrelimExtraction chain includes individual module xml files and the 'FULL' Extraction chain includes the same xml files. The main difference in these chains are they reset the 'VALUE' (using these TableSettings.xml files) so the data is stored (or read) from the correct tables in the database.

To know exactly what you might wish to dump and analyse, a table is referenced below giving you the Key, Value/Table Name, and brief description.

  • NOTE: In the xml files you will notice 'STORAGE-KEY' i.e. STORAGE-Table_DpcStellar. You must use this 'STORAGE-' part, because Odyssey uses it as a keyword to look up in a mapping of previously stored Key->Value combinations. Also make sure a ' ' {space} is before and after when used in the SQL statements.
  • NOTE: when constructing your xml files you have to know the data model name (className attribute in the xml file). Many of you will know this already because you deal with it in your coding. This is not listed in the table below, but by simply doing a 'grep {KEY} {DU/}xml/*' you can easily find the className if your not certain i.e. 'grep Table_DPCStellar Extraction/xml/*'.

KEY Value/TableName Description
Table_Cu6RawSpecDesc incinfo_rawspectrumdesc RawSpectrumDescription created from IncorporateInformation. Other modules may update MaskingSamples in this table
Table_Cu6WindowSpec incinfo_windowedspectrum WindowedSpectrum created from IncorporateInformation.
Table_DpcStellar incinfo_dpctransitcharact DpcStellarSourceTransitCharacteristics created from IncorporateInformation. Other modules may update fields in this table. STA has it's own output, see below.
Table_PBGDPC incinfo_dpctransitcharact_contams DpcStellarSourceTransitCharacteristics containing contaminants for PBG. Created from IncorporateInformation
Table_ObjectLogRVS mdbcu3idtrawobjectlogrvs ObjectLogRVS ingested from IDT and used by BiasNU. NOTE: used as the primary batching table as well.
Table_BiasRecordDt mdbcu3idtintermbiasrecorddt BiasRecordDt objects ingested from IDT.
Table_Cu3RawSpectro mdbcu3idtrawspectroobservation Primary raw spectro observation ingested from IDT.
Table_xmmatch mdbcu3idtxmmatch Cross Match table for sourceId to TransitId. Used in IncorporateInformation to assign sourceId. Not originally given and is in a dump extension file.
Table_CompleteSource mdbcu1integratedcompletesource CompleteSource information. Might be used in IncorporateInformation to assign alpha&delta along with atm params. Not originally given and is in a dump extension file.
Table_AcShifts mdbcu3idtrawacshifts AC Shift data ingested from IDT and used in BiasNU.
Table_StarNormal starnormal StarNormal used in STA.
Table_SmallStarNormal small_starnormal Smaller version of StarNormal used in PBG.
Table_HackedSmoothTemplates hackedsmoothtemplates SmoothTemplates but modified slightly for PBG in the current version.
Table_ExtDetAtmTempCal ext_detatm_templatecalibratedspectrum CalibratedSpectrum templates used for DetermineAtm.
Table_ExtDetAtmTempSpecDesc ext_detatm_templatespectrumdescriptions TemplateSpectrumDescriptions used for DetermineAtm.
Table_CalcRvsTransit preproc_transitdescription TransitDescription output from CalcRvs.
Table_PostCalcFieldWindow preproc_fieldangles_windowedspectrum WindowedSpectrum out from from CalcFieldAngles.
Table_SpectrumDescription ext_selsynth_spectrumdescription TemmplateSpectrumDescription output from SelSynthSpectra.
Table_BGWindow backgroundwindowedspectrum Initialzed WindowedSpectrum background for BasicCleaning. Used if no results from PBG for that particular transit.
Table_SpectralLibrary spectrallibrary SpectralLibrary used in various modules.
Table_SmallSpectralLibrary small_spectrallibrary Smaller version of SpectralLibrary used in PBG.
Table_ExtDetAtmSpecDesc ext_detatm_spectrumdescription Output of DetAtm.
Table_PCAAUX pcaaux PCAAux record.
Table_DpcMeanCharact incinfo_dpcmeancharact DPCMean Characteristics generated by IncorporateInformation.
Table_BaryVelCorr mdbcu3idtintermbaryvelocorr IDT ingested information for barycentric velocity correction.
Table_WaveRecord calibrationwavelengthrecord Wavelength Records, initialised with data for prelim and populated by calibration for full extraction and other chains. cauNumber equals 0 for initial data.
Table_PhotRecord calibrationphotometricrecord Photometric Records, initialised with data for prelim and populated by calibration for full extraction and other chains. cauNumber equals 0 for initial data.
Table_PCALSF pcalsf PCALSF records, initialized with data for prelim and populated by calibration for full extraction and other chains. cauNumber equals 0 for initial data.
Table_MDBStellar incinfo_mdbtransitcharact MDBStellarSourceTransitCharacteristics populated by IncorporateInformation. STA will have it's own output, see below for table name.
Table_STACrossCorrelationFunc sta_crosscorrelationfunc STA Cross Correlation function outputs.
Table_STADPCTransitCharact sta_dpctransitcharact STA output of DPCStellarSourceTransitCharacteristics.
Table_STAMDB sta_mdbtransitcharact STA output of MDBStellarSourceTranistCharacteristics.
Table_NormPolySpectra prelim_normpoly_calibratedspectrum Used in Preliminary for the normalisation - Continuum Poly module.
Table_NormPolySpectra full_ext_normpoly_calibratedspectrum Used in Full Extraction for the normalisation Continuum Poly module.
Table_NormSpectra ext_normalized_calibratedspectrum Used in Full Extraction for Normalisation plus merger of the full_ext_normpoly_calibratedspectrum.
Table_RodcDflm rodcdflm2D Output of Calibration.
Table_CleanWindow prelim_ext_basicclean_windowedspectrum Windowedspectrum output of basic cleaning.
Table_CalCCD prelim_calibrationccdrecord CalibrationCCDRecord used in preliminary extraction and at this time of writing also used in full extraction.
Table_ExtWindow prelim_ext_collapse_extractedspectrum ExtractedSpectrum output of Deblend to collapse, prelim extraction.
Table_ExtWindow full_ext_collapse_extractedspectrum ExtractedSpectrum output of Deblend to collapse, full extraction.
Table_PostApply prelim_ext_applycal_calibratedspectrum CalibratedSpectrum output from ApplyCalibrations prelim extraction.
Table_PostApply full_ext_applycal_calibratedspectrum CalibratedSpectrum output from ApplyCalibrations, full extraction.
Table_BiasNU prelim_ext_biasnu_windowedspectrum WindowedSpectrum output from BiasNU, prelim extraction.
Table_BiasNU full_ext_biasnu_windowedspectrum WindowedSpectrum output from BiasNU, full extraction.
Table_BGWindow2 prelim_ext_pbg_windowedspectrum WindowedSpectrum output from PBG, prelim extraction.
Table_BGWindow2 full_ext_pbg_windowedspectrum WindowedSpectrum output from PBG, full extraction.
Table_ACPeakPosition acpeakposition Output from Calibration, a wavelength data model currently.
Table_MDBStellarMean mta_mdbmeancharact MTA MeanCharactistics but is initially created at an early stage for STA.

Advanced Information and FAQ

Odyssey

More Odyssey information can be found here, though much of the detail can be discovered by simply viewing the xml files. This wiki page though will help better understand Odyssey as well as the Logic classes that allow you to do various special capabilities.

Information on Batching

Extraction and STA work on a batch mechanism defined in the xml/TransitLoop.xml file. A special Logic class was developed to use rawspectroobservation to construct a batch mechanism by default, but has properties that can override the logic class. Chris Dolding did exactly this to use the ObjectLogRVS table to construct a more correct batch and to be used with the BiasNU. Other information can be overriden and can be found here. By default the batch size is roughly 5000 transits or 15000 including strips. In fact STA being the slower one in the group was overriden to do 500 instead of 5000 in a batch, that way it is a little easier to see the batches quickly written to the database.

Information on Running Odyssey in Parallel

Odyssey does not have automatic parrelization by default, but what you can do is start Odyssey up several times each with a slightly different configuration in the xml file for batch processing. So you can run several batches at once and you may even run them from different machines considering your database is accessible from other machines. There is some difference between the chains, so they are descirbed below on what needs to happen.

One Key step for all of them is that: When starting Odyssey via 'ant' be sure to add a -Dnoclean except for the first time So wait a few minutes after the first run then start the other parallel Odyssey commands. Again just like the property file you need to add a '-Dnoclean' to the ant command.

This tells Odyssey to not clean or delete data in the tables at all, normally this is turned on in the xml files to clean out data in the output tables in the initial run through the Facades. Again start Odyssey the first time to let it clean out data (wait about 5 minutes) then you should be safe to run all the other scripts with 'no' cleaning.

PrelimExtraction, FullExtration, and STA

These chains are all similiar and uses a special DBLoopLogic class to compute the batches that needs to be performed. This Logic class allows you to setup properties to tell it how to customize you want to do the batching. The three main things that can be setup.

  • How many (default 5000), STA for example does 500:
    <putInStorage values="LOOPARRAYINCREMENT,500" />
  • Start of the Batch default (0). It must be an even number:
    <putInStorage values="LOOPARRAYINDEX,500" />
  • End of the Batch default none, goes to the end:
    <putInStorage values="LOOPARRAYSTOPINDEX,2000" />

  • Finally rare to use, by default it is assumed you will be querying based on transitId, spectroobservation table is used by default. But really you can base your query on any table and in fact you will notice in the TransitLoop.xml we now use objectlog table to make certain we pick up all transits and no gaps suggested by ChrisD.
    <putInStorage values="LOOPSQL,SELECT transitid FROM mdbcu3idtrawobjectlogrvs ORDER BY transitid" />

These lines MUST be contained inside a 'test' xml element, for example they could go into TransitLoop.xml or a TableSettings.xml or it could go into one of the chain xml's like 'FullChain_PrelimExtraction.xml' but again it would need to a ''test'' so place it near the top i.e.:

<test id="MyBatchCustomization" outputType="DB" facade="mssl.ucl.odyssey.commonfacade.gaia.DummyFacade">
   <putInStorage values="LOOPARRAYINDEX,500" />
   <putInStorage values="LOOPARRAYSTOPINDEX,2000" />
</test>

  • Why is the LOOPARRAYINDEX need to be even? In short the Logic class places the min TransitId in the first index i.e. 0 and max transitId for that batch in the next index i.e. 1. So every even number 0, 2, 4 correlates to a minTransitId for a batch.
  • How do I know how many batches that could be done?
    • Good question: Easy answer is to do one of the mysql statements above and do: select count(*) from mdbcu3idtrawobjectlogrvs; (or mdbcu3idtrawspectroobservation) then divide that number by LOOPARRAYINCREMENT (or 5000 if using defualt). NOTE: In most queries in the logs you see more near 15,000 objects loaded, this is because of the 3 strips in most queries.

MTA

MTA is more about being based on sourceID and uses a different Logic class called 'DBIDStackLogic', which places id's on a stack and simply pops off each id when it is called again. To run in parallel is easier to simply change the LOOPSQL, currently defined in SourceLoop.xml
 <putInStorage values="LOOPSQL,SELECT distinct sourceId FROM sta_mdbtransitcharact" />
Simply change this to have a LIMIT Start,HowMany i.e. (3rd process doing batches of 8000)
<putInStorage values="LOOPSQL,SELECT distinct sourceId FROM sta_mdbtransitcharact limit 16000,8000" />

  • Note again like above it uses sta_mdbtransitcharact as the base table. Figuring STA has ran or data was ingested into this table, though other tables could be done.
  • Like above on trying to determine how many batches: select count(distinct sourceId) from sta_mdbtransitcharact

Calibration

Calibration already has several filters (i.e. temperature range, bright stars, and others) plus batching is done differently, because of all these filters it tends to complete at quite a reasonable time. It is foreseen to not to bother running Calibration in parallel at this time.

Unique ID's for each call in the chain.

As mentioned above you can Skip or run particular tests. To do this capability you must know the 'id' of that particular test. Currently out of time to produce a table listing all the id's, but they are fairly self-explanatory i.e. TestBasicCleaning runs the Basic Cleaning module and so forth. To find out the ID's in a particular module then run: grep "test id" {DU}/xml/* i.e. grep "test id" Extraction/xml/*


Here is an image of all the Facades carried out by Odyssey. It must be noted these are just Facade names, use the above technique if you need to discover the 'id' that is being used.

  • Slide12.jpg:
    Slide12.jpg

Ingesting and running IncoporateInfo on your own

A few people have asked about ingesting there own datasets and running incoporate information. Here are the instructions.
  • Download the ingestor located here (ingestor.zip)
  • Unzip the file
    • This zip file is really just the Ingestor found in CU1 svn directory with mysql jdbc driver. The zip file is much bigger than it needs to be, but do not have time to properly trim it down. So currently just bundled the entire directory.
  • In the jobconf directory you will find a properties file. copy one of them (pipeline_template or or3_day1_final are the good ones) to make a duplicate for your db i.e. sta_test.properties Edit the file.
    • Verify any jdbc url is using your db domain hence localhost if it's on your local machine. And your new db name.
    • At the end of the file you will see the DM mapping to a table name. These should map to your correct table names your trying to ingest e.g. sta_dpctransit... it might have already been setup; so change them if needed or add new ones.
  • Now your ready to run the ingestor. Here is a sample command:
    • java -Dlog4j.configuration=file:conf/logging.properties -Xmx900M -cp dist/MDBExtIng-14.0.0.jar gaia.cu1.ingest.GaiaDataIngestorLauncher -t false -c 'gaia.cu1.mdb.cu3.idt.xm.dm.Match' -d /mydirectoryofgbins --database jobconf/sta_test.properties
  • TIPS:
    • Gbin names need to start with the name of the DM class or table name defined in the properties. i.e. Match.1.gbin Match.2.gbin
    • You can do more than one class at a time by adding a space for the -c option i.e. -c'gaia.cu1.mdb.cu3.idt.xm.dm.Match gaia.cu6....RawSpectrumDescription'
    • If your needing to ingest a DPC or a different MDB Data Model then add it to the -cp option i.e. -cp GaiaCu6DPC-14.jar:dist/MDBExtIng-14.0.0.jar You can place a GaiaMdb before the 'dist' to catch new MDB datamodels if needed.
    • The '-t false' tells the Ingestor to create the table for you. I have had issues sometimes not quite working. Do a 'jar xvf GaiaMdb-.jar' Look at the conf/mdb.properties to find the DM with the Create table command for mysql. Using the mysql statements as given at the top of this wiki you can create the table manually. You may also wish to run the command 'alter table {tablename} add index {column}' if you wish to add indexes before Ingesting.

IncoporateInformation: Here are the instructions to run your own IncoporateInformation. This has always been more difficult than the other modules, hence the luxury you normally get from MSSL by doing it first then distributing the database smile So possibly expect a few issues.

  • Go to the PreProcessing directory.
  • IncoporateInformation is always ran separately compared to the rest of the chain. Look and Edit the xml/IncoporateInformationStandAlone.xml
  • This file will have include statements for the parts of IncoporateInformation. My advice is to uncomment and run each xml file one at a time. The hardest is the first one, then the rest are fairly easy and straight forward.
    • CreateDPCStellar_Odyssey.xml is the xml file to create DPCStellarTransitCharacteristics it queries on a variety of tables. And it is just impossible to get the SQL to do matching of all the various tables perfectly, hence the Facade does looping to find the correct objects when needed. Note this xml file will create everything in a contams table for PBG, the FinishIncInfo.xml will separate contams and targets.
  • It is expected you have ingested the needed external tables such as various IDT tables i.e. xmmmatch, odas source.
  • Since we do not use DpcBatch, we had to make our own facades for DpcStellarTransitCharacteristics. If you feel you need to edit, view, or change something then analyze the xml file to see which facade is being called and you can find the IncoporateInformation with the source code here (incinfomain.zip)
  • Sample call: ant -Dfilename=xml/IncoporateInformationStandAlone.xml -DpropertyFile=conf/pipeline_template.properties runpipeline > mylogfullincinfo.txt
    • Note the properties file is in a different conf located in PreProcessing. Presumaly you could do ../conf/pipeline_template.properties, but have not tried.

FAQ

  • Does this chain mirror the SAGA/CNES Framework? Tricky answer. NO it does not mirror exactly, here is a list of areas that it differs.
    • PBG and STA parts of the chain are very complicated with special coding in many places. With the primary developers not located at MSSL it would take considerable time and effort to integrate (each part) into Odyssey. The easy solution, since the primary developers (Ronny and Katja) already had a Full Chain type classes that calls everything i.e. similar to a Testbed or STACombine, that Odyssey could easily write a Logic class to call one of these classes. So the primary developers supplied a class with one method that can take all the inputs and begin calling all the facades needed to produce an output. The best way to think of this is think of PBG and STA supplying a class that is much like a Super-Facade.
    • PBG will always give a result for all the transits, including transits that were not processed for a particular reason. This was to help in the Odyssey integration so an SQL query to the database will be certain to line up all the transints for BasicCleaning inputs, a special Logic class is used to see if this particular input has no real data before BasicCleaning and to set the input to 'NULL'.
    • We don't use DpcBatch anywhere in the chain for inputs or outputs.
    • IncorporateInformation - All simulations given in the past for distribution did not include a PrimarySource or SecondarySource data models, which are used to populate fields (alpha, delta, atm params) in DpcStellarSourceTransitCharacteristics. A special facade was created at MSSL to use CompleteSource (or failing that Igsl or IdtSource or finally TrueValue) type catalogues that also has this information.
    • Batching I suspect would be done differently.
  • Why did you say the batch is roughly 5000 and not exactly?
    • The batch mechanism works by obtaining 5000 transits from the ObjectLogRVS table, if for some reason SpectroObservation was missing some transits then when it comes to querying SpectroObservation or other tables (which were originally constructed from SpectroObservation) you could get a lower number e.g. 3000.
  • Can I run the xml files at the same time. For example PrelimExtraction and Calibration.
    • In theory 'yes', but you must let PrelimExtraction run a little longer. The other issue is if Calibration is faster then it would eventually catch up and have no data to process. STA is slower, so running FullExtraction followed by STA is probably safe.
  • I see some text like 'KMB' in the log file. Yes Yes some of this will be cleaned up at a later stage. You found an issue that I was trying to debug and simply placed println statements in various locations to find the culprit of an issue.
  • My machine is running slow can I speed it up. Can I change the Batch size.
    • Most likely culprit is the batch size is too much and your machine is holding too much memory. STA by default is less so we can see results written to the database a little quicker. To shorten the batch size for other processing (except Calibration) then add to your xml/PrimaryTableSettings.xml this line:
      •    <putInStorage values="LOOPARRAYINCREMENT,500" /> 

  • How can I monitor where I am at in the batch while processing. This can be improved in Odyssey, but for now do these steps.
    • grep Current {logfile}
      • This command will list an index number along with current min and max transitId for that batch. The number will always count evenly e.g. 0,2,4,6... Reason for this is an array is kept where by a min id {transitId} and max id {transitId} is kept. So 0-1 is your first multiple of your batch increment e.g. 0-5000. So 2-3 will be 5000-10000 and so on.
    • To get an idea of how many batches then check how many multiples (of your batch size) are in spectroobservation:
      • mysql -u gaia --password=gaia {dbname e.g. pipeline_template} -e"select count(*) from mdbcu3idtrawspectroobservation";
      • Or to see how many transits to go then look at your max transitId number and do: mysql -u gaia --password=gaia pipeline_template -e"select count(*) from mdbcu3idtrawspectroobservation where transitId > {maxtransit}";
  • I see an error/bug/investigation that needs to happen, how to I run it with new fixes in the code.
    • Build a new jar/lib and add it to the {DU}/locallib directory.
  • I wish to run IncorporateInformation because I see some issues that need fixing.
    • Ok if you really must: Go to Preprocessing and run a similar ant command as above:
      • ant -Dfilename=xml/IncorporateInformationStandAlone.xml -DpropertyFile=conf/pipeline_template.properties runpipeline
    • You must edit this IncorporateInformationStandAlone.xml and uncomment the areas you wish to run.
    • You will find the current IncorporateInformation that I run: here
    • NOTE: I am leaving it up to your technical skills to know what to run and solve.
  • Are there Exceptions I am likely to see. YES
    • Oga3 and Oga1 exceptions might show from CalcFieldAngles. Do not worry about this, it will find the Oga2 records.
    • Infinite or NaN on a few records when running DeblendToCollapse part of PrelimExtraction and FullExtraction. Should be ignored for now.
    • FullExtraction at the same area as the above Exception might say 'Null wavelength calibration records'. Again should be ignored, this happens because you ran FullCalibration that did not produce all wavelength records because it did not have enough data to work with. Does not happen if you use the ShortCircuit routine.
    • Same issue as the above point will cause a CutEdges exception and ArrayIndexOutofBounds.
    • FullCalibration on the pipeline_template near the end of the log file will see a Select Synth Spectra Exception saying source transits are null. I believe this could be ignored, but I need to check. This happens near the end and suspect it is just a very tiny batch that cannot find any records. Calibration still outputs data.
    • STA outputs records, but the log file will have an exception: 'incorrect MdbCharacteristics along with a few more above this line'. Will probably need Ronny to investigate.

-- KevinBenson - 13 Mar 2013

Topic attachments
I Attachment History Action Size Date Who Comment
JPEGjpg Slide12.jpg r1 manage 81.8 K 2013-05-17 - 13:51 KevinBenson  
Edit | Attach | Watch | Print version | History: r38 < r37 < r36 < r35 < r34 | Backlinks | Raw View | More topic actions
Topic revision: r38 - 2014-08-12 - ChrisDolding
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback