mspasspy.reduce

mspasspy.reduce.mspass_dask_foldby(self, key='site_id')[source]

This function implements a convenient foldby method for a dask bag to generate ensembles from atomic data. The concept is to assemble ensembles of mspasspy.ccore.seismic.TimeSeries or mspasspy.ccore.seismic.Seismogram objects with a common Metadata attribute using a single key. The output is the ensemble objects we call class:mspasspy.ccore.seismic.TimeSeriesEnsemble and mspasspy.ccore.seismic.SeismogramEnsemble respectively. Note that “foldby” in this context acts a bit like a reduce operator BUT the data volume is not reduced; we just bundle groups of related data into ensembles. The outputs are always larger due to the overhead of the ensemble container. That is important as be careful with this operator as it can easily create huge ensembles that could cause a memory fault in your workflow.

Note that because this is implemented as a method of the bag class the usage is a different from the map and reduce methods. arg0 is “self” which means it must be defined by the input bag.

Example:

# preceded by set of map-reduce lines to create a bag called mydata
mydata = mspass_dask_foldby(mydata,key="source_id")
Parameters:

key – The key that defines the gather. By default, it will use “site_id” to produce a common station gather.

Returns:

dask.bag.Bag of mspasspy.ccore.seismic.TimeSeriesEnsemble or mspasspy.ccore.seismic.SeismogramEnsemble.

mspasspy.reduce.mspass_spark_foldby(self, key='site_id')[source]

This function implements a convenient foldby method for a spark RDD to generate ensembles from atomic data. The concept is to assemble ensembles of mspasspy.ccore.seismic.TimeSeries or mspasspy.ccore.seismic.Seismogram objects with a common Metadata attribute using a single key. The output is the ensemble objects we call class:mspasspy.ccore.seismic.TimeSeriesEnsemble and mspasspy.ccore.seismic.SeismogramEnsemble respectively. Note that “foldby” in this context acts a bit like a reduce operator BUT the data volume is not reduced; we just bundle groups of related data into ensembles. The outputs are always larger due to the overhead of the ensemble container. That is important as be careful with this operator as it can easily create huge ensembles that could cause a memory fault in your workflow.

Note that because this is implemented as a method of the RDD class the usage is a different from the map and reduce methods. arg0 is “self” which means it must be defined by the input RDD.

Example:

# preceded by set of map-reduce lines to create RDD mydata
mydata = mspass_spark_foldby(mydata,key="source_id")
Parameters:

key – The key that defines the gather. By default, it will use “site_id” to produce a common station gather.

Returns:

pyspark.RDD of mspasspy.ccore.seismic.TimeSeriesEnsemble or mspasspy.ccore.seismic.SeismogramEnsemble.

mspasspy.reduce.stack(data1, data2, object_history=False, alg_id=None, alg_name=None, dryrun=False)[source]

This function sums the data field of two mspasspy objects, the result will be stored in data1. Note it is wrapped by mspass_reduce_func_wrapper, so the history and error logs can be preserved.

Parameters:
  • data1 – input data, only mspasspy data objects are accepted, i.e. TimeSeries, Seismogram, Ensemble.

  • data2 – input data, only mspasspy data objects are accepted, i.e. TimeSeries, Seismogram, Ensemble.

  • object_history – True to preserve the history. For details, refer to mspass_reduce_func_wrapper.

  • alg_id – alg_id is a unique id to record the usage of this function while preserving the history. Used in the mspass_reduce_func_wrapper.

  • alg_name (str) – alg_name is the name of the func we are gonna save while preserving the history.

  • dryrun – True for dry-run, which return “OK”. Used in the mspass_reduce_func_wrapper.

Returns:

data1 (modified).