mspasspy.util
converter
Functions for converting to and from MsPASS data types.
- mspasspy.util.converter.AntelopePf2dict(pf)[source]
Converts a AntelopePf object to a Python dict. This converts a AntelopePf object to a Python dict by recursively decoding the tbls. :param pf: AntelopePf object to convert. :type md:
AntelopePf
:return: Python dict equivalent to md. :rtype: dict
- mspasspy.util.converter.Metadata2dict(md)[source]
Converts a Metadata object to a Python dict.
This is the inverse of dict2Metadata. It converts a Metadata object to a Python dict. Note that Metadata behavies like dict, so this conversion is usually not necessay.
- Parameters:
md (
Metadata
) – Metadata object to convert.- Returns:
Python dict equivalent to md.
- Return type:
dict
- mspasspy.util.converter.Pf2AttributeNameTbl(pf, tag='attributes')[source]
This function will parse a pf file to extract a tbl with a specific key and return a data structure that defines the names and types of each column in the input file.
- The structure returned is a tuple with three components:
- 1 (index 0) python array of attribute names in the original tbl order
This is used to parse the text file so the order matters a lot.
- 2 (index 1) parallel array of type names for each attribute.
These are actual python type objects that can be used as the second arg of isinstance.
- 3 (index 2) python dictionary keyed by name field that defines
what a null value is for this attribute.
- Parameters:
pf – AntelopePf object to be parsed
tag – &Tbl tag for section of pf to be parsed.
- mspasspy.util.converter.Seismogram2Stream(sg, chanmap=['E', 'N', 'Z'], hang=[90.0, 0.0, 0.0], vang=[90.0, 90.0, 0.0])[source]
Convert a mspass::Seismogram object to an obspy::Stream with 3 components split apart.
mspass and obspy have completely incompatible approaches to handling three component data. obspy uses a Stream object that is a wrapper around and a list of Trace objects. mspass stores 3C data bundled into a matrix container. This function takes the matrix container apart and produces the three Trace objects obspy want to define 3C data. The caller is responsible for how they handle bundling the output.
A very dark side of this function is any error log entries in the part mspass Seismogram object will be lost in this conversion as obspy does not implement that concept. If you need to save the error log you will need to save the input of this function to MongoDB to preserve the errorlog it may contain.
- Parameters:
sg (
Seismogram
) – is the Seismogram object to be convertedchanmap (list) – 3 element list of channel names to be assigned components
hang (list) – 3 element list of horizontal angle attributes (azimuth in degrees) to be set in Stats array of output for each component. (default is for cardinal directions)
vang (list) – 3 element list of vertical angle (theta of spherical coordinates) to be set in Stats array of output for each component. (default is for cardinal directions)
- Returns:
obspy Stream object containing a list of 3 Trace objects in mspass component order. Presently the data are ALWAYS returned to cardinal directions (see above). It will be empty if sg was marked dead
- Return type:
- mspasspy.util.converter.SeismogramEnsemble2Stream(sge)[source]
Convert a seismogram ensemble to stream :param sge: seismogram ensemble input :return: stream
- mspasspy.util.converter.Stream2Seismogram(st, master=0, cardinal=False, azimuth='azimuth', dip='dip')[source]
Convert obspy Stream to a Seismogram.
Convert an obspy Stream object with 3 components to a mspass::Seismogram (three-component data) object. This implementation actually converts each component first to a TimeSeries and then calls a C++ function to assemble the complete Seismogram. This has some inefficiencies, but the assumption is this function is called early on in a processing chain to build a raw data set.
- Parameters:
st – input obspy Stream object. The object MUST have exactly 3 components or the function will throw a AssertionError exception. The program is less dogmatic about start times and number of samples as these are handled by the C++ function this python script calls. Be warned, however, that the C++ function can throw a MsPASSrror exception that should be handled separately.
master – a Seismogram is an assembly of three channels composed created from three TimeSeries/Trace objects. Each component may have different metadata (e.g. orientation data) and common metadata (e.g. station coordinates). To assemble a Seismogram a decision has to be made on which component has the definitive common metadata. We use a simple algorithm and clone the data from one component defined by this index. Must be 0,1, or 2 or the function wil throw a RuntimeError. Default is 0.
cardinal – boolean used to define one of two algorithms used to assemble the bundle. When true the three input components are assumed to be in cardinal directions (x1=positive east, x2=positive north, and x3=positive up) AND in a fixed order of E,N,Z. Otherwise the Metadata fetched with the azimuth and dip keys are used for orientation.
azimuth – defines the Metadata key used to fetch the azimuth angle used to define the orientation of each component Trace object. Default is ‘azimuth’ used by obspy. Note azimuth=hang in css3.0. Cannot be aliased - must be present in obspy Stats unless cardinal is true
dip – defines the Metadata key used to fetch the vertical angle orientation of each data component. Vertical angle (vang in css3.0) is exactly the same as theta in spherical coordinates. Default is obspy ‘dip’ key. Cannot be aliased - must be defined in obspy Stats unless cardinal is true
- Raise:
Can throw either an AssertionError or MsPASSrror(currently defaulted to
pybind11’s default RuntimeError. Error message can be obtained by calling the what method of RuntimeError).
- mspasspy.util.converter.Stream2SeismogramEnsemble(stream)[source]
Convert a stream to seismogram ensemble. :param stream: stream input :return: converted seismogram ensemble
- mspasspy.util.converter.Stream2TimeSeriesEnsemble(stream)[source]
Convert a stream to timeseries ensemble. :param stream: stream input :return: converted timeseries ensemble
- mspasspy.util.converter.Textfile2Dataframe(filename, separator='\\s+', type_dict=None, header_line=0, attribute_names=None, rename_attributes=None, attributes_to_use=None, one_to_one=True, parallel=False, insert_column=None)[source]
Import a text file representation of a table and store its representation as a pandas dataframe. Note that even in the parallel environment, a dask dataframe will be transfered back to a pandas dataframe for the consistency.
- Parameters:
filename – path to text file that is to be read to create the table object that is to be processed (internally we use pandas or dask dataframes)
separator –
The delimiter used for seperating fields, the default is “s+”, which is the regular expression of “one or more spaces”.
For csv file, its value should be set to ‘,’. This parameter will be passed into pandas.read_csv or dask.dataframe.read_csv. To learn more details about the usage, check the following links: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html https://docs.dask.org/en/latest/generated/dask.dataframe.read_csv.html
type_dict – pairs of each attribute and its type, usedd to validate the type of each input item
header_line – defines the line to be used as the attribute names for columns, if is < 0, an attribute_names is required. Please note that if an attribute_names is provided, the attributes defined in header_line will always be override.
attribute_names – This argument must be either a list of (unique) string names to define the attribute name tags for each column of the input table. The length of the array must match the number of columns in the input table or this function will throw a MsPASSError exception. This argument is None by default which means the function will assume the line specified by the “header_line” argument as column headers defining the attribute name. If header_line is less than 0 this argument will be required. When header_line is >= 0 and this argument (attribute_names) is defined all the names in this list will override those stored in the file at the specified line number.
rename_attributes – This is expected to be a python dict keyed by names matching those defined in the file or attribute_names array (i.e. the panda/dataframe column index names) and values defining strings to use to override the original names. That usage, of course, is most common to override names in a file. If you want to change all the name use a custom attributes_name array as noted above. This argument is mostly to rename a small number of anomalous names.
attributes_to_use – If used this argument must define a list of attribute names that define the subset of the dataframe dataframe attributes that are to be saved. For relational db users this is effectively a “select” list of attribute names. The default is None which is taken to mean no selection is to be done.
one_to_one – is an important boolean use to control if the output is or is not filtered by rows. The default is True which means every tuple in the input file will create a single row in dataframe. (Useful, for example, to construct an wf_miniseed collection css3.0 attributes.) If False the (normally reduced) set of attributes defined by attributes_to_use will be filtered with the panda/dask dataframe drop_duplicates method. That approach is important, for example, to filter things like Antelope “site” or “sitechan” attributes created by a join to something like wfdisc and saved as a text file to be processed by this function.
parallel – When true we use the dask dataframe operation. The default is false meaning the simpler, identical api panda operators are used.
insert_column – a dictionary of new columns to add, and their value(s).
If the content is a single value, it can be passedto define a constant value for the entire column of data. The content can also be a list, in that case, the list should contain values that are to be set, and it must be the same length as the number of tuples in the table.
- mspasspy.util.converter.TimeSeries2Trace(ts)[source]
Converts a TimeSeries object to an obspy Trace object.
MsPASS can handle scalar data either as an obspy Trace object or as with the mspass TimeSeries object. The capture nearly the same concepts. The main difference is that TimeSeries support the error logging and history features of mspass while obspy, which is a separate package, does not. Obspy has a number of useful algorithms that operate on scalar data, however, so it is frequently useful to switch between Trace and TimeSeries formats. The user is warned, however, that converting a TimeSeries to a Trace object with this function will result in the loss of any error log information. For production runs unless the data set is huge, we recommend saving the intermediate result AFTER calling this function if there is any possibility there are errors posted on any data. We say after because some warning errors from this function may be posted in elog. Since python uses call by reference d may thus be altered.
- Parameters:
ts (
TimeSeries
) – is the TimeSeries object to be converted- Returns:
an obspy Trace object from conversion of d. An empty Trace object will be returned if d was marked dead
- Return type:
:class:`~obspy.core.trace.Trace`b
- mspasspy.util.converter.TimeSeriesEnsemble2Stream(tse)[source]
Convert a timeseries ensemble to stream. Always copies all ensemble Metadata to tse members before conversion. That is necessary to avoid loss of data in the case where the only copy is stored in the ensemble’s metadata.
- Parameters:
tse – timeseries ensemble
- Returns:
converted stream
- mspasspy.util.converter.Trace2TimeSeries(trace, history=None)[source]
Convert an obspy Trace object to a TimeSeries object.
An obspy Trace object mostly maps directly into the mspass TimeSeries object with the stats of Trace mapping (almost) directly to the TimeSeries Metadata object that is a base class to TimeSeries. A deep copy of the data vector in the original Trace is made to the result. That copy is done in C++ for speed (we found a 100+ fold speedup using that mechanism instead of a simple python loop) There is one important type collision in copying obspy starttime and endtime stats fields. obspy uses their UTCDateTime object to hold time but TimeSeries only supports an epoch time (UTCDateTime.timestamp) so the code here has to convert from the UTCDateTime to epoch time in the TimeSeries. Note in a TimeSeries starttime is the t0 attribute.
The biggest mismatch in Trace and TimeSeries is that Trace has no concept of object level history as used in mspass. That history must be maintained outside obspy. To maintain full history the user must pass the history maintained externally through the optional history parameter. The contents of history will be loaded directly into the result with no sanity checks.
- Parameters:
trace (
Trace
) – obspy trace object to converthistory – mspass ProcessingHistory object to post to result.
- Returns:
TimeSeries object derived from obpsy input Trace object
- Return type:
TimeSeries
- mspasspy.util.converter.dict2Metadata(dic)[source]
Function to convert Python dict data to Metadata.
pymongo returns a Python dict container from find queries to any collection. Simple type in returned documents can be converted to Metadata that are used as headers in the C++ components of mspass.
- Parameters:
dict (dict) – Python dict to convert
- Returns:
Metadata object translated from d
- Return type:
Metadata
- mspasspy.util.converter.list2Ensemble(l, keys=None)[source]
Convert a list of TimeSeries or Seismograms to a corresponding type of Ensemble. This function will make copies of all the data, to create a new Ensemble. Note that the Ensemble’s Metadata will always be copied from the first member. If the keys argument is specifid, it will only copy the keys specified. If a key does not exist in the first member, it will be skipped and leave a complaint in the error log of the ensemble.
- Parameters:
l – a list of TimeSeries or Seismograms
keys – a list of keys to be copied from the first object to the Ensemble’s Metadata
- Returns:
converted TimeSeriesEnsemble or SeismogramEnsemble
- mspasspy.util.converter.post_ensemble_metadata(ens, keys=[], check_all_members=False, clean_members=False)[source]
It may be necessary to call this function after conversion from an obspy Stream to one of the mspass Ensemble classes. This function is necessary because a mspass Ensemble has a concept not part of the obspy Stream object. That is, mspass ensembles have a global Metadata container. That container is expected to contain Metadata common to all members of the ensemble. For example, for data from a single earthquake it would be sensible to post the source location information in the ensemble metadata container rather than having duplicates in each member.
Two different approaches can be used to do this copy. The faster, but least reliable method is to simply copy the values from the first member of the ensemble. That approach is enabled by default. It is completely reliable when used after a conversion from an obspy Stream but ONLY if the data began life as a mspass ensemble with exactly the same keys set as global. The type example of that is after an obspy algorithm is applied to a mspass ensemble via the mspass decorators.
A more cautious algorithm can be enabled by setting check_all_members True. In that mode the list of keys received is tested with a not equal test for against each member. Note we do not do anything fancy with floating point data to allow for finite precision. The reason is Metadata float values are normally expected to be constant data. In that case an != test will yield false when the comparison is between two copies. The not equal test may fail, however, if used with computed floating point numbers. An example where that is possible would be spatial gathers like PP data assembled by midpoint coordinates. If you need to build gathers in such a context we recommend you use an integer image point tied to a specialized document collection in MongoDB that defines the geometry of that point. There may be other examples, but the point is don’t trust computed floating point values to work. It will also not work if the values of a key-value pair don’t support an != comparison. That could be common if the value request for copy was a python object.
- Parameters:
ens – ensemble data to be processed. The function will throw a MsPASSError exception of ens is not either a TimeSeriesEnsemble or a SeismogramEnsemble.
keys – is expected to be a list of metadata keys (required to be strings) that are to be copied from member metadata to ensemble metadata.
check_all_members – switch controlling method used to extract metadata that is to be copied (see above for details). Default is False
clean_members – when true data copied to ensemble metadata will be removed from all members. This option is only allowed if check_all_members is set True. It will be silently ignored if check_all_members is False.
decorators
- mspasspy.util.decorators.is_input_dead(*args, **kwargs)[source]
A helper method to see if any mspass objects in the input parameters are dead. If one is dead, we should keep silent, i.e. no longer perform any further operations on this dead mspass object. Note for an ensemble object, only if all the objects of it are dead, we mark them as dead, otherwise they are still alive.
- Parameters:
args – any parameters.
kwargs – any key-word parameters.
- Returns:
True if there is a dead mspass object in the parameters, False if no mspass objects in the input parameters or all of them are still alive.
Janitor
Created on Wed Nov 13 14:56:53 2024
@author: pavlis
- class mspasspy.util.Janitor.Janitor(keepers_file=None, TimeSeries_keepers=None, Seismogram_keepers=None, ensemble_keepers=None, process_ensemble_members=True)[source]
Bases:
object
Generic handler to clean up the Metadata namespace of a MsPASS data object.
In data processing it is common for Metadata attributes to become inconsistent with the data. An example is a “chan” attribute makes no sense if the data have passed through an algorithm to convert a set of TimeSeries objects into Seismogram objects. The name of the class is meant as a memory device that the object is used to clear attributes that are junk/trash that need to be removed.
There are two fundamentally different conceptual ways for the Janitor to handle the trash. First, is to discard them forever. That is the approach of the method called clean. When clean is called on a datum the inconsistent attributes (trash) are thrown away (like garbage sent to a landfill). The alternative is the to bag up the garbage and put it somewhere until you are ready to deal with it. That s the idea of the two methods called collect_trash and bag_trash. collect_trash removes trash attributes from the object but returns the trash in a container (implmented a dictonary). The bag_trash takes the result from collect_trash and posts it back to the object as a subdocument (a python dictionary) with a specified key. That mode can be useful for an experimental algorithm where you may need to pull some trash from the bag later but need to get the debris out of the way for understanding.
Handling of ensemles is potentially ambiguous as both the ensemble container itself and all the members have a Metadata container. For that reason the class has a seperate set of keys that define attributes to be retained for ensembles. The default for enssembles is an empty list because in most cases the ensemble metadata is loaded from a normalizing collection and does not need to be retained. (e.g. source attributes for a common source gather)
This class can be thought of as an inline version of the clean and clean_collection methods of
mspasspy.db.database.Database
. That is, the database versions can be used to do a similar operation of data previously stored in MongoDB. This methds of this class would normally occur as the function in a map operator or in an assignment in a serial loop.The defaults for the class are designed to be appropriate for stock use. The variable args for the constructor can be used to override the list of attribute keys that the Janitor should treat as not junk. The default are loaded from a yaml file. You can also change the namespace by specifying an alterate name for the yaml file. The file is expected to contain keys to retain with the dictionary keys “TimeSeries” and “Seismogram”. See the default in mspass/data/yaml/Janitor.yaml to see the format.
- Parameters:
keeper_file (string defining yaml file name. Default is None which assumed a default path of $MSPASS_HOME/data/yaml/Janitor.yaml.) – yaml file containing the keys to be retained for TimeSeries and Seismogram objects.
TimeSeries_keepers (assumeed to be a list of strings of attributes to be retained for TimeSeries objects. Default is None which causes this argument to be ignored and using the yaml file to define the list of attributes to be retained.) – Use to override list defined in yaml file for TimeSeries objects. If defined, it should be list of attributes to be retained. Use this option with caution as the list is not checked for required Metadata. Use the default yaml file for guidance.
Seismogram_keepers (assumeed to be a list of strings of attributes to be retained. Default is None which causes this argument to be ignored and using the yaml file to define the list of attributes to be retained.) – Use to override list defined in yaml file for Seismogram objects. If defined, it should be a list of attributes to be retained. Use this option with caution as the list is not checked for required Metadata. Use the default yaml file for guidance.
ensemble_keepers (list of strings defining keys of attributes that should not be treated as junk and retained.) – Use to override the content of a yaml file. If defined it replaces the yaml file defnition of what attribute keys should be retained.
process_ensemble_members – boolean controlling behavior with ensemble objects. With ensembles there are two possible definitions of what Metadata container is to be cleaned. That is, the ensemble itself has a Metadata container and each atomic member has a Metdata container. When True the cleaning opeators are applied to the members. When False the ensemble container is handled if the datum is an ensemble. This argument is ignored when processing atomic data.
- add2keepers(key, keeper_type='atomic')[source]
Adds a new key to the namespace for a data type.
It is often useful to extend the namespace during processing without having to create a special instance of a Janitor from a yaml file. Use this method to add a key to the list of attributes defined as a keeper.
- Parameters:
key (str) – key of the attribute to add to a the list of keeper names in this Janitor. Note if the name already exists in the list this method does nothing.
keeper_type (str (default "atomic")) – defines the data to to which key should be added. Normal use is one the following keywword whose use should be clear: “TimeSeries”,”Seismogam”, and “ensemble”. Also accepts the special keyword “atomic” which means the key is added to the list of keepers for both TimeSeries and Seismogram objects.
- bag_trash(datum, trashbag_key='trash', ensemble_trashbag_key='ensemble_trash')[source]
This method bundles up trash in a python dictionary and posts it back to the datum with a specified key. This allows the trash data to be retained but put into the auxiliary trash bag container to simplify the datum’s Metadata namespace. The idea is to allow an algorithm to pull junk from the trashcan if necesary at a later stage.
Note for ensembles there are two different entities that are handled separately. Any attributes in the ensemble’s Metadata are processed against self.ensemble_keepers and any junk is bagged into a dictionary stored back in to the ensemble’s Metadata container with the key defined by the ensemble_trashbag_key. Members are only processed if self.prococss_ensemble_members is True. When True the members will also be passed through this method with the handling depending on the type of the members.
- Parameters:
datum (Must be a MsPASS data object (TimeSeries,`Seismogram`, TimeSeriesEnsemble, or SeismogramEnsemle) or this method will raise a MsPASSError exception marked Fatal.) – MsPASS data objet to be processed.
trashbag_key (str (default "trash")) – dictinary key used to post the trash bag for atomic data or the members of ensembles when self.process_ensemble_members is True.
ensemble_trashbag_key (str (default "ensemble_trash") – dictionary key to use post the trashbag dictionary constructed from an ensemble’s Metadata container. Ignored for atomic data. Note this key should be distinct from trashbag_key or the member trashbags will be overwritten by the ensemble trashbag if the ensemble is saved.
- clean(datum)[source]
Process datum to remove all Metadata with keys not defined in this instance of Janitor. Returns the datum with the attributes it treats as junk removed. For ensembles if self.process_ensemble_members s True the operation will be appplied on all ensemble members. When False the Metadata container will be altered.
- Parameters:
datum – data objet to be processed.
- collect_trash(datum) dict [source]
Processes datum by extracting attributes that this instance of Janitor does not define as a keeper. It then clears the attributes it treats as junk before returning the attributes it cleared in a python dictionary. When run on ensembles the self.ensemble_keeper slist is used to edit the ensemble Metadata container. When datum is an ensemble the ensemble members are not altered.
- class mspasspy.util.Janitor.MiniseedJanitor[source]
Bases:
Janitor
Convenience class for handling data read from wf_miniseed.
Data loaded from miniseed files in MsPASS tend to have some debris that is not needed once the data are loaded into memory for processng. This is a conveniene class that loads a different yaml file to create a stock Janitor for handling data loaded from wf_miniseed.
WARNING: use this class only on data immediately after loading from wf_miniseed. It will eliminate miniseed specific debris that is a perfect example of why a Janitor is useful. It should not, however, be used after any processing that loads additional attributes or it will almost certainly delete useful attributes.
logging_helper
- mspasspy.util.logging_helper.ensemble_error(d, alg, message, err_severity=<ErrorSeverity.Invalid: 1>)[source]
This is a small helper function useful for error handlers in except blocks for ensemble objects. If a function is called on an ensemble object that throws an exception this function will post the message posted to all ensemble members. It silently does nothing if the ensemble is empty.
- Parameters:
err_severity – severity of the error, default as ErrorSeverity.Invalid.
d – is the ensemble data to be handled. It print and error message and returns doing nothing if d is not one of the known ensemble objects.
alg – is the algorithm name posted to elog on each member
message – is the string posted to all members
(Note due to a current flaw in the api we don’t have access to the severity attribute. For now this always set it Invalid)
- mspasspy.util.logging_helper.info(data, alg_id, alg_name, target=None)[source]
This helper function is used to log operations in processing history of mspass object. Per best practice, every operations happen on the mspass object should be logged.
- Parameters:
data – the mspass data object
alg_id – an id designator to uniquely define an instance of algorithm.
alg_name – the name of the algorithm that used on the mspass object.
target – if the mspass data object is an ensemble type, you may use target as index to log on one specific object in the ensemble. If target is not specified, all the objects in the ensemble will be logged using the same information.
- Returns:
None
- mspasspy.util.logging_helper.reduce(data1, data2, alg_id, alg_name)[source]
This function replicates the processing history of data2 onto data1, which is a common use case in reduce stage. If data1 is dead, it will keep silent, i.e. no history will be replicated. If data2 is dead, the processing history will still be replicated.
- Parameters:
data1 – Mspass object
data2 – Mspass object
alg_id – The unique id of that user gives to the algorithm.
alg_name – The name of the reduce algorithm that uses this helper function.
- Returns:
None
seismic
- mspasspy.util.seismic.ensemble_time_range(ensemble, metric='inner') TimeWindow [source]
Scans a Seismic data ensemble returning a measure of the time span of members. The metric returned ban be either smallest time range containing all the data, the range defined by the minimum start time and maximum endtime, or an average defined by either the median or the arithmetic mean of the vector of startime and endtime values.
- Parameters:
ensemble – ensemble container to be scanned for
time range. :type ensemble: TimeSeriesEnsemble or SeismogramEnsemble. :param metric: measure to use to define the time range. Accepted values are:
- “inner” - (default) return range defined by largest
start time to smallest end time.
- “outer” - return range defined by minimum start time and
largest end time (maximum time span of data)
- “median” - return range as the median of the extracted
vectors of start time and end time values.
- “mean” - return range as arithmetic average of
start and end time vectors
- Returns:
TimeWindow object with start and end times. If the
ensemble has all dead member the default constructed TimeWindow object will be returned which has zero length.
- mspasspy.util.seismic.number_live(ensemble) int [source]
Scans an ensemble and returns the number of live members. If the ensemble is marked dead it immediately return 0. Otherwise it loops through the members countinng the number live. :param ensemble: ensemble to be scanned :type ensemble: Must be a TimeSeriesEnsemble or SeismogramEnsemble or it will throw a TypeError exception.
- mspasspy.util.seismic.print_metadata(d, indent=2)[source]
Prints Metadata stored in the object passed as arg0 (d) as json format output with indentation defined by the optional indent argument. Indent is always on and defaults to 2 characters. The purpose of this function is to standardize printing the Metadata contents of any MsPASS data object. It has the beneficial side effect of producing the same print for documents (python dictionaries) retrieved directly from MongoDB.
It is important to realize that when applied to a TimeSeriesEnsemble or SeismogramEnsemble only the ensemble Metadata container will be printed.
- Parameters:
d (any subclass of mspasspy.ccore.Metadata or a) – datum for which the Metadata is to be printed.
python dictionary (the container used for documents returned by pymongo). If d is anything else a TypeError exception will be thrown. :param indent: indentation argument for json printing :type indent: integer (default 2)
- mspasspy.util.seismic.regularize_sampling(ensemble, dt_expected, Nsamp=10000, abort_on_error=False)[source]
This is a utility function that can be used to validate that all the members of an ensemble have a sample interval that is indistinguishable from a specified constant (dt_expected) The test for constant is soft to allow handling data created by some older digitizers that skewed the recorded sample interval to force time computed from N*dt to match time stamps on successive data packets. The formula used is the datum dt is declared constant if the difference from the expected dt is less than or equal to dt_expected/2*(Nsamp-1). That means the computed endtime difference from that using dt_expected is less than or equal to dt/2.
The function by default will kill members that have mismatched sample intervals and log an informational message to the datum’s elog container. In this mode the entire ensemble can end up marked dead if all the members are killed (That situation can easily happen if the entire data set has the wrong dt.); If the argument abort_on_errors is set True a ValueError exception will be thrown if ANY member of the input ensemble.
An important alternative to this function is to pass a data set through the MsPASS resample function found in mspasspy.algorithms.resample. That function will guarantee all live data have the same sample interval and not kill them like this function will. Use this function for algorithms that can’t be certain the input data will have been resampled and need to be robust for what might otherwise be considered a user error.
- Parameters:
ensemble – ensemble container of data to be scanned for irregular
sampling. :type ensemble: TimeSeriesEnsemble or SeismogramEnsemble. The function does not test for type and will abort with an undefined method error if sent anything else. :param dt_expected: constant data sample interval expected. :type dt_expected: float :param Nsamp: Nominal number of samples expected in the ensemble members. This number is used to compute the soft test to allow for slippery clocks discussed above. (Default 10000) e = regularize_sampling(e,dt) assert e.live :type Nsamp: integer :param abort_on_error: Controls what the function does if it encountered a datum with a sample interval different that dt_expected. When True the function aborts with a ValueError exception if ANY ensemble member does not have a matching sample interval. When False (the default) the function uses the MsPASS error logging to hold a message an kills any member datum with a problem. :type abort_on_error: boolean
- mspasspy.util.seismic.sort_ensemble(ensemble, key, nullvalue=0.0, ascending=True, drop_dead=True)[source]
Sorts members of an ensemble by a single Metadata key value.
For graphical QC one often needs to sort an ensemble by a metadata attribute to appraise how the attribute relates to a graphical display of that data. This function does that with a memory intensive algorithm the makes a copy of the input that is returned.
- Parameters:
ensemble (TimeSeriesEnsemble or SeismogramEnsemble) – input to be sorted
key (string) – key of Metadata attribute whose value is to be used for sorting
nullvaue – value assigned for sort for any ensemble member for which a value is not defined for the sort key.
ascending – boolean defining direction of sort. When True sort is in ascending order. False returns data sorted in descending order.
drop_dead – when True (default) any ensemble member marked dead will be not appear in the output. When False dead data will get an implicit value defined by “nullvalue”. Where the dead appear will depend upon what that value is relative to the valid values.
seispp
- mspasspy.util.seispp.index_data(filebase, db, ext='d3C', verbose=False)[source]
Import function for data from antelope export_to_mspass.
This function is an import function for Seismogram objects created by the antelope program export_to_mspass. That program writes header data as a yaml file and the sample data as a raw binary fwrite of the data matrix (stored in fortran order but written as a contiguous block of 3*npts (number of samples) double values. This function parses the yaml file and adds three critical metadata entries: dfile, dir, and foff. To get foff values the function reads the binary data file and gets foff values by calls to tell. It then writes these entries into MongoDB in the wf collection of a database. Readers that want to read this raw data will need to use dir, dfile, and foff to find the right file and read point.
- Parameters:
filebase – is the base name of the dataset to be read and indexed. The function will look for filebase.yaml for the header data and filebase.ext (Arg 3 defaulting to d3C).
db – is the MongoDB database handler
ext – is the file extension for the sample data (default is ‘d3C’).
Undertaker
- class mspasspy.util.Undertaker.Undertaker(dbin, regular_data_collection='cemetery', aborted_data_collection='abortions', data_tag=None)[source]
Bases:
object
Class to handle dead data. Results are stored to two spcial collections defined by default as “cemetery”, for regular dead bodies, and “abortions” for those defined as abortions.
- Parameters:
dbin (the constructor for this class only tests for that the handle is an instance of pymongo’s Database class. The MsPASS version of Database extends the pymongo version. Thsi particular class references only two methods of Database: (1) the private method _save_elog and (2) the private method _save_history. Technically an alternative extension of pymongo’s Database class that implements those two methods would be plug compatible. User’s who might want to pull MsPASS apart and use this class separately could do so with an alternative Database extension than MsPaSs.) – Should be an instance of mspasspy.db.Database that is used to save the remains of any bodies.
regular_data_collection – collection where we bury regular
dead bodies. Default “cemetery” :type regular_data_collection: string
- Parameters:
aborted_data_collection – collection where aborted data documents
are buried. Default “abortions” :type aborted_data_collection: string
- Parameters:
data_tag – tag to attach to each document. Normally would
be the same as the data_tag used for a particular save operation for data not marked dead.
- bring_out_your_dead(d, bury=False, save_history=True, mummify_atomic_data=True)[source]
Seperate an ensemble into live and dead members. Result is returned as a pair (tuple) of two ensembles. First (0 component) is a copy of the input with the dead bodies removed. The second (component 1) has the same ensemble Metadata as the input but only contains dead members - like the name implies stolen from a great line in the Monty Python movie “Search for the Holy Grail”.
- Parameters:
d – must be either a TimeSeriesEnsemble or SeismogramEnsemble of data to be processed.
bury – if true the bury method will be called on the ensemble of dead data before returning. Note a limitation of using this method is there is no way to save the optional history data via this method. If you need to save history run this with bury=False and then run bury with save_history true on the dead ensemble. There is also no way to specify an alternative to the default collection name of “cemetery”
- Returns:
python list with two elements. 0 is ensemble with live data and 1 is ensemble with dead data.
- Return type:
python list with two components
- bury(mspass_object, save_history=False, mummify_atomic_data=True)[source]
Handles dead data by saving a subset of content to database.
MsPASS makes extensive use of the idea of “killing” data as a way to say it is bad and should not be considered further in any analysis. There is a need to record what data was killed and it is preferable to do so without saving the entire data object. (That is the norm in seismic reflection processing where data marked dead are normally carried through until removed through a process like a stack.) This method standardizes the method of how to do that and what is saved as the shell of a dead datum. That “shell” is always a minimum of two things:
All elog entries - essential to understand why datum was killed
The content of the Metadata container saved under a subdocument called “tombstone”.
If save_history is set True and the datum has history records they will also be saved.
It is important to realize this method acts like an overloaded c++ method in that it accepts multiple data types, but handles them differently. 1. Atomic data (TimeSeries or Seismogram) marked dead
generate a document saved to the specified collection and an (optional) history document. If the mummify_atomic_data parameter is set True (the default) the returned copy of the data will be processed with the “mummify” method of this class. (That means the sample data are discarded and the array is set
to zero length).
Ensembles have to handle two different situations. If the entire ensemble is marked dead, all members are treated as dead and then processed through this method by a recursive call on each member. In that situation an empty ensemble is returned with only ensemble metadata not empty. If the ensemble is marked live the code loops over members calling this method recusively only on dead data. In that situation the ensemble returned is edited with all dead data removed. (e.g. if we started with 20 members and two were marked dead, the return would have 18 members.)
- Parameters:
mspass_object (Must be a MsPASS seismic data object (TimeSeries, Seismogram, TimeSeriesEnsemble, or SeismogramEnsemble) or the method will throw a TypeError.) – datum to be processed
save_history – If True and a datum has the optional history data stored with it, the history data will be stored in a MongoDB collection hard wired into the _save_history method of Database. Default is False
mummify_atomic_data – When True (default) atomic data marked dead will be passed through self.mummify to reduce memory use of the remains. This parameter is ignored for ensembles.
- bury_the_dead(mspass_object, save_history=True, mummify_atomic_data=True)[source]
Depricated method exactly equivalent to new, and shorter name of simply bury. With context as a member of Undertaker the long name was redundnant. Note the call sequence is exactly the same as bury.
- cremate(mspass_object)[source]
Like bury but nothing is preserved of the dead.
Fpr atomic data it returns a default constructed (empty) copy of the container matching the original type. That avoids downstream type collisions if this method is called in parallel workflow to release memory. This method is most appropriate for ensembles. In that case, it returns a copy of the ensemble with all dead data removed. (i.e. they are ommited from the returned copy leaving no trace.) If an ensemble is marked dead the return is an empty ensemble containing only ensemble Metadata.
- Parameters:
mspass_object – Seismic data object. If not a MsPASS seismic data object a TypeError will be thrown.
- handle_abortion(doc_or_datum, type=None)[source]
Standardized method to handle what we call abortions (see class overview).
This method standardizes handling of abortions. They are always saved as a document in a collection set by the constructor (self.aborted_data_collection) that defaults to “abortions”. The documents saved have up to 3 key-value pairs:
- “tombstone” - contents are a subdocument (dict) of the
wf document that was aborted during construction.
“logdata” - any error log records left by the reeader that failed. “type” - string describing the expected type of data object
that a reader was attempting to construct. In rare situations it could be set to “unknown” if Undertaker._handle_abortion is called on a raw document and type is not set (see parameters below)
- Parameters:
doc_or_datum (Must be one of TimeSeries, Seismogram, Metadata,) – container defining the aborted fetus.
or a python dict. For the seismic data objects any content in the ErrorLogger will be saved. For dict input an application should post a message to the dict with some appropriate (custom) key to preserve a cause for the abortion.
- Parameters:
type – string description of the type of data object
to associate with dict input. Default for this parameter is None and it is not referenced at all for normal input of TimeSeries and Seismogram objects. It is ONLY referenced if arg0 is a dict. If type is None and the input is a dict the value assigned to the “type” key in the abortions document is “unknown”. The escape for “unknown” makes the method bombproof but may make the saved documents ambiguous.
- Exception:
throws a TypeError if arg0 does not obey type
list described above.
- mummify(mspass_object, post_elog=True, post_history=False)[source]
Reduce memory use associated with dead data.
For atomic data objects if they are marked dead the data vector/matrix is set to zero length releasing the dynamically allocated memory. For Ensembles if the entire ensemble is marked dead all members are killed and this method calls itself on each member. For normal ensembles with mixed live and dead data only the data marked dead are muffified.
Handling of
- Parameters:
mspass_object – datum to be processed.