mspasspy.db
client
- class mspasspy.db.client.DBClient(host=None, *args, **kwargs)[source]
Bases:
MongoClient
A client-side top-level handle into MongoDB.
MongoDB uses the client server model for transactions. An instance of this class must be created in any MsPASS job using the MongoDB database to set up the communciation channels between the you (the client) and an instance of the MongoDB server. This class is a little more han a wrapper around the
MongoClient
created for convenience. In most cases there is functionally little difference from creating a MongoClient or the MsPASS DBClient (this class).- get_database(name=None, schema=None, codec_options=None, read_preference=None, write_concern=None, read_concern=None)[source]
Get a
Database
with the given name and options.Useful for creating a
Database
with different codec options, read preference, and/or write concern from thisMongoClient
.>>> client.read_preference Primary() >>> db1 = client.test >>> db1.read_preference Primary() >>> from pymongo import ReadPreference >>> db2 = client.get_database( ... 'test', read_preference=ReadPreference.SECONDARY) >>> db2.read_preference Secondary(tag_sets=None)
- Parameters:
name – The name of the database - a string. If
None
(the default) the database named in the MongoDB connection URI is returned.codec_options – An instance of
CodecOptions
. IfNone
(the default) thecodec_options
of thisMongoClient
is used.read_preference – The read preference to use. If
None
(the default) theread_preference
of thisMongoClient
is used. Seeread_preferences
for options.write_concern – An instance of
WriteConcern
. IfNone
(the default) thewrite_concern
of thisMongoClient
is used.read_concern – An instance of
ReadConcern
. IfNone
(the default) theread_concern
of thisMongoClient
is used.
Changed in version 3.5: The name parameter is now optional, defaulting to the database named in the MongoDB connection URI.
- get_default_database(default=None, schema=None, codec_options=None, read_preference=None, write_concern=None, read_concern=None)[source]
Get the database named in the MongoDB connection URI.
>>> uri = 'mongodb://host/my_database' >>> client = MongoClient(uri) >>> db = client.get_default_database() >>> assert db.name == 'my_database' >>> db = client.get_database() >>> assert db.name == 'my_database'
Useful in scripts where you want to choose which database to use based only on the URI in a configuration file.
- Parameters:
default – the database name to use if no database name was provided in the URI.
codec_options – An instance of
CodecOptions
. IfNone
(the default) thecodec_options
of thisMongoClient
is used.read_preference – The read preference to use. If
None
(the default) theread_preference
of thisMongoClient
is used. Seeread_preferences
for options.write_concern – An instance of
WriteConcern
. IfNone
(the default) thewrite_concern
of thisMongoClient
is used.read_concern – An instance of
ReadConcern
. IfNone
(the default) theread_concern
of thisMongoClient
is used.comment – A user-provided comment to attach to this command.
Changed in version 4.1: Added
comment
parameter.Changed in version 3.8: Undeprecated. Added the
default
,codec_options
,read_preference
,write_concern
andread_concern
parameters.Changed in version 3.5: Deprecated, use
get_database()
instead.
database
- class mspasspy.db.database.Database(*args, schema=None, db_schema=None, md_schema=None, **kwargs)[source]
Bases:
Database
MsPASS core database handle. All MongoDB database operation in MsPASS should normally utilize this object. This object is a subclass of the Database class of pymongo. It extends the class in several ways fundamental to the MsPASS framework:
It abstracts read and write operations for seismic data managed by the framework. Note reads and writes are for atomic objects. Use the distributed read and write functions for parallel handling of complete data sets.
It contains methods for managing the most common seismic Metadata (namely source and receiver geometry and receiver response data).
It adds a schema that can (optionally) be used to enforce type requirements and/or provide aliasing.
Manages error logging data.
Manages (optional) processing history data
The class currently has only one constructor normally called with a variant of the following:
db=Database(dbclient,’mydatabase’)
where dbclient is either a MongoDB database client instance or (recommended) the MsPASS DBClient wrapper (a subclass of the pymongo client). The second argument is the database “name” passed to the MongoDB server that defines your working database.
The constructor should normally be used only with serial workflows. In cluster environments the recommended way to obtain a Database handle is via the DBClient method called get_database. The typical construct is:
dbclient = DBClient(“dbhostname”) db = dbclient.get_database(“mydatabase”)
where dbhostname is the hostname of the node running the MongoDB server and mydatabase is the name you assign your database. Serial workflows can and should use a similar construct but can normally default construct DBClient.
Optional parameters are:
- Parameters:
schema – a
str
of the yaml file name that defines both the database schema and the metadata schema. If this parameter is set, it will override the following two.db_schema – Set the name for the database schema to use with this handle. Default is the MsPASS schema. (See User’s Manual for details)
md_schema – Set the name for the Metadata schema. Default is the MsPASS definitions. (see User’s Manual for details)
As a subclass of pymongo Database the constructor accepts any parameters defined for the base class (see pymongo documentation)
- clean(document_id, collection='wf', rename_undefined=None, delete_undefined=False, required_xref_list=None, delete_missing_xref=False, delete_missing_required=False, verbose=False, verbose_keys=None)[source]
This is the atomic level operation for cleaning a single database document with one or more common fixes. The clean_collection method is mainly a loop that calls this method for each document it is asked to handle. Most users will likely not need or want to call this method directly but use the clean_collection method instead. See the docstring for clean_collection for a less cyptic description of the options as the options are identical.
- Parameters:
document_id (
bson.ObjectId.ObjectId
) – the value of the _id field in the document you want to cleancollection – the name of collection saving the document. If not specified, use the default wf collection
rename_undefined (
dict
) – Specify adict
of{original_key:new_key}
to rename the undefined keys in the document.delete_undefined – Set to
True
to delete undefined keys in the doc.rename_undefined
will not work if this isTrue
. Default isFalse
.required_xref_list (
list
) – alist
of xref keys to be checked.delete_missing_xref – Set to
True
to delete this document if any keys specified inrequired_xref_list
is missing. Default isFalse
.delete_missing_required – Set to
True
to delete this document if any required keys in the database schema is missing. Default isFalse
.verbose – Set to
True
to print all the operations. Default isFalse
.verbose_keys (
list
ofstr
) – a list of keys you want to added to better identify problems when error happens. It’s used in the print messages.
- Returns:
number of fixes applied to each key
- Return type:
dict
- clean_collection(collection='wf', query=None, rename_undefined=None, delete_undefined=False, required_xref_list=None, delete_missing_xref=False, delete_missing_required=False, verbose=False, verbose_keys=None)[source]
This method can be used to fix a subset of common database problems created during processing that can cause problems if the user tries to read data saved with such blemishes. The subset of problems are defined as those that are identified by the “dbverify” command line tool or it’s Database equivalent (the verify method of this class). This method is an alternative to the command line tool dbclean. Use this method for fixes applied as part of a python script.
- Parameters:
collection (
str
) – the collection you would like to clean. If not specified, use the default wf collectionquery (
dict
) – this is a python dict that is assumed to be a query to MongoDB to limit suite of documents to which the requested cleanup methods are applied. The default will process the entire database collection specified.rename_undefined (
dict
) – when set the options is assumed to contain a python dict defining how to rename a set of attributes. Each member of the dict is assumed to be of the form{original_key:new_key}
Each document handled will change the key from “original_key” to “new_key”.delete_undefined – attributes undefined in the schema can be problematic. As a minimum they waste storage and memory if they are baggage. At worst they may cause a downstream process to abort or mark some or all data dead. On the other hand, undefined data may contain important information you need that for some reason is not defined in the schema. In that case do not run this method to clear these. When set true all attributes matching the query will have undefined attributes cleared.
required_xref_list (
list
) – in MsPASS we use “normalization” of some common attributes for efficiency. The stock ones currently are “source”, “site”, and “channel”. This list is a intimately linked to the delete_missing_xref option. When that is true this list is enforced to clean debris. Typical examples are [‘site_id’,’source_id’] to require source and receiver geometry.delete_missing_xref – Turn this option on to impose a brutal fix for data with missing (required) cross referencing data. This clean operation, for example, might be necessary to clean up a data set with some channels that defy all attempts to find valid receiver metadata (stationxml stuff for passive data, survey data for active source data). This clean method is a blunt instrument that should be used as a last resort. When true the is list of xref keys defined by required_xref_list are tested any document that lacks of the keys will be deleted from the database.
delete_missing_required – Set to
True
to delete this document if any required keys in the database schema is missing. Default isFalse
.verbose – Set to
True
to print all the operations. Default isFalse
.verbose_keys (
list
ofstr
) – a list of keys you want to added to better identify problems when error happens. It’s used in the print messages.
- create_collection(name, codec_options=None, read_preference=None, write_concern=None, read_concern=None, session=None, **kwargs)[source]
Create a new
mspasspy.db.collection.Collection
in this database. Normally collection creation is automatic. This method should only be used to specify options on creation.CollectionInvalid
will be raised if the collection already exists. Useful mainly for advanced users tuning a polished workflow.- Parameters:
- param name:
the name of the collection to create
- param codec_options` (optional):
An instance of
CodecOptions
. IfNone
(the default) thecodec_options
of thisDatabase
is used.- param read_preference:
(optional): The read preference to use. If
None
(the default) theread_preference
of thisDatabase
is used.- param write_concern:
(optional): An instance of
WriteConcern
. IfNone
(the default) thewrite_concern
of thisDatabase
is used.- param read_concern:
(optional): An instance of
ReadConcern
. IfNone
(the default) theread_concern
of thisDatabase
is used.- param collation:
(optional): An instance of
Collation
.- param session:
(optional): a
ClientSession
.- param **kwargs:
(optional): additional keyword arguments will be passed as options for the `create collection command`_
All optional `create collection command`_ parameters should be passed as keyword arguments to this method. Valid options include, but are not limited to:
size
: desired initial size for the collection (inbytes). For capped collections this size is the max size of the collection.
capped
: if True, this is a capped collectionmax
: maximum number of objects if capped (optional)timeseries
: a document specifying configuration options fortimeseries collections
expireAfterSeconds
: the number of seconds after which adocument in a timeseries collection expires
- delete_data(object_id, object_type, remove_unreferenced_files=False, clear_history=True, clear_elog=True)[source]
Delete method for handling mspass data objects (TimeSeries and Seismograms).
Delete is one of the basic operations any database system should support (the last letter of the acronymn CRUD is delete). Deletion is nontrivial with seismic data stored with the model used in MsPASS. The reason is that the content of the objects are spread between multiple collections and sometimes use storage in files completely outside MongoDB. This method, however, is designed to handle that and when given the object id defining a document in one of the wf collections, it will delete the wf document entry and manage the waveform data. If the data are stored in gridfs the deletion of the waveform data will be immediate. If the data are stored in disk files the file will be deleted when there are no more references in the wf collection for the exact combination of dir and dfile associated an atomic deletion. Error log and history data deletion linked to a datum is optional. Note this is an expensive operation as it involves extensive database interactions. It is best used for surgical solutions. Deletion of large components of a data set (e.g. all data with a given data_tag value) are best done with custom scripts utilizing file naming conventions and unix shell commands to delete waveform files.
- Parameters:
object_id (
bson.ObjectId.ObjectId
) – the wf object id you want to delete.object_type (
str
) – the object type you want to delete, must be one of [‘TimeSeries’, ‘Seismogram’]remove_unreferenced_files – if
True
, we will try to remove the file that no wf data is referencing. Default to beFalse
clear_history – if
True
, we will clear the processing history of the associated wf object, default to beTrue
clear_elog – if
True
, we will clear the elog entries of the associated wf object, default to beTrue
- get_collection(name, codec_options=None, read_preference=None, write_concern=None, read_concern=None)[source]
Get a
mspasspy.db.collection.Collection
with the given name and options.This method is useful for creating a
mspasspy.db.collection.Collection
with different codec options, read preference, and/or write concern from thisDatabase
. Useful mainly for advanced users tuning a polished workflow.- Parameters:
- param name:
The name of the collection - a string.
- param codec_options:
(optional): An instance of
bson.codec_options.CodecOptions
. IfNone
(the default) thecodec_options
of thisDatabase
is used.- param read_preference:
(optional): The read preference to use. If
None
(the default) theread_preference
of thisDatabase
is used. Seepymongo.read_preferences
for options.- param write_concern:
(optional): An instance of
pymongo.write_concern.WriteConcern
. IfNone
(the default) thewrite_concern
of thisDatabase
is used.- param read_concern:
An (optional) instance of
pymongo.read_concern.ReadConcern
. IfNone
(the default) theread_concern
of thisDatabase
is used.
- get_response(net=None, sta=None, chan=None, loc=None, time=None)[source]
Returns an obspy Response object for seed channel defined by the standard keys net, sta, chan, and loc and a time stamp. Input time can be a UTCDateTime or an epoch time stored as a float.
- Parameters:
db – mspasspy Database handle containing a channel collection to be queried
net – seed network code (required)
sta – seed station code (required)
chan – seed channel code (required)
loc – seed net code. If None loc code will not be included in the query. If loc is anything else it is passed as a literal. Sometimes loc codes are not defined by in the seed data and are literal two ascii space characters. If so MongoDB translates those to “”. Use loc=”” for that case or provided the station doesn’t mix null and other loc codes use None.
time – time stamp for which the response is requested. seed metadata has a time range for validity this field is required. Can be passed as either a UTCDateTime object or a raw epoch time stored as a python float.
- get_seed_channel(net, sta, chan, loc=None, time=-1.0, verbose=True)[source]
The channel collection is assumed to have a one to one mapping of net:sta:loc:chan:starttime - endtime. This method uses a restricted query to match the keys given and returns the document matching the specified keys.
The optional loc code is handled specially. The reason is that it is common to have the loc code empty. In seed data that puts two ascii blank characters in the 2 byte packet header position for each miniseed blockette. With pymongo that can be handled one of three ways that we need to handle gracefully. That is, one can either set a literal two blank character string, an empty string (“”), or a MongoDB NULL. To handle that confusion this algorithm first queries for all matches without loc defined. If only one match is found that is returned immediately. If there are multiple matches we search though the list of docs returned for a match to loc being conscious of the null string oddity.
The (optional) time arg is used for a range match to find period between the site startime and endtime. If not used the first occurence will be returned (usually ill adivsed) Returns None if there is no match. Although the time argument is technically option it usually a bad idea to not include a time stamp because most stations saved as seed data have time variable channel metadata.
Note this method may be DEPRICATED in the future as it has been largely superceded by BasicMatcher implementations in the normalize module.
- Parameters:
net – network name to match
sta – station name to match
chan – seed channel code to match
loc – optional loc code to made (empty string ok and common) default ignores loc in query.
time – epoch time for requested metadata
verbose – When True (the default) this method will issue a print warning message when the match is ambiguous - multiple docs match the specified keys. When set False such warnings will be suppressed. Use false only if you know the duplicates are harmless and you are running on a large data set and you want to reduce the log size.
- Returns:
handle to query return
- Return type:
MondoDB Cursor object of query result.
- get_seed_site(net, sta, loc='NONE', time=-1.0, verbose=True)[source]
The site collection is assumed to have a one to one mapping of net:sta:loc:starttime - endtime. This method uses a restricted query to match the keys given and returns the MongoDB document matching the keys. The (optional) time arg is used for a range match to find period between the site startime and endtime. Returns None if there is no match.
An all to common metadata problem is to have duplicate entries in site for the same data. The default behavior of this method is to print a warning whenever a match is ambiguous (i.e. more than on document matches the keys). Set verbose false to silence such warnings if you know they are harmless.
An all to common metadata problem is to have duplicate entries in site for the same data. The default behavior of this method is to print a warning whenever a match is ambiguous (i.e. more than on document matches the keys). Set verbose false to silence such warnings if you know they are harmless.
The seed modifier in the name is to emphasize this method is for data originating as the SEED format that use net:sta:loc:chan as the primary index.
Note this method may be DEPRICATED in the future as it has been largely superceded by BasicMatcher implementations in the normalize module.
- Parameters:
net – network name to match
sta – station name to match
loc – optional loc code to made (empty string ok and common)
default ignores loc in query. :param time: epoch time for requested metadata. Default undefined
and will cause the function to simply return the first document matching the name keys only. (This is rarely what you want, but there is no standard default for this argument.)
- Parameters:
verbose – When True (the default) this method will issue a print warning message when the match is ambiguous - multiple docs match the specified keys. When set False such warnings will be suppressed. Use false only if you know the duplicates are harmless and you are running on a large data set and you want to reduce the log size.
- Returns:
MongoDB doc matching query
- Return type:
python dict (document) of result. None if there is no match.
- index_mseed_FDSN(provider, year, day_of_year, network, station, location, channel, collection='wf_miniseed')[source]
- index_mseed_file(dfile, dir=None, collection='wf_miniseed', segment_time_tears=True, elog_collection='elog', return_ids=False, normalize_channel=False, verbose=False)[source]
This is the first stage import function for handling the import of miniseed data. This function scans a data file defined by a directory (dir arg) and dfile (file name) argument. I builds an index for the file and writes the index to mongodb in the collection defined by the collection argument (wf_miniseed by default). The index is bare bones miniseed tags (net, sta, chan, and loc) with a starttime tag. The index is appropriate ONLY if the data on the file are created by concatenating data with packets sorted by net, sta, loc, chan, time AND the data are contiguous in time. The original concept for this function came from the need to handle large files produced by concanentation of miniseed single-channel files created by obpsy’s mass_downloader. i.e. the basic model is the input files are assumed to be something comparable to running the unix cat command on a set of single-channel, contingous time sequence files. There are other examples that do the same thing (e.g. antelope’s miniseed2days).
We emphasize this function only builds an index - it does not convert any data. It has to scan the entire file deriving the index from data retrieved from miniseed packets with libmseed so for large data sets this can take a long time.
Actual seismic data stored as miniseed are prone to time tears. That can happen at the instrument level in at least two common ways: (1) dropped packets from telemetry issues, or (2) instrument timing jumps when a clock loses external lock to gps or some other standard and the rock is restored. The behavior is this function is controlled by the input parameter segment_time_tears. When true a new index entry is created any time the start time of a packet differs from that computed from the endtime of the last packet by more than one sample AND net:sta:chan:loc are constant. The default for this parameter is false because data with many dropped packets from telemetry are common and can create overwhelming numbers of index entries quickly. When false the scan only creates a new index record when net, sta, chan, or loc change between successive packets. Our reader has gap handling functions to handle time tears. Set segment_time_tears true only when you are confident the data set does not contain a large number of dropped packets.
Note to parallelize this function put a list of files in a Spark RDD or a Dask bag and parallelize the call the this function. That can work because MongoDB is designed for parallel operations and we use the thread safe version of the libmseed reader.
Finally, note that cross referencing with the channel and/or source collections should be a common step after building the index with this function. The reader found elsewhere in this module will transfer linking ids (i.e. channel_id and/or source_id) to TimeSeries objects when it reads the data from the files indexed by this function.
- Parameters:
dfile – file name of data to be indexed. Asssumed to be the leaf node of the path - i.e. it contains no directory information but just the file name.
dir – directory name. This can be a relative path from the current directory be be advised it will always be converted to an fully qualified path. If it is undefined (the default) the function assumes the file is in the current working directory and will use the result of the python getcwd command as the directory path stored in the database.
collection – is the mongodb collection name to write the index data to. The default is ‘wf_miniseed’. It should be rare to use anything but the default.
segment_time_tears – boolean controlling handling of data gaps defined by constant net, sta, chan, and loc but a discontinuity in time tags for successive packets. See above for a more extensive discussion of how to use this parameter. Default is True.
elog_collection – name to write any error logs messages from the miniseed reader. Default is “elog”, which is the same as for TimeSeries and Seismogram data, but the cross reference keys here are keyed by “wf_miniseed_id”.
return_ids – if set True the function will return a tuple with two id lists. The 0 entry is an array of ids from the collection (wf_miniseed by default) of index entries saved and the 1 entry will contain the ids in the elog_collection of error log entry insertions. The 1 entry will be empty if the reader found no errors and the error log was empty (the hopefully normal situation). When this argument is False (the default) it returns None. Set true if you need to build some kind of cross reference to read errors to build some custom cleaning method for specialized processing that can be done more efficiently. By default it is fast only to associate an error log entry with a particular waveform index entry. (we store the saved index MongoDB document id with each elog entry)
normalize_channel – boolean controlling normalization with the channel collection. When set True (default is false) the method will call the Database.get_seed_channel method, extract the id from the result, and set the result as “channel_id” before writing the wf_miniseed document. Set this argument true if you have a relatively complete channel collection assembled before running a workflow to index a set of miniseed files (a common raw data starting point).
verbose – boolean passed to get_seed_channel. This argument has no effect unless normalize_channel is set True. It is necessary because the get_seed_channel function has no way to log errors except calling print. A very common metadata error is duplicate and/or time overlaps in channel metadata. Those are usually harmless so the default for this parameter is False. Set this True if you are using inline normalization (normalize_channel set True) and you aren’t certain your channel collection has no serious inconsistencies.
- Exception:
This function can throw a range of error types for a long list of possible io issues. Callers should use a generic handler to avoid aborts in a large job.
- index_mseed_s3_continuous(s3_client, year, day_of_year, network='', station='', channel='', location='', collection='wf_miniseed', storage_mode='s3_continuous')[source]
This is the first stage import function for handling the import of miniseed data. However, instead of scanning a data file defined by a directory (dir arg) and dfile (file name) argument, it reads the miniseed content from AWS s3. It builds and index it writes to mongodb in the collection defined by the collection argument (wf_miniseed by default). The index is bare bones miniseed tags (net, sta, chan, and loc) with a starttime tag.
- Parameters:
s3_client – s3 Client object given by user, which contains credentials
year – year for the query mseed file(4 digit).
day_of_year – day of year for the query of mseed file(3 digit [001-366])
network – network code
station – station code
channel – channel code
location – location code
collection – is the mongodb collection name to write the index data to. The default is ‘wf_miniseed’. It should be rare to use anything but the default.
- Exception:
This function will do nothing if the obejct does not exist. For other exceptions, it would raise a MsPASSError.
- index_mseed_s3_event(s3_client, year, day_of_year, filename, dfile, dir=None, collection='wf_miniseed')[source]
This is the first stage import function for handling the import of miniseed data. However, instead of scanning a data file defined by a directory (dir arg) and dfile (file name) argument, it reads the miniseed content from AWS s3. It builds and index it writes to mongodb in the collection defined by the collection argument (wf_miniseed by default). The index is bare bones miniseed tags (net, sta, chan, and loc) with a starttime tag.
- Parameters:
s3_client – s3 Client object given by user, which contains credentials
year – year for the query mseed file(4 digit).
day_of_year – day of year for the query of mseed file(3 digit [001-366])
filename – SCSN catalog event id for the event
collection – is the mongodb collection name to write the index data to. The default is ‘wf_miniseed’. It should be rare to use anything but the default.
- Exception:
This function will do nothing if the obejct does not exist. For other exceptions, it would raise a MsPASSError.
- load_channel_metadata(mspass_object, exclude_keys=['serialized_channel_data'], include_undefined=False)[source]
Reads metadata from the channel collection and loads standard attributes in channel collection to the data passed as mspass_object. The method will only work if mspass_object has the site_id attribute set to link it to a unique document in source.
Note the mspass_object can be either an atomic object (TimeSeries or Seismogram) with a Metadata container base class or an ensemble (TimeSeriesEnsemble or SeismogramEnsemble). Ensembles will have the site data posted to the ensemble Metadata and not the members. This should be the stock way to assemble the generalization of a common-receiver gather of TimeSeries data for a common sensor component.
This method is DEPRICATED. It is an slow alternative for normalization and is effectively an alternative to normalization with the Database driven id matcher. Each call to this function requires a query with find_ond.
- Parameters:
mspass_object (
mspasspy.ccore.seismic.TimeSeries
,mspasspy.ccore.seismic.Seismogram
,mspasspy.ccore.seismic.TimeSeriesEnsemble
ormspasspy.ccore.seismic.SeismogramEnsemble
.) – data where the metadata is to be loadedexclude_keys – list of attributes that should not normally be loaded. Default excludes the serialized obspy class that is used to store response data. Ignored if include_undefined is set
True
.include_undefined – when
True
all data in the matching channel document are loaded
- Raises:
mspasspy.ccore.utility.MsPASSError – any detected errors will cause a MsPASSError to be thrown
- load_event(source_id)[source]
Return a bson record of source data matching the unique id defined by source_id. The idea is that magic string would be extraced from another document (e.g. in an arrival collection) and used to look up the event with which it is associated in the source collection.
This function is a relic and may be depricated. I originally had a different purpose.
- load_site_metadata(mspass_object, exclude_keys=None, include_undefined=False)[source]
Reads metadata from the site collection and loads standard attributes in site collection to the data passed as mspass_object. The method will only work if mspass_object has the site_id attribute set to link it to a unique document in source.
Note the mspass_object can be either an atomic object (TimeSeries or Seismogram) with a Metadata container base class or an ensemble (TimeSeriesEnsemble or SeismogramEnsemble). Ensembles will have the site data posted to the ensemble Metadata and not the members. This should be the stock way to assemble the generalization of a common-receiver gather.
This method is DEPRICATED. It is an slow alternative for normalization and is effectively an alternative to normalization with the Database driven id matcher. Each call to this function requires a query with find_ond.
- Parameters:
mspass_object (
mspasspy.ccore.seismic.TimeSeries
,mspasspy.ccore.seismic.Seismogram
,mspasspy.ccore.seismic.TimeSeriesEnsemble
ormspasspy.ccore.seismic.SeismogramEnsemble
.) – data where the metadata is to be loadedexclude_keys (a
list
ofstr
) – list of attributes that should not normally be loaded. Default is None. Ignored if include_undefined is setTrue
.include_undefined – when
True
all data in the matching site document are loaded.
- Raises:
mspasspy.ccore.utility.MsPASSError – any detected errors will cause a MsPASSError to be thrown
- load_source_metadata(mspass_object, exclude_keys=['serialized_event', 'magnitude_type'], include_undefined=False)[source]
Reads metadata from the source collection and loads standard attributes in source collection to the data passed as mspass_object. The method will only work if mspass_object has the source_id attribute set to link it to a unique document in source.
Note the mspass_object can be either an atomic object (TimeSeries or Seismogram) with a Metadata container base class or an ensemble (TimeSeriesEnsemble or SeismogramEnsemble). Ensembles will have the source data posted to the ensemble Metadata and not the members. This should be the stock way to assemble the generalization of a shot gather.
This method is DEPRICATED. It is an slow alternative for normalization and is effectively an alternative to normalization with the Database driven id matcher. Each call to this function requires a query with find_ond.
- Parameters:
mspass_object (
mspasspy.ccore.seismic.TimeSeries
,mspasspy.ccore.seismic.Seismogram
,mspasspy.ccore.seismic.TimeSeriesEnsemble
ormspasspy.ccore.seismic.SeismogramEnsemble
.) – data where the metadata is to be loadedexclude_keys (a
list
ofstr
) – list of attributes that should not normally be loaded. Default are attributes not normally need that are loaded from QuakeML. Ignored if include_undefined is setTrue
.include_undefined – when
True
all data in the matching source document are loaded.
- Raises:
mspasspy.ccore.utility.MsPASSError – any detected errors will cause a MsPASSError to be thrown
- read_data(id_doc_or_cursor, mode='promiscuous', normalize=None, normalize_ensemble=None, load_history=False, exclude_keys=None, collection='wf', data_tag=None, ensemble_metadata={}, alg_name='read_data', alg_id='0', define_as_raw=False, merge_method=0, merge_fill_value=None, merge_interpolation_samples=0, aws_access_key_id=None, aws_secret_access_key=None)[source]
Top-level MsPASS reader for seismic waveform data objects.
Most MsPASS processing scripts use this method directly or indirectly via the parallel version called read_distributed_data. This function will return one of four seismic data objects defined in MsPASS: TimeSeries,`Seismogram`, TimeSeriesEnsemble, or SeismogramEnsemble. What is retrieved is driven by the type of arg0, which in the implementation has the symbol id_doc_or_cursor. As the symbol name implies it can be one of three things. Each of the three types have implied assumptions:
If arg0 is a MongoDB ObjectId it is presumed to be the ObjectId of a document in the collection defined by the collection argument (see below). When used in this way a query is always required to retrieve the document needed to construct the desired datum. This method always implies you want to construct one, atomic datum. This functionality is not recommended as it is the slowest reader due to the implicit database query required to implement that approach.
If arg0 is a python dictionary (dict) it is assumed to be a MongoDB “document” retrieved previously through the pymongo interface. This usage is the best use for serial jobs driven by a for loop over a MongoDB cursor returned following a find query. (See User’s manual and tutorials for specific examples) The reason is that a cursor is effectively a buffered interface into the MongoDB database. That is a loop over a cursor requires communication with the MongoDB server only when the buffer drains to a low water mark. Consequently, cursor loops using this interface are much faster than atomic reads with ObjectIds.
If arg0 is a pymongo Cursor object the reader assumes you are asking it to construct an ensemble object. Whether that is a TimeSeriesEnsemble or SeismogramEnsemble is dependent upon the setting of the “collection” argument. The entire content of the implied list of documents returned by iterating over the cursor are used to construct ensembles of the atomic members. In fact, the basic algorithm is to call this method recursively by sequentially reading one document at a time, constructing the atomic datum, and posting it to the member vector of the ensemble.
As noted above arg0 interacts with the “collection” argument. The default of “wf_TimeSeries” can be used where appropriate but good practice is to be explicit and specify a value for “collection” in all alls to this method.
This reader accepts data in any mix of what are defined by the database attribute tags storage_mode and format. If those attributes are not defined for a retrieved document they default to “storage_mode”==”gridfs” and “format”==”binary”. The storage_mode attribute can currently be one of the following: - gridfs is taken to mean the data are stored in the MongoDB
gridfs file system.
‘files’ implies the data are stored in conventional computer files stored with two attributes that are required with this storage mode: “dir” and “dfile”.
URL implies the data can be retrieved by some form of web service request through a url defined by other attributes in the document. This mode is volatile and currently not recommended due to the very slow and unreliable response to FDSN data queries. It is likely, however, to become a major component with FDSN services moving to cloud systems.
This reader has prototype support for reading SCEC data stored on AWS s3. The two valid values for defining those “storage_mode”s are “s3_continuous” and “s3_event”, which map to two different areas of SCEC’s storage on AWS. We emphasize the readers for this storage mode are prototypes and subject to change. A similar interface may evolve for handling FDSN cloud data storage depending on how that interface is actually implemented. It was in development at the time this docstring was written.
The reader can read data in multiple formats. The actual format expected is driven by the database attribute called “format”. As noted above it defaults to “binary”, which means the data are stored in contiguous binary blocks on files that can be read with the low-level C function fread. That is the fastest reader currently available, but comes at storage cast as the data are uncompressed doubles. If the data are in some nonnative format (seed is considered not native), the format is cracked using obspy. The list of formats accepted match those defined for obspy’s read function. The format value stored in the database is passed as the format argument to that function. The miniseed format reader has been heavily exercised. Other formats will be an adventure. Be aware there will most definitely be namespace collisions of Metadata attributes with non-native formats other than miniseed.
There is a special case for working with ensemble data that can be expected to provide the highest performance with a conventional file system. That is the case of ensembles with the format of “binary” and data saved in files where all the waveforms of each ensemble are in the same file. That structure is the default for ensembles saved to files with the save_data method. This method is fast because the sample data are read with a C++ function that uses the stock fread function that for conventional file input is about as fast a reader as possible.
An additional critical parameter is “mode”. It must be one of “promiscuous”, “cautious”, or “pedantic”. Default is “promiscuous” which more-or-less ignores the schema and loads the entire content of the document(s) to Metadata containers. For ensembles that means the Metadata for the ensemble and all members. See the User Manual section on “CRUD Operations” for details.
This reader also has an option to normalize with one or more sets of data during reading. For backward compatibility with versions of MsPASS prior to 2.0 the “normalize” parameter will accept a list of strings that are assumed to be collection names containing Metadata that is to be “normalized”. That case, however, invokes the slowest algorithm possible to do that operation. Not only does it use a query-based matching algorithm that requires a MongoDB find_one query for each datum, but it requires construction of an instance of the python class used in MsPASS for each calls and each collection listed. Better performance is normally possible with the v2 approach where the list passed via “normalize” is a list of subclasses of the
mspasspy.db.normalize.BasicMatcher
class. That interface provides a generic matching functionality and is most useful for improving performance when using one of the cache algorithms. For more detail see the User Manual section on normalization.Finally, this reader has to handle a special case of errors that cause the result to be invalid. We do NOT use the standard exception mechanism for this reader because as noted in our User Manual throwing exceptions can abort a large parallel job. Hence, we must try to minimize the cases when a single read operation will kill a larger workflow. (That is reserved for truly unrecoverable errors like a malloc error.) Throughout MsPASS we handle this issue with the live attribute of data objects. All data objects, including ensembles, can be killed (marked data or live==False) as a signal they can be moved around but should be ignored in any processing workflows. Reading data in is a special case. A single read may fail for a variety of reasons but killing the entire job for something like one document containing an invalid attribute is problematic. For that reason, this reader defines the concept of a special type of dead datum it calls an “abortion”. This concept is discussed in greater detail in the User’s Manual section on “CRUD Operations”, but for this context there are three key points:
Any return from this function that is marked dead (live==False) is by definition an abortion.
This reader adds a boolean attribute not stored in the database with the key “is_abortion”. That value will be True if the datum is returned dead but should be False if it is set live.
An entire ensemble may be marked dead only if reading of all the members defined by a cursor input fail. Note handling of failures when constructing ensembles is different than atomic data because partial success is assumed to be acceptable. Hence, when a given datum in an ensemble fails the body is not added to the ensemble but processed the body is buried in a special collection called “abortions”. See User’s Manual for details.
- Parameters:
id_doc_or_cursor (As the name implies should be one of the following: (1) MongoDB ObjectId of wf document to be retrieved to define this read, (2) MongoDB document (python dict) of data defining the datum to be constructed, or (3) a pymongo Cursor object that can be iterated to load an enemble. Note the “doc” can also be a Metadata subclass. That means you can use a seismic data object as input and it will be accepted. The read will work, however, only if the contents of the Metadata container are sufficient. Use of explicit or implicit Metadata container input is not advised even it it might work as it violates an implicit assumption of the function that the input is closely linked to MongoDB. A doc from a cursor or one retrieved through an ObjectId match that assump0tion but an Metadata container does not.) – required key argument that drives what the reader will return. The type of this argument is a top-level control on what is read. See above for details.
mode (
str
) – read mode that controls how the function interacts with the schema definition. Must be one of [‘promiscuous’,’cautious’,’pedantic’]. See user’s manual for a detailed description of what the modes mean. Default is ‘promiscuous’ which turns off all schema checks and loads all attributes defined for each object read.normalize (a
list
ofBasicMatcher
orstr
.BasicMatchers
are applied sequentialy with the normalize function using this matcher. When a list of strings is given each call to this function initiates construction of a database Id matcher using the collection name. e.g. if the list has “source” the wf read is expected to contain the attribute “source_id” that resolves to an id in the source collection. With string input that always happens only through database transactions.) – Use this parameter to do normalization during read. From version 2 onward the preferred input is a list of concrete instances of the base classBasicMatcher
. For backward compatibility this parameter may also defined a list of collection names defined by a list of strings. Note all normalizers will, by default, normally kill any datum for which matching fails. Note for ensembles this parameter defines matching to be applied to all enemble members. Use normalize_ensemble to normalize the enemble’s Metadata container. Member ensemble normalziation can be different for each member while ensemble Metadata is assume the same for all members.normalize_ensemble (a
list
ofBasicMatcher
orstr
.BasicMatchers
are applied sequentialy with the normalize function. When a list of strings is given each call to this function initiates construction of an database Id matcher using the collection name. e.g. if the list has “source” the wf read is expected to contain the attribute “source_id” that resolves to an id in the source collection. With string input that always happens only through database transactions.) – This parameter should be used to apply normalization to ensemble Metadata (attributes common to all ensemble “members”. ) It will be ignored if reading atomic data. Otherwise it behaves like normalizeload_history – boolean switch controlling handling of the object-level history mechanism. When True if the data were saved previously with a comparable switch to save the history, this reader will load the history data. This feature allows preserving a complete object-level history tree in a workflow with an intermediate save. If no history data are found the history tree is initialized. When this parameter is set False the history tree will be left null.
exclude_keys (a
list
ofstr
defining the keys to be cleared.) – Sometimes it is helpful to remove one or more attributes stored in the database from the data’s Metadata (header) so they will not cause problems in downstream processing. Use this argument to supply a list of keys that should be deleted from the datum before it is returned. With ensembles the editing is applied to all members. Note this same functionality can be accomplished within a workflow with the trace editing module.collection –
- Specify the collection name for this read.
In MsPASS the equivalent of header attributes are stored in MongoDB documents contained in a “collection”. The assumption is a given collection only contains documents for one of the two atomic types defined in MsPASS: TimeSeries or Seismogram. The data type for collections defined by the schema is an attribute that the reader tests to define the type of atomic objects to be constructed (Note ensembles are assembled from atomic objects by recursive calls the this same read method.) The default is “wf_TimeSeries”, but we recommend always defining this argument for clarity and stability in the event the default would change.
- type collection:
str
defining a collection that must be defined in the schema. If not, the function will abort the job.
data_tag (
str
used for the filter. Can be None in which case the cross-check test is disable.) – The definition of a dataset can become ambiguous when partially processed data are saved within a workflow. A common example would be windowing long time blocks of data to shorter time windows around a particular seismic phase and saving the windowed data. The windowed data can be difficult to distinguish from the original with standard queries. For this reason we make extensive use of “tags” for save operations to avoid such ambiguities. Reads present a different problem as selection for such a tag is best done with MongoDB queries. If set this argument provides a cross validation that the data are consistent with a particular tag. In particular, if a datum does not have the attribute “data_tag” set to the value defined by the argument a null, dead datum will be returned. For ensembles any document defined by the cursor input for which the “data_tag” attribute does not match will be silenetly dropped. The default is None which disables the cross-checking.ensemble_metadata (python dict. The contents are copied verbatim to the ensemble Metadata container with a loop over the dict keys.) – Optional constant attributes to assign to ensemble Metadata. Ignored for atomic data. It is important to stress that the contents of this dict are applied last. Use with caution, particularly when using normalizatoin, to avoid accidentally overwriting some attribute loaded from the database.
alg_name (
str
) – optional alternative name to assign for this algorithm as a tag for the origin node of the history tree. Default is the function’s name (“read_data”) and should rarely if ever be changed. Ignored unless load_history is set True.alg_id (
str
) – The object-level history mechanism uses a string to tag a specific instance of running a particular function in a workflow. If a workflow has multiple instances of read_data with different parameters, for example, one might specify a value of this argument. Ignored unless load_history is set True.define_as_raw – boolean to control a detail of the object-level history definition of the tree node created for this process. The functions by definition an “origin” but a special case is a “raw” origin meaning the sample data are unaltered (other than type) from the field data. Most miniseed data, for example, would want to set this argument True. Ignored unless load_history is set True.
merge_method – when reading miniseed data implied gaps can be present when packets are missing from a sequence of packets having a common net:sta:chan:loc codes. We use obspy’s miniseed reader to crack miniseed data. It breaks such data into multiple “segments”. We then use their `merge<https://docs.obspy.org/packages/autogen/obspy.core.stream.Stream.merge.html>`__ method of the “Stream” object to glue any such segments together. This parameter is passed as the “method” argument to that function. For detail see __add__ <https://docs.obspy.org/packages/autogen/obspy.core.trace.Trace.__add__.html#obspy.core.trace.Trace.__add__> for details on methods 0 and 1, See _cleanup <https://docs.obspy.org/packages/autogen/obspy.core.stream.Stream._cleanup.html#obspy.core.stream.Stream._cleanup> for details on method -1. Note this argument is ignored unless the reader is trying to read miniseed data.
merge_fill_value (
int
,float
or None (default)) – Fill value for gap processing when obspy’s merge method is invoked. (see description for “merge_method” above). The value given here is passed used as the “fill” argument to the obspy merge method. As with merge_method this argument is relevant only when reading miniseed data.merge_interpolation_samples – when merge_method is set to -1 the obspy merge function requires a value for an argument they call “interpolate_samples”. See their documentation for details, but this argument controls how “overlaps”, as opposed to gaps, are handled by merge. See the function documentation `here<https://docs.obspy.org/packages/autogen/obspy.core.stream.Stream.merge.html>`__ for details.
- Returns:
for ObjectId python dictionary values of arg0 will return either a
mspasspy.ccore.seismic.TimeSeries
ormspasspy.ccore.seismic.Seismogram
object. If arg0 is a pymongo Cursor the function will return amspasspy.ccore.seismic.TimeSeriesEnsemble
ormspasspy.ccore.seismic.SeismogramEnsemble
object. As noted above failures in reading atomic data will result in an object be marked dead and the Metadata attribute “is_abortion” to be set True. When reading ensembles any problem members will be excluded from the output and the bodies buried in a special collection called “abortions”.
- read_ensemble_data(cursor, ensemble_metadata={}, mode='promiscuous', normalize=None, load_history=False, exclude_keys=None, collection='wf', data_tag=None, alg_name='read_ensemble_data', alg_id='0')[source]
DEPRICATED METHOD: do not use except for backward compatility in short term. Will go away in a later release
- read_ensemble_data_group(cursor, ensemble_metadata={}, mode='promiscuous', normalize=None, load_history=False, exclude_keys=None, collection='wf', data_tag=None, alg_name='read_ensemble_data_group', alg_id='0')[source]
DEPRICATED METHOD: do not use except for backward compatility in short term. Will go away in a later release
- read_inventory(net=None, sta=None, loc=None, time=None)[source]
Loads an obspy inventory object limited by one or more keys. Default is to load the entire contents of the site collection. Note the load creates an obspy inventory object that is returned. Use load_stations to return the raw data used to construct an Inventory.
- Parameters:
net – network name query string. Can be a single
unique net code or use MongoDB’s expression query mechanism (e.g. “{‘$gt’ : 42}). Default is all :param sta: statoin name query string. Can be a single station name or a MongoDB query expression. :param loc: loc code to select. Can be a single unique location (e.g. ‘01’) or a MongoDB expression query. :param time: limit return to stations with startime<time<endtime. Input is assumed an epoch time NOT an obspy UTCDateTime. Use a conversion to epoch time if necessary. :return: obspy Inventory of all stations matching the query parameters :rtype: obspy Inventory
- save_catalog(cat, verbose=False)[source]
Save the contents of an obspy Catalog object to MongoDB source collection. All contents are saved even with no checking for existing sources with duplicate data. Like the comparable save method for stations, save_inventory, the assumption is pre or post cleanup will be preformed if duplicates are a major issue.
- Parameters:
cat – is the Catalog object to be saved
verbose – Print informational data if true.
When false (default) it does it’s work silently.
- Returns:
integer count of number of items saved
- save_data(mspass_object, return_data=False, return_list=['_id'], mode='promiscuous', storage_mode='gridfs', dir=None, dfile=None, format=None, overwrite=False, exclude_keys=None, collection=None, data_tag=None, cremate=False, save_history=True, normalizing_collections=['channel', 'site', 'source'], alg_name='save_data', alg_id='0')[source]
Standard method to save all seismic data objects to be managed by MongoDB.
This method provides a unified interface for saving all seismic data objects in MsPASS. That is, what we call atomic data (
mspasspy.ccore.seismic.TimeSeries
andmspasspy.ccore.seismic.Seismogram
) and ensembles of the two atomic types ():class:mspasspy.ccore.seismic.TimeSeriesEnsemble andmspasspy.ccore.seismic.SeismogramEnsemble
). Handling multiple types while simultaneously supporting multiple abstractions of how the data are stored externally has some complexities. 1. All MsPASS data have multiple containers used internallyto define different concepts. In particular atomic data have a Metadata container we map to MongoDB documents directly, sample data that is the largest data component that we handle multiple ways, error log data, and (optional) object-level history data. Ensembles are groups of atomic data but they also contain a Metadata and error log container common with content common to the group.
With seismic data the size is normally dominated by the sample data that are stored internally as a vector container for TimeSeries objects and a matrix for Seismogram objects. Handling moving that data to storage as fast as possible is thus the key to optimize write performance. The write process is further complicated, however, by the fact that “write/save” is itself an abstraction that even in FORTRAN days hid a lot of complexity. A call to this function supports multiple save mechanisms we define through the storage_mode keyword. At present “storage_mode” can be only one of two options: “file” and “gridfs”. Note this class has prototype code for reading data in AWS s3 cloud storage that is not yet part of this interface. The API, however, was designed to allow adding one or more “storage_mode” options that allow other mechanisms to save sample data. The top priority for a new “storage_mode” when the details are finalized is cloud storage of FDSN data by Earthscope and other data centers. We have also experimented with parallel file containers like HDF5 for improved IO performance but have not yet produced a stable implementation. In the future look for all such improvements. What is clear is the API will define these alternatives through the “storage_mode” concept.
In MsPASS we make extensive use of the idea from seismic reflection processing of marking problem data “dead”. This method handles all dead data through a standard interface with the memorable/colorful name
mspasspy.util.Undertaker
. The default save calls the “bury” method of that class, but there is an optional “cremate” argument that calls that method instead. The default of “bury” writes error log data to a special collection (“cemetery”) while the “cremate” method causes dead data to vanish without a trace on save.There are two types of save concepts this method has to support. That is, sometimes one needs to save intermediate results of a workflow and continue on doing more work on the same data. The more common situation is to terminate a workflow with a call to this method. The method argument return_data is an implicit switch for these two concepts. When return_data is False, which is the default, only the ObjectId(s) of thw wf document(s) are returned by the function. We use that as the default to avoid memory overflow in a final save when the compute method is called in dask or the collect method is called in pyspark. Set the return_data argument True if you are doing an intermediate save and want to reuse the content of what is returned.
There is a long history in seismology of debate on data formats. The fundamental reason for much of the debate has to do with the namespace and concepts captured in data “headers” for different formats seismologists have invented over several decades. A type example is that two “standard” formats are SEGY that is the universal standard in seismic reflection data exchange and SEED which is now the standard earthquakek data format (Technically most data is now “miniseed” which differs from SEED by defining only minimal header attributes compared to the abomination of SEED that allows pretty much anything stored in an excessively complex data structure.) Further debate occurs regarding how data sets are stored in a particular format. These range from the atomic level file model of SAC to the opposite extreme of an entire data set stored in one file, which is common for SEGY. Because of the complexity of multiple formats seismologists have used for the external representation of data, this writer has limits on what it can and cannot do. Key points about how formatting is handled are: - The default save is a native format that is fast and efficient.
You can select alternative formats for data by setting a valid value (string) for the “format” argument to this method. What is “valid” for the format argument is currently simple to state: the value received for format is passed directly to obspy’s Stream write method’s format argument. That is, realize that when saving to a nonnative format the first thing this method does is convert the data to a Stream. It then calls obspy’s write method with the format specified.
Given the above, it follows that if you are writing data to any nonnative format before calling this function you will almost certainly have to edit the Metadata of each datum to match the namespace for the format you want to write. If not, obspy’s writer will almost certainly fail. The only exception is miniseed where if the data being processed originate as miniseed they can often be saved as miniseed without edits.
This writer only saves atomic data on single ensembles and knows nothing about assumptions a format makes about the larger scale layout of a dataset. e.g. you could theoretically write an ensemble in SAC format, but the output will be unreadable by SAC because multiple records could easily be written in a single file. Our perspective is that MsPASS is a framework and we do not have resources to sort out all the complexities of all the formats out there. We request the user community to share an development for nonstandard formats. Current the only format options that are not “use at your own risk” are the native “binary” format and miniseed.
Given the complexity just described, users should not be surprised that there are limits to what this single method can do and that evolution of its capabilities is inevitable as the IT world evolves. We reiterate the entire point of this method (also the related read_data method) is to abstract the save process to make it as simple as possible while providing mechanisms to make it work as efficiently as possible.
There are some important implementation details related to the points above that are important to understand if you encounter issues using this algorithm. 1. Writing is always atomic. Saving ensemble data is little more
than an enhanced loop over data members. By “enhanced” we mean two things: (a) any ensemble Metadata attributes are copied to the documents saved for each member, and (b) dead data have to be handled differently but the handling is a simple conditional of “if live” with dead data handled the same as atomic bodies.
Atomic saves handle four components of the MsPASS data objects differently. The first thing saved is always the sample data. How that is done depends upon the storage mode and format discussed in more detail below. The writer then saves the content of the datum’s Metadata container returning a copy of the value of the ObjectId of the document saved. That ObjectID is used as a cross reference for saving the final two fundamentally different components: (a) any error log entries and (b) object-level history data (optional).
As noted above how the sample data is stored is driven by the “storage_mode” argument. Currently two values are accepted. “gridfs”, which is the default, pushes the sample data to a MongoDB managed internal file space the documents refer to with that name. There are two important caveats about gridfs storage. First, the data saved will use the same file system as that used by the database server. That can easily cause a file-system full error or a quota error if used naively. Second, gridfs IO is prone to a serious preformance issue in a parallel workflow as all workers can be shouting at the database server simultaneously filling the network connection and/or overloading the server. For those reasons, “file” should be used for the storage mode for most applications. It is not the default because storing data in files always requires the user to implement some organizational scheme through the “dir” and “dfile” arguments. There is no one-size-fits-all solution for organizing how a particular project needs to organize the files it produces. Note, however, that one can set storage_mode to “file” and this writer will work. By default is sets “dir” to the current directory and “dfile” to a random string created with a uuid generator. (For ensembles the default is to write all member data to the file defined explicitly or implicitly by the “dir” and “dfile” arguments.) An final implementation detail is that if “dir” and “dfile” are left at the default None, the algorithm will attempt to extract the value of “dir” and “dfile” from the Metadata of an atomic datum referenced in the save (i.e. for ensembles each datum will be treated independently.). That feature allows more complex data organization schemes managed externally from the writer for ensembles. All of that was designed to make this method as bombproof as possible, but users need to be aware naive usage can create a huge mess. e.g. with “file” and null “dir” and “dfile” saving a million atomic data will create a million files with random names in your current directory. Good luck cleaning that mess up. Finally, we reiterate that when Earthscope finishes their design work for cloud storage access there will probably be a new storage mode added to support that access. The “s3” methods in Database should be viewed only as prototypes.
If any datum has an error log entry it is always saved by this method is a collection called “elog”. That save is not optional as we view all error log entries as significant. Each elog document contains an ObjectId of the wf document with which it is associated.
Handling of the bodies of dead data has some complexity. Since v2 of Database all dead data are passed through an instance of
mspasspy.util.Undertaker
. By default atomic dead data are passed through themspasspy.util.Undertaker.bury()
method. If the “cremate” argument is set True themspasspy.util.Undertaker.cremate()
method will be called. For ensembles the writer calls themspasspy.util.Undertaker.bring_out_your_dead()
. If cremate is set True the bodies are discarded. Otherwise the bury method is called in a loop over all bodies. Note the default bury method creates a document for each dead datum in one of two special collections: (a) data killed by a processing algorithm are saved to the “cemetery” collection, and (b) data killed by a reader are saved to “abortions” (An abortion is a datum that was never successfully constructed - born.) The documents for the two collections contain the same name-value pairs but we split them because action required by an abortion is very different from a normal kill. See the User Manual section on “CRUD Operations” for information on this topic.Object-level history is not saved unless the argument “save_history” is set True (default is False). When enabled the history data (if defined) is saved to a collection called “history”. The document saved, like elog, has the ObjectId of the wf document with which it is associated. Be aware that there is a signficant added cost to creating a history document entry. Because calling this method is part of the history chain one would want to preserve, an atomic database operation is required for each datum saved. That is, the algorithm does an update of the wf document with which it it is associated containing the ObjectId of the history document it saved.
As noted above saving data with “format” set to anything but the default “binary” or “MSEED” is adventure land. We reiterate saving other formats is a development frontier we hope to come as contributions from users who need writers for a particular format and are intimately familiar with that format’s idiosyncracies.
The “mode” argument has a number of subtle idiosyncracies that need to be recognized. In general we recommend always writing data (i.e. calling this method) with the default mode of “promiscuous”. That guarantees you won’t mysteriously lose data being saved or abort the workflow when it is finished. In general it is better to use one of the verify methods or the dbverify command line tool to look for problem attributes in any saved documents than try to enforce rules setting mode to “cautious” or “pedantic”. On the other hand, even if running in “promiscuous” mode certain rules are enforced:
Any aliases defined in the schema are always reset to the key defined for the schema. e.g. if you used the alias “dt” for the data sample interval this writer will always change it to “delta” (with the standard schema anyway) before saving the document created from that datum.
All modes enforce a schema constraint for “readonly” attributes. An immutable (readonly) attribute by definition should not be changed during processing. During a save all attributes with a key defined as readonly are tested with a method in the Metadata container that keeps track of any Metadata changes. If a readonly attribute is found to have been changed it will be renamed with the prefix “READONLYERROR_”, saved, and an error posted (e.g. if you try to alter site_lat (a readonly attribute) in a workflow when you save the waveform you will find an entry with the key READONERROR_site_lat.) We emphasize the issue happens if the value associated with such a key was altered after the datum was constructed. If the attribute does not change it is ERASED and will not appear in the document. The reason for that is this feature exists to handle attributes loaded in normalization. Be aware this feature can bloat the elog collection if an entire dataset share a problem this algorithm flags as a readonly error.
This method can throw an exception but only for errors in usage (i.e. arguments defined incorrectly)
- Parameters:
mspass_object (
mspasspy.ccore.seismic.TimeSeries
,mspasspy.ccore.seismic.Seismogram
,mspasspy.ccore.seismic.TimeSeriesEnsemble
, ormspasspy.ccore.seismic.SeismogramEnsemble
) – the seismic object you want to save.return_data – When True return a (usually edited) copy
of the data received. When False, the default, return only a requested subset of the attributes in the wf document saved for this datum.s :param return_list: list of keys for attributes to extract
and set for return python dictionary when return_data is False. Ignored if return_data is True. Be warned for ensembles the attributes are expected to be in the ensemble Metadata container. Missing attributes will be silentely ignored so an empty dict will be returned if none of the attributes is this list are defined.
- Parameters:
mode (
str
) – This parameter defines how attributes defined with key-value pairs in MongoDB documents are to be handled on reading. By “to be handled” we mean how strongly to enforce name and type specification in the schema for the type of object being constructed. Options are [‘promiscuous’,’cautious’,’pedantic’] with ‘promiscuous’ being the default. See the User’s manual for more details on the concepts and how to use this option.storage_mode (
str
) – Current must be either “gridfs” or “file. When set to “gridfs” the waveform data are stored internally and managed by MongoDB. If set to “file” the data will be stored in a file system with the dir and dfile arguments defining a file name. The default is “gridfs”. See above for more details.dir (
str
) – file directory for storage. This argument is ignored if storage_mode is set to “gridfs”. When storage_mode is “file” it sets the directory in a file system where the data should be saved. Note this can be an absolute or relative path. If the path is relative it will be expanded with the standard python library path functions to a full path name for storage in the database document with the attribute “dir”. As for any io we remind the user that you much have write permission in this directory. Note if this argument is None (default) and storage_mode is “file” the algorithm will first attempt to extract “dir” from the Metadata of mspass_object. If that is defined it will be used as the write directory. If it is not defined it will default to the current directory.dfile (
str
) – file name for storage of waveform data. As with dir this parameter is ignored if storage_mode is “gridfs” and is required only if storage_mode is “file”. Note that this file name does not have to be unique. The writer always positions the write pointer to the end of the file referenced and sets the attribute “foff” to that position. That allows automatic appends to files without concerns about unique names. Like the dir argument if this argument is None (default) and storage_mode is “file” the algorithm will first attempt to extract “dfile” from the Metadata of mspass_object. If that is defined it will be used as the output filename. If it is not defined a uuid generator is used to create a file name with a random string of characters. That usage is never a good idea, but is a feature added to make this method more bombproof.format (
str
) – the format of the file. This can be one of the supported formats of ObsPy writer. The default the python None which the method assumes means to store the data in its raw binary form. The default should normally be used for efficiency. Alternate formats are primarily a simple export mechanism. See the User’s manual for more details on data export. Used only for “file” storage mode. A discussion of format caveats can be found above.overwrite (boolean) – If true gridfs data linked to the original waveform will be replaced by the sample data from this save. Default is false, and should be the normal use. This option should never be used after a reduce operator as the parents are not tracked and the space advantage is likely minimal for the confusion it would cause. This is most useful for light, stable preprocessing with a set of map operators to regularize a data set before more extensive processing. It can only be used when storage_mode is set to gridfs.
exclude_keys (a
list
ofstr
) – Metadata can often become contaminated with attributes that are no longer needed or a mismatch with the data. A type example is the bundle algorithm takes three TimeSeries objects and produces a single Seismogram from them. That process can, and usually does, leave things like seed channel names and orientation attributes (hang and vang) from one of the components as extraneous baggage. Use this of keys to prevent such attributes from being written to the output documents. Note if the data being saved lack these keys nothing happens so it is safer, albeit slower, to have the list be as large as necessary to eliminate any potential debris.collection – The default for this parameter is the python None. The default should be used for all but data export functions. The normal behavior is for this writer to use the object data type to determine the schema is should use for any type or name enforcement. This parameter allows an alernate collection to be used with or without some different name and type restrictions. The most common use of anything other than the default is an export to a diffrent format.
data_tag (
str
) – a user specified “data_tag” key. See above and User’s manual for guidance on how the use of this option.cremate – boolean controlling how dead data are handled. When True a datum marked dead is ignored and the body or a shell of the body (what depends on the return_list value) is returned. If False (default) the
mspasspy.util.Undertaker.bury()
method is called with body as input. That creates a document for each dead datum either in the “cemetery” or “abortions” collection. See above for more deails.normalizing_collections – list of collection names dogmatically treated as normalizing collection names. The keywords in the list are used to always (i.e. for all modes) erase any attribute with a key name of the form collection_attribute where `collection is one of the collection names in this list and attribute is any string. Attribute names with the “_” separator are saved unless the collection field matches one one of the strings (e.g. “channel_vang” will be erased before saving to the wf collection while “foo_bar” will not be erased.) This list should ONLY be changed if a different schema than the default mspass schema is used and different names are used for normalizing collections. (e.g. if one added a “shot” collection to the schema the list would need to be changed to at least add “shot”.)
save_history – When True the optional history data will be saved to the database if it was actually enabled in the workflow. If the history container is empty will silently do nothing. Default is False meaning history data is ignored.
- Returns:
python dict of requested attributes by default. Edited copy of input when return_data is True.
- save_dataframe(df, collection, null_values=None, one_to_one=True, parallel=False)[source]
Tansfer a dataframe into a set of documents, and store them in a specified collection. In one_to_one mode every row in the dataframe will be saved in the specified mongodb collection. Otherwise duplicates would be discarded.
Note we first call the dropna method of DataFrame to eliminate None values to mesh with how MongoDB handles Nulls; not consistent with DataFrame implemenations that mimic relational database tables where nulls are a hole in the table.
- Parameters:
df – Pandas.Dataframe object, the input to be transfered into mongodb
documents :param collection: MongoDB collection name to be used to save the (often subsetted) tuples of filename as documents in this collection. :param null_values: is an optional dict defining null field values. When used an == test is applied to each attribute with a key defined in the null_vlaues python dict. If == returns True, the value will be set as None in dataframe. If your table has a lot of null fields this option can save space, but readers must not require the null field. The default is None which it taken to mean there are no null fields defined. :param one_to_one: a boolean to control if the set should be filtered by rows. The default is True which means every row in the dataframe will create a single MongoDB document. If False the (normally reduced) set of attributes defined by attributes_to_use will be filtered with the panda/dask dataframe drop_duplicates method before converting the dataframe to documents and saving them to MongoDB. That approach is important, for example, to filter things like Antelope “site” or “sitechan” attributes created by a join to something like wfdisc and saved as a text file to be processed by this function. :param parallel: a boolean that determine if dask api will be used for operations on the dataframe, default is false. :return: integer count of number of documents added to collection
- save_ensemble_data(ensemble_object, mode='promiscuous', storage_mode='gridfs', dir_list=None, dfile_list=None, exclude_keys=None, exclude_objects=None, collection=None, data_tag=None, alg_name='save_ensemble_data', alg_id='0')[source]
Save an Ensemble container of a group of data objecs to MongoDB.
DEPRICATED METHOD: use save_method instead.
Ensembles are a core concept in MsPASS that are a generalization of fixed “gather” types frozen into every seismic reflection processing system we know of. This is is a writer for data stored in such a container. It is little more than a loop over each “member” of the ensemble calling the Database.save_data method for each member. For that reason most of the arguments are passed downstream directly to save_data. See the save_data method and the User’s manual for more verbose descriptions of their behavior and expected use.
The only complexity in handling an ensemble is that our implementation has a separate Metadata container associated with the overall group that is assumed to be constant for every member of the ensemble. For this reason before entering the loop calling save_data on each member the method calls the objects sync_metadata method that copies (overwrites if previously defined) the ensemble attributes to each member. That assure atomic saves will not lose their association with a unique ensemble indexing scheme.
A final feature of note is that an ensemble can be marked dead. If the entire ensemble is set dead this function returns immediately and does nothing.
- Parameters:
ensemble_object (either
mspasspy.ccore.seismic.TimeSeriesEnsemble
ormspasspy.ccore.seismic.SeismogramEnsemble
.) – the ensemble you want to save.mode (
str
) – reading mode regarding schema checks, should be one of [‘promiscuous’,’cautious’,’pedantic’]storage_mode (
str
) – “gridfs” stores the object in the mongodb grid file system (recommended). “file” stores the object in a binary file, which requiresdfile
anddir
.dir_list – A
list
of file directories if using “file” storage mode. File directory isstr
type.dfile_list – A
list
of file names if using “file” storage mode. File name isstr
type.exclude_keys (a
list
ofstr
) – the metadata attributes you want to exclude from being stored.exclude_objects (
list
) – A list of indexes, where each specifies a object in the ensemble you want to exclude from being saved. Starting from 0.collection – the collection name you want to use. If not specified, use the defined collection in the metadata schema.
data_tag (
str
) – a user specified “data_tag” key to tag the saved wf document.
- save_ensemble_data_binary_file(ensemble_object, mode='promiscuous', dir=None, dfile=None, exclude_keys=None, exclude_objects=None, collection=None, data_tag=None, kill_on_failure=False, alg_name='save_ensemble_data_binary_file', alg_id='0')[source]
Save an Ensemble container of a group of data objecs to MongoDB.
DEPRICATED METHOD: use save_method instead.
Ensembles are a core concept in MsPASS that are a generalization of fixed “gather” types frozen into every seismic reflection processing system we know of. This is is a writer for data stored in such a container. It saves all the objects in the ensemble to one file.
Our implementation has a separate Metadata container associated with the overall group that is assumed to be constant for every member of the ensemble. For this reason the method calls the objects sync_metadata method that copies (overwrites if previously defined) the ensemble attributes to each member. That assure atomic saves will not lose their association with a unique ensemble indexing scheme.
A final feature of note is that an ensemble can be marked dead. If the entire ensemble is set dead this function returns immediately and does nothing.
- Parameters:
ensemble_object (either
mspasspy.ccore.seismic.TimeSeriesEnsemble
ormspasspy.ccore.seismic.SeismogramEnsemble
.) – the ensemble you want to save.mode (
str
) – reading mode regarding schema checks, should be one of [‘promiscuous’,’cautious’,’pedantic’]dir (
str
) – file directory.dfile (
str
) – file name.exclude_keys (a
list
ofstr
) – the metadata attributes you want to exclude from being stored.exclude_objects (
list
) – A list of indexes, where each specifies a object in the ensemble you want to exclude from being saved. Starting from 0.collection – the collection name you want to use. If not specified, use the defined collection in the metadata schema.
data_tag (
str
) – a user specified “data_tag” key to tag the saved wf document.kill_on_failure (boolean) – When true if an io error occurs the data object’s kill method will be invoked. When false (the default) io errors are logged and left set live. (Note data already marked dead are return are ignored by this function. )
- save_inventory(inv, networks_to_exclude=['SY'], verbose=False)[source]
Saves contents of all components of an obspy inventory object to documents in the site and channel collections. The site collection is sufficient for Seismogram objects but TimeSeries data will normally need to be connected to the channel collection. The algorithm used will not add duplicates based on the following keys:
- For site:
net sta chan loc starttime::endtime - this check is done cautiously with
a 10 s fudge factor to avoid the issue of floating point equal tests. Probably overly paranoid since these fields are normally rounded to a time at the beginning of a utc day, but small cost to pay for stabilty because this function is not expected to be run millions of times on a huge collection.
- for channels:
net sta chan loc starttime::endtime - same approach as for site with same
issues - note especially 10 s fudge factor. This is necessary because channel metadata can change more frequently than site metadata (e.g. with a sensor orientation or sensor swap)
The channel collection can contain full response data that can be obtained by extracting the data with the key “serialized_inventory” and running pickle loads on the returned string.
A final point of note is that not all Inventory objects are created equally. Inventory objects appear to us to be designed as an image of stationxml data. The problem is that stationxml, like SEED, has to support a lot of complexity faced by data centers that end users like those using this package do not need or want to know. The point is this method tries to untangle the complexity and aims to reduce the result to a set of documents in the site and channel collection that can be cross referenced to link the right metadata with all waveforms in a dataset.
- Parameters:
inv – is the obspy Inventory object of station data to save.
- Networks_to_exclude:
should contain a list (or tuple) of SEED 2 byte network codes that are to be ignored in processing. Default is SY which is used for synthetics. Set to None if if all are to be loaded.
- Verbose:
print informational lines if true. If false
works silently)
- Returns:
tuple with 0 - integer number of site documents saved 1 -integer number of channel documents saved 2 - number of distinct site (net,sta,loc) items processed 3 - number of distinct channel items processed
- Return type:
tuple
- save_textfile(filename, collection='textfile', separator='\\s+', type_dict=None, header_line=0, attribute_names=None, rename_attributes=None, attributes_to_use=None, null_values=None, one_to_one=True, parallel=False, insert_column=None)[source]
Import and parse a textfile into set of documents, and store them into a mongodb collection. This function consists of two steps: 1. Textfile2Dataframe: Convert the input textfile into a Pandas dataframe 2. save_dataframe: Insert the documents in that dataframe into a mongodb collection
- Parameters:
filename – path to text file that is to be read to create the
table object that is to be processed (internally we use pandas or dask dataframes) :param collection: MongoDB collection name to be used to save the (often subsetted) tuples of filename as documents in this collection. :param separator: The delimiter used for seperating fields, the default is “s+”, which is the regular expression of “one or more spaces”.
For csv file, its value should be set to ‘,’. This parameter will be passed into pandas.read_csv or dask.dataframe.read_csv. To learn more details about the usage, check the following links: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html https://docs.dask.org/en/latest/generated/dask.dataframe.read_csv.html
- Parameters:
type_dict – pairs of each attribute and its type, usedd to validate
the type of each input item :param header_line: defines the line to be used as the attribute names for columns, if is < 0, an attribute_names is required. Please note that if an attribute_names is provided, the attributes defined in header_line will always be override. :param attribute_names: This argument must be either a list of (unique) string names to define the attribute name tags for each column of the input table. The length of the array must match the number of columns in the input table or this function will throw a MsPASSError exception. This argument is None by default which means the function will assume the line specified by the “header_line” argument as column headers defining the attribute name. If header_line is less than 0 this argument will be required. When header_line is >= 0 and this argument (attribute_names) is defined all the names in this list will override those stored in the file at the specified line number. :param rename_attributes: This is expected to be a python dict keyed by names matching those defined in the file or attribute_names array (i.e. the panda/dataframe column index names) and values defining strings to use to override the original names. That usage, of course, is most common to override names in a file. If you want to change all the name use a custom attributes_name array as noted above. This argument is mostly to rename a small number of anomalous names. :param attributes_to_use: If used this argument must define a list of attribute names that define the subset of the dataframe dataframe attributes that are to be saved. For relational db users this is effectively a “select” list of attribute names. The default is None which is taken to mean no selection is to be done. :param one_to_one: is an important boolean use to control if the output is or is not filtered by rows. The default is True which means every tuple in the input file will create a single row in dataframe. (Useful, for example, to construct an wf_miniseed collection css3.0 attributes.) If False the (normally reduced) set of attributes defined by attributes_to_use will be filtered with the panda/dask dataframe drop_duplicates method. That approach is important, for example, to filter things like Antelope “site” or “sitechan” attributes created by a join to something like wfdisc and saved as a text file to be processed by this function. :param parallel: When true we use the dask dataframe operation. The default is false meaning the simpler, identical api panda operators are used. :param insert_column: a dictionary of new columns to add, and their value(s). If the content is a single value, it can be passedto define a constant value for the entire column of data. The content can also be a list, in that case, the list should contain values that are to be set, and it must be the same length as the number of tuples in the table. :return: Integer count of number of documents added to collection
- set_database_schema(schema)[source]
Change the database schema defined for this handle.
Normal use sets the schema at the time handle is created. In rare instances it can be useful to change the schema on the fly. Use this method to do that for the database schema component. An alternative is to create a new instance of Database with the new schema, but that approach is much slower than using this method. Whether that matters is dependent on the number of times that operation is required.
- Parameters:
schema (
mspsspy.db.schema.MetadataSchema
or astr
) – Specification of the schema to use. Can be a string defining a path to a yaml file defining the schema or an instance ofmspsspy.db.schema.MetadataSchema
that was previously created by reading such a file.
- set_metadata_schema(schema)[source]
Change the metadata schema defined for this handle.
Normal use sets the schema at the time handle is created. In rare instances it can be useful to change the schema on the fly. Use this method to do that for the Metadata component. An alternative is to create a new instance of Database with the new schema, but that approach is much slower than using this method. Whether that matters is dependent on the number of times that operation is required.
- Parameters:
schema (
mspsspy.db.schema.MetadataSchema
or astr
) – Specification of the schema to use. Can be a string defining a path to a yaml file defining the schema or an instance ofmspsspy.db.schema.MetadataSchema
that was previously created by reading such a file.
- set_schema(schema)[source]
Use this method to change both the database and metadata schema defined for this instance of a database handle. This method sets the database schema (namespace for attributes saved in MongoDB) and the metadata schema (interal namespace).
- Parameters:
schema – a
str
of the yaml file name.
- update_data(mspass_object, collection=None, mode='cautious', exclude_keys=None, force_keys=None, data_tag=None, normalizing_collections=['channel', 'site', 'source'], alg_id='0', alg_name='Database.update_data')[source]
Updates both metadata and sample data corresponding to an input data object.
Since storage of data objects in MsPASS is broken into multiple collections and storage methods, doing a full data update has some complexity. This method handles the problem differently for the different pieces:
An update is performed on the parent wf collection document. That update makes use of the related Database method called update_metadata.
If the error log is not empty it is saved.
If the history container has contents it is saved.
The sample data is the thorniest problem. Currently this method will only do sample updates for data stored in the mongodb gridfs system. With files containing multiple waveforms it would be necessary to append to the files and this could create a blaat problem with large data sets so we do not currently support that type of update.
A VERY IMPORTANT implicit feature of this method is that if the magic key “gridfs_id” exists the sample data in the input to this method (mspass_object.data) will overwrite any the existing content of gridfs found at the matching id. This is a somewhat hidden feature so beware.
- Parameters:
mspass_object (either
mspasspy.ccore.seismic.TimeSeries
ormspasspy.ccore.seismic.Seismogram
) – the object you want to update.exclude_keys (a
list
ofstr
) – a list of metadata attributes you want to exclude from being updated.force_keys (a
list
ofstr
) – a list of metadata attributes you want to force to be updated. Normally this method will only update attributes that have been marked as changed since creation of the parent data object. If data with these keys is found in the mspass_object they will be added to the update record.collection – the collection name you want to use. If not specified, use the defined collection in the metadata schema.
mode (
str
) – This parameter defines how attributes defines how strongly to enforce schema constraints. As described above ‘promiscuous’ justs updates all changed values with no schema tests. ‘cautious’, the default, enforces type constraints and tries to convert easily fixed type mismatches (e.g. int to floats of vice versa). Both ‘cautious’ and ‘pedantic’ may leave one or more complaint message in the elog of mspass_object on how the method did or did not fix mismatches with the schema. Both also will drop any key-value pairs where the value cannot be converted to the type defined in the schema.normalizing_collections – list of collection names dogmatically treated as normalizing collection names. The keywords in the list are used to always (i.e. for all modes) erase any attribute with a key name of the form collection_attribute where `collection is one of the collection names in this list and attribute is any string. Attribute names with the “_” separator are saved unless the collection field matches one one of the strings (e.g. “channel_vang” will be erased before saving to the wf collection while “foo_bar” will not be erased.) This list should ONLY be changed if a different schema than the default mspass schema is used and different names are used for normalizing collections. (e.g. if one added a “shot” collection to the schema the list would need to be changed to at least add “shot”.)
alg_name (
str
) – alg_name is the name the func we are gonna save while preserving the history. (defaults to ‘Database.update_data’ and should not normally need to be changed)alg_id (
bson.ObjectId.ObjectId
) – alg_id is a unique id to record the usage of func while preserving the history.
- Returns:
mspass_object data. Normally this is an unaltered copy of the data passed through mspass_object. If there are errors, however, the elog will contain new messages. All such messages, howevever, should be saved in the elog collection because elog is the last collection updated.
- update_ensemble_metadata(ensemble_object, mode='promiscuous', exclude_keys=None, exclude_objects=None, collection=None, alg_name='update_ensemble_metadata', alg_id='0')[source]
Updates (or save if it’s new) the mspasspy ensemble object, including saving the processing history, elogs and metadata attributes.
DEPRICATED METHOD: Do not use.
This method is a companion to save_ensemble_data. The relationship is comparable to that between the save_data and update_metadata methods. In particular, this method is mostly for internal use to save the contents of the Metadata container in each ensemble member. Like save_ensemble_data it is mainly a loop over ensemble members calling update_metadata on each member. Also like update_metadata it is advanced usage to use this method directly. Most users will apply it under the hood as part of calls to save_ensemble_data.
- Parameters:
ensemble_object (either
mspasspy.ccore.seismic.TimeSeriesEnsemble
ormspasspy.ccore.seismic.SeismogramEnsemble
.) – the ensemble you want to update.mode (
str
) – reading mode regarding schema checks, should be one of [‘promiscuous’,’cautious’,’pedantic’]exclude_keys (a
list
ofstr
) – the metadata attributes you want to exclude from being updated.exclude_objects – a list of indexes, where each specifies a object in the ensemble you want to
exclude from being saved. The index starts at 0. :type exclude_objects:
list
:param collection: the collection name you want to use. If not specified, use the defined collection in the metadata schema. :param ignore_metadata_changed_test: if specify asTrue
, we do not check the whether attributes we want to update are in the Metadata.modified() set. Default to beFalse
. :param data_tag: a user specified “data_tag” key to tag the saved wf document. :type data_tag:str
- update_metadata(mspass_object, collection=None, mode='cautious', exclude_keys=None, force_keys=None, normalizing_collections=['channel', 'site', 'source'], alg_name='Database.update_metadata')[source]
Use this method if you want to save the output of a processing algorithm whose output is only posted to metadata. That can be something as simple as a little python function that does some calculations on other metadata field, or as elaborate as a bound FORTRAN or C/C++ function that computes something, posts the results to Metadata, but doesn’t actually alter the sample data. A type example of the later is an amplitude calculation that posts the computed amplitude to some metadata key value.
This method will ONLY attempt to update Metadata attributes stored in the data passed (mspass_object) that have been marked as having been changed since creation of the data object. The default mode will check entries against the schema and attempt to fix any type mismatches (mode==’cautious’ for this algorithm). In cautious or pedantic mode this method can end up posting a lot of errors in elog for data object (mspass_object) being handled. In promiscuous mode there are no safeties and the any values that are defined in Metadata as having been changed will be posted as an update to the parent wf document to the data object.
A feature of the schema that is considered an unbreakable rule is that any attribute marked “readonly” in the schema cannot by definition be updated with this method. It utilizes the same method for handling this as the save_data method. That is, for all “mode” parameters if an key is defined in the schema as readonly and it is listed as having been modified, it will be save with a new key creating by adding the prefix “READONLYERROR_” . e.g. if we had a site_sta read as ‘AAK’ but we changed it to ‘XYZ’ in a workflow, when we tried to save the data you will find an entry in the document of {‘READONLYERROR_site_sta’ : ‘XYZ’}
- Parameters:
mspass_object (either
mspasspy.ccore.seismic.TimeSeries
ormspasspy.ccore.seismic.Seismogram
) – the object you want to update.exclude_keys (a
list
ofstr
) – a list of metadata attributes you want to exclude from being updated.force_keys (a
list
ofstr
) – a list of metadata attributes you want to force to be updated. Normally this method will only update attributes that have been marked as changed since creation of the parent data object. If data with these keys is found in the mspass_object they will be added to the update record.collection – the collection name you want to use. If not specified, use the defined collection in the metadata schema.
mode (
str
) – This parameter defines how attributes defines how strongly to enforce schema constraints. As described above ‘promiscuous’ justs updates all changed values with no schema tests. ‘cautious’, the default, enforces type constraints and tries to convert easily fixed type mismatches (e.g. int to floats of vice versa). Both ‘cautious’ and ‘pedantic’ may leave one or more complaint message in the elog of mspass_object on how the method did or did not fix mismatches with the schema. Both also will drop any key-value pairs where the value cannot be converted to the type defined in the schema.alg_name (
str
) – alg_name is the name the func we are gonna save while preserving the history.alg_id (
bson.ObjectId.ObjectId
) – alg_id is a unique id to record the usage of func while preserving the history.
- Returns:
mspass_object data. Normally this is an unaltered copy of the data passed through mspass_object. If there are errors, however, the elog will contain new messages. Note any such messages are volatile and will not be saved to the database until the save_data method is called.
- verify(document_id, collection='wf', tests=['xref', 'type', 'undefined'])[source]
This is an atomic-level operation to search for known issues in Metadata stored in a database and needed to construct a valid data set for starting a workflow. By “atomic” we main the operation is for a single document in MongoDB linked to an atomic data object (currently that means TimeSeries or Seismogram objects). The tests are the same as those available through the command line tool dbverify. See the man page for that tool and the user’s manual for more details about the tests this method enables.
- Parameters:
document_id (
bson.ObjectId.ObjectId
of document to be tested) – the value of the _id field in the document you want to verifycollection – the name of collection to which document_id is expected to provide a unique match. If not specified, uses the default wf collection
tests (
list
ofstr
) – this should be a python list of test to apply by name keywords. Test nams allowed are ‘xref’, ‘type’, and ‘undefined’. Default runs all tests. Specify a subset of those keywords to be more restrictive.
- Returns:
a python dict keyed by a problematic key. The value in each entry is the name of the failed test (i.e. ‘xref’, ‘type’, or ‘undefined’)
- Return type:
dict
- Excpetion:
This method will throw a fatal error exception if the id received does no match any document in the database. That is intentional as the method should normally appear in a loop over ids found after query and the ids should then always be valid.
- mspasspy.db.database.doc2md(doc, database_schema, metadata_schema, wfcol, exclude_keys=None, mode='promiscuous')[source]
This function is more or less the inverse of md2doc. md2doc is needed by writers to convert Metadata to a python dict for saving with pymongo. This function is similarly needed for readers to translate MongoDB documents into the Metadata container used by MsPASS data objects.
This function can optionally apply schema constraints using the same schema class used by the Database class. In fact, normal use would pass the schema class from the instance of Database that was used in loading the document to be converted (arg0).
This function was built from a skeleton that was originally part of the read_data method of Database. Its behavior for differnt modes is inherited from that implementation for backward compatibility. The returns structure is a necessary evil with that change in the implementatoin, but is exactly the same as md2doc for consistency.
The way the mode argument is handled is slightly different than for md2doc because of the difference in the way this function is expected to be used. This function builds the Metadata container that is used in all readers to drive the construction of atomic data objects. See below for a description of what different settings of mode.
- Parameters:
doc (python dict assumed (there is no internal test for efficiency)) – document (dict) to be converted to Metadata
An associative array with string keys operator [] are the main requirements. e.g. this function might work with a Metadata container to apply schema constraints.
- Parameters:
metadata_schema – instance of MetadataSchema class that can
optionally be used to impose schema constraints. :type metadata_schema:
mspasspy.db.schema.MetadataSchema
- Parameters:
wfcol – Collection name from which doc was retrieved. It should
normally alreacy be known by the caller so we require it to be passed with this required arg. :type wfcol: string
- Parameters:
mode – read mode as described in detail in User’s Manual.
- Behavior for this function is as follows:
- “promiscuous” - (default) no checks are applied to any key-value
pairs and the result is a one-to-one translation of the input.
- “cautious” - Type constraints in the schema are enforced and
automatically conveted if possible. If conversion is needed and fails the live/dead boolan in the return will be set to signal this datum should be killed. There will also be elog entries.
- “pedantic” - type conversions are strongly enforced. If any
type mismatch of a value occurs the live/dead boolean returned will be set to signal a kill and there will be one or more error messages in the elog return.
- Return 3-component tuple:
0 = converted Metadata container, 1 - boolean equivalent to “live”. i.e. if True the results is valid while if False constructing an object from the result is ill advised, 2 - ErrorLogger object containing in error messages. Callers should
test if the result of the size method of the return is > 0 and handle the error messages as desired.
- mspasspy.db.database.doclist2mdlist(doclist, database_schema, metadata_schema, wfcol, exclude_keys, mode='promiscuous')[source]
Create a cleaned array of Metadata containers that can be used to construct a TimeSeriesEnsemble or SeismogramEnsemble.
This function is like doc2md but for an input given as a list of docs (python dict). The main difference is the return tuple is very different. The function has to robustly handle the fact that sometimes converting a document to md is problematic. The issues are defined in the related doc2md function that is used here for the atomic operation of converting a given document to a Metadata object. The issue we have to face is what to do with warning message and documents that have fatal flaws (marked dead when passed through doc2md). Warning messages are passed to the ErrorLogger component of the returned tuple. Callers should either print those messages or post them to the ensemble metadata that is expected to be constructed after calling this function. In “cautious” and “pedantic” mode doc2md may mark a datum as bad with a kill return. When a document is “killed” by doc2md it is dropped and two thi
- Parameters:
doclist – list of documents to be converted to Metadata with schema
constraints :type doclist: any iterable container holding an array of dict containers with rational content (i.e. expected to be a MongoDB document with attributes defined for a set of seismic data objects.)
—other here –
- Returns:
array with three components: 0 - filtered array of Metadata containers 1 - live boolean. Set False only if conversion of all the documents
in doclist failed.
2 - ErrorLogger where warning and kill messages are posted (see above) 3 - an array of documents that could not be converted (i.e. marked
bad when processed with doc2md.)
- mspasspy.db.database.elog2doc(elog) dict [source]
Extract error log messages for storage in MongoDB
This function can be thought of as a formatter for an ErrorLogger object. The error log is a list of what is more or less a C struct (class) This function converts the log to list of python dictionaries with keys being the names of the symbols in the C code. The list of dict objects is then stored in a python dictionary with the single key “logdata” that is returned. If the log is entry an empty dict is returned. That means the return is either empty or with one key-value pair with key==”logdata”.
- Parameters:
elog – ErrorLogger object to be reformatted. (Note for all mspass seismic data objects == self.elog)
- Returns:
python dictionary as described above
- mspasspy.db.database.geoJSON_doc(lat, lon, doc=None, key='epicenter') dict [source]
Convenience function to create a geoJSON format point object document from a points specified by latitude and longitude. The format for a geoJSON point isn’t that strange but how to structure it into a mongoDB document for use with geospatial queries is not as clear from current MongoDB documentation. This function makes that proess easier.
The required inpput is latitude (lat) and longitude (lon). The values are assumed to be in degrees for compatibility with MongoDB. That means latitude must be -90<=lat<=90 and longitude must satisfy -180<=lat<=180. The function will try to handle the common situation with 0<=lon<=360 by wrapping 90->180 values to -180->0, A ValueError exception is thrown if for any other situation with lot or lon outside those bounds.
If you specify the optional “doc” argument it is assumed to be a python dict to which the geoJSON point data is to be added. By default a new dict is created that will contain only the geoJSON point data. The doc options is useful if you want to add the geoJSON data to the document before appending it. The default is more useful for updates to add geospatial query capabilities to a collection with lat-lon data that is not properl structure. In all cases the geoJSON data is a itself a python dict but a value associated accessible from the output dict with te key defined by the “key” argument (default is ‘epicenter’, which is appropriate for earthquake source data.) with a
- mspasspy.db.database.history2doc(proc_history, alg_id=None, alg_name=None, job_name=None, job_id=None) dict [source]
Extract ProcessingHistory data and package into a python dictionary.
This function can be thought of as a formatter for the ProcessingHistory container in a MsPASS data object. It returns a python dictionary that, if retrieved, can be used to reconstruct the ProcessingHistory container. We do that a fairly easy way here by using pickle.dumps of the container that is saved with the key “processing_history”.
- Parameters:
proc_history (Must be an instance of a mspasspy.ccore.util.ProcessingHistory object or a TypeError exception will be thrown.) – history container to be reformatted
alg_id – algorithm id for caller. By default this is extracted from the last entry in the history tree. Use other than the default should be necessary only if called from a nonstandard writer. (See C++ doxygen page on ProcessingHistory for concept ).
alg_id – string
alg_name (string) – name of calling algorithm. By default this is extracted from the last entry in the history tree. Use other than the default should be necessary only if called from a nonstandard writer. (See C++ doxygen page on ProcessingHistory for concept ).
job_name (string (default None taken to mean do not save)) – optional job name string. If set the value will be saved in output with the “job_name”. By default there will be no value for the “job_name” key
job_id (string (default None taken to mean do not save)) – optional job id string. If set the value will be saved in output with the “job_id”. By default there will be no value for the “job_id” key
- Returns:
python dictoinary
- mspasspy.db.database.index_mseed_file_parallel(db, *arg, **kwargs)[source]
A parallel wrapper for the index_mseed_file method in the Database class. We use this wrapper to handle the possible error in the original method, where the file dir and name are pointing to a file that doesn’t exist. User could use this wrapper when they want to run the task in parallel, result will then be an RDD/bag of either None or error message strings. User would need to scan the RDD/bag to search for thing not None for errors.
- Parameters:
db – The MsPass core database handle that we want to index into
arg – All the arguments that users pass into the original
index_mseed_file method :return: None or error message string
- mspasspy.db.database.md2doc(md, save_schema, exclude_keys=None, mode='promiscuous', normalizing_collections=['channel', 'site', 'source']) {} [source]
Converts a Metadata container to a python dict applying a schema constraints.
This function is used in all database save operations to guaranteed the Metadata container in a mspass data object is consistent with requirements for MongDB defined by a specified schema. It dogmatically enforces readonly restrictions in the schema by changing the key for any fields marked readonly and found to have been set as changed. Such entries change to “READONLYERROR_” + k where k is the original key marked readonly. See user’s manual for a discussion of why this is done.
Other schema constraints are controlled by the setting of mode. Mode must be one of “promiscuous”,”cautious”, or “pedantic” or the function will raise a MsPASS error marked fatal. That is the only exception this function can throw. It will never happen when used with the Database class method but is possible if a user uses this function in a different implementation.
The contents of the data associated with the md argument (arg0) are assumed to have passed through the private database method _sync_metadata_before_update before calling this function. Anyone usingn this function outside the Database class should assure a comparable algorithm is not required.
Note the return is a tuple. See below for details.
- Parameters:
md (For normal use in mspass md is a larger data object that inherits Metadata. That is, in most uses it is a TimeSeries or Seismogram object. It can be a raw Metadata container and the algorithm should work) – contains a Metadata container that is to be converted.
save_schema (Schema class) – The Schema class to be used for constraints in building the doc for MongoDB use. See User’s Manual for details on how a schema is defined and used in MsPASS.
exclude_keys (list of strings) – list of keys (strings) to be excluded from the output python dict. Note “_id” is always excluded to mess with required MongoDB usage. Default is None which means no values are excluded. Note is harmless to list keys that are not present in md - does nothing except for a minor cost to test for existence.
mode (str) – Must be one of “promiscuous”, “caution”, or “pedantic”. See User’s Manual for which to use. Default is “promiscuous”. The function will throw MsPASSError exception if not one of the three keywords list.
- Returns:
Result is a returned as a tuple appropriate for the normal use of this function inside the Database class. The contents of the tuple are:
0 - python dictionary of edited result ready to save as MongoDB document 1 - boolean equivalent of the “live” attribute of TimeSeries and Seismogram.
i.e. if True the result can be considered valid. If False something was very wrong with the input and the contents of 0 is invalid and should not be used. When False the error log in 2 will contain one or more error messages.
- 2 - An ErrorLogger object that may or may not contain any error logging
messages. Callers should call the size method of the this entry and handle the list of error messages it contains if size is not zero. Note the right way to do that for TimeSeries and Seismogram is to use operator += for the elog attribute of the datum.
- mspasspy.db.database.parse_normlist(input_nlist, db) list [source]
Parses a list of multiple accepted types to return a list of Matchers.
This function is more or less a translator to create a list of subclasses of the BasicMatcher class used for generic normalization. The input list (input_nlist) can be one of two things. If the list is a set of strings the strings are assumed to define collection names. It then constructs a database-driven matcher class using the ObjectId method that is the stock MongoDB indexing method. Specifically, for each collection name it creates an instance of the generic ObjectIdDBMatcher class pointing to the named collection. The other allowed type for the members of the input list are children of the base class called BasicMatcher defined in spasspy.db.normalize. BasicMatcher abstracts the task required for normalization and provides a generic mechanism to load normalization data including data defined outside of MongoDB (e.g. a pandas DataFrame). If the list contains anything but a string or a child of BasicMatcher the function will abort throwing a TypeError exception. On success it returns a list of children of BasicMatcher that can be used to normalize any wf document retried from MongoDB assuming the matching algorithm is valid.
normalize
- class mspasspy.db.normalize.ArrivalDBMatcher(db, collection='arrival', attributes_to_load=['phase', 'time'], load_if_defined=None, aliases=None, require_unique_match=False, prepend_collection_name=True, query=None)[source]
Bases:
DatabaseMatcher
This is a class for matching a table of arrival times to input waveform data objects. Use this version if the table of arrivals is huge and database query delays will not create a bottleneck in your workflow.
Phase arrival time matching is a common need when waveform segments are downloaded. When data are assembled as miniseed files or url downloads of miniseed data, the format has no way to hold arrival time data. This matcher can prove useful for matching waveform segments with an origin as miniseed.
- The algorithm it uses for matching is a logic and of two tests:
We first match all arrival times falling between the sample range of an input MsPASS data object, d. That is, first component of the query is to find all arrival times, t_a, that obey the relation: d.t0 <= t_a <= d.endtime().
Match only data for which the (fixed) name “sta” in arrival and the data match. A secondary key match using the “net” attribute is used only if “net” is defined with the data. That is done to streamline processing of css3.0 data where “net” is not defined.
Note the concept of an arrival time is also mixed as in some contexts it means a time computed from an earth model and other time a measured time that is “picked” by a human or computer algorithm. This class does not distinguish model-based from measured times. It simply uses the time and station tag information with the algorithm noted above to attempt a match.
- Parameters:
db (normally a MsPASS Database class but with this algorithm it can be the superclass from which Database is derived.) – MongoDB database handle (positional - no default)
collection (string) –
Name of MongoDB collection that is to be queried (default “arrival”, which is not currently part of the stock
mspass schema. Note it isn’t required to be in the schema and illustrates flexibility’).
attributes_to_load (list of string defining keys in collection documents) – list of keys of required attributes that will be returned in the output of the find method. The keys listed must ALL have defined values for all documents in the collection or some calls to find_one will fail. Default [“phase”,”time”].
load_if_defined – list of keys of optional attributes to be extracted by find method. Any data attached to these keys will only be posted in the find return if they are defined in the database document retrieved in the query. Default is None
type – list of strings defining collection keys
aliases – python dictionary defining alias names to apply when fetching from a data object’s Metadata container. The key sense of the mapping is important to keep straight. The key of this dictionary should match one of the attributes in attributes_to_load or load_if_defined. The value the key defines should be the alias used to fetch the comparable attribute from the data.
prepend_collection_name (boolean) – when True attributes returned in Metadata containers by the find and find_one method will all have the collection name prepended with a (fixed) separator. For example, if the collection name is “channel” the “lat” attribute in the channel document would be returned as “channel_lat”.
require_unique_match (boolean) – boolean handling of ambiguous matches. When True find_one will throw an error if an entry is tries to match is not unique. When False find_one returns the first document found and logs a complaint message. (default is False)
query (python dictionary or None. None is equivalewnt to passing an empty dictionary. A TypeError will be thrown if this argument is not None or a dict.) – optional query predicate. That is, if set the interval query is appended to this query to build a more specific query. An example might be station code keys to match a specific pick for a specific station like {“sta”:”AAK”}. Another would be to limit arrivals to a specific phase name like {“phase” : “ScS”}. Default is None which reverts to no query predicate.
- query_generator(mspass_object) dict [source]
Concrete implementation of method required by superclass DatabaseMatcher.
This generator implements the switching algorithm noted in the class docstring. That is, for atomic data the time span for the interval query is determined from the range of the waveform data received through mspass_object. For ensembles the algorithm fetches fields defined by self.startime_key and self.endtime_key to define the time interval.
The interval test is overlaid on the self.query input. i.e. the query dict components derived are added to the self.query.
- class mspasspy.db.normalize.ArrivalMatcher(db_or_df, collection='arrival', attributes_to_load=['phase', 'time'], load_if_defined=None, aliases=None, require_unique_match=False, prepend_collection_name=True, ensemble_starttime_key='starttime', ensemble_endtime_key='endtime', arrival_time_key=None, custom_null_values=None)[source]
Bases:
DataFrameCacheMatcher
This is a class for matching a table of arrival times to input waveform data objects. Use this version if the table of arrivals is not huge enough to cause a memory problem.
Phase arrival time matching is a common need when waveform segments are downloaded. When data are assembled as miniseed files or url downloads of miniseed data, the format has no way to hold arrival time data. This matcher can prove useful for matching waveform segments with an origin as miniseed.
- The algorithm it uses for matching is a logic and of two tests:
We first match all arrival times falling between the sample range of an input MsPASS data object, d. That is, first component of the query is to find all arrival times, t_a, that obey the relation: d.t0 <= t_a <= d.endtime().
Match only data for which the (fixed) name “sta” in arrival and the data match. A secondary key match using the “net” attribute is used only if “net” is defined with the data. That is done to streamline processing of css3.0 data where “net” is not defined.
Note the concept of an arrival time is also mixed as in some contexts it means a time computed from an earth model and other time a measured time that is “picked” by a human or computer algorithm. This class does not distinguish model-based from measured times. It simply uses the time and station tag information with the algorithm noted above to attempt a match.
This implementation caches the table of attributes desired to an internal pandas DataFrame. It is thus most appropriate for arrival tables that are not huge. Note it may be possible to do appropriate preprocessing to manage the arrival table size. e.g. the table can be grouped by station or in time blocks and then processed in a loop updating waveform database records in multiple passes. The alternative for large arrival tables is to use the DB version of this matcher.
- Parameters:
db (normally a MsPASS Database class but with this algorithm it can be the superclass from which Database is derived.) – MongoDB database handle (positional - no default)
collection (string) –
Name of MongoDB collection that is to be queried (default “arrival”, which is not currently part of the stock
mspass schema. Note it isn’t required to be in the schema and illustrates flexibility’).
attributes_to_load (list of string defining keys in collection documents) – list of keys of required attributes that will be returned in the output of the find method. The keys listed must ALL have defined values for all documents in the collection or some calls to find_one will fail. Default [“phase”,”time”].
load_if_defined – list of keys of optional attributes to be extracted by find method. Any data attached to these keys will only be posted in the find return if they are defined in the database document retrieved in the query. Default is None
type – list of strings defining collection keyes
aliases – python dictionary defining alias names to apply when fetching from a data object’s Metadata container. The key sense of the mapping is important to keep straight. The key of this dictionary should match one of the attributes in attributes_to_load or load_if_defined. The value the key defines should be the alias used to fetch the comparable attribute from the data.
prepend_collection_name (boolean) – when True attributes returned in Metadata containers by the find and find_one method will all have the collection name prepended with a (fixed) separator. For example, if the collection name is “channel” the “lat” attribute in the channel document would be returned as “channel_lat”.
require_unique_match (boolean) – boolean handling of ambiguous matches. When True find_one will throw an error if an entry is tries to match is not unique. When False find_one returns the first document found and logs a complaint message. (default is False)
ensemble_starttime_key (string) – defines the key used to fetch a start time for the interval test when processing with ensemble data. Default is “starttime”.
ensemble_endtime_key (string) – defines the key used to fetch a end time for the interval test when processing with ensemble data. Default is “endtime”.
query (python dictionary or None. None is equivalewnt to passing an empty dictionary. A TypeError will be thrown if this argument is not None or a dict.) – optional query predicate. That is, if set the interval query is appended to this query to build a more specific query. An example might be station code keys to match a specific pick for a specific station like {“sta”:”AAK”}. Default is None.
- class mspasspy.db.normalize.BasicMatcher(attributes_to_load=None, load_if_defined=None, aliases=None)[source]
Bases:
ABC
This base class defines the api for a generic matching capability for MongoDB normalization. The base class is a mostly a skeleton that defines on required abstract methods and initializes a set of universal attributes all matchers need. It cannot be instatiated directly.
Matching is defined as one of two things: (1) a one-to-one match algorithm is guaranteed to have each search yield either exactly one match or none. That is defined through a find_one method following the same concept in MongoDB. (2) some matches are not unique and yield more than one document. For that case use the find method. Unlike the MongoDB find method, however, find in this context returns a list of Metadata containers holding the set of attributes requested in lists defined on constuction.
Another way of viewing this interface, in fact, is an abstraction of the find and find_one methods of MongoDB to a wider class of algorithms that may or may not utilize MongoDB directly. In particular, intermediate level classes defined below that implement different cache data structures allow input either by loading data from a MongoDB collection of from a pandas DataFrame. That can potentially provide a wide variety of applications of matching data to tabular data contained in files loaded into pandas by any of long list of standard dataframe read methods. Examples are any SQL database or antelope raw tables or views loaded as text files.
- abstract find(mspass_object, *args, **kwargs) tuple [source]
Abstraction of the MongoDB database find method with the matching criteria defined when a concrete instance is instantiated. Like the MongoDB method implementations it should return a list of containers matching the keys found in the data passed through the mspass_object. A key difference from MongoDB, however, is that instead of a MongoDB cursor we return a python list of Metadata containers. In some instances that is a direct translation of a MongoDB cursor to a list of Metadata objects. The abstraction is useful to allow small collections to be accessed faster with a generic cache algorithm (see below) and loading of tables of data through a file-based subclass of this base. All can be treated through a common interface. WE STRESS STRONGLY that the abstraction assumes returns are always small enough to not cause a memory bloating problem. If you need the big-memory model of a cursor use it directly.
All subclasses must implement this virtual method to be concrete or they cannot be instantiated. If the matching algorithm implemented is always expected to be a unique one-to-one match applications may want to have this method throw an exception as a use error. That case should use the find_one interface defined below.
All implementations should return a pair (2 component tuple). 0 is expected to hold a list of Metadata containers and 1 is expected to contain either a None type or an PyErrorLogger object. The PyErrorLogger is a convenient way to pass error messages back to the caller in a manner that is easier to handle with the MsPASS error system than an exception mechanism. Callers should handle four cases that are possible for a return (Noting [] means an empty list and […] a list with data)
[] None - notmatch found
[] ErrorLog - failure with an informational message in the
ErrorLog that should be preserved. The presence of an error should imply something went wrong and it was simply a null result.
[…] None - all is good with no detected errors
[…] ErrorLog - valid data returned but there is a warning or informational message posted. In this case handlers may want to examine the ErrorSeverity components of the log and handle different levels differently (e.g. Fatal and Informational should always be treated differently)
- find_doc(doc) Metadata [source]
find a unique match using a python dictionary as input.
The bulk_normalize function requires an implementation of a method with this name. It is conceptually similar to find_one but it uses a python dictionary (the doc argument) as input instead of a mspass seismic data object. It also returns only a Metadata container on success or None if it fails to find a match.
This method is little more than a thin wrapper around an implementation of the find_one method. It checkes the elog for entries marked Invalid and if so returns None. Otherwise it converts the Metdata container to a python dictionary it return. It is part of the base class because it depends only on the near equivalence of a python dictionary and the MsPASS Metadata containers.
When find_one returns a ErrorLogger object the contents are inspected. Errors less severe than “Invalid” are ignored and dropped. If the log contains a message tagged “Invalid” this function will silently return None. That could be problematic as it is indistinguishable from the return when there is no match, but is useful to simply the api. If an entry is tagged “Fatal” a MsPASSError exception will be thrown with the message posted to the MsPASSError container.
Subclasses may wish to override this method if the approach used here is inappropriate. i.e. if this were C++ this method would be declared virtual.
- abstract find_one(mspass_object, *args, **kwargs) tuple [source]
Abstraction of the MongoDB database find_one method with the matching criteria defined when a concrete instance is instantiated. Like the MongoDB method implementations should return the unique document that the keys in mspass_object are expected to define with the matching criteria defined by the instance. A type example of an always unique match is ObjectIds. When a match is found the result should be returned in a Metadata container. The attributes returned are normally a subset of the document and are defined by the base class attributes “attributes_to_load” and “load_if_defined”. For database instances this is little more than copying desired attributes from the matching document returned by MongoDB, but for abstraction more may be involved. e.g., implemented below is a generic cached algorithm that stores a collection to be matched in memory for efficiency. That implementation allows the “collection” to be loaded from MongoDB or a pandas DataFrame.
All implementations should return a pair (2 component tuple). 0 is expected to hold a Metadata containers that was yielded by the match. It should be returned as None if there is no match. 1 is expected to contain either a None type or an PyErrorLogger object. The PyErrorLogger is a convenient way to pass error messages back to the caller in a manner that is easier to handle with the MsPASS error system than an exception mechanism. Callers should handle four cases that are possible for a return:
1, None None - no match found 2. None ErrorLog - failure with an informational message in the
ErrorLog that the caller may want be preserved or convert to an exception.
Metadata None - all is good with no detected errors
Metadata ErrorLog - valid data was returned but there is a warning or informational message posted. In this case handlers may want to examine the ErrorSeverity components of the log and handle different levels differently (e.g. Fatal and Informational should always be treated differently)
- class mspasspy.db.normalize.DataFrameCacheMatcher(db_or_df, collection=None, attributes_to_load=None, load_if_defined=None, aliases=None, require_unique_match=False, prepend_collection_name=False, custom_null_values=None)[source]
Bases:
BasicMatcher
Matcher implementing a caching method based on a Pandas DataFrame
This is an intermediate class for instances where the database collection to be matched is small enough that the in-memory model is appropriate. It should be used when the matching algorithm is readily cast into the subsetting api of a pandas DataFrame.
The constructor of this intermediate class first calls the BasicMatcher (base class) constructor to initialize some common attribute including the critical lists of attributes to be loaded. This constructor then creates the internal DataFrame cache by one of two methods. If arg0 is a MongoDB database handle it loads the data in the named collection to a DataFrame created during construction. If the input is a DataFrame already it is simply copied selecting only columns defined by the attributes_to_load and load_if_defined lists. There is also an optional parameter, custom_null_values, that is a python dictionary defining values in a field that should be treated as a definition of a Null for that field. The constuctor converts such values to a standard pandas null field value.
This class implements generic find and find_one methods. Subclasses of this class must implement a “subset” method to be concrete. A subset method is the abstract algorithm that defines a match for that instance expressed as a pandas subset operation. (For most algorithms there are multiple ways to skin that cat or is
it a panda?) See concrete subclasses for examples.
This class cannot be instantiated because it is not concrete (has abstract - virtual - methods that must be defined by subclasses) See implementations for constructor argument definitions.
- find(mspass_object) tuple [source]
DataFrame generic implementation of find method.
This method uses content in any part of the mspass_object (data object) to subset the internal DataFrame cache to return subset of tuples matching some condition defined computed through the abstract (virtual) methdod subset. It then copies entries in attributes_to_load and when not null load_if_defined into one Metadata container for each row of the returned DataFrame.
- find_one(mspass_object) tuple [source]
DataFrame implementation of the find_one method.
This method is mostly a wrapper around the find method. It calls the find method and then does one of two thing s depending upon the value of self.require_unique_match. When that boolean is True if the match is not unique it creates an PyErrorLogger object, posts a message to the log, and then returns a [Null,elog] pair. If self.require_unique_match is False and the match is not ambiguous, it again creates an PyErrorLogger and posts a message, but it also takes the first container in the list returned by find and returns in as component 0 of the pair.
- abstract subset(mspass_object) DataFrame [source]
Required method defining how the internal DataFrame cache is to be subsetted using the contents of the data object mspass_object. Concrete implementation must implement this class. The point of this abstract method is that the way one defines how to get the information needed to define a match with the cache is application dependent. An implementation can use Metadata attributes, data object attributes (e.g. TimeSeries t0 attribute), or even sample data to compute a value to use in DataFrame subset condition. This simplifies writing a custom matcher to implementing only this method as find and find_one use it.
Implementations should return a zero length DataFrame if the subset condition yields a null result. i.e. the test len(return_result) should work and return 0 if the subset produced no rows.
- class mspasspy.db.normalize.DatabaseMatcher(db, collection, attributes_to_load=None, load_if_defined=None, aliases=None, require_unique_match=False, prepend_collection_name=False)[source]
Bases:
BasicMatcher
Matcher using direct database queries to MongoDB. Each call to the find method of this class constructs a query, calls the MongoDB database find method with that query, and extracts desired attributes from the return in the form of a Metadata container. The query construction is abstracted by a virtual method called query_generator. This is an intermediate class that cannot be instantiated directly because it contains a virtual method. User’s should consult docstrings for constructors for subclasses of this intermediate class.
- find(mspass_object)[source]
Generic database implementation of the find method for this abstraction. It returns what the base class api specifies. That is, normally it returns a tuple with component 0 being a python list of Metadata containers. Each container holds the subset of attributes defined by attributes_to_load and (if present) load_if_defined. The list is the set of all documents matching the query, which at this level of the class structure is abstract.
The method dogmatically requires data for all keys defined by attributes_to_load. It will throw a MsPASSError exception with a Fatal tag if any of the required attributes are not defined in any of the documents. The return matches the API specification for BasicMatcher.
It also handles failures of the abstract query_generator through the mechanism the base class api specified: a None return means the method could not create a valid query. Failures in the query will always post a message to elog tagging the result as “Invalid”.
It also handles the common problem of dead data or accidentally receiving invalid data like a None. The later may cause other algorithms to abort, but we handle it here return [None,None]. We don’t return an PyErrorLogger in that situation as the assumption is there is no place to put it and something else has gone really wrong.
- find_one(mspass_object)[source]
Generic database implementation of the find_one method. The tacit assumption is that if you call find_one you are expecting a unique match to the algorithm implemented. The actual behavior for a nonunique match is controlled by the class attribute require_unique_match. Subclasses that want to dogmatically enforce uniqueness (appropriate for example with ObjectIds) should set require_unique_match True. In that case if a match is not unique the method will throw an exception. When False, which is the default, an informational message is posted and the method returns the first list element returned by find. This method is actually little more than a wrapper around find to handle that uniqueness issue.
- abstract query_generator(mspass_object) dict [source]
Subclasses of this intermediate class MUST implement this method. It should extract content from mspass_object and use that content to generate a MongoDB query that is passed directly to the find method of the MongoDB database handle stored within this object (self) during the class construction. Since pymongo uses a python dict for that purpose it must return a valid query dict. Implementations should return None if no query could be generated. Common, for example, if a key required to generate the query is missing from mspass_object.
- class mspasspy.db.normalize.DictionaryCacheMatcher(db_or_df, collection, query=None, attributes_to_load=None, aliases=None, load_if_defined=None, require_unique_match=False, prepend_collection_name=False, use_dataframe_index_as_cache_id=False)[source]
Bases:
BasicMatcher
Matcher implementing a caching method based on a python dictionary.
This is an intermediate class for instances where the database collection to be matched is small enough that the in-memory model is appropriate. It should also only be used if the matching algorithm can be reduced to a single string that can serve as a unique id for each tuple.
The class defines a generic dictionary cache with a string key. The way that key is define is abstracted through two virtual methods: (1) The cache_id method creates a match key from a mspass data object.
That is normally from the Metadata container but it is not restricted to that. e.g. start time for TimeSeries or Seismogram objects can be obtained from the t0 attribute directly.
The db_make_cache_id is called by the internal method of this intermediate class (method name is _load_normalization_cache) to build the cache index from MongoDB documents scanned to construct the cache.
Two different methods to define the cache index are necessary as a generic way to implement aliases. A type example is the mspass use of names like “channel_id” to refer to the ObjectId of a specific document in the channel collection. When loading channel the name key is “_id” but data objects would normally have that same data defined with the key “channel_id”. Similarly, if data have had aliases applied a key in the data may not match the name in a collection to be matched. The dark side of this is it is very easy when running subclasses of this to get null results with all members of a dataset. As always testing with a subset of data is strongly recommended before running versions of this on a large dataset.
This class cannot be instantiated because it is not concrete (has abstract - virtual - methods that must be defined by subclasses) See implementations for constructor argument definitions.
- abstract cache_id(mspass_object) str [source]
Concrete implementations must implement this method to define how a mspass data object, mspass_object, is to be used to construct the key to the cache dict container. It is distinct from db_make_cache_id to allow differences in naming or even the algorithm used to construct the key from a datum relative to the database. This complicates the interface but makes it more generic.
- Parameters:
mspass_object – is expected to be a MsPASS object. Any type restrictions should be implemented in subclasses that implement the method.
- Returns:
should always return a valid string and never throw an exception. If the algorithm fails the implementation should return a None.
- abstract db_make_cache_id(doc) str [source]
Concrete implementation must implement this method to define how the cache index is to be created from database documents passed through the arg doc, which pymongo always returns as a python dict. It is distinct from cache_id to allow differences in naming or the algorithm for loading the cache compared to accessing it using attributes of a data object. If the id string cannot be created from doc an implementation should return None. The generic loaders in this class, db_load_normalization_cache and df_load_normalization_class, handle that situation cleanly but if a subclass overrides the load methods they should handle such errors. “cleanly” in this case means they throw an exception which is appropriate since they are run during construction and any invalid key is not acceptable in that situation.
- find(mspass_object) tuple [source]
Generic implementation of find method for cached tables/collections.
This method is a generalization of the MongoDB database find method. It differs in two ways. First, it creates the “query” directly from a MsPASS data object (pymongo find requires a dict as input). Second, the result is return as a python list of Metadata containers containing what is (usually) a subset of the data stored in the original collection (table). In contrast pymongo database find returns a database “Cursor” object which is their implementation of a large list that may exceed the size of memory. A key point is the model here makes sense only if the original table itself is small enough to not cause a memory problem. Further, find calls that yield long list may cause efficiency problems as subclasses that build on this usually will need to do a linear search through the list if they need to find a particular instance (e.g. call to find_one).
- Parameters:
mspass_object (must be a valid MsPASS data object. currently that means TimeSeries, Seismogram, TimeSeriesEnsemble, or SeismogramEnsemble. If it is anything else (e.g. None) this method will return a tuple [None, elog] with elog being a PyErrorLogger with a posted message.) – data object to match against data in cache (i.e. query).
- Returns:
tuple with two elements. 0 is either a list of valid Metadata container(s) or None and 1 is either None or an PyErrorLogger object. There are only two possible returns from this method:
[None, elog] - find failed. See/save elog for why it failed. [ [md1, md2, …, mdn], None] - success with 0 a list of Metadata
containing attributes_to_load and load_if_defined (if appropriate) in each component.
- find_one(mspass_object) tuple [source]
Implementation of find for generic cached method. It uses the cache_id method to create the indexing string from mspass_object and then returns a match to the cache stored in self. Only subclasses of this intermediate class can work because the cache_id method is defined as a pure virtual method in this intermediate class. That construct is used to simplify writing additional matcher classes. All extensions need to do is define the cache_id and db_make_cache_id algorithms to build that index.
- Parameters:
mspass_object – Any valid MsPASS data object. That means TimeSeries, Seismogram, TimeSeriesEnsemble, or SeismogramEnsemble. This datum is passed to the (abstract) cache_id method to create an index string and the result is used to fetch the Metadata container matching that key. What is required of the input is dependent on the subclass implementation of cache_id.
- Returns:
2-component tuple following API specification in BasicMatcher. Only two possible results are possible from this implementation:
- None ErrorLog - failure with an error message that can be passed on
if desired or printed
- Metadata None - all is good with no detected errors. The Metadata
container holds all attributes_to_load and any defined load_if_defined values.
- class mspasspy.db.normalize.EqualityDBMatcher(db, collection, match_keys, attributes_to_load, load_if_defined=None, aliases=None, require_unique_match=False, prepend_collection_name=False)[source]
Bases:
DatabaseMatcher
Database equivalent of EqualityMatcher.
param db: MongoDB database handle (positional - no default) :type db: normally a MsPASS Database class but with this algorithm
it can be the superclass from which Database is derived.
- Parameters:
collection (string) – Name of MongoDB collection that is to be queried. This arg is required by the constructor and has not default.
match_keys (python dictionary) – python dict of keys that are to be used for the equality match. The dict is used as an alias mechanism allowing different keys to be used for the Metadata container in data to be tested relative to the keys used in the database for the same attribute. (a typical common example would be something like “source_lat” in the data matching “lat” in the source collection). The key for each entry in this dict is taken as the key for the data side (mspass_object) and the value assigned to that key in this input is taken as the mongoDB/DataFrame key.
attributes_to_load (list of string defining keys in collection documents.) – list of keys of required attributes that will be returned in the output of the find method. The keys listed must ALL have defined values for all documents in the collection or some calls to find will fail. There is no default for this class and the list must be defined as arg3.
load_if_defined – list of keys of optional attributes to be extracted by find method. Any data attached to these keys will only be posted in the find return if they are defined in the database document retrieved in the query. Default is to add load no optional data.
type – list of strings defining collection keys
aliases – python dictionary defining alias names to apply when fetching from a data object’s Metadata container. The key sense of the mapping is important to keep straight. The key of this dictionary should match one of the attributes in attributes_to_load or load_if_defined. The value the key defines should be the alias used to fetch the comparable attribute from the data.
prepend_collection_name (boolean) – when True attributes returned in Metadata containers by find and find_one method will all have the collection name prepended with a (fixed) separator. For example, if the collection name is “channel” the “lat” attribute in the channel document would be returned as “channel_lat”. Default is False.
require_unique_match (boolean) – boolean handling of ambiguous matches. When True find_one will throw an error if an entry is tries to match is not unique. When False find_one returns the first document found and logs a complaint message. (default is False)
- class mspasspy.db.normalize.EqualityMatcher(db_or_df, collection, match_keys, attributes_to_load, query=None, load_if_defined=None, aliases=None, require_unique_match=True, prepend_collection_name=False, custom_null_values=None)[source]
Bases:
DataFrameCacheMatcher
Match with an equality test for the values of one or more keys with possible aliasing between data keys and database keys.
This class can be used for matching a set of keys that together provide a unique matching capability. Note the keys are applied sequentially to reduce the size of internal DataFrame cache in stages. If the DataFrame is large it may improve performance if the most unique key in a series appears first.
A special feature of the implementation is that we allow what is best thought of as reverse aliasing for the keys to be used for matching. That is, the base class of this family has an attribute self.aliases that allow mapping from collection names to data object names. The match_keys parameter here is done in the reverse order. That is, the key of the match_keys dictionary is the data object key while the value associated with that key is the DataFrame column name to match. The constructor of the class does a sanity check to verify the two are consistent. The constructor will throw an exception if the two dictionaries are inconstent. Note that means if you use an actual alias through match_keys (i.e. the key and value are different) you must define the aliases dictionary with the same combination reversed. (e.g. matchkeys={“KSTA”:”sta”} requires aliases={“sta”:”KSTA”})
- Parameters:
db_or_df (MongoDB database handle or pandas DataFrame.) – MongoDB database handle or a pandas DataFrame. Most users will use the database handle version. In that case the collection argument is used to determine what collection is loaded into the cache. If using a DataFrame is used the collection name is only a tag defined by the user. For a DataFrame a column index is required that contains at least the attributes defined in attribute_to_load.
collection (string) – When using database input this is expected to be a string defining a valid MongoDB collection with documents that are to be scanned and loaded into the internal cache. With DataFrame input this string is only a tag. It is relevant then only if the prepend_collection_name boolean is set True. There is no default for this parameter so it must be specified as arg 1.
match_keys (python dictionary) – python dict of keys that are to be used for the equality match. The dict is used as an alias mechanism allowing different keys to be used for the Metadata container in data to be tested relative to the keys used in the database for the same attribute. (a typical common example would be something like “source_lat” in the data matching “lat” in the source collection). The key for each entry in this dict is taken as the key for the data side (mspass_object) and the value assigned to that key in this input is taken as the mongoDB/DataFrame key.
attributes_to_load (list of string defining keys in collection documents) – list of keys of required attributes that will be returned in the output of the find method. The keys listed must ALL have defined values for all documents in the collection or some calls to find will fail. There is currently no default for this parameter and it must be defined as arg 3.
query (python dictionary.) – optional query to apply to collection before loading data from the database. This parameter is ignored if the input is a DataFrame. A common use would be to reduce the size of the cache by using a time range limit on station metadata to only load records relevant to the dataset being processed. This parameter is currently ignored for DataFrame input as we assume pandas subsetting would be used for the same functionality in the workflow prior to calling the class constructor for this object.
load_if_defined (list of strings defining collection keys) – list of keys of optional attributes to be extracted by find method. Any data attached to these keys will only be posted in the find return if they are defined in the database document retrieved in the query. Default resolves to an empty list. Note this parameter is ignored for DataFrame input.
aliases – python dictionary defining alias names to apply when fetching from a data object’s Metadata container. The key sense of the mapping is important to keep straight. The key of this dictionary should match one of the attributes in attributes_to_load or load_if_defined. The value the key defines should be the alias used to fetch the comparable attribute from the data.
prepend_collection_name (boolean) – when True attributes returned in Metadata containers by the find and find_one method will all have the collection name prepended with a (fixed) separator. Default is False.
require_unique_match (boolean) – boolean handling of ambiguous matches. When True find_one will throw an error if an entry is tries to match is not unique. When False find_one returns the first document found and logs a complaint message. (default is True)
- subset(mspass_object) DataFrame [source]
Concrete implementation of this virtual method of DataFrameMatcher for this class.
The subset is done sequentially driven by the order key order of the self.match_keys dictionary. i.e. the algorithm uses the row reduction operation of a dataframe one key at a time. An implementation detail is that there may be a more clever way instead create a single conditional clause to pass to the DataFrame operator [] combining the key matches with “and”. That would likely improve performance, particulary on large tables. Note the alias is applied using the self.match_keys. i.e. one can have different keys on the left (mspass_data side is the match_keys dictionary key) than the right (dataframe column name).
- Parameters:
mspass_object – Any valid mspass data object with a
Metadata container. The container must contain all the required match keys or the function will return an error condition (see below) :type mspass_object: TimeSeries, Seismogram, TimeSeriesEnsemble
or SeismogramEnsemble object
- Returns:
DataFrame containing all data satisying the match series of match conditions defined on construction. Silently returns a zero length DataFrame if is no match. Be warned two other situations can cause the return to have no data:
dead input, and (2) match keys missing from mspass_object.
- class mspasspy.db.normalize.MiniseedDBMatcher(db, collection='channel', attributes_to_load=['starttime', 'endtime', 'lat', 'lon', 'elev', '_id'], load_if_defined=None, aliases=None, prepend_collection_name=True)[source]
Bases:
DatabaseMatcher
Database implementation of matcher for miniseed data using SEED keys net:sta:chan(channel only):loc and a time interval test. Miniseed data uses the exessively complex key that combines four unique string names (net, sta, chan, and loc) and a time interval of operation to define a unique set of station metadata for each channel. In mspass we also create the site collection without the chan attribute. This implementation works for both channel and site under control of the collection argument.
This case is the complete opposite of something like the ObjectId matcher above as the match query we need to generate is excessively long requiring up to 6 fields.
The default collection name is channel which is the only correct use if applied to data created through readers applied to wf_miniseed. The implementation can also work on Seismogram data if and only if the channel argument is then set to “site”. The difference is that for Seismogram data “chan” is a undefined concept. In both cases the keys and content assume the mspass schema for the channel or site collections. The constructor will throw a MsPASSError exception if the collection argument is anything but “channel” or “site” to enforce this limitation. If you use a schema other than the mspass schema the methods in this class will fail if you change any of the following keys:
net, sta, chan, loc, starttime, endtime, hang, vang
Users should only call the find_one method for this application. The find_one algorithm first queries for any matches of net:sta:chan(channel only):loc(optional) and data t0 within the startime and endtime of the channel document attributes (an interval match). That combination should yield either 1 or no matches if the channel collection is clean. However, there are known issues with station metadata that can cause multiple matches in unusual cases (Most notably overlapping time intervals defined for the same channel.) The find_one method will handle that case returning the first one found and posting an error message that should be handled by the caller.
Instantiation of this class is a call to the superclass constructor with specialized defaults and wrapper code to automatically handle potential mismatches between site and channel. The arguments for the constructor follow:
- Parameters:
db (normally a MsPASS Database class but with this algorithm it can be the superclass from which Database is derived.) – MongoDB database handle (positional - no default)
collection (string) – Name of MongoDB collection that is to be queried The default is “channel”. Use “site” for Seismogram data. Use anything else at your own risk as the algorithm depends heavily on mspass schema definition and properties guaranteed by using the converter from obspy Inventory class loaded through web services or stationml files.
attributes_to_load (list of string defining keys in collection documents) –
list of keys of required attributes that will be returned in the output of the find method. The keys listed must ALL have defined values for all documents in the collection or some calls to find will fail. Default [“_id”,”lat”,”lon”,”elev”,”hang”,”vang”]
”hang”,”vang”,”starttime”,”endtime”]
when collection is set as channel. Smaller list of [“_id”,”lat”,”lon”,”elev”] is
default when collection is set as “site”.
load_if_defined (list of strings defining collection keys) – list of keys of optional attributes to be extracted by find method. Any data attached to these keys will only be posted in the find return if they are defined in the database document retrieved in the query. Default is [“loc”]. A common addition here may be response data (see schema definition for keys)
aliases – python dictionary defining alias names to apply when fetching from a data object’s Metadata container. The key sense of the mapping is important to keep straight. The key of this dictionary should match one of the attributes in attributes_to_load or load_if_defined. The value the key defines should be the alias used to fetch the comparable attribute from the data.
prepend_collection_name (boolean) – when True attributes returned in Metadata containers by the find and find_one method will all have the collection name prepended with a (fixed) separator. For example, if the collection name is “channel” the “lat” attribute in the channel document would be returned as “channel_lat”.
- find_one(mspass_object)[source]
We overload find_one to provide the unique match needed. Most of the work is done by the query_generator method in this case. This method is little more than a wrapper to run the find method and translating the output into the slightly different form required by find_one. More important is the fact that the wrapper implements two safties to make the code more robust:
It immediately tests that mspass_object is a valid TimeSeries or Seismogram object. It will raise a TypeError exception if that is not true. That is enforced because find_one in this context make sense only for atomic objects.
It handles dead data cleanly logging a message complaining that the data was already marked dead.
In addition note the find method this calls is assumed to handle the case of failures in the query_generator function if any of the required net, sta, chan keys are missing from mspass_object.
- query_generator(mspass_object)[source]
Concrete implementation of (required) abstract method defined in superclass DatabaseMatcher. It generates the complex query for matching net, sta, chan, and optional loc along with the time interval match of data start time between the channel defined “starttime” and “endtime” attributes.
This method provides one safety for a common data problem. A common current issue is that if miniseed data are saved to wf_TimeSeries and then read back in a later workflow the default schema will alter the keys for net, sta, chan, and loc to add the prefix “READONLY_” (e.g. READONLY_net). The query automatically tries to recover any of the station name keys using that recipe.
- Parameters:
mspass_object –
assumed to be a TimeSeries object with net, sta, chan, and (optional) loc defined. The time for the time interval test is translation to MongoDB syntax of:
channel[“starttime”] <= mspass_object.to <= channel[“endtime”] This algorithm will abort if the statement mspass_object.t0 does not resolve, which means the caller should assure the input is a TimeSeries object.
- Returns:
normal return is a string defining the query. If any required station name keys are not defined the method will silently return a None. Caller should handle a None condition.
- class mspasspy.db.normalize.MiniseedMatcher(db, collection='channel', query=None, attributes_to_load=['starttime', 'endtime', 'lat', 'lon', 'elev', '_id'], load_if_defined=None, aliases=None, prepend_collection_name=True)[source]
Bases:
DictionaryCacheMatcher
Cached version of matcher for miniseed station/channel Metadata.
Miniseed data require 6 keys to uniquely define a single channel of data (5 at the Seismogram level where the channels are merged). A further complication for using the DictionaryCacheMatcher interface is that part of the definition is a UTC time interval defining when the metadata is valid. We handle that in this implementation by implementing a two stage search algorithm for the find_one method. First, the cache index is defined by a unique string created from the four string keys of miniseed that the MsPASS default schema refers to with the keywords net, sta, chan, and loc. At this point in time we know of no examples of a seismic instrument where the number of distinct time intervals with different Metadata are huge so the secondary search is a simple linear search through the python list return by the generic find method using only the net, sta, chan, and loc keys as the index.
The default collection name is channel which is the only correct use if applied to data created through readers applied to wf_miniseed. The implementation can also work on Seismogram data if and only if the channel argument is then set to “site”. The difference is that for Seismogram data “chan” is a undefined concept. In both cases the keys and content assume the mspass schema for the channel or site collections. The constructor will throw a MsPASSError exception if the collection argument is anything but “channel” or “site” to enforce this limitation. If you use a schema other than the mspass schema the methods in this class will fail if you change any of the following keys:
net, sta, chan, loc, starttime, endtime, hang, vang
Users should only call the find_one method for this application. The find_one method here overrides the generic find_one in the superclass DictionaryCacheMatcher. It implements the linear search for a matching time interval test as noted above. Note also this class does not support Ensembles directly. Matching instrument data is by definition what we call atomic. If you are processing ensembles you will need to write a small wrapper function that would run find_one and handle the out looping over each member of the ensemble.
- Parameters:
db (mspass Database handle(mspasspy.db.database.Database).) – MongoDB database handle containing collection to be loaded.
collection (string) – Name of MongoDB collection that is to be queried The default is “channel”. Use “site” for Seismogram data. Use anything else at your own risk as the algorithm depends heavily on the mspass schema definition and properties guaranteed by using the converter from obspy Inventory class loaded through web services or stationml files.
query (python dictionary.) – optional query to apply to collection before loading data from the database. This parameter is ignored if the input is a DataFrame. A common use would be to reduce the size of the cache by using a time range limit on station metadata to only load records relevant to the dataset being processed.
attributes_to_load (list of string defining keys in collection documents) –
list of keys of required attributes that will be returned in the output of the find method. The keys listed must ALL have defined values for all documents in the collection or some calls to find will fail. Default [“_id”,”lat”,”lon”,”elev”,”hang”,”vang”]
”hang”,”vang”,”starttime”,”endtime”]
when collection is set as channel. Smaller list of [“_id”,”lat”,”lon”,”elev”] is default when collection is set as “site”. In either case the list MUST contain “starttime” and “endtime”. The reason is the linear search step will always use those two fields in the linear search for a time interval match. Be careful in how endtime is defined that resolves to an epoch time in the distant future and not some null database attribute; a possible scenario with DataFrame input but not a concern if using mspass loaders from StationML data.
Note there is also an implicit assumption that the keys “net” and “sta” are always defined. “chan” must also be defined if the collection name is “channel”. “loc” is handled as optional for database input but required if the input is via a Dataframe because we use the same cache id generator for all cases.
load_if_defined (list of strings defining collection keys) – list of keys of optional attributes to be extracted by find method. Any data attached to these keys will only be posted in the find return if they are defined in the database document retrieved in the query. Default is [“loc”]. A common addition here may be response data (see schema definition for keys)
aliases – python dictionary defining alias names to apply when fetching from a data object’s Metadata container. The key sense of the mapping is important to keep straight. The key of this dictionary should match one of the attributes in attributes_to_load or load_if_defined. The value the key defines should be the alias used to fetch the comparable attribute from the data.
prepend_collection_name (boolean) – when True attributes returned in Metadata containers by the find and find_one method will all have the collection name prepended with a (fixed) separator. For example, if the collection name is “channel” the “lat” attribute in the channel document would be returned as “channel_lat”.
- cache_id(mspass_object) str [source]
Concrete implementations of this required method. The cache_id in this algorithm is a composite key made from net, sta, chan, and loc with a fixed separator string of “_”. A typical example is IU_AAK_BHZ_00_.
An added feature to mesh with MsPASS conventions is a safety for attributes that are automatically renamed when saved that are marked readonly in the schema. Such attributes have a prepended tag, (at this time “READONLYERROR_”). If one of the required keys for the index is missing (e.g. “net”) the function tries to then fetch the modified name (e.g. “READONLYERROR_net”). If that also fails it returns a None as specified by the API.
- Parameters:
mspass_object – mspass object to be matched with cache. Must contain net, sta fpr site matching and net, sta, and chan for the channel collection. If loc is not defined for any case an emtpy string in defined an the key has two trailing separator characters (e.g. IU_AAK_BHZ__)
- Returns:
normal return is a string that can be used as an index string. If any of the required keys is missing it will return a None
- db_make_cache_id(doc) str [source]
Concrete implementations of this required method. The cache_id in this algorithm is a composite key made from net, sta, chan, and loc with a fixed separator string of “_”. A typical example is IU_AAK_BHZ_00_. This method creates this string from a MongoDB document assumed passed through a python dictionary as argument doc. Unlike the cache_id method this function does not have a safety for readonly errors. The reason is that it is designed to be used only while loading the cache from site or channel documents.
- Parameters:
doc – python dict containing a site or channel document. Must contain net, sta fpr site matching and net, sta, and chan for the channel collection. If loc is not defined for any case an emtpy string in defined an the key has two trailing separator characters (e.g. IU_AAK_BHZ__)
- Returns:
normal return is a string that can be used as an index string. If any of the required keys is missing it will return a None
- find_doc(doc, wfdoc_starttime_key='starttime')[source]
Function to support application to bulk_normalize to set channel_id or site_id. Acts like find_one but without support for readonly recovery. The bigger difference is that this method accepts a python dict retrieved in a cursor loop for bulk_normalize. Returns the Metadata container that is matched from the cache. This uses the same algorithm as the overloaded find_one above where a linear search is used to handle the time interval matching. Here, however, the time field is extracted from doc with the key defined by starttime.
This method overrides the generic version in BasicMatcher due to some special peculiarities of miniseed.
- Parameters:
doc (python dictionary - document from MongoDB) – document (pretty much assumed to be from wf_miniseed) to be matched with channel or site.
wfdoc_starttime_key (string) – optional parameter to change the key used to fetch the start time of waveform data from doc. Default is “starttime”/
- Returns:
matching Metadata container if successful. None if matching fails for any reason.
- find_one(mspass_object)[source]
We overload find_one to provide the unique match needed. The algorithm does a linear search for the first time interval for which the start time of mspass_object is within the startime <= t0 <= endtime range of a record stored in the cache. This works only if starttime and endtime are defined in the set of attributes loaded so the constructor of this class enforces that restriction. The times in starttime and endtime are assumed to be defined as epoch times as the algorithm uses a simple numeric test of the data start time with the two times. An issue to watch out for is endtime not being set to a valid distant time but some null field that resolves to something that doesn’t work in a numerical test for < endtime.
- Parameters:
mspass_object – data to be used for matching against the
cache. It must contain the required keys or the matching will fail. If the datum is marked dead the algorithm will return immediately with a None response and an error message that would usually be dropped by the call in that situation. :type mspass_object: must be one of the atomic data types of
mspass (currently TimeSeries and Seismogram) with t0 defined as an epoch time computed from a UTC time stamp.
- class mspasspy.db.normalize.ObjectIdDBMatcher(db, collection='channel', attributes_to_load=['_id', 'lat', 'lon', 'elev', 'hang', 'vang'], load_if_defined=None, aliases=None, prepend_collection_name=True)[source]
Bases:
DatabaseMatcher
Implementation of DatabaseMatcher for ObjectIds. In this class the virtual method query_generator uses the mspass convention of using the collection name and the magic string “_id” as the data object key (e.g. channel_id) but runs the query using the “_id” magic string used in MongoDB for the ObjectId of each document.
Users should only utilize the find_one method of this class as find, by definition, will always return only one record or None. The find method, in fact, is overloaded and attempts to use it will result in raising a MsPASSError exception.
- Parameters:
db (normally a MsPASS Database class but with this algorithm it can be the superclass from which Database is derived.) – MongoDB database handle (positional - no default)
collection (string) – Name of MongoDB collection that is to be queried (default “channel”).
attributes_to_load (list of string defining keys in collection documents) – list of keys of required attributes that will be returned in the output of the find method. The keys listed must ALL have defined values for all documents in the collection or some calls to find_one will fail. Default [“_id”,”lat”,”lon”,”elev”,”hang”,”vang”].
load_if_defined – list of keys of optional attributes to be extracted by find method. Any data attached to these keys will only be posted in the find return if they are defined in the database document retrieved in the query.
type – list of strings defining collection keys
aliases – python dictionary defining alias names to apply when fetching from a data object’s Metadata container. The key sense of the mapping is important to keep straight. The key of this dictionary should match one of the attributes in attributes_to_load or load_if_defined. The value the key defines should be the alias used to fetch the comparable attribute from the data.
prepend_collection_name (boolean) – when True attributes returned in Metadata containers by the find and find_one method will all have the collection name prepended with a (fixed) separator. For example, if the collection name is “channel” the “lat” attribute in the channel document would be returned as “channel_lat”.
- query_generator(mspass_object) dict [source]
Subclasses of this intermediate class MUST implement this method. It should extract content from mspass_object and use that content to generate a MongoDB query that is passed directly to the find method of the MongoDB database handle stored within this object (self) during the class construction. Since pymongo uses a python dict for that purpose it must return a valid query dict. Implementations should return None if no query could be generated. Common, for example, if a key required to generate the query is missing from mspass_object.
- class mspasspy.db.normalize.ObjectIdMatcher(db, collection='channel', query=None, attributes_to_load=['_id', 'lat', 'lon', 'elev', 'hang', 'vang'], load_if_defined=None, aliases=None, prepend_collection_name=True)[source]
Bases:
DictionaryCacheMatcher
Implement an ObjectId match with caching. Most of the code for this class is derived from the superclass DictionaryCacheMatcher. It adds only a concrete implementation of the cache_id method used to construct a key for the cache defined by a python dict (self.normcache). In this case the cache key is simply the string representation of the ObjectId of each document in the collection defined in construction. The cache is then created by the superclass generic method _load_normalization_cache.
- Parameters:
db (normally a MsPASS Database class but with this algorithm it can be the superclass from which Database is derived.) – MongoDB database handle (positional - no default)
collection (string) – Name of MongoDB collection that is to be loaded and cached to memory inside this object (default “channel”)
query (python dict with content that defines a valid query when be passed to MongoDB the MongoDB find method. If query is a type other than a None type or dict the constructor will throw a TypeError.) – optional query to apply to collection before loading document attributes into the cache. A typical example would be a time range limit for the channel or site collection to avoid loading instruments not operational during the time span of a data set. Default is None which causes the entire collection to be parsed.
attributes_to_load (list of string defining keys in collection documents) – list of keys of required attributes that will be returned in the output of the find method. The keys listed must ALL have defined values for all documents in the collection or some calls to find will fail. Default [“_id”,”lat”,”lon”,”elev”,”hang”,”vang”]
load_if_defined – list of keys of optional attributes to be extracted by find method. Any data attached to these keys will only be posted in the find return if they are defined in the database document retrieved in the query.
type – list of strings defining collection keys
aliases – python dictionary defining alias names to apply when fetching from a data object’s Metadata container. The key sense of the mapping is important to keep straight. The key of this dictionary should match one of the attributes in attributes_to_load or load_if_defined. The value the key defines should be the alias used to fetch the comparable attribute from the data.
prepend_collection_name (boolean) –
when set true all attributes loaded from the normalizing collection will have the channel name prepended. That is essential if the collection contains generic names like “lat” or “depth” that would produce ambiguous keys if used directly. (e.g. lat is used for source, channel, and site collections in the
default schema.)
- cache_id(mspass_object) str [source]
Implementation of virtual method with this name for this matcher. It implements the MsPASS approach of defining the key for a normalizing collection as the collection name and the magic string “_id” (e.g. channel_id or site_id). It assumes the collection name is define as self.collection by the constructor of the class when it is instantiated. It attempts to extract the expanded _id name (e.g. channel_id) from the input mspass_object. If successful it returns the string representation of the resulting (assumed) ObjectId. If the key is not defined it returns None as specified by the superclass api.
It is important to note the class attribute, self.prepend_collection_name, is indendent of the definition of the cache_id. i.e. what we attempt to extract as the id ALWAYS used the collection name as a prefix (channel_id and not “_id”). The internal boolean controls if the attributes returned by find_one will have the collection name prepended.
- Parameters:
mspass_object (Normally this is expected to be a mspass data object (TimeSeries, Seismogram, or ensembles of same) but it can be as simple as a python dict or Metadata with the required key defined.) – key-value pair container containing an id that is to be extracted.
- Returns:
string representation of an ObjectId to be used to matching the cache index stored internally - find method.
- db_make_cache_id(doc) str [source]
Implementation of virtual methods with this name for this matcher. It does nothing more than extract the magic “_id” value from doc and returns its string representation. With MongoDB that means the string representation of the ObjectId of each collection document is used as the key for the cache.
- Parameters:
doc (python dict container returned by pymongo - usually a cursor component.) – python dict defining a document return by MongoDB. Only the “_id” value is used.
- class mspasspy.db.normalize.OriginTimeDBMatcher(db, collection='source', t0offset=0.0, tolerance=4.0, query=None, attributes_to_load=['_id', 'lat', 'lon', 'depth', 'time'], load_if_defined=['magnitude'], aliases=None, require_unique_match=False, prepend_collection_name=True, data_time_key=None, source_time_key=None)[source]
Bases:
DatabaseMatcher
Generic class to match data by comparing a time defined in data to an origin time using a database query algorithm.
The default behavior of this matcher class is to match data to source documents based on origin time with an optional time offset. Conceptually the data model for this matching is identical to conventional multichannel shot gathers where the start time is usually the origin time. It is also a common model for downloaded source oriented waveform segments from FDSN web services with obspy. Obspy has an example in their documentation for how to download data defined exactly this way. In that mode we match each source document that matches a projected origin time within a specified tolerance. Specifically, let t0 be the start time extracted from the data. We then compute the projected, test origin time as test_otime = t0 - t0offset. Note the sign convention that a positive offset means the time t0 is after the event origin time. We then select all source records for which the time field satisifies:
source.time - tolerance <= test_time <= source.time + tolerance
The test_time value for matching from a datum can come through one of two methods driven by the constructor argument “time_key”. When time_key is a None (default) the algorithm assumes all input are mspass atomic data objects that have the start time defined by the attribute “t0” (mspass_object.t0). If time_key is a string it is assumed to be a Metadata key used to fetch an epoch time to use for the test. The most likely use of that feature would be for ensemble processing where test_time is set as a field in the ensemble Metadata. Note that form of associating source data to ensembles that are common source gathers can be much faster than the atomic version because only one query is needed per ensemble.
- Parameters:
db (normally a MsPASS Database class but with this algorithm it can be the superclass from which Database is derived.) – MongoDB database handle (positional - no default)
collection (string) – Name of MongoDB collection that is to be queried (default “source”).
t0offset (float) – constant offset from data start time that is expected as origin time. A positive t0offset means the origin time is before the data start time. Units are always assumed to be seconds.
tolerance (float) – time tolerance to test for match of origin time. (see formula above for exact use) If the source estimates are exactly the same as the ones used to define data start time this number can be a few samples. Otherwise a few seconds is safter for teleseismic data and less for local/regional events. i.e. the choice depends up on how the source estimates relate to the data.
attributes_to_load (list of string defining keys in collection documents) – list of keys of required attributes that will be returned in the output of the find method. The keys listed must ALL have defined values for all documents in the collection or some calls to find_one will fail. Default is [“lat”,”lon”,”depth”,”time”]
load_if_defined – list of keys of optional attributes to be extracted by find method. Any data attached to these keys will only be posted in the find return if they are defined in the database document retrieved in the query. Default is [“magnitude”]
type – list of strings defining collection keys
aliases – python dictionary defining alias names to apply when fetching from a data object’s Metadata container. The key sense of the mapping is important to keep straight. The key of this dictionary should match one of the attributes in attributes_to_load or load_if_defined. The value the key defines should be the alias used to fetch the comparable attribute from the data.
prepend_collection_name (boolean) – when True attributes returned in Metadata containers by the find and find_one method will all have the collection name prepended with a (fixed) separator. For example, if the collection name is “channel” the “lat” attribute in the channel document would be returned as “channel_lat”.
require_unique_match (boolean) – boolean handling of ambiguous matches. When True find_one will throw an error if an entry is tries to match is not unique. When False find_one returns the first document found and logs a complaint message. (default is False)
- query_generator(mspass_object) dict [source]
Concrete implementation of this required method for a subclass of DatabaseMatcher.
This algorithm implements the time test described in detail in docstring for this class. Note the fundamental change in how the test time is computed that depends on the internal (self) attribute time_key. When None we use the data’s t0 attribute. Otherwise self.time_key is assumed to be a string key to fetch the test time from the object’s Metadata container.
- Parameters:
mspass_object (Any valid MsPASS data object.) – MsPASS defined data object that contains data to be used for this match (t0 attribute or content of self.time_key).
- Returns:
query python dictionary on sucess. Return None if a query could not be constructed. That happens two ways here. (1) If the input is not a valid mspass data object or marked dead. (2) if the time_key algorithm is used and time_key isn’t defined
in the input datum.
- class mspasspy.db.normalize.OriginTimeMatcher(db_or_df, collection='source', t0offset=0.0, tolerance=4.0, attributes_to_load=['_id', 'lat', 'lon', 'depth', 'time'], load_if_defined=['magnitude'], aliases=None, require_unique_match=False, prepend_collection_name=True, data_time_key=None, source_time_key='time', custom_null_values=None)[source]
Bases:
DataFrameCacheMatcher
Generic class to match data by comparing a time defined in data to an origin time using a cached DataFrame.
The default behavior of this matcher class is to match data to source documents based on origin time with an optional time offset. Conceptually the data model for this matching is identical to conventional multichannel shot gathers where the start time is usually the origin time. It is also a common model for downloaded source oriented waveform segments from FDSN web services with obspy. Obspy has an example in their documentation for how to download data defined exactly this way. In that mode we match each source document that matches a projected origin time within a specified tolerance. Specifically, let t0 be the start time extracted from the data. We then compute the projected, test origin time as test_otime = t0 - t0offset. Note the sign convention that a positive offset means the time t0 is after the event origin time. We then select all source records for which the time field satisifies:
source.time - tolerance <= test_time <= source.time + tolerance
The test_time value for matching from a datum can come through one of two methods driven by the constructor argument “time_key”. When time_key is a None (default) the algorithm assumes all input are mspass atomic data objects that have the start time defined by the attribute “t0” (mspass_object.t0). If time_key is a string it is assumed to be a Metadata key used to fetch an epoch time to use for the test. The most likely use of that feature would be for ensemble processing where test_time is set as a field in the ensemble Metadata. Note that form of associating source data to ensembles that are common source gathers can be much faster than the atomic version because only one query is needed per ensemble.
This implentation should be used only if the catalog of events is reasonably small. If the catalog is huge the database version may be more appropriate.
- Parameters:
db (normally a MsPASS Database class but with this algorithm it can be the superclass from which Database is derived.) – MongoDB database handle (positional - no default)
collection (string) – Name of MongoDB collection that is to be queried (default “source”).
t0offset (float) – constant offset from data start time that is expected as origin time. A positive t0offset means the origin time is before the data start time. Units are always assumed to be seconds.
tolerance (float) – time tolerance to test for match of origin time. (see formula above for exact use) If the source estimates are exactly the same as the ones used to define data start time this number can be a few samples. Otherwise a few seconds is safter for teleseismic data and less for local/regional events. i.e. the choice depends up on how the source estimates relate to the data.
attributes_to_load (list of string defining keys in collection documents) – list of keys of required attributes that will be returned in the output of the find method. The keys listed must ALL have defined values for all documents in the collection or some calls to find_one will fail. Default is [“_id”,”lat”,”lon”,”depth”,”time”]. Note if constructing from a DataFrame created from something like a Datascope origin table this list will need to be changed to remove _id as it in that context no ObjectID would normally be defined. Be warned, however, that if used with a normalize function the _id may be required to match a “source_id” cross reference in a seismic data object. Also note that the list must contain the key defined by the related argument “source_time_key” as that is used to match times in the source data with data start times.
load_if_defined – list of keys of optional attributes to be extracted by find method. Any data attached to these keys will only be posted in the find return if they are defined in the database document retrieved in the query. Default is [“magnitude”]
type – list of strings defining collection keys
aliases – python dictionary defining alias names to apply when fetching from a data object’s Metadata container. The key sense of the mapping is important to keep straight. The key of this dictionary should match one of the attributes in attributes_to_load or load_if_defined. The value the key defines should be the alias used to fetch the comparable attribute from the data.
prepend_collection_name (boolean) – when True attributes returned in Metadata containers by the find and find_one method will all have the collection name prepended with a (fixed) separator. For example, if the collection name is “channel” the “lat” attribute in the channel document would be returned as “channel_lat”.
require_unique_match (boolean) – boolean handling of ambiguous matches. When True find_one will throw an error if an entry is tries to match is not unique. When False find_one returns the first document found and logs a complaint message. (default is False)
data_time_key (string) – data object Metadata key used to fetch time for testing as alternative to data start time. If set None (default) the test will use the start time of an atomic data object for the time test. If nonzero it is assumed to be a string used to fetch a time from the data’s Metadata container. That is the best way to run this matcher on Ensembles.
source_time_key (string Can also be a None type which is causes the internal value to be set to "time") – dataframe column name to use as source origin time field. Default is “time”. This key must match a key in the attributes_to_load list or the constructor will throw an exception. Note this should match the key definingn origin time in the collection not the common actual value stored with data. I.e. normal usage is “time” not “source_time”
- find_doc(doc, starttime_key='starttime') dict [source]
Override of the find_doc method of BasicMatcher. This method acts lke find_one but the inputs and outputs are different. The input to this method is a python dictionary that is expected to normally be a MongoDB document. The output is also a python dictionary without (normally) a reduced set of attributes defined by self.attributes_to_load and self.load_if_defined. We need to override the base class version of ths method because the base class version by default requires an atomic seismic data object (TimeSEries or Seismogram). The algorithm used is a variant of that in the subset method of this class.
This method also differs from find_one it that it has no mechanism to log errors. find_one returns a Metadata container and an ErrorLogger container used to post messages. This method will return a None if there are errors that cause it to fail. That can be ambiguous because a None return also is used to indicate failure to match anything. The primary use of this method is normalizing an entire data set with the ObjetIds of source documnts with the bulk_normaize function. In that case additional forensic work is possible with MongoDB to uncover why a given document match failed.
Because the interval match relative to a waveform start time can be ambiguous from global events (Although rare earthquakes can easily occur with + or - self.tolerance time) when multiple rows of the dataframe match the interval test the one returned is the one for which the time projected from the waveform start time (uses self.t0offset value) is defined as the match that is returned.
- Parameters:
doc (python dictionary) – wf document (i.e. a document used to construct an atomic datum) to be matched with content of this object (assued the source collection or a variant that contains source origin times).
starttime_key (str (default "starttime")) – key that can be used fetch the waveform segment start time that is to be used to match against origin times loaded in the object’s cache. ‘
- Returns:
python dictionary of the best match or None if there is no match or in nonfatal error conditions.
- find_one(mspass_object) tuple [source]
Override of find_one method of DataframeMatcher. The override is necessary to handle the ambiguity of a timer interval match for source origin times. That is, there is a finite probability that tow earthquakes can occur with the interval of this matcher defined by the time projected from the waveform start time (starttime - self.t0offset) + or - self.tolerance. When multiple matches are found this method handles that ambiguity by finding the source where the origin time is closest to the waveform start time corrected by self.t0offset.
Note this method normally expects input to be an atomic seismic object. It also, however, accepts any object that is a subclass of Metadata. The most important example of that is TimeSeriesEnsemble and SeismogramEnsemble objects. For that to work, however, you MUST define a key to use to fetch a reference time in the constructor to this object via the data_time_key argument. If you then load the appropriate reference time in the ensemble’s Metadata container you can normalize a common source gather’s ensemble container with a workflow. Here is a code fragment illustrating the idea:
` source_matcher = OriginTimeMatcher(db,data_time_key="origin_time") e = db.read_data(cursor, ... read args...) # read ensemle e # assume we got ths time (otime)vsome other way above e['origin_time'] = otime e = normalize(e,source_matcher) `
If the match suceeds the attributes defined in te Dataframe cache will be loaded into the Metadata contaienr of e. That is the defiition of a common source gather.- Parameters:
mspass_object – atomic seismic data object to be matched. The match is normally made against the datum’s t0 value so there is an implict assumption the datum is a UTC epoch time. If a data set is passed through this operator and the data are relative time all will fail. The function intentionaly avoids that test for efficiency. A plain Metadata container can be passed through mspass_object if and only if it contains a value associated with the key defined by the starttime_key attibute.
- Returns:
a tuple consistent with the BasicMatcher API definition. (i.e. pair [Metadata,ErrorLogger])
- subset(mspass_object) DataFrame [source]
Implementation of subset method requried by inheritance from DataframeCacheMatcher. Returns a subset of the cache Dataframe with source origin times matching the definition of this object. i.e. a time interval relative to the start time defined by mspass_object. Note that if a key is given the time will be extrated from the Metadata container of mspass_object. If no key is defined (self.data_time_key == None) the t0 attribute of mspass_object will be used.
- mspasspy.db.normalize.bulk_normalize(db, wfquery=None, wf_col='wf_miniseed', blocksize=1000, matcher_list=None)[source]
This function iterates through the collection specified by db and wf_col, and runs a chain of normalization funtions in serial on each document defined by the cursor returned by wfquery. It speeds updates by using the bulk methods of MongoDB. The chain also speeds updates as the all matchers in matcher_list append to the update string for the same wf_col document. A typical example would be to run this function on wf_miniseed data running a matcher to set channel_id, site_id, and source_id.
- Parameters:
db – should be a MsPASS database handle containing the wf_col and the collections defined by the matcher_list list.
wf_col – The collection that need to be normalized, default is wf_miniseed
blockssize – To speed up updates this function uses the bulk writer/updater methods of MongoDB that can be orders of magnitude faster than one-at-a-time updates. A user should not normally need to alter this parameter.
wfquery – is an optional query to apply to wf_col. The output of this query defines the list of documents that the algorithm will attempt to normalize as described above. The default (None) will process the entire collection (query set to an emtpy dict).
matcher_list – a list of instances of one or more subclasses of BasicMather. In addition to the required classes all instances passed to through this interface must contain two required attributes: (1) collection which defines the collection name, and (2) prepend_collection_name is a boolean that determines if the attributes loaded should have the collection name prepended (e.g. channel_id). In addition, all instances must define the find_doc method which is not required by the BasicMatcher interface. (find_doc is comparable to find_one but uses a python dictionary as the container instead of referencing a mspass data object. find_one is the core method for inline normalization)
- Returns:
a list with a length of len(matcher_list)+1. 0 is the number of documents processed in the collection (output of query), The rest are the numbers of success normalizations for the corresponding NMF instances, they are mapped one on one (matcher_list[x] -> ret[x+1]).
- mspasspy.db.normalize.normalize(mspass_object, matcher, kill_on_failure=True)[source]
Generic function to do in line normalization with dask/spark map operator.
In MsPASS we use the normalized data model for receiver and source metadata. The normalization can be done during any reads if the data have cross-referencing ids defined as described in the User’s Manual. This function provides a generic interface to link to a normalizing collection within a workflow using a map operator applied to a dask bag or spark rdd containing a dataset of MsPASS data objects. The algorithm is made generic through the matcher argument that must point a concrete implementation of the abstract base class defined in this module as BasicMatcher.
For example, suppose we create a concrete implementation of the MiniseedMatcher using all defaults from a database handle db as follows:
matcher = MiniseedMatcher()
Suppose we then load data from wf_miniseed with read_distributed_data into the dask bag we will call dataset. We can normalize that data within a workflow as follows:
dataset = dataset.map(normalize,matcher)
- Parameters:
mspass_object (For all mspass matchers this must be one of the mspass data types of TimeSeries, Seismogram, TimeSeriesEnsemble, or SeismogramEnsemble. Many matchers have further restrictions. e.g. the normal use of the MiniseedMatcher using the defaults like the example insists the data received are either TimeSeries or Seismogram objects. Read the docstring carefully for your matcher choice for any limitations.) – data to be normalized
matcher (must be a concrete subclass of BasicMatcher) – a generic matching function that is a subclass of BasicMatcher. This function only calls the find_one method.
kill_on_failure – when True if the call to the find_one
method of matcher fails the datum returned will be marked dead. :type kill_on_failure: boolean
- Returns:
copy of mspass_object. dead data are returned immediately.
if kill_on_failure is true the result may be killed on return.
- mspasspy.db.normalize.normalize_mseed(db, wfquery=None, blocksize=1000, normalize_channel=True, normalize_site=True)[source]
In MsPASS the standard support for station information is stored in two collections called “channel” and “site”. When normalized with channel collection data a miniseed record can be associated with station metadata downloaded by FDSN web services and stored previously with MsPASS database methods. The default behavior tries to associate each wf_miniseed document with an entry in “site”. In MsPASS site is a smaller collection intended for use only with data already assembled into three component bundles we call Seismogram objects.
For both channel and site the association algorithm used assumes the SEED convention wherein the strings stored with the keys “net”,”sta”,”chan”, and (optionally) “loc” define a unique channel of data registered globally through the FDSN. The algorithm then need only query for a match of these keys and a time interval match with the start time of the waveform defined by each wf_miniseed document. The only distinction in the algorithm between site and channel is that “chan” is not used in site since by definition site data refer to common attributes of one seismic observatory (commonly also called a “station”).
- Parameters:
db – should be a MsPASS database handle containing at least
wf_miniseed and the collections defined by the norm_collection list. :param blockssize: To speed up updates this function uses the bulk writer/updater methods of MongoDB that can be orders of magnitude faster than one-at-a-time updates for setting channel_id and site_id. A user should not normally need to alter this parameter. :param wfquery: is a query to apply to wf_miniseed. The output of this query defines the list of documents that the algorithm will attempt to normalize as described above. The default will process the entire wf_miniseed collection (query set to an emtpy dict). :param normalize_channel: boolean for handling channel collection. When True (default) matches will be attempted with the channel collection and when matches are found the associated channel document id will be set in the associated wf_miniseed document as channel_id. :param normalize_site: boolean for handling site collection. When True (default) matches will be attempted with the site collection and when matches are found the associated site document id will be set wf_miniseed document as site_id. Note at least one of the two booleans normalize_channel and normalize_site must be set True or the function will immediately abort.
- Returns:
list with three integers. 0 is the number of documents processed in
wf_miniseed (output of query), 1 is the number with channel ids set, and 2 contains the number of site documents set. 1 or 2 should contain 0 if normalization for that collection was set false.
schema
Tools to define the schema of Metadata.
- class mspasspy.db.schema.DBSchemaDefinition(schema_dic, collection_str)[source]
Bases:
SchemaDefinitionBase
- add(name, attr)[source]
Add a new entry to the definitions. Note that because the internal container is dict if attribute for name is already present it will be silently replaced.
- Parameters:
name (str) – The name of the attribute to be added
attr (dict) – A dictionary that defines the property of the added attribute. Note that the type must be defined.
- Raises:
mspasspy.ccore.utility.MsPASSError – if type is not defined in attr
- data_type()[source]
Return the data type that the collection is used to reference. If not recognized, it returns None.
- Returns:
type of data associated with the collection
- Return type:
type
- reference(key)[source]
Return the collection name that a key is referenced from.
- Parameters:
key (str) – the name of the key
- Returns:
the name of the collection
- Return type:
str
- Raises:
mspasspy.ccore.utility.MsPASSError – if the key is not defined
- class mspasspy.db.schema.DatabaseSchema(schema_file=None)[source]
Bases:
SchemaBase
- default(name)[source]
Return the schema definition of a default collection.
This method is used when multiple collections of the same concept is defined. For example, the wf_TimeSeries and wf_Seismogram are both collections that are used for data objects (characterized by their common wf prefix). The Database API needs a default wf collection to operate on when no collection name is explicitly given. In this case, this default_name method can be used. Note that if requested name has no default collection defined and it is a defined collection, it will treat that collection itself as the default.
- Parameters:
name (str) – The requested default collection
- Returns:
the schema definition of the default collection
- Return type:
- Raises:
mspasspy.ccore.utility.MsPASSError – if the name has no default defined
- default_name(name)[source]
Return the name of a default collection.
This method is behaves similar to the default method, but it only returns the name as a string instead.
- Parameters:
name (str) – The requested default collection
- Returns:
the name of the default collection
- Return type:
str
- Raises:
mspasspy.ccore.utility.MsPASSError – if the name has no default defined
- set_default(collection: str, default: str | None = None)[source]
Set a collection as the default.
This method is used to change the default collections (e.g., switching between wf_TimeSeries and wf_Seismogram). If
default
is not given, it will try to infer one fromcollection
at the first occurrence of “_” (e.g., wf_TimeSeries will become wf).- Parameters:
collection (str) – The name of the targetting collection
default (str, optional) – the default name to be set to
- class mspasspy.db.schema.MDSchemaDefinition(schema_dic, collection_str, dbschema)[source]
Bases:
SchemaDefinitionBase
- add(name, attr)[source]
Add a new entry to the definitions. Note that because the internal container is dict if attribute for name is already present it will be silently replaced.
- Parameters:
name (str) – The name of the attribute to be added
attr (dict) – A dictionary that defines the property of the added attribute. Note that the type must be defined.
- Raises:
mspasspy.ccore.utility.MsPASSError – if type is not defined in attr
- collection(key)[source]
Return the collection name that a key belongs to
- Parameters:
key (str) – the name of the key
- Returns:
the name of the collection
- Return type:
str
- readonly(key)[source]
Check if an attribute is marked readonly.
- Parameters:
key (str) – key to be tested
- Returns:
True if the key is readonly or its readonly attribute is not defined, else return False
- Return type:
bool
- Raises:
mspasspy.ccore.utility.MsPASSError – if the key is not defined
- set_collection(key, collection, dbschema=None)[source]
Set the collection name that a key belongs to. It optionally takes a dbschema argument and will set the attribute of that key with the corresponding one defined in the dbschema.
- Parameters:
key (str) – the name of the key
collection (str) – the name of the collection
dbschema (
mspasspy.db.schema.DatabaseSchema
) – the database schema used to set the attributes of the key.
- Raises:
mspasspy.ccore.utility.MsPASSError – if the key is not defined
- set_readonly(key)[source]
Lock an attribute to assure it will not be saved.
Parameters can be defined readonly. That is a standard feature of this class, but is normally expected to be set on construction of the object. There are sometimes reason to lock out a parameter to keep it from being saved in output. This method allows this. On the other hand, use this feature only if you fully understand the downstream implications or you may experience unintended consequences.
- Parameters:
key (str) – the key for the attribute with properties to be redefined
- Raises:
mspasspy.ccore.utility.MsPASSError – if the key is not defined
- set_writeable(key)[source]
Force an attribute to be writeable.
Normally some parameters are marked readonly on construction to avoid corrupting the database with inconsistent data defined with a common key. (e.g. sta) This method overrides such definitions for any key so marked. This method should be used with caution as it could have unintended side effects.
- Parameters:
key (str) – the key for the attribute with properties to be redefined
- Raises:
mspasspy.ccore.utility.MsPASSError – if the key is not defined
- swap_collection(original_collection, new_collection, dbschema=None)[source]
Swap the collection name of all the keys of a matching colletions. It optionally takes a dbschema argument and will set the attribute of that key with the corresponding one defined in the dbschema. It will silently do nothing if no matching collection is defined.
- Parameters:
original_collection (str) – the name of the collection to be swapped
new_collection (str) – the name of the collection to be changed into
dbschema (
mspasspy.db.schema.DatabaseSchema
) – the database schema used to set the attributes of the key.
- writeable(key)[source]
Check if an attribute is writeable. Inverted logic from the readonly method.
- Parameters:
key (str) – key to be tested
- Returns:
True if the key is not readonly, else or its readonly attribute is not defined return False
- Return type:
bool
- Raises:
mspasspy.ccore.utility.MsPASSError – if the key is not defined
- class mspasspy.db.schema.MetadataSchema(schema_file=None)[source]
Bases:
SchemaBase
- class mspasspy.db.schema.SchemaDefinitionBase[source]
Bases:
object
- add(name, attr)[source]
Add a new entry to the definitions. Note that because the internal container is dict if attribute for name is already present it will be silently replaced.
- Parameters:
name (str) – The name of the attribute to be added
attr (dict) – A dictionary that defines the property of the added attribute. Note that the type must be defined.
- Raises:
mspasspy.ccore.utility.MsPASSError – if type is not defined in attr
- add_alias(key, aliasname)[source]
Add an alias for key
- Parameters:
key (str) – key to be added
aliasname (str) – aliasname to be added
- aliases(key)[source]
Get a list of aliases for a given key.
- Parameters:
key (str) – The unique key that has aliases defined.
- Returns:
A list of aliases associated to the key.
- Return type:
list
- apply_aliases(md, alias)[source]
Apply a set of aliases to a data object.
This method will change the unique keys of a data object into aliases. The alias argument can either be a path to a valid yaml file of key:alias pairs or a dict. If the “key” is an alias itself, it will be converted to its corresponding unique name before being used to change to the alias. It will also add the applied alias to the schema’s internal alias container such that the same schema object can be used to convert the alias back.
- Parameters:
md (
mspasspy.ccore.utility.Metadata
) – Data object to be altered. Normally amspasspy.ccore.seismic.Seismogram
ormspasspy.ccore.seismic.TimeSeries
but can be a rawmspasspy.ccore.utility.Metadata
.alias (dict/str) – a yaml file or a dict that have pairs of key:alias
- clear_aliases(md)[source]
Restore any aliases to unique names.
Aliases are needed to support legacy packages, but can cause downstream problem if left intact. This method clears any aliases and sets them to the unique_name defined by this object. Note that if the unique_name is already defined, it will silently remove the alias only.
- Parameters:
md (
mspasspy.ccore.utility.Metadata
) – Data object to be altered. Normally amspasspy.ccore.seismic.Seismogram
ormspasspy.ccore.seismic.TimeSeries
but can be a rawmspasspy.ccore.utility.Metadata
.
- concept(key)[source]
Return a description of the concept this attribute defines.
- Parameters:
key (str) – The name that defines the attribute of interest
- Returns:
A string with a terse description of the concept this attribute defines
- Return type:
str
- Raises:
mspasspy.ccore.utility.MsPASSError – if concept is not defined
- constraint(key)[source]
Return a description of the constraint this attribute defines.
- Parameters:
key (str) – The name that defines the attribute of interest
- Returns:
A string with a terse description of the constraint this attribute defines
- Return type:
str
- has_alias(key)[source]
Test if a key has registered aliases
Sometimes it is helpful to have alias keys to define a common concept. For instance, if an attribute is loaded from a relational db one might want to use alias names of the form table.attribute as an alias to attribute. has_alias should be called first to establish if a name has an alias. To get a list of aliases call the aliases method.
- Parameters:
key (str) – key to be tested
- Returns:
True if the key has aliases, else or if key is not defined return False
- Return type:
bool
- is_alias(key)[source]
Test if a key is a registered alias
This asks the inverse question to has_alias. That is, it yields true of the key is registered as a valid alias. It returns false if the key is not defined at all. Note it will yield false if the key is a registered unique name and not an alias.
- Parameters:
key (str) – key to be tested
- Returns:
True if the key is a alias, else return False
- Return type:
bool
- is_defined(key)[source]
Test if a key is defined either as a unique key or an alias
- Parameters:
key (str) – key to be tested
- Returns:
True if the key is defined
- Return type:
bool
- is_normal(key)[source]
Test if the constraint of the key is normal to the schema
- Parameters:
key (str) – key to be tested
- Returns:
True if the constraint of the key is normal
- Return type:
bool
- is_optional(key)[source]
Test if the constraint of the key is optional to the schema
- Parameters:
key (str) – key to be tested
- Returns:
True if the constraint of the key is optional
- Return type:
bool
- is_required(key)[source]
Test if the constraint of the key is required to the schema
- Parameters:
key (str) – key to be tested
- Returns:
True if the constraint of the key is required
- Return type:
bool
- is_xref_key(key)[source]
Test if the constraint of the key is xref_key to the schema
- Parameters:
key (str) – key to be tested
- Returns:
True if the constraint of the key is xref_key
- Return type:
bool
- keys()[source]
Get a list of all the unique keys defined.
- Returns:
list of all the unique keys defined
- Return type:
list
- remove_alias(alias)[source]
Remove an alias. Will silently ignore if alias is not found
- Parameters:
alias (str) – alias to be removed
- required_keys()[source]
Return all the required keys in the current collection as a list
- Returns:
type of data associated with the collection
- Return type:
a
list
ofstr
- type(key)[source]
Return the type of an attribute. If not recognized, it returns None.
- Parameters:
key (str) – The name that defines the attribute of interest
- Returns:
type of the attribute associated with
key
- Return type:
- unique_name(aliasname)[source]
Get definitive name for an alias.
This method is used to ask the opposite question as aliases. The aliases method returns all acceptable alternatives to a definitive name defined as the key to get said list. This method asks what definitive key should be used to fetch an attribute. Note that if the input is already the unique name, it will return itself.
- Parameters:
aliasname (str) – the name of the alias for which we want the definitive key
- Returns:
the name of the definitive key
- Return type:
str
- Raises:
mspasspy.ccore.utility.MsPASSError – if aliasname is not defined