Cleaning Inconsistent Metadata
Gary L. Pavlis
Concepts
One of the strengths of MsPASS as a framework for research computing is the
Metadata container that is conceptually nearly identical to a python dictionary.
That is, a Metadata container stores attributes accessible via a string-valued
using constructs like the following: x=d['sta']
.
as a way to more clearly handle things that define what the
modern concept of “Metadata” means. The dark side of the flexibility
of containers like Metadata or a python dictionary is that the information
they contain can easily become stale and/or inconsistent with the data
with which they are associated. A common case in MsPASS is when
a set of three TimeSeries objects are run through the
mspasspy.algorithms.bundle.bundle()
function to create a Seismogram
object. That function leaves debris from cloning the parents
related to the single channel inputs. That is, attributes with
keys “chan”, “loc”, “channel_id”, “channel_lat”, etc. Those attributes
are inconsistent with the concept of a Seismogram, which is defined
as an assembled bundle of three channels. The keys are inconsistent
because the data are defined by three different components and the ones
defined are always referring to only one of the three components.
The example of the issue created by the
mspasspy.algorithms.bundle.bundle()
function
could been solved by appropriate modifications to that function.
The problem of functions creating stale/inconsistent metadata,
however, is ubiquitous. For that reason we developed
a generic solution in MsPASS. We implemented a metadata cleaner
class with the mnemonic name mspasspy.util.Janitor.Janitor
.
The class has methods to clear inconsistent Metadata from data
during processing to reduce the volume of junk attributes stored in
the database. The remainder of this section uses examples to illustrate
the use of an instance of a mspasspy.util.Janitor.Janitor
.
Janitor class
An instance of a Janitor
can be thought of as a robotic cleaner that
keeps only attributes it is told to keep. An analogy for a physical janitor
is a person with instructions to clean out an office and retain only papers,
pens, or pencils on a desk. Throw everything else away. For
seismic processing an instance of a Janitor
is told what keys to
retain in any Metadata container. Any key not defined in that list is to
be treated as junk/trash.
mspasspy.util.Janitor.Janitor
has three processing methods
that can be used to handle junk differently:
mspasspy.util.Janitor.Janitor.clean()
silently discards all attributes not in the list of “keepers”. It returns an edited copy of the input.mspasspy.util.Janitor.Janitor.bag_trash()
does not discard attributes it treats as trash but bundles them up into a python dictionary (the bag), removes the originals, and posts the trash bag content with a user defined key. It returns and edited copy of the input with the trash attributes removed and placed in the bag python dictionary.mspasspy.util.Janitor.Janitor.collect_trash()
is best thought of as a lower-level function most users are unlikely to need. Unlike the other methods ofJanitor
, it returns a python dictionary that the trash bag. It is called by themspasspy.util.Janitor.Janitor.bag_trash()
to create the python dictionary it posts back to the datum it is handling. This method exists largely to allow alternative ways to handle the trash.
Note that in all cases an instance of a Janitor
needs to be
instantiated before it can be used. The constructor initializes
the list of “keepers” for different seismic data types. The default
constructor reads from yaml format file found in the mspass source
code tree in data/yaml. The default list is easily changed by
created a new yaml format file and passing that file name to the
class constructor via the keepers_file
argument. See the docstring for
mspasspy.util.Janitor.Janitor
for details.
Examples
Usage
The examples below illustrate a range of applications of the
Janitor
class. None of them are complete python scripts
that can be run as all have incomplete initializations. The examples
are intended to starting points to aid development of workflows using
this class.
Basic usage
This is a trivial example that is a variant of
part of the pytest module used for this
class. It uses a test function to generate an TimeSeries
object with zeros in the data array and minimal Metadata.
The script adds an undefined Metadata value of the “foo-bar”
construct used in many tutorials. The assert statements
verify that the clean
method clears the debris:
from mspasspy.util.Janitor import Janitor
# this will resolve only when run with pytest
from helper import get_live_timeseries
# default constructor assign to the symbol cleaner
cleaner = Janitor()
datum = get_live_timeseries()
# assign a metadata key-value pair not in keepers list of cleaner
datum["foo"] = "bar"
assert "foo" in datum
datum = cleaner.clean(datum)
assert "foo" not in datum
Application to ensembles
The second example below demonstrates an ambiguity the Janitor
has to handle. That is, with ensemble object there are two things
the cleaner may have to take care of: (1) the Metadata
for
the ensemble object itself, and (2) the content of the atomic data
that are bundled in the “ensemble”. That is handled by the constructor
and the difference in the usage is displayed in this example:
from mspasspy.util.Janitor import Janitor
# this will resolve only if run with pytest
from helper import get_live_timeseries_ensemble
from mspasspy.ccore.seismic import TimeSeriesEnsemble
# generate a junk ensemble with 3 members using helper function
e1 = get_live_timeseries_ensemble(3)
# add undefined key-value pair to each ensemble member
for i in range(len(e1.member)):
e1.member[i]["foo"] = "bar"
# add an invalid key to the ensemble's metadata
e1["badkey"] = "badvalue"
# use the copy constructor for this object from C++ bindings as best practice
e_save = TimeSeriesEnsemble(e1)
# default Janitor behavior cleans members
member_cleaner = Janitor()
e1 = member_cleaner.clean(e1)
# removes foo from members
for d in e1.member:
assert "foo" not in d
# does not alter ensemble Metadata container
assert "badkey" in e1
assert e1["badkey"] == "badvalue"
# Now create an instance that does the opposite.
ensemble_cleaner = Janitor(process_ensemble_members=False)
e1 = TimeSeriesEnsemble(e_copy)
# note the asserts below all have the reverse logic of above
e1 = member_cleaner.clean(e1)
# now foo is still in all members
for d in e1.member:
assert "foo" in d
# Now the ensemble has badkey cleared
assert "badkey" not in e1
Miniseed Data
The MsPASS indexing function for miniseed data loads the common content of
miniseed packet headers and several computed quantities like
start time and end time. Some of those like the “format” attribute,
which in this case is always “miniseed”,
are an example of an attribute that is inconsistent
with the data once a TimeSeries
object is constructed from
a miniseed file or file image. Because miniseed data are the most common
starting point for most seismology workflows, there is a special
subclass of mspasspy.util.Janitor.Janitor
called
mspasspy.util.Janitor.MiniseedJanitor
. It differs
only in the initialization where the default yaml file is specialized
for reading from raw miniseed data. This class should only be used
immediately after reading from wf_miniseed records. The following
is a sketch of a typical algorithm:
from mspasspy.util.Janitor import MiniseedJanitor
... additional initializations ...
janitor = MiniseedJanitor()
# assumes symbol db is a database handle constructed earlier
cursor = db.wf_miniseed.find({})
for doc in cursor:
d = db.read_data(doc,collection="wf_miniseed")
d = janitor.clean(d)
... additional processing functions ...
Parallel workflow
This is a sketch of a code segment illustrating the use of a
Janitor
in a parallel workflow. The example reads
a collection of TimeSeriesEnsembles
, runs the
mspasspy.algorithms.bundle.bundle()
function to
convert each to a SeismogramEnsemble
, and then
runs the instance of Janitor
before saving the results.
... Initialization code would go above this point ...
janitor = Janitor()
# generate a list of queries defining all common source gathers
# defined in the data set
srcids=db.wf_TimeSeries.distinct("source_id")
queries=list()
for sid in srcids:
queries.append({"source_id" : sid})
# parallel job using parallel reader and writer
mydata = read_distributed_data(queries,collection="wf_TimeSeries")
mydata = mydata.map(rotate_to_standard)
mydata = mydata.map(bundle)
mydata = mydata.map(janitr.clean)
saved_ids = write_distributed_data(mydata,
db,
collection="wf_Seismogram",
data_are_atomic=False,
)