Importing Tabular Data
Prof. Gary L. Pavlis
Concepts
Few seismologists and surprising few “IT professionals” appreciate
the full significance of the
generic concept encapsulated by the term Tabular Data
in the title of
this section. Tabular data are ubiquitous in the modern IT world
largely due to the intimate connection with relational databases.
In fact, it is worth stressing that a “relation” in database jargon
means “table”.
MsPASS supports mechanisms to import/export the following types of tabular data:
A large range of file-based formats. Common exampes include: comma separated value (csv) files, Excel speadsheets, most white-space separated text files, and various legacy fixed format tabular data files.
The standard query language for relational database servers is the language called “Structured Query Language” that is commmonly referred to by the jargon term “SQL server”. SQL queries return tabular data that is readily handled in MsPASS with common python packages.
A special form of relational database in the seismology community is what is most properly called a “Datascope database” but which is commonly referred to as a “Antelope database”. Technically “Datascope” is a component of the larger “Antelope” software suite. Those tables are important as a number of regional seismic networks use Antelope as their database engine. Furthermore, the Array Network Facility of the Earthscope project used Antelope and the database tables are still accessible at the AUG web site here. In any case, the key concept to remember is that Antelope/Datascope is just a different implementation of a relational database. Below we describe a special interface in MsPASS to import data from Datascope tables.
In MsPASS the central concept used to unify these diverse sources is the
concept of a DataFrame
. :code:`DataFrame`s are THE data structure that is the
focus of the commonly used
`pandas <https://pandas.pydata.org/docs/>`__ python package.
The name does not come from a fuzzy animal but is apparently
a acronymn derived from “PANel DAta”. “panel” is, in fact,
an odd synonym for “table”. Pandas and their extension in
dask and pyspark are an API to manipulate tabular data.
Because both dask and pyspark have DataFrame implementations
it is particular natural that the way we suggest to best manipulate
tabular data in MsPASS is to use the DataFrame API.
All “imports” below can be thought of as ways to convert a table
stored some other way to a DataFrame.
Import/Export Tables Stored in Files
Importing data from any common tabular data format I know is
essentially a solved problem through the pandas, DataFrame
API.
The documentation on reading tabular data files can be found here. There are also writers for most of the same formats documented on that same page.
Dask has a more limited set of readers described here. The reason is that the large data model of DataFrame for a dask workflow is most applicable when the table is large compared to the memory space of a single node. Hence, something like an Excel spreadsheet can never be expected to hold gigantic tables. In constrast, a huge cvs file is easy to construct from many identical pieces using the standard unix “cat” command.
Pyspark has similar functionallity, but a very different API than pandas and dask. The documentation for the read/write interface can be found here. The list of formats pyspark can read or write is similar to pandas.
The most common requirement for reading tabular data is needing to import some nonstandard data from a research application. There are some common examples in seismology that won’t work with standard reader like the CMT catalog. Most tabular data downloadable on the internet today, however, is stored in one of the standard formats. For example, here an example extracted from our jupyter notebook tutorial on MongoDB. It shows how one can import the output of PhaseNets with it’s output structured as a csv file. It also shows how the results can be written to MongoDB in a collection it creates called “arrival”:
import pandas as pd
df = pd.read_csv('./data/picks.csv')
doclist=df.to_dict('records')
for doc in doclist:
station_id = doc['station_id']
slist=station_id.split('.')
net=slist[0]
sta=slist[1]
doc['sta']=sta
doc['net']=net
insert_many_output=db.arrival.insert_many(doclist)
Noting the complexity using the python string split method is a necessary evil for the file format. See the tutorial for a more extensive explanation of this example.
Import/Export Data from an SQL server
Importing data from an SQL server or writing a DataFrame to an
SQL server are best thought of as different methods of the
DataFrame implementation. e.g. if the data in the example above
had been stored on an SQL server you would change the line
df = pd.read_csv('./data/picks.csv')
to use the variant
for interacting with an SQL server. See the links above the to
IO sections for pandas, dask, and pyspark for details on the
correct incantation for your needs.
Import/Export of Data from Datascope
Although the Datascope (Antelope) database is an implementation of a relational database, it does not work with the SQL API of pandas, dask, and pyspark because Datascope does not speak SQL. On the other hand, Datascope is heavily used in seismology as a means to create earthquake catalogs from local earthquake networks. It also has a set of useful tools that complement MsPASS. Therefore, there are a lot of known ways that the Antelope software can be used in combination with MsPASS to handle novel research problems. As a result, we created a special API to interact with data managed by an Antelope database. The MsPASS API aims to loosely mimic core SQL functionality using pandas :code:`DataFrame`s as the intermediary.
A python class called
DatascopeDatabase
is used as the agent for interacting with an
Antelope/Datascope database. Unlike all common SQL
database of today, Datascope does not use a client-server model
to manage the data it contains. It is an example of what is
commonly called a “flat-file database” because the tables it manages
are stored in text files. That is relevant for this discussion
only because of how those files are defined. That is, in
Datascope the file names are special and take the form
dbname.`table`. dbname is the name of the collection
of tables that is the “database”: the “database name”. As the
word implies table is the schema name for a particular table
that that file contains. For example, if one sees a
Datascope table with the file name “usarray.arrival” that file
is the “arrival” table in a database someone chose to call “usarray”.
With that background, an instance of a
DatascopeDatabase
can be created with a variant of the following code fragment:
dsdb = DatascopeDatabase("usarray")
where “usrray” is the “name” used for this database instance.
Experienced Datascope users will know that Datascope has a useful,
albeit confusing, feature that allows the collection of
files that define the database to be spread through multiple directories.
That features is nearly always exploited, in practice, by placing
more static tables in a separate directory. For that reason
DatascopeDatabase
has an optional dir argument to point the constructor to read
data files from a different directory. e.g. a variant of the
above example to access files in a “dbmaster” (common practice)
directory is the following:
dsdbm = DatascopeDatabase("usarray",dir="~/dbmaster")
Once, an instance of
DatascopeDatabase
is created that points to the directory from which you want to import
one or more tables, the usage to fetch the data that table contains
is similar to that for the pandas SQL readers. Use the
get_table
method of
DatascopeDatabase
to retrieve individual tables from the Datascope database
as a pandas DataFrame. An important option described in the
docstrng is a python list passed via the optional argument
with key attributes_to_load. The default loads the entire css3.0
schema table. Use a list to limit what attributes are retrieved.
That is frequently desirable as all CSS3.0 tables have attributes that
are often or nearly always null.
The following eexample shows a typical use of the
get_table
method. This example retrieves the coordinate data from
the usarray “site” tables using the handle dsdbm created with
the code line above:
coordlist = ['sta','ondate','offdate','lat','lon','elev']
df = dsdbm.get_table("site",attributes_to_load=coordlist)
The result could be used for normalization to load coordinates by station name. (In reality there are some additional complications related to the time fields and seed station codes. Those, however are a side issue that would only confuse the topic of discussion so I ignore it here.)
The inverse of the
get_table
method is the
df2table
method. As it’s name implies it attempts to write a pandas
DataFrame to a particular Datascope table, which means it will attempt to
write a properly formatted text file for the table name passed to the
method function.
Finally, the datascope.py
module also contains two
convenience methods that simply two common operations with
Datascope database tables:
CSS30Catalog2df
creates the standard “catalog-view” of CSS3.0. In seismology a “catalog” is a image of what in ancient times was distributed as book tabulating earthquake hypocenter estimates and arrival time data used to create those estimates. CSS3.0 standardized a schema for creating a “catalog” as the join of four tables that this method creates: event->origin->assoc->arrival where “->” symbolizes a right database join operator. It returns a pandas DataFrame that is the “catalog”. Usage details can be gleaned from the docstring.wfdisc2doc
can be thought of as an alternative to the MsPASSindex_mseed_file
method. It returns a list of python dictionaries (documents) that are roughly equivalent to documents created byindex_mseed_file
. The main application is to use the alternative miniseed indexer of Antelope. There are many ways that raw miniseed files from experimental data (i.e. data not sanitized for storage in the archives) can be flawed. The Antelope implementation has more robust handlers for known problems than that in MsPASS. For most uses we would encourage immediately dumping the output to the standard MsPASS collectionwf_miniseed
using a variant of this example.
client = DBClient()
db = client.get_database("myproject")
dsdb = DatascopeDatabase("usarray")
doclist = dsdb.wfdisc2doclist()
db.wf_miniseed.insert_many(doclist)
Note
Be warned that py:meth:wfdisc2doc<mspasspy.preprocessing.css30.datascope.DatascopeDatabase.wfdisc2doclist> only work with a wfdisc that is an index to miniseed data. It does not currently support other formats defined by CSS3.0.