SMDC_perftests¶
This is the documentation of SMDC_perftests, a small python module that provides
a decorator for measuring the time a function needs to execute. It then stores the
results in a SMDC_perftests.performance_tests.test_cases.TestResults
object.
These objects can be compared to each other to quickly find out if the measured time
was significantly different using a 95% confidence interval.
The objects can also be stored to and restored from netCDF4 files on disk. There are also plotting functions for the TestResults object.
Requirements¶
This package was tested using python2.7 and requires the packages
netCDF4
pytest
matplotlib
pygeogrids
# optional
# seaborn for pretty plots
Contents¶
Examples¶
Basic Example¶
import smdc_perftests.performance_tests.test_cases as test_cases
import time
import numpy as np
# use measure decorator to run function multiple times
# and measure execution time of each run
# the returned results gets the name given in
# the decorator but can be changed later if necessary
@test_cases.measure('experiment', runs=50)
def experiment(sleeptime=0.01):
time.sleep(sleeptime+np.random.rand(1)*sleeptime)
result1 = experiment()
result2 = experiment(0.05)
result2.name = "sleep 0.05"
result3 = experiment(0.011)
result3.name = "sleep 0.011"
# the results can be printed
print result1
print result3
Results experiment
50 runs
median 0.0158 mean 0.0157 stdev 0.0029
sum 0.7859
95%% confidence interval of the mean
upper 0.0165
|
mean 0.0157
|
lower 0.0149
Results sleep 0.011
50 runs
median 0.0158 mean 0.0163 stdev 0.0034
sum 0.8168
95%% confidence interval of the mean
upper 0.0173
|
mean 0.0163
|
lower 0.0154
# the results can also be compared based on the 95% confidence intervals.
print result1 < result2
print result2 < result1
print result1 < result3
True
False
False
# or then plotted as boxplots
import smdc_perftests.visual as vis
import matplotlib.pyplot as plt
%matplotlib inline
fig, axis = vis.plot_boxplots(result1, result3)
plt.show()

Example with Dataset¶
import smdc_perftests.performance_tests.test_runner as test_runner
import time
import datetime as dt
import numpy as np
# define a fake Dataset class that implements the methods
# get_timeseries, get_avg_image and get_data
class FakeDataset(object):
"""
Fake Dataset that provides routines for reading
time series and images
that do nothing
"""
def __init__(self):
pass
self.ts_read = 0
self.img_read = 0
self.cells_read = 0
def get_timeseries(self, gpi, date_start=None, date_end=None):
time.sleep(0.01*np.random.rand(1))
self.ts_read += 1
return None
def get_avg_image(self, date_start, date_end=None, cell_id=None):
"""
Image readers generally return more than one
variable. This should not matter for these tests.
"""
assert type(date_start) == dt.datetime
self.img_read += 1
time.sleep(0.01*np.random.rand(1))
return None, None, None, None, None
def get_data(self, date_start, date_end, cell_id):
"""
Image readers generally return more than one
variable. This should not matter for these tests.
"""
assert type(date_start) == dt.datetime
assert type(date_end) == dt.datetime
self.cells_read += 1
time.sleep(0.01*np.random.rand(1))
return None, None, None, None, None
fd = FakeDataset()
# setup grid point index list, must come from grid object or
# sciDB
# this test dataset has 10000 gpis of which 1 percent will be read
gpi_list = range(10000)
@test_runner.measure('test_rand_gpi', runs=100)
def test_ts():
test_runner.read_rand_ts_by_gpi_list(fd, gpi_list)
result_ts = test_ts()
print result_ts
Results test_rand_gpi
100 runs
median 0.5642 mean 0.5591 stdev 0.0334
sum 55.9069
95%% confidence interval of the mean
upper 0.5657
|
mean 0.5591
|
lower 0.5524
# setup datetime list
# this test dataset has 10000 days of dates of which 1 percent will be read
date_list = []
for days in range(10000):
date_list.append(dt.datetime(2007, 1, 1) + dt.timedelta(days=days))
@test_runner.measure('test_rand_date', runs=100)
def test_img():
test_runner.read_rand_img_by_date_list(fd, date_list)
result_img = test_img()
print result_img
Results test_rand_date
100 runs
median 0.5530 mean 0.5548 stdev 0.0343
sum 55.4800
95%% confidence interval of the mean
upper 0.5616
|
mean 0.5548
|
lower 0.5480
"""
Read data by cell list using fixed start and end date
1 percent of the cells are read with a minimum of 1 cell.
"""
fd = FakeDataset()
cell_list = range(10000)
@test_runner.measure('test_rand_cells', runs=100)
def test():
test_runner.read_rand_cells_by_cell_list(fd,
dt.datetime(2007, 1, 1), dt.datetime(2008, 1, 1), cell_list)
results_cells = test()
print results_cells
Results test_rand_cells
100 runs
median 0.5510 mean 0.5476 stdev 0.0368
sum 54.7624
95%% confidence interval of the mean
upper 0.5549
|
mean 0.5476
|
lower 0.5403
import smdc_perftests.visual as vis
import matplotlib.pyplot as plt
%matplotlib inline
fig, axis = vis.plot_boxplots(result_ts, result_img, results_cells)
plt.show()

Example of running the test suite and analyzing the results¶
import os
from datetime import datetime
from smdc_perftests.performance_tests import test_scripts
# the test_scripts module contains the function
# run performance tests which runs all the performance tests on a dataset
# in this example we will use the esa cci dataset class
from smdc_perftests.datasets.esa_cci import ESACCI_netcdf
from smdc_perftests import helper
#init the esa cci dataset
fname = os.path.join("/media", "sf_H", "Development", "python",
"workspace",
"SMDC", "SMDC_perftests", "tests", "test_data",
"ESACCI-2Images.nc")
# only read the sm variable for this testrun
ds = ESACCI_netcdf(fname, variables=['sm'])
# get the testname from the filename
testname = os.path.splitext(os.path.split(fname)[1])[0]
# generate a date range list using the helper function
# in this example this does not make a lot of sense
date_range_list = helper.generate_date_list(datetime(2013, 11, 30),
datetime(2013, 12, 1),
n=50)
# set a directory into which to save the results
# in this case the the tests folder in the home directory
res_dir = "/home/pydev/tests/"
# run the performance tests using the grid point indices from
# the dataset grid, the generated date_range_list and gpi read percentage
# of 0.1 percent and only one repeat
test_scripts.run_performance_tests(testname, ds, res_dir,
gpi_list=ds.grid.land_ind,
date_range_list=date_range_list,
gpi_read_perc=0.1,
repeats=1)
reading 245 out of 244243 time series
reading 1 out of 50 dates
reading 1 out of 50 dates
This creates the following files named using the name given to the test and the name of the test function that was run.
!ls /home/pydev/tests
ESACCI-2Images_test-rand-avg-img.nc ESACCI-2Images_test-rand-gpi.nc
ESACCI-2Images_test-rand-daily-img.nc
Visualization of the results¶
%matplotlib inline
import glob
import smdc_perftests.performance_tests.analyze as analyze
# get all the files in the results folder
fs = glob.glob(os.path.join(res_dir, "*.nc"))
df = analyze.prep_results(fs)
# this returnes the mean times at the moment
print df
# and makes a very simple bar plot
ax = analyze.bar_plot(df)
means
ESACCI-2Images_test-rand-avg-img 0.085946
ESACCI-2Images_test-rand-gpi 0.098265
ESACCI-2Images_test-rand-daily-img 0.059122

License¶
# Copyright (c) 2014,Vienna University of Technology,
# Department of Geodesy and Geoinformation
# All rights reserved.
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of the Vienna University of Technology,
# Department of Geodesy and Geoinformation nor the
# names of its contributors may be used to endorse or promote products
# derived from this software without specific prior written permission.
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED. IN NO EVENT SHALL VIENNA UNIVERSITY OF TECHNOLOGY,
# DEPARTMENT OF GEODESY AND GEOINFORMATION BE LIABLE FOR ANY
# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
smdc_perftests¶
smdc_perftests package¶
Subpackages¶
smdc_perftests.datasets package¶
Dataset Reader for EQUI-7 data Created on Mon Jun 8 17:30:19 2015
-
class
smdc_perftests.datasets.EQUI_7.
EQUI_7
(fname, variables=None, avg_var=None, time_var='time', lat_var='x', lon_var='y')[source]¶ Bases:
smdc_perftests.datasets.esa_cci.ESACCI_netcdf
Methods
Module contains the Classes for reading ASCAT data Created on Fri Mar 27 15:12:18 2015
@author: Christoph.Paulik@geo.tuwien.ac.at
-
class
smdc_perftests.datasets.ascat.
ASCAT_netcdf
(fname, variables=None, avg_var=None, time_var='time', gpi_var='gpis_correct', cell_var='cells_correct', get_exact_time=False)[source]¶ Bases:
object
Class for reading ASCAT data from netCDF files
Caches the following: - time variable - keeps the dataset open as long as the instance exists
Methods
get_avg_image
(date_start[, date_end, cellID])Reads image from dataset, takes the average if more than one value is in the result array. get_data
(date_start, date_end[, cellID])Reads date cube from dataset get_timeseries
(locationid[, date_start, ...])Parameters: -
get_avg_image
(date_start, date_end=None, cellID=None)[source]¶ Reads image from dataset, takes the average if more than one value is in the result array.
Parameters: date_start: datetime
start date of the image to get. If only one date is given then the whole day of this date is read
date_end: datetime, optional
end date of the averaged image to get
cellID: int, optional
cell id to which the image should be limited, for ESA CCI this is not defined at the moment.
-
get_data
(date_start, date_end, cellID=None)[source]¶ Reads date cube from dataset
Parameters: date_start: datetime
start date of the image to get. If only one date is given then the whole day of this date is read
date_end: datetime
end date of the averaged image to get
cellID: int
cell id to which the image should be limited, for ESA CCI this is not defined at the moment.
-
Module contains the Class for reading ESA CCI data in netCDF Format Created on Fri Mar 27 15:12:18 2015
@author: Christoph.Paulik@geo.tuwien.ac.at
-
class
smdc_perftests.datasets.esa_cci.
ESACCI_netcdf
(fname, variables=None, avg_var=None, time_var='time', lat_var='lat', lon_var='lon')[source]¶ Bases:
object
Class for reading ESA CCI data from netCDF files
Caches the following: - time variable - keeps the dataset open as long as the instance exists
Methods
get_avg_image
(date_start[, date_end, cellID])Reads image from dataset, takes the average if more than one value is in the result array. get_data
(date_start, date_end[, cellID])Reads date cube from dataset get_timeseries
(locationid[, date_start, ...])Parameters: -
get_avg_image
(date_start, date_end=None, cellID=None)[source]¶ Reads image from dataset, takes the average if more than one value is in the result array.
Parameters: date_start: datetime
start date of the image to get. If only one date is given then the whole day of this date is read
date_end: datetime, optional
end date of the averaged image to get
cellID: int, optional
cell id to which the image should be limited, for ESA CCI this is not defined at the moment.
-
get_data
(date_start, date_end, cellID=1)[source]¶ Reads date cube from dataset
Parameters: date_start: datetime
start date of the image to get. If only one date is given then the whole day of this date is read
date_end: datetime
end date of the averaged image to get
cellID: int
cell id to which the image should be limited, for ESA CCI this is not defined at the moment.
-
smdc_perftests.performance_tests package¶
Module for analyzing and the test results Created on Thu Apr 2 14:30:51 2015
@author: christoph.paulik@geo.tuwien.ac.at
-
smdc_perftests.performance_tests.analyze.
bar_plot
(df, show=True)[source]¶ Make a bar plot from the gathered results
Parameters: df: pandas.DataFrame
Measured data
show: boolean
if set then the plot is shown
Returns: ax: matplotlib.axes
axes of the plot
-
smdc_perftests.performance_tests.analyze.
prep_results
(results_files, name_fm=None, grouping_f=None)[source]¶ Takes a list of results file names and bundles the results into a pandas DataFrame
Parameters: results_files: list
list of filenames to load
name_fm: function, optional
if set a function that gets the name of the results and returns a more meaningful name. This is useful if the names of the results are very long or verbose.
grouping_f: function ,optional
can be used to assign groups according to the name of the results. Gets the name and returns a string.
Returns: df : pandas.DataFrame
Results named and possibly grouped
This module contains functions that run tests according to specifications from SMDC Performance comparison document.
Interfaces to data should be interchangeable as long as they adhere to interface specifications from rsdata module
Created on Tue Oct 21 13:37:58 2014
@author: christoph.paulik@geo.tuwien.ac.at
-
class
smdc_perftests.performance_tests.test_cases.
SelfTimingDataset
(ds, timefuncs=['get_timeseries', 'get_avg_image', 'get_data'])[source]¶ Bases:
object
Dataset class that times the functions of a dataset instance it gets in it’s constructor
Stores the results as TestResults instances in a dictionary with the timed function names as keys.
Methods
gentimedfunc
(funcname)generate a timed function that calls
-
class
smdc_perftests.performance_tests.test_cases.
TestResults
(init_obj, name=None, ddof=1)[source]¶ Bases:
object
Simple object that contains the test results and can be used to compare the test results to other test results.
Objects of this type can also be plotted by the plotting routines. Parameters ———- measured times or filename: list or string
list of measured times or netCDF4 file produced by to_nc of another TestResults object- ddof: int
- difference degrees of freedom. This is used to calculate standard deviation and variance. It is the number that is subtracted from the sample number n when estimating the population standard deviation and variance. see bessel’s correction on e.g. wikipedia for explanation
Attributes
median: float median of the measurements n: int sample size stdev: float standard deviation var: float variance total: float total time expired mean: float mean time per test run Methods
confidence_int
([conf_level])Calculate confidence interval of the mean to_nc
(filename)store results on disk as a netCDF4 file -
confidence_int
(conf_level=95)[source]¶ Calculate confidence interval of the mean time measured
Parameters: conf_level: float
confidence level desired for the confidence interval in percent. this will be transformed into the quantile needed to get the z value for the t distribution. default is 95% confidence interval
Returns: lower_mean : float
lower confidence interval boundary
mean : float
mean value
upper_mean : float
upper confidence interval boundary
-
smdc_perftests.performance_tests.test_cases.
measure
(exper_name, runs=5, ddof=1)[source]¶ Decorator that measures the running time of a function and calculates statistics.
Parameters: exper_name: string
experiment name, used for plotting and saving
runs: int
number of test runs to perform
ddof: int
difference degrees of freedom. This is used to calculate standard deviation and variance. It is the number that is subtracted from the sample number n when estimating the population standard deviation and variance. see bessel’s correction on e.g. wikipedia for explanation
Returns: results: dict
TestResults instance
-
smdc_perftests.performance_tests.test_cases.
read_rand_cells_by_cell_list
(dataset, cell_date_list, cell_id, read_perc=1.0, max_runtime=None)[source]¶ reads data from the dataset using the get_data method. In this method the start and end datetimes are fixed for all cell ID’s that are read.
Parameters: dataset: instance
instance of a class that implements a get_data(date_start, date_end, cell_id) method
date_start: datetime
start dates which should be read.
date_end: datetime
end dates which should be read.
cell_date_list: list of tuples, time intervals to read for each cell
cell_id: int or iterable
cell ids which should be read. can also be a list of integers
read_perc : float
percentage of cell ids to read from the
max_runtime: int, optional
maximum runtime of test in second.
-
smdc_perftests.performance_tests.test_cases.
read_rand_img_by_date_list
(dataset, date_list, read_perc=1.0, max_runtime=None, **kwargs)[source]¶ reads image data for random dates on a list additional kwargs are given to read_img method of dataset
Parameters: dataset: instance
instance of a class that implements a read_img(datetime) method
date_list: iterable
list of datetime objects
read_perc: float
percentage of datetimes out of date_list to read
max_runtime: int, optional
maximum runtime of test in second.
**kwargs:
other keywords are passed to the get_avg_image method dataset
-
smdc_perftests.performance_tests.test_cases.
read_rand_img_by_date_range
(dataset, date_list, read_perc=1.0, max_runtime=None, **kwargs)[source]¶ reads image data between random dates on a list additional kwargs are given to read_img method of dataset
Parameters: dataset: instance
instance of a class that implements a read_img(datetime) method
date_list: iterable
list of datetime objects The format is a list of lists e.g. [[datetime(2007,1,1), datetime(2007,1,1)], #reads one day
[datetime(2007,1,1), datetime(2007,12,31)]] # reads one year
read_perc: float
percentage of datetimes out of date_list to read
max_runtime: int, optional
maximum runtime of test in second.
**kwargs:
other keywords are passed to the get_avg_image method dataset
-
smdc_perftests.performance_tests.test_cases.
read_rand_ts_by_gpi_list
(dataset, gpi_list, read_perc=1.0, max_runtime=None, **kwargs)[source]¶ reads time series data for random grid point indices in a list additional kwargs are given to read_ts method of dataset
Parameters: dataset: instance
instance of a class that implements a read_ts(gpi) method
gpi_list: iterable
list or numpy array of grid point indices
read_perc: float
percentage of points from gpi_list to read
max_runtime: int, optional
maximum runtime of test in second.
**kwargs:
other keywords are passed to the get_timeseries method dataset
Module implements the test cases specified in the performance test protocol Created on Wed Apr 1 10:59:05 2015
@author: christoph.paulik@geo.tuwien.ac.at
-
smdc_perftests.performance_tests.test_scripts.
run_ascat_tests
(dataset, testname, results_dir, n_dates=10000, date_read_perc=0.1, gpi_read_perc=0.1, repeats=3, cell_read_perc=10.0, max_runtime_per_test=None)[source]¶ Runs the ASCAT tests given a dataset instance
Parameters: dataset: Dataset instance
Instance of a Dataset class
testname: string
Name of the test, used for storing the results
results_dir: string
path where to store the test restults
n_dates: int, optional
number of dates to generate
date_read_perc: float, optioanl
percentage of random selection from date_range_list read for each try
gpi_read_perc: float, optional
percentage of random selection from gpi_list read for each try
repeats: int, optional
number of repeats of the tests
cell_list: list, optional
list of possible cells to read from. if given then the read_data test will be run
max_runtime_per_test: float, optional
maximum runtime per test in seconds, if given the tests will be aborted after taking more than this time
-
smdc_perftests.performance_tests.test_scripts.
run_equi7_tests
(dataset, testname, results_dir, n_dates=10000, date_read_perc=0.1, gpi_read_perc=0.1, repeats=3, cell_read_perc=100.0, max_runtime_per_test=None)[source]¶ Runs the ASAR/Sentinel 1 Equi7 tests given a dataset instance
Parameters: dataset: Dataset instance
Instance of a Dataset class
testname: string
Name of the test, used for storing the results
results_dir: string
path where to store the test restults
n_dates: int, optional
number of dates to generate
date_read_perc: float, optioanl
percentage of random selection from date_range_list read for each try
gpi_read_perc: float, optional
percentage of random selection from gpi_list read for each try
repeats: int, optional
number of repeats of the tests
cell_list: list, optional
list of possible cells to read from. if given then the read_data test will be run
max_runtime_per_test: float, optional
maximum runtime per test in seconds, if given the tests will be aborted after taking more than this time
-
smdc_perftests.performance_tests.test_scripts.
run_esa_cci_netcdf_tests
(test_dir, results_dir, variables=['sm'])[source]¶ function for running the ESA CCI netCDF performance tests the tests will be run for all .nc files in the test_dir
Parameters: test_dir: string
path to the test files
results_dir: string
path in which the results should be stored
variables: list
list of variables to read for the tests
-
smdc_perftests.performance_tests.test_scripts.
run_esa_cci_tests
(dataset, testname, results_dir, n_dates=10000, date_read_perc=0.1, gpi_read_perc=0.1, repeats=3, cell_read_perc=10.0, max_runtime_per_test=None)[source]¶ Runs the ESA CCI tests given a dataset instance
Parameters: dataset: Dataset instance
Instance of a Dataset class
testname: string
Name of the test, used for storing the results
results_dir: string
path where to store the test restults
n_dates: int, optional
number of dates to generate
date_read_perc: float, optioanl
percentage of random selection from date_range_list read for each try
gpi_read_perc: float, optional
percentage of random selection from gpi_list read for each try
repeats: int, optional
number of repeats of the tests
cell_list: list, optional
list of possible cells to read from. if given then the read_data test will be run
max_runtime_per_test: float, optional
maximum runtime per test in seconds, if given the tests will be aborted after taking more than this time
-
smdc_perftests.performance_tests.test_scripts.
run_performance_tests
(name, dataset, save_dir, gpi_list=None, date_range_list=None, cell_list=None, cell_date_list=None, gpi_read_perc=1.0, date_read_perc=1.0, cell_read_perc=1.0, max_runtime_per_test=None, repeats=1)[source]¶ Run a complete test suite on a dataset and store the results in the specified directory
Parameters: name: string
name of the test run, used for filenaming
dataset: dataset instance
instance implementing the get_timeseries, get_avg_image and get_data methods.
save_dir: string
directory to store the test results in
gpi_list: list, optional
list of possible grid point indices, if given the timeseries reading tests will be run
date_range_list: list, optional
list of possible dates, if given then the read_avg_image and read_data tests will be run. The format is a list of lists e.g. [[datetime(2007,1,1), datetime(2007,1,1)], #reads one day
[datetime(2007,1,1), datetime(2007,12,31)]] # reads one year
cell_list: list, optional
list of possible cells to read from. if given then the read_data test will be run
cell_date_list: list, optional
list of time intervals to read per cell. Should be as long as the cell list or longer.
gpi_read_perc: float, optional
percentage of random selection from gpi_list read for each try
date_read_perc: float, optioanl
percentage of random selection from date_range_list read for each try
cell_read_perc: float, optioanl
percentage of random selection from cell_range_list read for each try
max_runtime_per_test: float, optional
maximum runtime per test in seconds, if given the tests will be aborted after taking more than this time
repeats: int, optional
number of repeats for each measurement
Submodules¶
smdc_perftests.helper module¶
Helper functions Created on Wed Apr 1 14:50:18 2015
@author: christoph.paulik@geo.tuwien.ac.at
-
smdc_perftests.helper.
generate_date_list
(minimum, maximum, n=500, max_spread=30, min_spread=None)[source]¶ Parameters: minimum: datetime
minimum datetime
maximum: datetime
maximum datetime
n: int
number of dates to generate
max_spread: int, optional
maximum spread between dates
min_spread: int, optional
minimum spread between dates
Returns: date_list: list
list of start, end lists The format is a list of lists e.g. [[datetime(2007,1,1), datetime(2007,1,1)],
[datetime(2007,1,1), datetime(2007,12,31)]]
smdc_perftests.visual module¶
Module for visualizing the test results Created on Tue Nov 25 13:44:56 2014
@author: Christoph.Paulik@geo.tuwien.ac.at
-
smdc_perftests.visual.
plot_boxplots
(*args, **kwargs)[source]¶ plots means and confidence intervals of given TestResults objects
Parameters: *args: TestResults instances
any Number of TestResults instances that should be plotted side by side
conf_level: int, optional
confidence level to use for the computed confidence intervals
**kwargs: varied
all other keyword arguments will be passed on to the plt.subplots function
Returns: fig: matplotlib.Figure
ax1: matplotlib.axes