Tutorial#
You have collected on-ground measurements of a metric of interest at multiple field sites. You now want to extract pixels from satellite or aerial images acquired over the field sites. And you want to calculate one or more statistics on the extracted pixel values for each site.
This tutorial guides you through a typical usage pattern to do so.
A complete example is provided in pixdrill.example.
The typical usage pattern#
Specify the field survey properties
Choose the images to be drilled
Define the statistics to be calculated on the drilled pixels
Extract the pixels and compute the statistics
Fetch the results
Specify the field survey properties#
Use the pixdrill.drillpoints.Point class to model the properties of
the pixel extraction you want to do for each field site. The properties are the:
- location of your field site defined by a coordinate and a coordinate reference system
- the extraction region of interest, which will resemble that of your field site
- date of the field survey
- time range, either side of the field survey date, of the imagery acquisition
Example:
from pixdrill import drill
from pixdrill import drillpoints
# Create the first point.
longitude = 140
latitude = -36.5
crs_code = 4326 # EPSG code for WGS84 coordinate reference system. See epsg.org.
# The survey date and time. If the timezone is not specified, then
# UTC is assumed.
tz = datetime.timezone(datetime.timedelta(hours=10))
survey_date = datetime.datetime(
2022, 7, 28, 14, 30, 0, tzinfo=tz) # 28 July 2022 @ 2:30 pm
acquisition_window = datetime.timedelta(days=3)
site_shape = drill_points.ROI_SHP_CIRCLE # Define the extraction shape
site_radius = 50 # The extraction size, in metres
pt_1 = drillpoints.Point(
longitude, latitude, survey_date, crs_code,
acquisition_window, site_radius, site_shape)
# Create extra points.
...
# Add the points to a list, ready for extraction.
points = [pt_1, ...]
# Extract pixels and run stats - only the points argument is shown for the time being
drill.drill(points, ...)
Choose the images#
Specify the images to extract pixels from using one or both of the following methods:
provide a list of file paths or URLs
use the Items returned from a search of a STAC catalogue
Pixel Driller uses GDAL to read the images. So the
file paths or URLs must be in a form that GDAL can parse.
STAC makes Pixel Driller powerful.
Consider this: you have hundreds of field sites and, depending on your acquisition
window and satellite revisit period, a search of a STAC catalogue returns
thousands of images to drill. This is clearly better than having to specify
these yourself in a list.
If using a STAC catalogue, we assume that you are familiar with the properties of that catalogue. In particular: - the endpoint (a URL) to the STAC catalogue - the names of the collections in the catalogue; a collection is typically a grouping of similar images, such as Sentinel-2 Level 2A - the names of the STAC Item’s raster assets that you want to drill; note that a list of STAC Items is returned from a search, and the set of asset IDs is the same for every Item in a collection
Continuing our example:
# Specify paths to the images to be drilled.
img1 = "/path/to/img1.tif"
img2 = "/path/to/img2.tif"
# Specify the collection and raster assets of the Items in a STAC catalogue
endpoint = "https://earth-search.aws.element84.com/v0"
collections = ['sentinel-s2-l2a-cogs']
assets = ['B02', 'B11']
# Extract the pixels and run stats
drill.drill(
points, images=[img1, img2],
stac_endpoint=endpoint, raster_assets=assets, collections=collections,
...)
How do I find the STAC collection and its Items’ asset names?
Hopefully they are documented by the publisher of the STAC catalogue. You may also like to inspect the catalogue using a STAC Browser.
Define the statistics#
A point will be found to intersect one or more of the images that you supply, or one or more of the STAC Items returned from a catalogue search.
Statistics are calculated from the pixels extracted around the point from each image and each STAC Item.
Use one or both of the following methods to specify the statistics to be calculated:
use the built-in, standard statistics
define your own user statistics
Standard statistics#
The standard statistics are defined in the pixdrill.drillstats module.
Use the STATS_ symbols when specifying them. For example, to calculate
the mean and standard deviation of each raster:
from pixdrill import drillstats
...
std_stats = [drillstats.STATS_MEAN, drillstats.STATS_STDEV]
drill.drill(
points, images=[img1, img2],
stac_endpoint=endpoint, raster_assets=assets, collections=collections,
std_stats=std_stats, ...)
There is a limitation on the use of standard statistics. The underlying
functions assume that each drilled image is a single-band raster.
So, in our example img1 and img2 contain only one band.
Likewise, the STAC assets B02 and B11 contain only one band.
If one of your images contains multiple bands you will have to write your own functions to calculate statistics.
User statistics#
The standard statistics are quite limited. So you may need to write your own functions to calculate the statistics (model predictors) that you require.
Your function must have the following signature:
def my_func(array_info, item, point):
Where:
array_infois a list ofpixdrill.image_reader.ArrayInfoinstancesitemis an instance ofpixdrill.drill.ImageItemfor a user-supplied image, or an instance ofpystac.Itemfor a STAC Item.point is one of the
pixdrill.drillpoints.Pointobjects that you defined
The array_info list contains:
one element if the data were extracted from a user-supplied image
an element for every asset name given if the data were extracted from a STAC Item
An ArrayInfo instance contains these properties:
data: a 3D masked array containing the pixel values read from the image or item assetasset_id: the asset nameand other attributes that define the location of the array within the image it was extracted from
In the following example we want to know the range (max-min) of all pixel
values. It returns a list with one element when the item is
img1 or img2. And a list with two elements (one each for B02 and B11)
when item is a pystac.Item:
def user_range(array_info, item, pt):
return [a_info.data.max() - a_info.data.min() for a_info in array_info]
# For user stats, supply a list of (stat_name, stat_func) tuples.
# The name is used as a reference to retrieve the data later.
user_stats = [("MY_RANGE", user_range)]
drill.drill(
points, images=[img1, img2],
stac_endpoint=endpoint, raster_assets=assets,
collections=collections,
std_stats=std_stats, user_stats=user_stats)
Extract the pixels and calculate the stats#
This is done by calling pixdrill.drill.drill(), as per the previous section’s
example.
Fetch the results#
Pixel Driller stores the statistics for each field site with the corresponding
Point object. They are accessed using the
Point's stats attribute.
stats is an instance of pixdrill.drillstats.PointStats.
Use pixdrill.drillstats.PointStats.get_stats() to access the statistics
for all items:
# The stats.
std_stats = [drillstats.STATS_MEAN, drillstats.STATS_STDEV]
user_stats = [("MY_RANGE", user_range)]
# Extract pixels and calc stats.
drill.drill(
points, images=[img1, img2],
stac_endpoint=endpoint, raster_assets=assets,
collections=collections,
std_stats=std_stats, user_stats=user_stats)
# Fetch the results.
for pt in points:
print(f"Stats for point: x={pt.x}, y={pt.y}")
for item_id, item_stats in pt.stats.get_stats().items():
print(f" Item ID={item_id}")
print(f" Mean values: {item_stats[drillstats.STATS_MEAN]}")
print(f" Std dev : {item_stats[drillstats.STATS_STDEV]}")
print(f" Ranges : {item_stats['MY_RANGE']}")
For pt_1 in our example, this gives the following output:
Stats for point: x=140, y=-36.5:
Item ID=S2A_54HVE_20220730_0_L2A
Asset IDs : ['B02', 'B11']
Mean values: [3257.65289256 2369.75]
Std dev : [25.58754564 10.98578627]
Ranges : [164, 37]
Item ID=S2B_54HVE_20220725_0_L2A
Asset IDs : ['B02', 'B11']
Mean values: [3945.52066116 3198.11111111]
Std dev : [200.69515962 167.57366171]
Ranges : [1064, 779]
Item ID=/path/to/img1.tif
Asset IDs : [None]
Mean values: [60.]
Std dev : [0.]
Ranges : [0]
Item ID=/path/to/img2.tif
Asset IDs : [None]
Mean values: [1782.]
Std dev : [0.]
Ranges : [0]
Note that:
two STAC Items were found that matched the Point’s location and imagery acquisition window
the call to
pt.stats.get_stats()(with no parameters) returns a dictionary keyed by the item_id whose values are dictionaries, keyed by the statstic namethe standard statistics are retrieved using the
STATS_symbols in thepixdrill.drillstatsmodule.the user statistcs are retrieved using the user-defined name
A note about image no-data values#
An image may have no-data values defined in its metadata, one for each band.
Pixels with this value represents locations in the image that contain
no information. By default, Pixel Driller uses the no-data value
set on every band of every image it drills as values to ignore when
calculating statistics.
A problem may arise when the image’s no-data values are not set,
or the file format lacks support for it to be specified.
In such cases, Pixel Driller considers all pixel values
to be valid data when calculating statistics. But what if they’re not?
You can set or override the no data using
the ignore_val parameter in drill()
(and other functions).
Pixel Driller uses the ignore_val differently depending on whether
the Item being drilled is an ImageItem or a
pystac.Item.
When reading the raster assets of a pystac.Item,
ignore_val can be a list of values or a single value.
The list must contain the no-data value per asset. The same value
is used for all bands in an asset.
If ignore_val is a single value, then the same value is used for all bands
of all assets.
When reading the image of an ImageItem, ignore_val
can be a single value. It is used for all bands in the image.
Clearly, further development work is needed to support specifying the no data
value per-band. Pixel Driller currently relies on the images’ creators
to do that for us.
An alternative usage pattern#
Consider this: a STAC Item has raster assets that contain continuous (e.g. surface reflectance) and categorical (e.g. a scene classification) data. You want to calculate the mean and standard deviation of the pixels in the continuous assets, and user-defined statistics for the categorical assets.
This is achieved by reading the data, calculating the statistics, and fetching the results for the continuous assets separately to the categorical assets.
The usage pattern is:
Specify the locations and acquisition windows of your field surveys
Find the STAC Items and create Driller objects for each one
Define the assets and statistics to be calculated on the drilled pixels
Extract the pixels and compute the statistics
Fetch the results
Reset the statistics
Repeat steps 3-6 for a different set of assets
Example:
# Step 1. Create your points.
points = create_points()
# Step 2. Find the STAC Items to drill.
# This is done by drill.create_stac_drillers(), which returns a list of
# drillpoints.ItemDriller objects, one for each STAC Item.
drillers = drill.create_stac_drillers(stac_endpoint, points, collections)
# Steps 3 and 4. Loop over each driller, reading the data and
# calculating statistics on the continuous assets.
for drlr in drillers:
drlr.set_asset_ids(['B02', 'B11'])
drlr.read_data()
std_stats = [drillstats.STATS_MEAN, drillstats.STATS_STD]
drlr.calc_stats(std_stats=std_stats)
# Step 5. Fetch the stats
for pt in points:
stats_dict = pt.stats.get_stats()
# do something
...
# Step 6. reset the stats, ready for the next extract
pt.stats.reset()
# Note: another method for resetting the stats is:
# for drlr in drillers:
# drlr.reset_stats()
# Repeat steps 3-6, but this time for a categorical asset.
for drlr in drillers:
drlr.set_asset_ids(['SCL'])
drlr.read_data()
std_stats = [drillstats.STATS_COUNT]
user_stats = [("MY_STAT_1", my_func_1), ("MY_STAT_2", my_func_2)]
drlr.calc_stats(std_stats=std_stats, user_stats=user_stats)
# Fetch the stats
for pt in points:
stats_dict = pt.stats.get_stats()
# do something
...
# then reset the point's stats, ready for the next extract
pt.stats.reset()
# And so on.
Pitfalls#
Failing to specify a Point’s timezone#
A Point's time zone is assumed to be UTC if its
t attribute is a timezone unaware
datetime object.
Setting the time zone correctly is important when
determining the nearest_n STAC Items to the survey.
Multiple calls to calc_stats#
All data should be read from images before calling
calc_stats().
And calc_stats()
should only be called once. This is how pixdrill.drill.drill() works.
But care should be taken when using an alternative usage pattern
to reuse the Points to calculate statistics on a new set of Items.
Always reset the statistics for every point before reading new data and
calculating a new set of statistics. If the stats are not reset, any
previously calculated stats are recalculated.
Accessing STAC Item assets#
For a STAC Item, GDAL must be able to read the assets that you want to drill. This means that:
assets have a URL (http, https, ftp etc) as their
href attributeGDAL is built so that it can read data from network-based filesystems
if authentication is required it is done in a manner supported by GDAL
GDAL’s
CPL_VSIL_CURL_ALLOWED_EXTENSIONSenvironment variable is set and contains the filename extensions of the assets, e.g.CPL_VSIL_CURL_ALLOWED_EXTENSIONS=".tif,.TIF,.tiff,.vrt,.jp2"For example, you should be able to read tif files if the following command returns information about the file:
CPL_VSIL_CURL_ALLOWED_EXTENSIONS=".tif" gdalinfo /vsicurl/https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/54/H/VE/2022/7/S2A_54HVE_20220730_0_L2A/B02.tif