Filtering your Dataset

Large datasets are oftentimes too overwhelming to deal with in their entirety, and it can be useful and faster to analyze subsets of these datasets. Furthermore, filtering the dataset based on some field condition can reveal subtle information not easily accessible by looking at the whole dataset. Filters can be generated based on spatial position, say in a sphere in the center of your dataset space, or more generally they can be defined by the properties of any field in the simulation.

Because mesh fields are internally different from particle fields, there are different ways of filtering each type as indicated below; however, filtering fields by spatial location (i.e. geometric objects) will apply to both types equally.

Filtering Mesh Fields

Mesh fields can be filtered by two methods: cut region objects (YTCutRegion) and NumPy boolean masks. Boolean masks are simpler, but they only work for examining datasets, whereas cut regions objects create wholly new data objects suitable for full analysis (data examination, image generation, etc.)

Boolean Masks

NumPy boolean masks can be used with any NumPy array simply by passing the array a conditional. As a general example of this:

Notebook
In [1]:
import numpy as np
a = np.arange(5)
bigger_than_two = (a > 2)
print("Original Array: a = \n%s" % a)
print("Boolean Mask: bigger_than_two = \n%s" % bigger_than_two)
print("Masked Array: a[bigger_than_two] = \n%s" % a[bigger_than_two])
Original Array: a = 
[0 1 2 3 4]
Boolean Mask: bigger_than_two = 
[False False False  True  True]
Masked Array: a[bigger_than_two] = 
[3 4]

Similarly, if you’ve created a yt data object (e.g. a region, a sphere), you can examine its field values as a NumPy array by simply indexing it with the field name. Thus, it too can be masked using a NumPy boolean mask. Let’s set a simple mask based on the contents of one of our fields.

Notebook
In [1]:
import yt
ds = yt.load('Enzo_64/DD0042/data0042')
ad = ds.all_data()
hot = ad["temperature"].in_units('K') > 1e6
print('Temperature of all data: ad["temperature"] = \n%s' % ad["temperature"])
print("Boolean Mask: hot = \n%s" % hot)
print('Temperature of "hot" data: ad["temperature"][hot] = \n%s' %
      ad['temperature'][hot])
/usr/lib64/python3.6/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Temperature of all data: ad["temperature"] = 
[1.00000000e+00 1.00000000e+00 1.00000000e+00 ... 1.87798863e+07
 1.77985684e+07 1.73020029e+07] K
Boolean Mask: hot = 
[False False False ...  True  True  True]
Temperature of "hot" data: ad["temperature"][hot] = 
[ 6502480.87990464  7005697.58048854  7453051.94593615 ...
 18779886.27258056 17798568.39391833 17302002.90929025] K

This was a simple example, but one can make the conditionals that define a boolean mask have multiple parts, and one can stack masks together to make very complex cuts on one’s data. Once the data is filtered, it can be used if you simply need to access the NumPy arrays:

Notebook
In [1]:
import yt
ds = yt.load('Enzo_64/DD0042/data0042')
ad = ds.all_data()
overpressure_and_fast = (ad["pressure"] > 1e-14) & (ad["velocity_magnitude"].in_units('km/s') > 1e2)
print('Density of all data: ad["density"] = \n%s' % ad['density'])
print('Density of "overpressure and fast" data: ad["density"][overpressure_and_fast] = \n%s' %
      ad['density'][overpressure_and_fast])
/usr/lib64/python3.6/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Density of all data: ad["density"] = 
[2.05609648e-31 1.87887401e-31 2.73497858e-31 ... 3.17107844e-28
 2.45682636e-28 2.13163618e-28] g/cm**3
Density of "overpressure and fast" data: ad["density"][overpressure_and_fast] = 
[9.67936877e-29 1.03145733e-28 1.17144334e-28 ... 3.17107844e-28
 2.45682636e-28 2.13163618e-28] g/cm**3

Cut Regions

Cut regions are a more general solution to filtering mesh fields. The output of a cut region is an entirely new data object, which can be treated like any other data object to generate images, examine its values, etc.

Notebook

Let us demonstrate this with an example using the same dataset as we used with the boolean masks.

In [1]:
import yt
ds = yt.load("Enzo_64/DD0042/data0042")
/usr/lib64/python3.6/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters

The only argument to a cut region is a conditional on field output from a data object. The only catch is that you must denote the data object in the conditional as "obj" regardless of the actual object's name.

Here we create three new data objects which are copies of the all_data object (a region object covering the entire spatial domain of the simulation), but we've filtered on just "hot" material, the "dense" material, and the "overpressure and fast" material.

In [2]:
ad = ds.all_data()
hot_ad = ad.cut_region(["obj['temperature'] > 1e6"])
dense_ad = ad.cut_region(['obj["density"] > 5e-30'])

# you can chain cut regions in two ways:
dense_and_cool_ad = dense_ad.cut_region(["obj['temperature'] < 1e5"])
overpressure_and_fast_ad = ad.cut_region(['(obj["pressure"] > 1e-14) & (obj["velocity_magnitude"].in_units("km/s") > 1e2)'])

Upon inspection of our "hot_ad" object, we can still get the same results as we got with the boolean masks example above:

In [3]:
print ("Temperature of all cells:\n ad['temperature'] = \n%s\n" % ad["temperature"])
print ("Temperatures of all \"hot\" cells:\n hot_ad['temperature'] = \n%s" % hot_ad['temperature'])
Temperature of all cells:
 ad['temperature'] = 
[1.00000000e+00 1.00000000e+00 1.00000000e+00 ... 1.87798863e+07
 1.77985684e+07 1.73020029e+07] K

Temperatures of all "hot" cells:
 hot_ad['temperature'] = 
[ 6502480.87990464  7005697.58048854  7453051.94593615 ...
 18779886.27258056 17798568.39391833 17302002.90929025] K
In [4]:
print ("Density of dense, cool material:\n dense_and_cool_ad['density'] = \n%s\n" % dense_and_cool_ad['density'])
print ("Temperature of dense, cool material:\n dense_and_cool_ad['temperature'] = \n%s" % dense_and_cool_ad['temperature'])
Density of dense, cool material:
 dense_and_cool_ad['density'] = 
[5.14552665e-30 7.12750300e-30 5.63890577e-30 ... 5.80246287e-30
 9.08704636e-30 5.50266378e-30] g/cm**3

Temperature of dense, cool material:
 dense_and_cool_ad['temperature'] = 
[9.58828100e+04 9.36330508e+04 9.10436886e+04 ... 1.11450104e+00
 3.58217286e+04 1.99072499e+02] K

Now that we've constructed a cut_region, we can use it as a data source for further analysis. To create a plot based on a cut_region, use the data_source keyword argument provided by yt's plotting objects.

Here's an example using projections:

In [5]:
proj1 = yt.ProjectionPlot(ds, 'x', "density", weight_field="density")
proj1.annotate_title('No Cuts')
proj1.set_figure_size(5)
proj1.show()

proj2 = yt.ProjectionPlot(ds, 'x', "density", weight_field="density", data_source=hot_ad)
proj2.annotate_title('Hot Gas')
proj2.set_zlim("density", 3e-31, 3e-27)
proj2.set_figure_size(5)
proj2.show()