API#

Opening Measurement Sets#

The standard xarray.backends.api.open_dataset() and xarray.backends.api.open_datatree() methods should be used to open either a Dataset or a DataTree.

>>> dataset = xarray.open_dataset(
                "/data/data.ms",
                partition_schema=["DATA_DESC_ID", "FIELD_ID"])
>>> datatree = xarray.backends.api.open_datatree(
                "/data/data.ms",
                partition_schema=["DATA_DESC_ID", "FIELD_ID"])

These methods defer to the relevant methods on the Entrypoint Class. Consult the method signatures for information on extra arguments that can be passed.

Entrypoint Class#

Entrypoint class for the MSv2 backend.

class xarray_ms.backend.msv2.entrypoint.MSv2EntryPoint#

Create a Dataset presenting an MSv4 view over a partition of a MSv2 CASA Measurement Set

Parameters:

filename_or_obj – The path to the MSv2 CASA Measurement Set file.
drop_variables – Variables to drop from the dataset.
partition_schema – The columns to use for partitioning the Measurement set. Defaults to ['OBSERVATION_ID', 'PROCESSOR_ID', 'DATA_DESC_ID', 'OBS_MODE'].
partition_key – A key corresponding to an individual partition. For example (('DATA_DESC_ID', 0), ('FIELD_ID', 0)). If None, the first partition will be opened.
preferred_chunks – The preferred chunks for each partition.
auto_corrs – Include/Exclude auto-correlations.
ninstances – The number of Measurement Set instances to open for parallel I/O.
epoch – A unique string identifying the creation of this Dataset. This should not normally need to be set by the user
structure_factory – A factory for creating MSv2Structure objects. This should not normally need to be set by the user

Returns:

A Dataset referring to the unique partition specified by partition_schema and partition_key.

Create a DataTree presenting an MSv4 view over multiple partitions of a MSv2 CASA Measurement Set.

Parameters:

filename_or_obj – The path to the MSv2 CASA Measurement Set file.

preferred_chunks –

Chunk sizes along each dimension, e.g. {"time": 10, "frequency": 16}. Individual partitions can be chunked differently by partially (or fully) specifying a partition key: e.g.

{  # Applies to all partitions with the relevant DATA_DESC_ID
  (("DATA_DESC_ID", 0),): {"time": 10, "frequency": 16},
  (("DATA_DESC_ID", 1),): {"time": 20, "frequency": 32},
}
{  # Applies to all partitions with the relevant DATA_DESC_ID and FIELD_ID
  (("DATA_DESC_ID", 0), ('FIELD_ID', 1)): {"time": 10, "frequency": 16},
  (("DATA_DESC_ID", 1), ('FIELD_ID', 0)): {"time": 20, "frequency": 32},
}
{  # String variants
  "DATA_DESC_ID=0,FIELD_ID=0": {"time": 10, "frequency": 16},
  "D=0,F=1": {"time": 20, "frequency": 32},
}

Note

xarray’s reserved chunks argument must be specified in order to enable this functionality and enable fine-grained chunking in Datasets and DataTrees. See xarray’s backend documentation on Preferred chunk sizes for more information.

drop_variables – Variables to drop from the dataset.
partition_schema – The columns to use for partitioning the Measurement set. Defaults to ['OBSERVATION_ID', 'PROCESSOR_ID', 'DATA_DESC_ID', 'OBS_MODE'].
auto_corrs – Include/Exclude auto-correlations.
ninstances – The number of Measurement Set instances to open for parallel I/O.
epoch – A string uniquely identifying this Dataset. This should not normally be set by the user

Returns:

An xarray DataTree

Partioning Schema#

The default partitioning schema contains the following columns:

xarray_ms.backend.msv2.structure.DEFAULT_PARTITION_COLUMNS: List[str] = ['OBSERVATION_ID', 'PROCESSOR_ID', 'DATA_DESC_ID', 'OBS_MODE']#: Default Partitioning Column Schema

Partitioning always uses these columns, but additional columns can be selected if finer grained partitioning is required:

xarray_ms.backend.msv2.structure.VALID_PARTITION_COLUMNS: List[str] = ['DATA_DESC_ID', 'OBSERVATION_ID', 'PROCESSOR_ID', 'FIELD_ID', 'SCAN_NUMBER', 'STATE_ID', 'SOURCE_ID', 'OBS_MODE', 'SUB_SCAN_NUMBER']#: Valid partitioning columns

Note that OBS_MODE and SUB_SCAN_NUMBER are columns in the STATE subtable, while SOURCE_ID is a column of the FIELD subtable. Partitioning on these columns is achieved by joining on the STATE_ID and FIELD_ID columns, respectively.