pytorch_pfn_extras.dataset.TabularDataset#

class pytorch_pfn_extras.dataset.TabularDataset(*args, **kwds)#

Bases: Dataset

An abstract class that represents tabular dataset.

This class represents a tabular dataset. In a tabular dataset, all examples have the same number of elements. For example, all examples of the dataset below have three elements (a[i], b[i], and c[i]).

a

b

c

0

a[0]

b[0]

c[0]

1

a[1]

b[1]

c[1]

2

a[2]

b[2]

c[2]

3

a[3]

b[3]

c[3]

Since an example can be represented by both tuple and dict ( (a[i], b[i], c[i]) and {'a': a[i], 'b': b[i], 'c': c[i]}), this class uses mode to indicate which representation will be used. If there is only one column, an example also can be represented by a value (a[i]). In this case, mode is None.

An inheritance should implement __len__(), keys, mode and get_examples().

>>> import numpy as np
>>>
>>> from pytorch_pfn_extras import dataset
>>>
>>> class MyDataset(dataset.TabularDataset):
...
...     def __len__(self):
...         return 4
...
...     @property
...     def keys(self):
...          return ('a', 'b', 'c')
...
...     @property
...     def mode(self):
...          return tuple
...
...     def get_examples(self, indices, key_indices):
...          data = np.arange(12).reshape((4, 3))
...          if indices is not None:
...              data = data[indices]
...          if key_indices is not None:
...              data = data[:, list(key_indices)]
...          return tuple(data.transpose())
...
>>> dataset = MyDataset()
>>> len(dataset)
4
>>> dataset.keys
('a', 'b', 'c')
>>> dataset.astuple()[0]
(0, 1, 2)
>>> sorted(dataset.asdict()[0].items())
[('a', 0), ('b', 1), ('c', 2)]
>>>
>>> view = dataset.slice[[3, 2], ('c', 0)]
>>> len(view)
2
>>> view.keys
('c', 'a')
>>> view.astuple()[1]
(8, 6)
>>> sorted(view.asdict()[1].items())
[('a', 6), ('c', 8)]

Methods

__init__()

asdict()

Return a view with dict mode.

astuple()

Return a view with tuple mode.

concat(*datasets)

Stack datasets along rows.

convert(data)

Convert fetched data.

fetch()

Fetch data.

get_example(i)

get_examples(indices, key_indices)

Return a part of data.

join(*datasets)

Stack datasets along columns.

transform(keys, transform)

Apply a transform to each example.

transform_batch(keys, transform_batch)

Apply a transform to examples.

with_converter(converter)

Override the behaviour of convert().

Attributes

keys

Names of columns.

mode

Mode of representation.

slice

Get a slice of dataset.

asdict()#

Return a view with dict mode.

Returns:

A view whose mode is dict.

astuple()#

Return a view with tuple mode.

Returns:

A view whose mode is tuple.

concat(*datasets)#

Stack datasets along rows.

Parameters:

datasets (iterable of TabularDataset) – Datasets to be concatenated. All datasets must have the same keys.

Returns:

A concatenated dataset.

convert(data)#

Convert fetched data.

This method takes data fetched by fetch() and pre-process them before passing them to models. The default behaviour is converting each column into an ndarray. This behaviour can be overridden by with_converter(). If the dataset is constructed by concat() or join(), the converter of the first dataset is used.

Parameters:

data (tuple or dict) – Data from fetch().

Returns:

A tuple or dict. Each value is an ndarray.

fetch()#

Fetch data.

This method fetches all data of the dataset/view. Note that this method returns a column-major data (i.e. ([a[0], ..., a[3]], ..., [c[0], ... c[3]]), {'a': [a[0], ..., a[3]], ..., 'c': [c[0], ..., c[3]]}, or [a[0], ..., a[3]]).

Returns:

If mode is tuple, this method returns a tuple of lists/arrays. If mode is dict, this method returns a dict of lists/arrays.

get_example(i)#
get_examples(indices, key_indices)#

Return a part of data.

Parameters:
  • indices (list of ints or slice) – Indices of requested rows. If this argument is None, it indicates all rows.

  • key_indices (tuple of ints) – Indices of requested columns. If this argument is None, it indicates all columns.

Returns:

tuple of lists/arrays

join(*datasets)#

Stack datasets along columns.

Args: datasets (iterable of TabularDataset):

Datasets to be concatenated. All datasets must have the same length

Returns:

A joined dataset.

property keys#

Names of columns.

A tuple of strings that indicate the names of columns.

property mode#

Mode of representation.

This indicates the type of value returned by fetch() and __getitem__(). tuple, dict, and None are supported.

property slice#

Get a slice of dataset.

Parameters:
  • indices (list/array of ints/bools or slice) – Requested rows.

  • keys (tuple of ints/strs or int or str) – Requested columns.

Returns:

A view of specified range.

transform(keys, transform)#

Apply a transform to each example.

The transformations are a list where each element is a tuple that holds the transformation signature and a callable that is the transformation itself.

The transformation signature is a tuple of 2 elements with the first one being the keys of the dataset that are taken as inputs. And the last one the outputs it produces for the transformation keys argument.

When multiple transformations are specified, the outputs must be disjoint or ValueError will be risen.

Parameters:
  • keys (tuple of strs) – The keys of transformed examples.

  • transform (list of tuples) – A list where each element specifies a transformation with a tuple with the transformation signature and a callable that takes an example and returns transformed example. mode of transformed dataset is determined by the transformed examples.

Returns:

A transfromed dataset.

transform_batch(keys, transform_batch)#

Apply a transform to examples.

The transformations are a list where each element is a tuple that holds the transformation signature and a callable that is the transformation itself.

The transformation signature is a tuple of 2 elements with the first one being the keys of the dataset that are taken as inputs. And the last one the outputs it produces for the transformation keys argument.

When multiple transformations are specified, the outputs must be disjoint or ValueError will be risen.

Parameters:
  • keys (tuple of strs) – The keys of transformed examples.

  • transform_batch (list of tuples) – A list where each element specifies a transformation with a tuple with the transformation signature and a callable that takes a batch of examples and returns a batch of transformed examples. mode of transformed dataset is determined by the transformed examples.

Returns:

A transfromed dataset.

with_converter(converter)#

Override the behaviour of convert().

This method overrides convert().

Parameters:

converter (callable) – A new converter.

Returns:

A dataset with the new converter.