Skip to content

data.h5py.H5GParser

Class · Context · Source

dset = mdnc.data.h5py.H5GParser(
    file_name, keywords, batch_size=32, shuffle=True, shuffle_seed=1000,
    preprocfunc=None, num_workers=4, num_buffer=10
)

Grouply parsing dataset. This class allows users to feed one .h5 file, and convert it to mdnc.data.sequence.MPSequence. The realization could be described as:

  1. Create .h5 file indexer, this indexer would be initialized by sequence.MPSequence. It would use the user defined keywords to get a group of h5py.Datasets.
  2. Estimate the h5py.Dataset sizes, each dataset should share the same size (but could have different shapes).
  3. Use the dataset size to create a sequence.MPSequence, and allows it to randomly shuffle the indices in each epoch.
  4. Invoke the sequence.MPSequence APIs to serve the parallel dataset parsing.

Certainly, you could use this parser to load a single h5py.Dataset. To find details about the parallel parsing workflow, please check mdnc.data.sequence.MPSequence.

Arguments

Requries

Argument Type Description
file_name str The path of the .h5 file (could be without postfix).
keywords (str, ) Should be a list of keywords (or a single keyword).
batch_size int Number of samples in each mini-batch.
shuffle bool If enabled, shuffle the data set at the beginning of each epoch.
shuffle_seed int The seed for random shuffling.
preprocfunc object This function would be added to the produced data so that it could serve as a pre-processing tool. Note that this tool would process the batches produced by the parser. The details about this argument would be shown in the following tips.
num_workers int The number of parallel workers.
num_buffer int The buffer size of the data pool, it means the maximal number of mini-batches stored in the memory.
Tip

The minimal requirement for the argument preprocfunc is to be a function, or implemented with the __call__() method. This function accepts all input mini-batch variables formatted as np.ndarray, and returns the pre-processed results. The returned varaible number could be different from the input variable number. In some cases, you could use the provided pre-processors in the mdnc.data.preprocs module. The processors in these module support our Broadcasting Pre- and Post- Processor Protocol. For example:

Example

1
2
3
4
5
6
7
import mdnc

def preprocfunc(x1, x2):
    return x1 + x2

mdnc.data.h5py.H5GParser(..., keywords=['x_1', 'x_2'],
                         preprocfunc=preprocfunc)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import mdnc

class PreprocWithArgs:
    def __init__(self, a):
        self.a = a

    def __call__(self, x1, x2):
        return x1, self.a * x2

mdnc.data.h5py.H5GParser(..., keywords=['x_1', 'x_2'],
                         preprocfunc=PreprocWithArgs(a=0.1))
1
2
3
4
import mdnc

mdnc.data.h5py.H5GParser(..., keywords=['x_1', 'x_2'],
                         preprocfunc=mdnc.data.preprocs.ProcScaler())

Warning

The argument preprocfunc requires to be a picklable object. Therefore, a lambda function or a function implemented inside if __name__ == '__main__' is not allowed in this case.

Methods

check_dsets

sze = dset.check_dsets(file_path, keywords)

Check the size of h5py.Dataset and validate all datasets. A valid group of datasets requires each h5py.Dataset shares the same length (sample number). If success, would return the size of the datasets. This method is invoked during the initialization, and do not requires users to call explicitly.

Requries

Argument Type Description
file_path str The path of the HDF5 dataset to be validated.
keywords (str, ) The keywords to be validated. Each keyword should point to or redict to an h5py.Dataset.

Returns

Argument Description
sze A int, the size of all datasets.

get_attrs

attrs = dset.get_attrs(keyword, *args, attr_names=None)

Get the attributes by the keyword.

Requries

Argument Type Description
keyword str The keyword of the to a h5py.Dataset in the to-be-loaded file.
attr_names (str, ) A sequence of required attribute names.
*args other attribute names, would be attached to the argument attr_names by list.extend().

Returns

Argument Description
attrs A list of the required attribute values.

get_file

f = dset.get_file(enable_write=False)

Get a file object of the to-be-loaded file.

Requries

Argument Type Description
enable_write bool If enabled, would use the a mode to open the file. Otherwise, use the r mode.

Returns

Argument Description
f The h5py.File object of the to-be-loaded file.

start

dset.start(compat=None)

Start the process pool. This method is implemented by mdnc.data.sequence.MPSequence. It supports context management.

Running start() or start_test() would interrupt the started sequence.

Requries

Argument Type Description
compat bool Whether to fall back to multi-threading for the sequence out-type converter. If set None, the decision would be made by checking os.name. The compatible mode requires to be enabled on Windows.
Tip

This method supports context management. Using the context is recommended. Here we show two examples:

1
2
3
4
dset.start()
for ... in dset:
    ...
dset.finish()
1
2
3
with dset.start() as ds:
    for ... in ds:
        ...
Danger

The cuda.Tensor could not be put into the queue on Windows (but on Linux we could), see

https://pytorch.org/docs/stable/notes/windows.html#cuda-ipc-operations

To solve this problem, we need to fall back to multi-threading for the sequence out-type converter on Windows.

Warning

Even if you set shuffle=False, due to the mechanism of the parallelization, the sample order during the iteration may still get a little bit shuffled. To ensure your sample order not changed, please use shuffle=False during the initialization and use start_test() instead.


start_test

dset.start_test(test_mode='default')

Start the test mode. In the test mode, the process pool would not be open. All operations would be finished in the main thread. However, the random indices are still generated with the same seed of the parallel dset.start() mode.

Running start() or start_test() would interrupt the started sequence.

Requries

Argument Type Description
test_mode str Could be 'default', 'cpu', or 'numpy'.
  • 'default': the output would be converted as start() mode.
  • 'cpu': even set 'cuda' as output type, the testing output would be still not converted to GPU.
  • 'numpy': would ignore all out_type configurations and return the original output. This output is still pre-processed.
Tip

This method also supports context management. See start() to check how to use it.


finish

dset.finish()

Finish the process pool. The compatible mode would be auto detected by the previous start().

Properties

len(), batch_num

len(dset)
dset.batch_num

The length of the dataset. It is the number of mini-batches, also the number of iterations for each epoch.


iter()

for x1, x2, ... in dset:
    ...

The iterator. Recommend to use it inside the context. The unpacked variables x1, x2 ... are ordered according to the given argument keywords during the initialization.


size

dset.size

The size of the dataset. It contains the total number of samples for each epoch.


batch_size

dset.batch_size

The size of each batch. This value is given by the argument batch_size during the initialization. The last size of the batch may be smaller than this value.


preproc

dset.preproc

The argument preprocfunc during the initialziation. This property helps users to invoke the preprocessor manually.

Exampless

Example 1
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import os
import mdnc

root_folder = 'alpha-test'
os.makedirs(root_folder, exist_ok=True)

if __name__ == '__main__':
    # Prepare the datasets.
    set_list_file = os.path.join(root_folder, 'web-data')
    mdnc.data.webtools.DataChecker.init_set_list(set_list_file)
    dc = mdnc.data.webtools.DataChecker(root=root_folder, set_list_file=set_list_file, token='', verbose=False)
    dc.add_query_file('test_data_h5gparser.h5')
    dc.query()

    # Perform test.
    dset = mdnc.data.h5py.H5GParser(os.path.join(root_folder, 'test_data_h5gparser'), ['one', 'zero'],
                                    batch_size=3, num_workers=4, shuffle=True, preprocfunc=None)
    with dset.start() as p:
        for i, data in enumerate(p):
            print('data.h5py: Epoch 1, Batch {0}'.format(i), data[0].shape, data[1].shape)

        for i, data in enumerate(p):
            print('data.h5py: Epoch 2, Batch {0}'.format(i), data[0].shape, data[1].shape)
data.webtools: All required datasets are available.
data.h5py: Epoch 1, Batch 0 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 1 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 2 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 3 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 4 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 5 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 6 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 7 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 8 torch.Size([1, 20]) torch.Size([1, 10])
data.h5py: Epoch 2, Batch 0 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 2, Batch 1 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 2, Batch 2 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 2, Batch 3 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 2, Batch 4 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 2, Batch 5 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 2, Batch 6 torch.Size([1, 20]) torch.Size([1, 10])
data.h5py: Epoch 2, Batch 7 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 2, Batch 8 torch.Size([3, 20]) torch.Size([3, 10])
Example 2
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import os
import numpy as np
import mdnc

root_folder = 'alpha-test'
os.makedirs(root_folder, exist_ok=True)

if __name__ == '__main__':
    # Prepare the datasets.
    set_list_file = os.path.join(root_folder, 'web-data')
    mdnc.data.webtools.DataChecker.init_set_list(set_list_file)
    dc = mdnc.data.webtools.DataChecker(root=root_folder, set_list_file=set_list_file, token='', verbose=False)
    dc.add_query_file('test_data_h5gparser.h5')
    dc.query()

    # Perform test.
    dset = mdnc.data.h5py.H5GParser(os.path.join(root_folder, 'test_data_h5gparser'), ['one', 'zero'],
                                    batch_size=3, num_workers=4, shuffle=True,
                                    preprocfunc=mdnc.data.preprocs.ProcScaler())
    with dset.start_test() as p:
        for i, (d_one, d_two) in enumerate(p):
            d_one, d_two = d_one.cpu().numpy(), d_two.cpu().numpy()
            std_one, std_two = np.std(d_one), np.std(d_two)
            d_one, d_two = p.preproc.postprocess(d_one, d_two)
            std_one_, std_two_ = np.std(d_one), np.std(d_two)
            print('Before: {0}, {1}; After: {2}, {3}.'.format(std_one, std_two, std_one_, std_two_))
data.webtools: All required datasets are available.
Before: 0.4213927686214447, 0.5810447931289673; After: 3.4976863861083984, 4.269893169403076.
Before: 0.47204485535621643, 0.5270004868507385; After: 3.2560627460479736, 5.232884407043457.
Before: 0.380888432264328, 0.5548458099365234; After: 2.69606876373291, 4.5017008781433105.
Before: 0.555243968963623, 0.5082056522369385; After: 3.231991767883301, 5.085717678070068.
Before: 0.39406657218933105, 0.5630286931991577; After: 2.8078441619873047, 5.10365629196167.
Before: 0.49584802985191345, 0.5255910754203796; After: 2.706739664077759, 5.646749019622803.
Before: 0.4346843361854553, 0.5725106000900269; After: 2.7871317863464355, 4.466533660888672.
Before: 0.5043540000915527, 0.5292088389396667; After: 2.373351573944092, 4.446733474731445.
Before: 0.46324262022972107, 0.6497944593429565; After: 2.350776433944702, 5.593009948730469.

Last update: March 14, 2021

Comments