data.webtools.DataChecker¶
Class ยท Source
dchecker = mdnc.data.webtools.DataChecker(
root='./datasets', set_list_file='web-data', token='', verbose=False
)
This data checker could check the local dataset folder, find the not existing datasets and fetch those required datasets from online repositories or links.
The workflow is illustrated in the following figure,
flowchart LR
subgraph dchecker [DataChecker]
init(__init__)
add(add_query_file)
query(query)
end
init -->|load| webdata[(set_list_file)]
add -->|load| fnames[(file_names)]
query --> start
webdata --> |set_names| start
fnames --> |query_list| ifblock
flow:::flowstyle
subgraph flow [query work flow]
start([for each<br>dataset]) --> |set_names| eachset([for each<br>item])
eachset -->|set_name| ifblock{set_name<br>in<br>query_list?}:::ifstyle
ifblock -->|yes| ifblock2{file<br>exists?}:::ifstyle
ifblock2 --> |yes| eachset
ifblock2 --> |no| download[Download<br>the dataset]
download --> start
end
classDef ifstyle fill:#eee, stroke: #999;
classDef flowstyle fill:#FEEEF0, stroke: #b54051;
To use this class, users require to follow 3 steps:
- Initialize the
DataChecker
with theset_list_file
argument, which is a json file. This file defines where the online datasets stored and what those datasets have. - Use
add_query_file
to add the require data file name. - Invoke
query
, this method would start iterate the dataset list, then find and download all online datasets which satisfies the following conditions:- Has a file item that does not locally exist.
- Has a file item that is required by the query list.
A private repository requires a token. In this case, the argument token
need to be not blank.
Arguments¶
Requries
Argument | Type | Description |
---|---|---|
root | str | The root path of all maintained local datasets. |
set_list_file | str | A json file recording the online repository paths (the file name extension could be absent) of the required datasets. |
token | int or(int, ) | The default Github OAuth token for downloading files from private repositories. If not set, the downloading from public repositories would not be influenced. To learn how to set the token, please refer to mdnc.data.webtools.get_token . |
verbose | bool | A flag, whether to show the downloaded size during the web request. |
Methods¶
init_set_list
¶
dchecker.init_set_list(file_name='web-data')
This method should get used by users manually. It is used for creating an initialized .json
config file for the DataChecker
.
Requries
Argument | Type | Description |
---|---|---|
file_name | str | The name of the to-be-created dataset config file. |
clear
¶
dchecker.clear()
Clear the query list. The query list is a list to required dataset names. This function is not necessary to be used frequently, because DataChecker
may only need to be invoked for one time.
add_query_file
¶
dchecker.add_query_file(file_names)
Add one or more file names in the query list. Add file names into the required dataset name list. For each different application, the required datasets could be different. The query file list should be a sub-set of the whole list given by set_list_file
.
Requries
Argument | Type | Description |
---|---|---|
file_name | str (str, ) | The could be one or a list of file name strs, including all requried dataset names for the current program. This argument could also be one or a list of file name strs, including all requried dataset names for the current program. |
query
¶
dchecker.query()
Search the files in the query list, and download the datasets.
Properties¶
token
¶
dchecker.token
Check or set the Github OAuth token.
Examples¶
Here we show an example of creating and using the config file.
Example
1 2 3 4 5 6 7 8 9 |
|
data.webtools: There are required dataset missing. Start downloading from the online repository...
Get test-datasets-1.tar.xz: 216B [00:00, 108kB/s]
data.webtools: Successfully download all required datasets.
The config file should be formatted like the following json examples:
Example of meta-data
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
where the set_list
contains a list of dictionaries. Each dictionary represents an xz
file. The keyword items
represents a list of file names inside the xz
file.