User Guide¶
grob creates groups of files that belong together. Each group has a unique key ; within a group, files can be organized into tags.
Let's take for example the Rider Safety & Compliance dataset on Kaggle: for a given subset, it contains pairs of images and labels:
├── images
│ ├── new1.jpg
│ ├── new2.jpg
│ └── ...
└── labels
├── new1.txt
├── new2.txt
└── ...
To use it, we need to join images with their corresponding label files:
{
"1": {"image": "images/new1.jpg", "labels": "labels/new1.jpg"},
"2": {"image": "images/new2.jpg", "labels": "labels/new2.jpg"},
...
}
Each group has a unique key: 1, 2 and so on. Within a group, files are organized into tags: image and labels in the above example.
Basics¶
Let's start by getting the image files:
~ grob "images/new*.jpg" train
[
"images/new1.jpg",
"images/new2.jpg",
...
]
We specified a glob pattern images/*.jpg, and grob returned all matching files. Now we can specify which part of the path we want to use as key, by declaring a placeholder with braces:
~ grob "images/new{id}.jpg" train
{
"1": "images/new1.jpg",
"2": "images/new2.jpg",
...
}
images/new{id}.jpg will match the same files as images/new*.jpg, but it will tell grob to extract the matched value and use it as the id part of the key. We can now declare how to find the labels:
~ grob "image=images/new{id}.jpg, labels=labels/new{id}.txt" train
{
"1": {
"image": "images/new1.jpg",
"labels": "labels/new1.txt"
},
"2": {
"image": "images/new2.jpg",
"labels": "labels/new2.txt"
},
...
}
For each tag image and labels, grob will find the files matching the pattern, extract any named placeholder and join them into a single key.
It will then match together files that have the same key.
Working with incomplete groups¶
In practice, the command above would fail if one of the groups is missing any of the tags. You can change this behavior by passing:
--optionalwill make all tags optional. The tag will containnullinstead of an actual file.--remove-on-missingwill remove all incomplete groups from the final outputs.
These two options accept a list of tags: for example, using --optional labels --remove-on-missing image would keep images without labels but remove labels without images.
More complex keys¶
You're not limited to a single placeholder per pattern. Our example dataset has two subsets, train and val. We want to get files from both subset, but we don't want to group an image from the training set with a label from the validation set. We can use:
~ grob "image={subset}/images/new{id}.jpg, labels={subset}/labels/new{id}.txt" .
{
"train_1": {
"image": "train/images/new1.jpg",
"labels": "train/labels/new1.txt"
},
...
"val_1": {
"image": "val/images/new1.jpg",
"labels": "val/labels/new1.txt"
}
}
Here, placeholders {subset} and {id} are concatenated to build the final key. A custom formatter can be passed with the --key option, using standard Python f-strings:
~ grob "image={subset}/images/new{id}.jpg, labels={subset}/labels/new{id}.txt" . --key "{subset}-{id:0>3}"
{
"train-001": {
"image": "train/images/new1.jpg",
"labels": "train/labels/new1.txt"
},
"train-002": {
"image": "train/images/new2.jpg",
"labels": "train/labels/new2.txt"
},
}
If you're not interested in the keys, but only in the files themselves, you can use --without-keys or -K:
~ grob "image={subset}/images/new{id}.jpg, labels={subset}/labels/new{id}.txt" . --without-keys
[
{
"image": "train/images/new1.jpg",
"labels": "train/labels/new1.txt"
},
...
]
More complex patterns¶
Patterns also support:
**for matching zero or more directories- option groups
(a|b|c)to match several options. Typically, use*.(jpg|png|gif)to match any of these extensions. Option groups cannot contain wildcards or placeholders. - placeholder flags to restrict what a placeholder can match
For complete control, a regular expression can be passed instead of a glob-like pattern.
Placeholder flags¶
As explained above, a placeholder will match the same as * (i.e. any character but a slash). We can add placeholder flags to restrict which characters can be matched by a placeholder:
dwill only match digitsawill only match alphanumeric characters>3will match three characters or more<3will match three characters or fewer3will match exactly three characters3-9will match three to nine characters
Placeholder flags must be indicated within the placeholder, after a colon :, e.g. {name:a>3}. Multiple flags can be combined.
Regular expressions¶
For even more control, one can use regular expressions instead of glob-like patterns. Regular expressions must use the regular Python syntax. Placeholders must be indicated with named capturing groups (?P<placeholder_name>...).
To indicate that a pattern must be intepreted as a regular expression, it must end with a :r flag.
For example, this will only match images whose id contains a three:
~ grob "image=images/new(?P<id>\d*3|3\d*)[.]jpg:r, labels=labels/new{id}.txt:r" . --remove-on-missing
Multiple files per tag¶
By default, grob expects exactly one file per group and per tag, and will fail if multiple files match a given pattern for a group.
This behavior can be turned off with the --multiple:
--multiple TAG1 TAG2 ...will makeTAG1andTAG2contain a list of files--multiplewill make all tags contain a list of files
Distribute file over groups¶
Sometimes, you'll want the same file to be included in multiple groups. For example, let's say we have tracks and covers from different albums:
├── Miles Ahead
│ ├── cover.jpg
│ ├── 01 - Springsville.mp3
│ ├── 02 - The Maids of Cadiz.mp3
│ └── ...
└── Kind of Blue
├── cover.jpg
├── 01 - So What.mp3
├── 02 - Freddie Freeloader.mp3
└── ...
Each cover must be associated to all tracks from the album. Tags can be described as usual:
~ grob "track={album}/{track:d2}*.mp3, cover={album}/cover.jpg" .
grob will detect that tag cover is missing the {track} placeholder, and will automatically distribute covers over the common placeholders, achieving the desired behavior.
Deep Dive
To better understand what's going on in this case, consider the files after the search step and before the join step.
For tag track:
{album} |
{track} |
File |
|---|---|---|
| Miles Ahead | 01 | `Miles Ahead/01 - Springsville.mp3 |
| Miles Ahead | 02 | `Miles Ahead/01 - Springsville.mp3 |
| ... | ... | ... |
For tag cover:
{album} |
File |
|---|---|
| Miles Ahead | Miles Ahead/cover.jpg |
| Kind of Blue | Kind of Blue/cover.jpg |
grob will perform an outer join over the common keys, in that case {album}, thus giving the following result:
{album} |
{track} |
track |
cover |
|---|---|---|---|
| Miles Ahead | 01 | Miles Ahead/01 - Springsville.mp3 |
Miles Ahead/cover.jpg |
| Miles Ahead | 02 | Miles Ahead/01 - Springsville.mp3 |
Miles Ahead/cover.jpg |
| ... | ... | ... | ... |
Output format¶
grob supports the following formats, controlled by the --output-format, -f option:
| Format | Description | Shortcut |
|---|---|---|
json |
JSON mapping from keys to groups | --json, -j |
jsonl |
one JSON record per group | --jsonl, -l |
human |
human readable table (not implemented yet) | --human, -h |
csv |
comma-separated file, with one tag per column | --csv |
tsv |
tab-separated file, same format | --tsv |
--output, -o allows you to choose where to write the output (default to stdout).
--absolute will return absolute file paths, instead of paths relative to the root directory.