Skip to content

Default Settings

Bases: BaseSettings

Centralized configuration management for the DataForge toolkit.

This class defines all the parameters used by the application, including file paths, hashing settings, and execution intervals. It uses Pydantic v2 to automatically validate types, enforce value limits, and load configuration from JSON files or environment variables.

Attributes:

Name Type Description
max_percentage int

Constant used for percentage calculations (default: 100).

remove bool

If True, source files will be deleted after processing.

pattern Tuple[str, ...]

Patterns used to find specific files.

repeat bool

If True, the operation runs in a continuous loop.

sleep Union[int, bool]

Seconds to wait between operation cycles.

suffix str

The file extension used for output files.

step_sec float

Time interval in seconds for video slicing.

log_path Path

Directory where log files are stored.

log_level str

Verbosity level of the logger (e.g., INFO, DEBUG).

datatype str

The category of files being processed (e.g., image).

method str

The algorithm name for hashing or comparison.

hash_threshold int

Distance threshold for identifying duplicates (0-100).

confirm_choice tuple

Keywords used to confirm interactive deletion.

core_size int

Resolution for hashing; must be a power of 2.

n_jobs int

Number of parallel workers; capped by system CPU count.

cache_file_path Path

Directory for storing persistent hash caches.

cache_name Optional[Path]

Custom name for the cache file.

a_suffix Tuple[str, ...]

File patterns specific to annotations.

a_source Optional[Path]

Directory where annotation files are located.

destination_type Optional[str]

Target format for annotations.

extensions Tuple[str, ...]

Supported image file extensions.

Source code in const_utils/default_values.py
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
class AppSettings(BaseSettings):
    """
    Centralized configuration management for the DataForge toolkit.

    This class defines all the parameters used by the application, including
    file paths, hashing settings, and execution intervals. It uses Pydantic v2
    to automatically validate types, enforce value limits, and load
    configuration from JSON files or environment variables.

    Attributes:
        max_percentage (int): Constant used for percentage calculations (default: 100).
        remove (bool): If True, source files will be deleted after processing.
        pattern (Tuple[str, ...]): Patterns used to find specific files.
        repeat (bool): If True, the operation runs in a continuous loop.
        sleep (Union[int, bool]): Seconds to wait between operation cycles.
        suffix (str): The file extension used for output files.
        step_sec (float): Time interval in seconds for video slicing.
        log_path (Path): Directory where log files are stored.
        log_level (str): Verbosity level of the logger (e.g., INFO, DEBUG).
        datatype (str): The category of files being processed (e.g., image).
        method (str): The algorithm name for hashing or comparison.
        hash_threshold (int): Distance threshold for identifying duplicates (0-100).
        confirm_choice (tuple): Keywords used to confirm interactive deletion.
        core_size (int): Resolution for hashing; must be a power of 2.
        n_jobs (int): Number of parallel workers; capped by system CPU count.
        cache_file_path (Path): Directory for storing persistent hash caches.
        cache_name (Optional[Path]): Custom name for the cache file.
        a_suffix (Tuple[str, ...]): File patterns specific to annotations.
        a_source (Optional[Path]): Directory where annotation files are located.
        destination_type (Optional[str]): Target format for annotations.
        extensions (Tuple[str, ...]): Supported image file extensions.
    """
    max_percentage: int = 100
    model_config = SettingsConfigDict(
        env_prefix="APP_",
        extra="ignore",
        validate_assignment=True
    )

    remove: bool = Field(default=False)
    pattern: Tuple[str, ...] = Field(default_factory=tuple)
    repeat: bool = Field(default=False)
    sleep: Union[int, bool] = Field(default=60, ge=0)
    suffix: str = Field(default=".jpg")
    step_sec: float = Field(default=1.0, ge=0.1)
    log_path: Path = Field(default=Path("./log"))
    log_level: str = Field(default=LevelMapping.info)
    datatype: str = Field(default=Constants.image)
    method: str = Field(default=Constants.dhash)
    hash_threshold: int = Field(default=10, ge=0, le=100)
    confirm_choice: tuple = Field(default=("yes",))
    core_size: int = Field(default=8, ge=8)
    n_jobs: int = Field(default=2, ge=1, le=multiprocessing.cpu_count())
    cache_file_path: Path = Field(default=Path("./cache"))
    cache_name: Optional[Path] = Field(default=None)
    a_suffix: Tuple[str, ...] = Field(default_factory=tuple)
    a_source: Optional[Path] = Field(default=None)
    destination_type: Optional[str] = Field(default=None)
    extensions: Tuple[str, ...] = Field(default=(".jpg", ".jpeg,", ".png"))
    margin_threshold: int = Field(default=5, ge=0, le=100)
    report_path: Path = Field(default=Path("./reports"))
    img_dataset_report_schema: List[Dict[str, Any]] = Field(default=[
        {
            "title": "GEOMETRY",
            "type": "numeric",
            "columns": [
                ImageStatsKeys.object_area,
                ImageStatsKeys.object_relative_area,
                ImageStatsKeys.object_width,
                ImageStatsKeys.object_height,
                ImageStatsKeys.object_aspect_ratio
            ]
        },
        {
            "title": "SPATIAL BIAS",
            "type": "binary",
            "columns": [
                ImageStatsKeys.object_in_center,
                ImageStatsKeys.object_in_top_side,
                ImageStatsKeys.object_in_bottom_side,
                ImageStatsKeys.object_in_left_side,
                ImageStatsKeys.object_in_right_side,
                ImageStatsKeys.object_in_left_top,
                ImageStatsKeys.object_in_right_top,
                ImageStatsKeys.object_in_left_bottom,
                ImageStatsKeys.object_in_right_bottom
            ]
        },
        {
            "title": "TRUNCATION",
            "type": "binary",
            "columns": [
                ImageStatsKeys.truncated_top,
                ImageStatsKeys.truncated_bottom,
                ImageStatsKeys.truncated_left,
                ImageStatsKeys.truncated_right
            ]
        },
        {
            "title": "IMAGE QUALITY",
            "type": "numeric",
            "columns": [
                ImageStatsKeys.im_brightness,
                ImageStatsKeys.im_contrast,
                ImageStatsKeys.im_blur_score
            ]
        }
    ])


    @field_validator('core_size')
    @classmethod
    def check_power_of_two(cls, value: int) -> int:
        """
        Validates that the core_size is a power of 2.

        Args:
            value (int): The value to check.

        Returns:
            int: The validated value.

        Raises:
            ValueError: If the value is not a power of 2 (e.g., 8, 16, 32).
        """
        if value <= 0 or (value & (value - 1) != 0):
            raise ValueError(f"core_size must be a power of 2 (e.g., 8, 16, 32, 64...), got {value}")
        return value


    @field_validator("report_path", "log_path", "cache_file_path", "a_source", mode='before')
    @classmethod
    def ensure_path(cls, value: Union[str, Path]) -> Path:
        """
        Converts string input into a Path object before type validation.

        Args:
            value (Union[str, Path]): The raw path input.

        Returns:
            Path: An initialized Path object.
        """
        if isinstance(value, str):
            return Path(value)
        return value


    @field_validator("n_jobs")
    @classmethod
    def ensure_n_jobs(cls, value: Union[int, str]) -> int:
        """
        Ensures the number of parallel jobs is within safe system limits.

        It prevents setting n_jobs to 0 and caps it at (CPU count - 1) to
        keep the operating system responsive.

        Args:
            value (Union[int, str]): Requested number of workers.

        Returns:
            int: A safe, adjusted number of workers.
        """
        if not isinstance(value, int):
            return int(float(value))
        elif value >= multiprocessing.cpu_count():
            return multiprocessing.cpu_count() - 1
        elif value < 1:
            return 1
        else:
            return value


    @field_validator("extensions")
    @classmethod
    def ensure_extensions(cls, value: Union[str, List[str]]) -> Tuple[str, ...]:
        """
        Ensures that file extensions are stored as a tuple of strings.

        Args:
            value (Union[str, List[str]]): Input extension data.

        Returns:
            Tuple[str, ...]: A tuple of extension strings.

        Raises:
            TypeError: If the input cannot be converted to a tuple.
        """
        if isinstance(value, tuple):
            return value
        else:
            try:
                return tuple(value)
            except TypeError as e:
                raise TypeError(e)


    @classmethod
    def load_config(cls, config_path: Path = Constants.config_file) -> "AppSettings":
        """
        Factory method to create a settings object from a JSON file.

        It attempts to read the specified JSON file. If the file is missing
        or corrupted, it falls back to the default values defined in the class.

        Args:
            config_path (Path): Path to the config.json file.

        Returns:
            AppSettings: An initialized and validated settings instance.
        """
        data = {}

        if config_path.exists():
            try:
                with open(config_path, "r", encoding="utf-8") as file:
                    data = json.load(file)
            except json.JSONDecodeError:
                print(f"Warning: {config_path} is corrupted. Using defaults.")

        return cls(**data)

check_power_of_two(value) classmethod

Validates that the core_size is a power of 2.

Parameters:

Name Type Description Default
value int

The value to check.

required

Returns:

Name Type Description
int int

The validated value.

Raises:

Type Description
ValueError

If the value is not a power of 2 (e.g., 8, 16, 32).

Source code in const_utils/default_values.py
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
@field_validator('core_size')
@classmethod
def check_power_of_two(cls, value: int) -> int:
    """
    Validates that the core_size is a power of 2.

    Args:
        value (int): The value to check.

    Returns:
        int: The validated value.

    Raises:
        ValueError: If the value is not a power of 2 (e.g., 8, 16, 32).
    """
    if value <= 0 or (value & (value - 1) != 0):
        raise ValueError(f"core_size must be a power of 2 (e.g., 8, 16, 32, 64...), got {value}")
    return value

ensure_extensions(value) classmethod

Ensures that file extensions are stored as a tuple of strings.

Parameters:

Name Type Description Default
value Union[str, List[str]]

Input extension data.

required

Returns:

Type Description
Tuple[str, ...]

Tuple[str, ...]: A tuple of extension strings.

Raises:

Type Description
TypeError

If the input cannot be converted to a tuple.

Source code in const_utils/default_values.py
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
@field_validator("extensions")
@classmethod
def ensure_extensions(cls, value: Union[str, List[str]]) -> Tuple[str, ...]:
    """
    Ensures that file extensions are stored as a tuple of strings.

    Args:
        value (Union[str, List[str]]): Input extension data.

    Returns:
        Tuple[str, ...]: A tuple of extension strings.

    Raises:
        TypeError: If the input cannot be converted to a tuple.
    """
    if isinstance(value, tuple):
        return value
    else:
        try:
            return tuple(value)
        except TypeError as e:
            raise TypeError(e)

ensure_n_jobs(value) classmethod

Ensures the number of parallel jobs is within safe system limits.

It prevents setting n_jobs to 0 and caps it at (CPU count - 1) to keep the operating system responsive.

Parameters:

Name Type Description Default
value Union[int, str]

Requested number of workers.

required

Returns:

Name Type Description
int int

A safe, adjusted number of workers.

Source code in const_utils/default_values.py
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
@field_validator("n_jobs")
@classmethod
def ensure_n_jobs(cls, value: Union[int, str]) -> int:
    """
    Ensures the number of parallel jobs is within safe system limits.

    It prevents setting n_jobs to 0 and caps it at (CPU count - 1) to
    keep the operating system responsive.

    Args:
        value (Union[int, str]): Requested number of workers.

    Returns:
        int: A safe, adjusted number of workers.
    """
    if not isinstance(value, int):
        return int(float(value))
    elif value >= multiprocessing.cpu_count():
        return multiprocessing.cpu_count() - 1
    elif value < 1:
        return 1
    else:
        return value

ensure_path(value) classmethod

Converts string input into a Path object before type validation.

Parameters:

Name Type Description Default
value Union[str, Path]

The raw path input.

required

Returns:

Name Type Description
Path Path

An initialized Path object.

Source code in const_utils/default_values.py
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
@field_validator("report_path", "log_path", "cache_file_path", "a_source", mode='before')
@classmethod
def ensure_path(cls, value: Union[str, Path]) -> Path:
    """
    Converts string input into a Path object before type validation.

    Args:
        value (Union[str, Path]): The raw path input.

    Returns:
        Path: An initialized Path object.
    """
    if isinstance(value, str):
        return Path(value)
    return value

load_config(config_path=Constants.config_file) classmethod

Factory method to create a settings object from a JSON file.

It attempts to read the specified JSON file. If the file is missing or corrupted, it falls back to the default values defined in the class.

Parameters:

Name Type Description Default
config_path Path

Path to the config.json file.

config_file

Returns:

Name Type Description
AppSettings AppSettings

An initialized and validated settings instance.

Source code in const_utils/default_values.py
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
@classmethod
def load_config(cls, config_path: Path = Constants.config_file) -> "AppSettings":
    """
    Factory method to create a settings object from a JSON file.

    It attempts to read the specified JSON file. If the file is missing
    or corrupted, it falls back to the default values defined in the class.

    Args:
        config_path (Path): Path to the config.json file.

    Returns:
        AppSettings: An initialized and validated settings instance.
    """
    data = {}

    if config_path.exists():
        try:
            with open(config_path, "r", encoding="utf-8") as file:
                data = json.load(file)
        except json.JSONDecodeError:
            print(f"Warning: {config_path} is corrupted. Using defaults.")

    return cls(**data)