model_signing.hashing
High level API for the hashing interface of model_signing
library.
Hashing is used both for signing and verification and users should ensure that the same configuration is used in both cases.
The module could also be used to just hash a single model, without signing it:
model_signing.hashing.hash(model_path)
This module allows setting up the hashing configuration to a single variable and then sharing it between signing and verification.
hashing_config = model_signing.hashing.Config().set_ignored_paths(
paths=["README.md"], ignore_git_paths=True
)
signing_config = (
model_signing.signing.Config()
.use_elliptic_key_signer(private_key="key")
.set_hashing_config(hashing_config)
)
verifying_config = (
model_signing.verifying.Config()
.use_elliptic_key_verifier(public_key="key.pub")
.set_hashing_config(hashing_config)
)
The API defined here is stable and backwards compatible.
1# Copyright 2024 The Sigstore Authors 2# 3# Licensed under the Apache License, Version 2.0 (the "License"); 4# you may not use this file except in compliance with the License. 5# You may obtain a copy of the License at 6# 7# http://www.apache.org/licenses/LICENSE-2.0 8# 9# Unless required by applicable law or agreed to in writing, software 10# distributed under the License is distributed on an "AS IS" BASIS, 11# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12# See the License for the specific language governing permissions and 13# limitations under the License. 14 15"""High level API for the hashing interface of `model_signing` library. 16 17Hashing is used both for signing and verification and users should ensure that 18the same configuration is used in both cases. 19 20The module could also be used to just hash a single model, without signing it: 21 22```python 23model_signing.hashing.hash(model_path) 24``` 25 26This module allows setting up the hashing configuration to a single variable and 27then sharing it between signing and verification. 28 29```python 30hashing_config = model_signing.hashing.Config().set_ignored_paths( 31 paths=["README.md"], ignore_git_paths=True 32) 33 34signing_config = ( 35 model_signing.signing.Config() 36 .use_elliptic_key_signer(private_key="key") 37 .set_hashing_config(hashing_config) 38) 39 40verifying_config = ( 41 model_signing.verifying.Config() 42 .use_elliptic_key_verifier(public_key="key.pub") 43 .set_hashing_config(hashing_config) 44) 45``` 46 47The API defined here is stable and backwards compatible. 48""" 49 50from collections.abc import Callable, Iterable 51import os 52import pathlib 53import sys 54from typing import Literal, Optional, Union 55 56from model_signing import manifest 57from model_signing._hashing import hashing 58from model_signing._hashing import io 59from model_signing._hashing import memory 60from model_signing._serialization import file 61from model_signing._serialization import file_shard 62 63 64if sys.version_info >= (3, 11): 65 from typing import Self 66else: 67 from typing_extensions import Self 68 69 70# `TypeAlias` only exists from Python 3.10 71# `TypeAlias` is deprecated in Python 3.12 in favor of `type` 72if sys.version_info >= (3, 10): 73 from typing import TypeAlias 74else: 75 from typing_extensions import TypeAlias 76 77 78# Type alias to support `os.PathLike`, `str` and `bytes` objects in the API 79# When Python 3.12 is the minimum supported version we can use `type` 80# When Python 3.11 is the minimum supported version we can use `|` 81PathLike: TypeAlias = Union[str, bytes, os.PathLike] 82 83 84def hash(model_path: PathLike) -> manifest.Manifest: 85 """Hashes a model using the default configuration. 86 87 Hashing is the shared part between signing and verification and is also 88 expected to be the slowest component. When serializing a model, we need to 89 spend time proportional to the model size on disk. 90 91 This method returns a "manifest" of the model. A manifest is a collection of 92 every object in the model, paired with the corresponding hash. Currently, we 93 consider an object in the model to be either a file or a shard of the file. 94 Large models with large files will be hashed much faster when every shard is 95 hashed in parallel, at the cost of generating a larger payload for the 96 signature. In future releases we could support hashing individual tensors or 97 tensor slices for further speed optimizations for very large models. 98 99 Args: 100 model_path: The path to the model to hash. 101 102 Returns: 103 A manifest of the hashed model. 104 """ 105 return Config().hash(model_path) 106 107 108class Config: 109 """Configuration to use when hashing models. 110 111 Hashing is the shared part between signing and verification and is also 112 expected to be the slowest component. When serializing a model, we need to 113 spend time proportional to the model size on disk. 114 115 Hashing builds a "manifest" of the model. A manifest is a collection of 116 every object in the model, paired with the corresponding hash. Currently, we 117 consider an object in the model to be either a file or a shard of the file. 118 Large models with large files will be hashed much faster when every shard is 119 hashed in parallel, at the cost of generating a larger payload for the 120 signature. In future releases we could support hashing individual tensors or 121 tensor slices for further speed optimizations for very large models. 122 123 This configuration class supports configuring the hashing granularity. By 124 default, we hash at file level granularity. 125 126 This configuration class also supports configuring the hash method used to 127 generate the hash for every object in the model. We currently support SHA256 128 and BLAKE2, with SHA256 being the default. 129 130 This configuration class also supports configuring which paths from the 131 model directory should be ignored. These are files that doesn't impact the 132 behavior of the model, or files that won't be distributed with the model. By 133 default, only files that are associated with a git repository (`.git`, 134 `.gitattributes`, `.gitignore`, etc.) are ignored. 135 """ 136 137 def __init__(self): 138 """Initializes the default configuration for hashing.""" 139 self._ignored_paths = frozenset() 140 self._ignore_git_paths = True 141 self.use_file_serialization() 142 self._allow_symlinks = False 143 144 def hash( 145 self, 146 model_path: PathLike, 147 *, 148 files_to_hash: Optional[Iterable[PathLike]] = None, 149 ) -> manifest.Manifest: 150 """Hashes a model using the current configuration.""" 151 # All paths in ``_ignored_paths`` are expected to be relative to the 152 # model directory. Join them to ``model_path`` and ensure they do not 153 # escape it. 154 model_path = pathlib.Path(model_path) 155 ignored_paths = [] 156 for p in self._ignored_paths: 157 full = model_path / p 158 try: 159 full.relative_to(model_path) 160 except ValueError: 161 continue 162 ignored_paths.append(full) 163 164 if self._ignore_git_paths: 165 ignored_paths.extend( 166 [ 167 model_path / p 168 for p in [ 169 ".git/", 170 ".gitattributes", 171 ".github/", 172 ".gitignore", 173 ] 174 ] 175 ) 176 177 self._serializer.set_allow_symlinks(self._allow_symlinks) 178 179 return self._serializer.serialize( 180 pathlib.Path(model_path), 181 ignore_paths=ignored_paths, 182 files_to_hash=files_to_hash, 183 ) 184 185 def _build_stream_hasher( 186 self, hashing_algorithm: Literal["sha256", "blake2"] = "sha256" 187 ) -> hashing.StreamingHashEngine: 188 """Builds a streaming hasher from a constant string. 189 190 Args: 191 hashing_algorithm: The hashing algorithm to use. 192 193 Returns: 194 An instance of the requested hasher. 195 """ 196 # TODO: Once Python 3.9 support is deprecated revert to using `match` 197 if hashing_algorithm == "sha256": 198 return memory.SHA256() 199 if hashing_algorithm == "blake2": 200 return memory.BLAKE2() 201 202 raise ValueError(f"Unsupported hashing method {hashing_algorithm}") 203 204 def _build_file_hasher_factory( 205 self, 206 hashing_algorithm: Literal["sha256", "blake2"] = "sha256", 207 chunk_size: int = 1048576, 208 ) -> Callable[[pathlib.Path], io.SimpleFileHasher]: 209 """Builds the hasher factory for a serialization by file. 210 211 Args: 212 hashing_algorithm: The hashing algorithm to use to hash a file. 213 chunk_size: The amount of file to read at once. Default is 1MB. A 214 special value of 0 signals to attempt to read everything in a 215 single call. 216 217 Returns: 218 The hasher factory that should be used by the active serialization 219 method. 220 """ 221 222 def _factory(path: pathlib.Path) -> io.SimpleFileHasher: 223 hasher = self._build_stream_hasher(hashing_algorithm) 224 return io.SimpleFileHasher(path, hasher, chunk_size=chunk_size) 225 226 return _factory 227 228 def _build_sharded_file_hasher_factory( 229 self, 230 hashing_algorithm: Literal["sha256", "blake2"] = "sha256", 231 chunk_size: int = 1048576, 232 shard_size: int = 1_000_000_000, 233 ) -> Callable[[pathlib.Path, int, int], io.ShardedFileHasher]: 234 """Builds the hasher factory for a serialization by file shards. 235 236 Args: 237 hashing_algorithm: The hashing algorithm to use to hash a shard. 238 chunk_size: The amount of file to read at once. Default is 1MB. A 239 special value of 0 signals to attempt to read everything in a 240 single call. 241 shard_size: The size of a file shard. Default is 1 GB. 242 243 Returns: 244 The hasher factory that should be used by the active serialization 245 method. 246 """ 247 248 def _factory( 249 path: pathlib.Path, start: int, end: int 250 ) -> io.ShardedFileHasher: 251 hasher = self._build_stream_hasher(hashing_algorithm) 252 return io.ShardedFileHasher( 253 path, 254 hasher, 255 start=start, 256 end=end, 257 chunk_size=chunk_size, 258 shard_size=shard_size, 259 ) 260 261 return _factory 262 263 def use_file_serialization( 264 self, 265 *, 266 hashing_algorithm: Literal["sha256", "blake2"] = "sha256", 267 chunk_size: int = 1048576, 268 max_workers: Optional[int] = None, 269 allow_symlinks: bool = False, 270 ignore_paths: Iterable[pathlib.Path] = frozenset(), 271 ) -> Self: 272 """Configures serialization to build a manifest of (file, hash) pairs. 273 274 The serialization method in this configuration is changed to one where 275 every file in the model is paired with its digest and a manifest 276 containing all these pairings is being built. 277 278 Args: 279 hashing_algorithm: The hashing algorithm to use to hash a file. 280 chunk_size: The amount of file to read at once. Default is 1MB. A 281 special value of 0 signals to attempt to read everything in a 282 single call. 283 max_workers: Maximum number of workers to use in parallel. Default 284 is to defer to the `concurrent.futures` library to select the best 285 value for the current machine. 286 allow_symlinks: Controls whether symbolic links are included. If a 287 symlink is present but the flag is `False` (default) the 288 serialization would raise an error. 289 290 Returns: 291 The new hashing configuration with the new serialization method. 292 """ 293 self._serializer = file.Serializer( 294 self._build_file_hasher_factory(hashing_algorithm, chunk_size), 295 max_workers=max_workers, 296 allow_symlinks=allow_symlinks, 297 ignore_paths=ignore_paths, 298 ) 299 return self 300 301 def use_shard_serialization( 302 self, 303 *, 304 hashing_algorithm: Literal["sha256", "blake2"] = "sha256", 305 chunk_size: int = 1048576, 306 shard_size: int = 1_000_000_000, 307 max_workers: Optional[int] = None, 308 allow_symlinks: bool = False, 309 ignore_paths: Iterable[pathlib.Path] = frozenset(), 310 ) -> Self: 311 """Configures serialization to build a manifest of (shard, hash) pairs. 312 313 The serialization method in this configuration is changed to one where 314 every file in the model is sharded in equal sized shards, every shard is 315 paired with its digest and a manifest containing all these pairings is 316 being built. 317 318 Args: 319 hashing_algorithm: The hashing algorithm to use to hash a shard. 320 chunk_size: The amount of file to read at once. Default is 1MB. A 321 special value of 0 signals to attempt to read everything in a 322 single call. 323 shard_size: The size of a file shard. Default is 1 GB. 324 max_workers: Maximum number of workers to use in parallel. Default 325 is to defer to the `concurrent.futures` library to select the best 326 value for the current machine. 327 allow_symlinks: Controls whether symbolic links are included. If a 328 symlink is present but the flag is `False` (default) the 329 serialization would raise an error. 330 ignore_paths: Paths of files to ignore. 331 332 Returns: 333 The new hashing configuration with the new serialization method. 334 """ 335 self._serializer = file_shard.Serializer( 336 self._build_sharded_file_hasher_factory( 337 hashing_algorithm, chunk_size, shard_size 338 ), 339 max_workers=max_workers, 340 allow_symlinks=allow_symlinks, 341 ignore_paths=ignore_paths, 342 ) 343 return self 344 345 def set_ignored_paths( 346 self, *, paths: Iterable[PathLike], ignore_git_paths: bool = True 347 ) -> Self: 348 """Configures the paths to be ignored during serialization of a model. 349 350 If the model is a single file, there are no paths that are ignored. If 351 the model is a directory, all paths are considered as relative to the 352 model directory, since we never look at files outside of it. 353 354 If an ignored path is a directory, serialization will ignore both the 355 path and any of its children. 356 357 Args: 358 paths: The paths to ignore. 359 ignore_git_paths: Whether to ignore git related paths (default) or 360 include them in the signature. 361 362 Returns: 363 The new hashing configuration with a new set of ignored paths. 364 """ 365 # Preserve the user-provided relative paths; they are resolved against 366 # the model directory later when hashing. 367 self._ignored_paths = frozenset(pathlib.Path(p) for p in paths) 368 self._ignore_git_paths = ignore_git_paths 369 return self 370 371 def add_ignored_paths( 372 self, *, model_path: PathLike, paths: Iterable[PathLike] 373 ) -> None: 374 """Add more paths to ignore to existing set of paths. 375 376 Args: 377 model_path: The path to the model 378 paths: Additional paths to ignore. All path must be relative to 379 the model directory. 380 """ 381 newset = set(self._ignored_paths) 382 model_path = pathlib.Path(model_path) 383 for p in paths: 384 candidate = pathlib.Path(p) 385 full = model_path / candidate 386 try: 387 full.relative_to(model_path) 388 except ValueError: 389 continue 390 newset.add(candidate) 391 self._ignored_paths = newset 392 393 def set_allow_symlinks(self, allow_symlinks: bool) -> Self: 394 """Set whether following symlinks is allowed.""" 395 self._allow_symlinks = allow_symlinks 396 return self
85def hash(model_path: PathLike) -> manifest.Manifest: 86 """Hashes a model using the default configuration. 87 88 Hashing is the shared part between signing and verification and is also 89 expected to be the slowest component. When serializing a model, we need to 90 spend time proportional to the model size on disk. 91 92 This method returns a "manifest" of the model. A manifest is a collection of 93 every object in the model, paired with the corresponding hash. Currently, we 94 consider an object in the model to be either a file or a shard of the file. 95 Large models with large files will be hashed much faster when every shard is 96 hashed in parallel, at the cost of generating a larger payload for the 97 signature. In future releases we could support hashing individual tensors or 98 tensor slices for further speed optimizations for very large models. 99 100 Args: 101 model_path: The path to the model to hash. 102 103 Returns: 104 A manifest of the hashed model. 105 """ 106 return Config().hash(model_path)
Hashes a model using the default configuration.
Hashing is the shared part between signing and verification and is also expected to be the slowest component. When serializing a model, we need to spend time proportional to the model size on disk.
This method returns a "manifest" of the model. A manifest is a collection of every object in the model, paired with the corresponding hash. Currently, we consider an object in the model to be either a file or a shard of the file. Large models with large files will be hashed much faster when every shard is hashed in parallel, at the cost of generating a larger payload for the signature. In future releases we could support hashing individual tensors or tensor slices for further speed optimizations for very large models.
Arguments:
- model_path: The path to the model to hash.
Returns:
A manifest of the hashed model.
109class Config: 110 """Configuration to use when hashing models. 111 112 Hashing is the shared part between signing and verification and is also 113 expected to be the slowest component. When serializing a model, we need to 114 spend time proportional to the model size on disk. 115 116 Hashing builds a "manifest" of the model. A manifest is a collection of 117 every object in the model, paired with the corresponding hash. Currently, we 118 consider an object in the model to be either a file or a shard of the file. 119 Large models with large files will be hashed much faster when every shard is 120 hashed in parallel, at the cost of generating a larger payload for the 121 signature. In future releases we could support hashing individual tensors or 122 tensor slices for further speed optimizations for very large models. 123 124 This configuration class supports configuring the hashing granularity. By 125 default, we hash at file level granularity. 126 127 This configuration class also supports configuring the hash method used to 128 generate the hash for every object in the model. We currently support SHA256 129 and BLAKE2, with SHA256 being the default. 130 131 This configuration class also supports configuring which paths from the 132 model directory should be ignored. These are files that doesn't impact the 133 behavior of the model, or files that won't be distributed with the model. By 134 default, only files that are associated with a git repository (`.git`, 135 `.gitattributes`, `.gitignore`, etc.) are ignored. 136 """ 137 138 def __init__(self): 139 """Initializes the default configuration for hashing.""" 140 self._ignored_paths = frozenset() 141 self._ignore_git_paths = True 142 self.use_file_serialization() 143 self._allow_symlinks = False 144 145 def hash( 146 self, 147 model_path: PathLike, 148 *, 149 files_to_hash: Optional[Iterable[PathLike]] = None, 150 ) -> manifest.Manifest: 151 """Hashes a model using the current configuration.""" 152 # All paths in ``_ignored_paths`` are expected to be relative to the 153 # model directory. Join them to ``model_path`` and ensure they do not 154 # escape it. 155 model_path = pathlib.Path(model_path) 156 ignored_paths = [] 157 for p in self._ignored_paths: 158 full = model_path / p 159 try: 160 full.relative_to(model_path) 161 except ValueError: 162 continue 163 ignored_paths.append(full) 164 165 if self._ignore_git_paths: 166 ignored_paths.extend( 167 [ 168 model_path / p 169 for p in [ 170 ".git/", 171 ".gitattributes", 172 ".github/", 173 ".gitignore", 174 ] 175 ] 176 ) 177 178 self._serializer.set_allow_symlinks(self._allow_symlinks) 179 180 return self._serializer.serialize( 181 pathlib.Path(model_path), 182 ignore_paths=ignored_paths, 183 files_to_hash=files_to_hash, 184 ) 185 186 def _build_stream_hasher( 187 self, hashing_algorithm: Literal["sha256", "blake2"] = "sha256" 188 ) -> hashing.StreamingHashEngine: 189 """Builds a streaming hasher from a constant string. 190 191 Args: 192 hashing_algorithm: The hashing algorithm to use. 193 194 Returns: 195 An instance of the requested hasher. 196 """ 197 # TODO: Once Python 3.9 support is deprecated revert to using `match` 198 if hashing_algorithm == "sha256": 199 return memory.SHA256() 200 if hashing_algorithm == "blake2": 201 return memory.BLAKE2() 202 203 raise ValueError(f"Unsupported hashing method {hashing_algorithm}") 204 205 def _build_file_hasher_factory( 206 self, 207 hashing_algorithm: Literal["sha256", "blake2"] = "sha256", 208 chunk_size: int = 1048576, 209 ) -> Callable[[pathlib.Path], io.SimpleFileHasher]: 210 """Builds the hasher factory for a serialization by file. 211 212 Args: 213 hashing_algorithm: The hashing algorithm to use to hash a file. 214 chunk_size: The amount of file to read at once. Default is 1MB. A 215 special value of 0 signals to attempt to read everything in a 216 single call. 217 218 Returns: 219 The hasher factory that should be used by the active serialization 220 method. 221 """ 222 223 def _factory(path: pathlib.Path) -> io.SimpleFileHasher: 224 hasher = self._build_stream_hasher(hashing_algorithm) 225 return io.SimpleFileHasher(path, hasher, chunk_size=chunk_size) 226 227 return _factory 228 229 def _build_sharded_file_hasher_factory( 230 self, 231 hashing_algorithm: Literal["sha256", "blake2"] = "sha256", 232 chunk_size: int = 1048576, 233 shard_size: int = 1_000_000_000, 234 ) -> Callable[[pathlib.Path, int, int], io.ShardedFileHasher]: 235 """Builds the hasher factory for a serialization by file shards. 236 237 Args: 238 hashing_algorithm: The hashing algorithm to use to hash a shard. 239 chunk_size: The amount of file to read at once. Default is 1MB. A 240 special value of 0 signals to attempt to read everything in a 241 single call. 242 shard_size: The size of a file shard. Default is 1 GB. 243 244 Returns: 245 The hasher factory that should be used by the active serialization 246 method. 247 """ 248 249 def _factory( 250 path: pathlib.Path, start: int, end: int 251 ) -> io.ShardedFileHasher: 252 hasher = self._build_stream_hasher(hashing_algorithm) 253 return io.ShardedFileHasher( 254 path, 255 hasher, 256 start=start, 257 end=end, 258 chunk_size=chunk_size, 259 shard_size=shard_size, 260 ) 261 262 return _factory 263 264 def use_file_serialization( 265 self, 266 *, 267 hashing_algorithm: Literal["sha256", "blake2"] = "sha256", 268 chunk_size: int = 1048576, 269 max_workers: Optional[int] = None, 270 allow_symlinks: bool = False, 271 ignore_paths: Iterable[pathlib.Path] = frozenset(), 272 ) -> Self: 273 """Configures serialization to build a manifest of (file, hash) pairs. 274 275 The serialization method in this configuration is changed to one where 276 every file in the model is paired with its digest and a manifest 277 containing all these pairings is being built. 278 279 Args: 280 hashing_algorithm: The hashing algorithm to use to hash a file. 281 chunk_size: The amount of file to read at once. Default is 1MB. A 282 special value of 0 signals to attempt to read everything in a 283 single call. 284 max_workers: Maximum number of workers to use in parallel. Default 285 is to defer to the `concurrent.futures` library to select the best 286 value for the current machine. 287 allow_symlinks: Controls whether symbolic links are included. If a 288 symlink is present but the flag is `False` (default) the 289 serialization would raise an error. 290 291 Returns: 292 The new hashing configuration with the new serialization method. 293 """ 294 self._serializer = file.Serializer( 295 self._build_file_hasher_factory(hashing_algorithm, chunk_size), 296 max_workers=max_workers, 297 allow_symlinks=allow_symlinks, 298 ignore_paths=ignore_paths, 299 ) 300 return self 301 302 def use_shard_serialization( 303 self, 304 *, 305 hashing_algorithm: Literal["sha256", "blake2"] = "sha256", 306 chunk_size: int = 1048576, 307 shard_size: int = 1_000_000_000, 308 max_workers: Optional[int] = None, 309 allow_symlinks: bool = False, 310 ignore_paths: Iterable[pathlib.Path] = frozenset(), 311 ) -> Self: 312 """Configures serialization to build a manifest of (shard, hash) pairs. 313 314 The serialization method in this configuration is changed to one where 315 every file in the model is sharded in equal sized shards, every shard is 316 paired with its digest and a manifest containing all these pairings is 317 being built. 318 319 Args: 320 hashing_algorithm: The hashing algorithm to use to hash a shard. 321 chunk_size: The amount of file to read at once. Default is 1MB. A 322 special value of 0 signals to attempt to read everything in a 323 single call. 324 shard_size: The size of a file shard. Default is 1 GB. 325 max_workers: Maximum number of workers to use in parallel. Default 326 is to defer to the `concurrent.futures` library to select the best 327 value for the current machine. 328 allow_symlinks: Controls whether symbolic links are included. If a 329 symlink is present but the flag is `False` (default) the 330 serialization would raise an error. 331 ignore_paths: Paths of files to ignore. 332 333 Returns: 334 The new hashing configuration with the new serialization method. 335 """ 336 self._serializer = file_shard.Serializer( 337 self._build_sharded_file_hasher_factory( 338 hashing_algorithm, chunk_size, shard_size 339 ), 340 max_workers=max_workers, 341 allow_symlinks=allow_symlinks, 342 ignore_paths=ignore_paths, 343 ) 344 return self 345 346 def set_ignored_paths( 347 self, *, paths: Iterable[PathLike], ignore_git_paths: bool = True 348 ) -> Self: 349 """Configures the paths to be ignored during serialization of a model. 350 351 If the model is a single file, there are no paths that are ignored. If 352 the model is a directory, all paths are considered as relative to the 353 model directory, since we never look at files outside of it. 354 355 If an ignored path is a directory, serialization will ignore both the 356 path and any of its children. 357 358 Args: 359 paths: The paths to ignore. 360 ignore_git_paths: Whether to ignore git related paths (default) or 361 include them in the signature. 362 363 Returns: 364 The new hashing configuration with a new set of ignored paths. 365 """ 366 # Preserve the user-provided relative paths; they are resolved against 367 # the model directory later when hashing. 368 self._ignored_paths = frozenset(pathlib.Path(p) for p in paths) 369 self._ignore_git_paths = ignore_git_paths 370 return self 371 372 def add_ignored_paths( 373 self, *, model_path: PathLike, paths: Iterable[PathLike] 374 ) -> None: 375 """Add more paths to ignore to existing set of paths. 376 377 Args: 378 model_path: The path to the model 379 paths: Additional paths to ignore. All path must be relative to 380 the model directory. 381 """ 382 newset = set(self._ignored_paths) 383 model_path = pathlib.Path(model_path) 384 for p in paths: 385 candidate = pathlib.Path(p) 386 full = model_path / candidate 387 try: 388 full.relative_to(model_path) 389 except ValueError: 390 continue 391 newset.add(candidate) 392 self._ignored_paths = newset 393 394 def set_allow_symlinks(self, allow_symlinks: bool) -> Self: 395 """Set whether following symlinks is allowed.""" 396 self._allow_symlinks = allow_symlinks 397 return self
Configuration to use when hashing models.
Hashing is the shared part between signing and verification and is also expected to be the slowest component. When serializing a model, we need to spend time proportional to the model size on disk.
Hashing builds a "manifest" of the model. A manifest is a collection of every object in the model, paired with the corresponding hash. Currently, we consider an object in the model to be either a file or a shard of the file. Large models with large files will be hashed much faster when every shard is hashed in parallel, at the cost of generating a larger payload for the signature. In future releases we could support hashing individual tensors or tensor slices for further speed optimizations for very large models.
This configuration class supports configuring the hashing granularity. By default, we hash at file level granularity.
This configuration class also supports configuring the hash method used to generate the hash for every object in the model. We currently support SHA256 and BLAKE2, with SHA256 being the default.
This configuration class also supports configuring which paths from the
model directory should be ignored. These are files that doesn't impact the
behavior of the model, or files that won't be distributed with the model. By
default, only files that are associated with a git repository (.git
,
.gitattributes
, .gitignore
, etc.) are ignored.
138 def __init__(self): 139 """Initializes the default configuration for hashing.""" 140 self._ignored_paths = frozenset() 141 self._ignore_git_paths = True 142 self.use_file_serialization() 143 self._allow_symlinks = False
Initializes the default configuration for hashing.
145 def hash( 146 self, 147 model_path: PathLike, 148 *, 149 files_to_hash: Optional[Iterable[PathLike]] = None, 150 ) -> manifest.Manifest: 151 """Hashes a model using the current configuration.""" 152 # All paths in ``_ignored_paths`` are expected to be relative to the 153 # model directory. Join them to ``model_path`` and ensure they do not 154 # escape it. 155 model_path = pathlib.Path(model_path) 156 ignored_paths = [] 157 for p in self._ignored_paths: 158 full = model_path / p 159 try: 160 full.relative_to(model_path) 161 except ValueError: 162 continue 163 ignored_paths.append(full) 164 165 if self._ignore_git_paths: 166 ignored_paths.extend( 167 [ 168 model_path / p 169 for p in [ 170 ".git/", 171 ".gitattributes", 172 ".github/", 173 ".gitignore", 174 ] 175 ] 176 ) 177 178 self._serializer.set_allow_symlinks(self._allow_symlinks) 179 180 return self._serializer.serialize( 181 pathlib.Path(model_path), 182 ignore_paths=ignored_paths, 183 files_to_hash=files_to_hash, 184 )
Hashes a model using the current configuration.
264 def use_file_serialization( 265 self, 266 *, 267 hashing_algorithm: Literal["sha256", "blake2"] = "sha256", 268 chunk_size: int = 1048576, 269 max_workers: Optional[int] = None, 270 allow_symlinks: bool = False, 271 ignore_paths: Iterable[pathlib.Path] = frozenset(), 272 ) -> Self: 273 """Configures serialization to build a manifest of (file, hash) pairs. 274 275 The serialization method in this configuration is changed to one where 276 every file in the model is paired with its digest and a manifest 277 containing all these pairings is being built. 278 279 Args: 280 hashing_algorithm: The hashing algorithm to use to hash a file. 281 chunk_size: The amount of file to read at once. Default is 1MB. A 282 special value of 0 signals to attempt to read everything in a 283 single call. 284 max_workers: Maximum number of workers to use in parallel. Default 285 is to defer to the `concurrent.futures` library to select the best 286 value for the current machine. 287 allow_symlinks: Controls whether symbolic links are included. If a 288 symlink is present but the flag is `False` (default) the 289 serialization would raise an error. 290 291 Returns: 292 The new hashing configuration with the new serialization method. 293 """ 294 self._serializer = file.Serializer( 295 self._build_file_hasher_factory(hashing_algorithm, chunk_size), 296 max_workers=max_workers, 297 allow_symlinks=allow_symlinks, 298 ignore_paths=ignore_paths, 299 ) 300 return self
Configures serialization to build a manifest of (file, hash) pairs.
The serialization method in this configuration is changed to one where every file in the model is paired with its digest and a manifest containing all these pairings is being built.
Arguments:
- hashing_algorithm: The hashing algorithm to use to hash a file.
- chunk_size: The amount of file to read at once. Default is 1MB. A special value of 0 signals to attempt to read everything in a single call.
- max_workers: Maximum number of workers to use in parallel. Default
is to defer to the
concurrent.futures
library to select the best value for the current machine. - allow_symlinks: Controls whether symbolic links are included. If a
symlink is present but the flag is
False
(default) the serialization would raise an error.
Returns:
The new hashing configuration with the new serialization method.
302 def use_shard_serialization( 303 self, 304 *, 305 hashing_algorithm: Literal["sha256", "blake2"] = "sha256", 306 chunk_size: int = 1048576, 307 shard_size: int = 1_000_000_000, 308 max_workers: Optional[int] = None, 309 allow_symlinks: bool = False, 310 ignore_paths: Iterable[pathlib.Path] = frozenset(), 311 ) -> Self: 312 """Configures serialization to build a manifest of (shard, hash) pairs. 313 314 The serialization method in this configuration is changed to one where 315 every file in the model is sharded in equal sized shards, every shard is 316 paired with its digest and a manifest containing all these pairings is 317 being built. 318 319 Args: 320 hashing_algorithm: The hashing algorithm to use to hash a shard. 321 chunk_size: The amount of file to read at once. Default is 1MB. A 322 special value of 0 signals to attempt to read everything in a 323 single call. 324 shard_size: The size of a file shard. Default is 1 GB. 325 max_workers: Maximum number of workers to use in parallel. Default 326 is to defer to the `concurrent.futures` library to select the best 327 value for the current machine. 328 allow_symlinks: Controls whether symbolic links are included. If a 329 symlink is present but the flag is `False` (default) the 330 serialization would raise an error. 331 ignore_paths: Paths of files to ignore. 332 333 Returns: 334 The new hashing configuration with the new serialization method. 335 """ 336 self._serializer = file_shard.Serializer( 337 self._build_sharded_file_hasher_factory( 338 hashing_algorithm, chunk_size, shard_size 339 ), 340 max_workers=max_workers, 341 allow_symlinks=allow_symlinks, 342 ignore_paths=ignore_paths, 343 ) 344 return self
Configures serialization to build a manifest of (shard, hash) pairs.
The serialization method in this configuration is changed to one where every file in the model is sharded in equal sized shards, every shard is paired with its digest and a manifest containing all these pairings is being built.
Arguments:
- hashing_algorithm: The hashing algorithm to use to hash a shard.
- chunk_size: The amount of file to read at once. Default is 1MB. A special value of 0 signals to attempt to read everything in a single call.
- shard_size: The size of a file shard. Default is 1 GB.
- max_workers: Maximum number of workers to use in parallel. Default
is to defer to the
concurrent.futures
library to select the best value for the current machine. - allow_symlinks: Controls whether symbolic links are included. If a
symlink is present but the flag is
False
(default) the serialization would raise an error. - ignore_paths: Paths of files to ignore.
Returns:
The new hashing configuration with the new serialization method.
346 def set_ignored_paths( 347 self, *, paths: Iterable[PathLike], ignore_git_paths: bool = True 348 ) -> Self: 349 """Configures the paths to be ignored during serialization of a model. 350 351 If the model is a single file, there are no paths that are ignored. If 352 the model is a directory, all paths are considered as relative to the 353 model directory, since we never look at files outside of it. 354 355 If an ignored path is a directory, serialization will ignore both the 356 path and any of its children. 357 358 Args: 359 paths: The paths to ignore. 360 ignore_git_paths: Whether to ignore git related paths (default) or 361 include them in the signature. 362 363 Returns: 364 The new hashing configuration with a new set of ignored paths. 365 """ 366 # Preserve the user-provided relative paths; they are resolved against 367 # the model directory later when hashing. 368 self._ignored_paths = frozenset(pathlib.Path(p) for p in paths) 369 self._ignore_git_paths = ignore_git_paths 370 return self
Configures the paths to be ignored during serialization of a model.
If the model is a single file, there are no paths that are ignored. If the model is a directory, all paths are considered as relative to the model directory, since we never look at files outside of it.
If an ignored path is a directory, serialization will ignore both the path and any of its children.
Arguments:
- paths: The paths to ignore.
- ignore_git_paths: Whether to ignore git related paths (default) or include them in the signature.
Returns:
The new hashing configuration with a new set of ignored paths.
372 def add_ignored_paths( 373 self, *, model_path: PathLike, paths: Iterable[PathLike] 374 ) -> None: 375 """Add more paths to ignore to existing set of paths. 376 377 Args: 378 model_path: The path to the model 379 paths: Additional paths to ignore. All path must be relative to 380 the model directory. 381 """ 382 newset = set(self._ignored_paths) 383 model_path = pathlib.Path(model_path) 384 for p in paths: 385 candidate = pathlib.Path(p) 386 full = model_path / candidate 387 try: 388 full.relative_to(model_path) 389 except ValueError: 390 continue 391 newset.add(candidate) 392 self._ignored_paths = newset
Add more paths to ignore to existing set of paths.
Arguments:
- model_path: The path to the model
- paths: Additional paths to ignore. All path must be relative to the model directory.