model_signing.hashing
High level API for the hashing interface of model_signing
library.
Hashing is used both for signing and verification and users should ensure that the same configuration is used in both cases.
The module could also be used to just hash a single model, without signing it:
model_signing.hashing.hash(model_path)
This module allows setting up the hashing configuration to a single variable and then sharing it between signing and verification.
hashing_config = model_signing.hashing.Config().set_ignored_paths(
paths=["README.md"], ignore_git_paths=True
)
signing_config = (
model_signing.signing.Config()
.use_elliptic_key_signer(private_key="key")
.set_hashing_config(hashing_config)
)
verifying_config = (
model_signing.verifying.Config()
.use_elliptic_key_verifier(public_key="key.pub")
.set_hashing_config(hashing_config)
)
The API defined here is stable and backwards compatible.
1# Copyright 2024 The Sigstore Authors 2# 3# Licensed under the Apache License, Version 2.0 (the "License"); 4# you may not use this file except in compliance with the License. 5# You may obtain a copy of the License at 6# 7# http://www.apache.org/licenses/LICENSE-2.0 8# 9# Unless required by applicable law or agreed to in writing, software 10# distributed under the License is distributed on an "AS IS" BASIS, 11# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12# See the License for the specific language governing permissions and 13# limitations under the License. 14 15"""High level API for the hashing interface of `model_signing` library. 16 17Hashing is used both for signing and verification and users should ensure that 18the same configuration is used in both cases. 19 20The module could also be used to just hash a single model, without signing it: 21 22```python 23model_signing.hashing.hash(model_path) 24``` 25 26This module allows setting up the hashing configuration to a single variable and 27then sharing it between signing and verification. 28 29```python 30hashing_config = model_signing.hashing.Config().set_ignored_paths( 31 paths=["README.md"], ignore_git_paths=True 32) 33 34signing_config = ( 35 model_signing.signing.Config() 36 .use_elliptic_key_signer(private_key="key") 37 .set_hashing_config(hashing_config) 38) 39 40verifying_config = ( 41 model_signing.verifying.Config() 42 .use_elliptic_key_verifier(public_key="key.pub") 43 .set_hashing_config(hashing_config) 44) 45``` 46 47The API defined here is stable and backwards compatible. 48""" 49 50from collections.abc import Callable, Iterable 51import os 52import pathlib 53import sys 54from typing import Literal, Optional, Union 55 56from model_signing import manifest 57from model_signing._hashing import hashing 58from model_signing._hashing import io 59from model_signing._hashing import memory 60from model_signing._serialization import file 61from model_signing._serialization import file_shard 62 63 64if sys.version_info >= (3, 11): 65 from typing import Self 66else: 67 from typing_extensions import Self 68 69 70# `TypeAlias` only exists from Python 3.10 71# `TypeAlias` is deprecated in Python 3.12 in favor of `type` 72if sys.version_info >= (3, 10): 73 from typing import TypeAlias 74else: 75 from typing_extensions import TypeAlias 76 77 78# Type alias to support `os.PathLike`, `str` and `bytes` objects in the API 79# When Python 3.12 is the minimum supported version we can use `type` 80# When Python 3.11 is the minimum supported version we can use `|` 81PathLike: TypeAlias = Union[str, bytes, os.PathLike] 82 83 84def hash(model_path: PathLike) -> manifest.Manifest: 85 """Hashes a model using the default configuration. 86 87 Hashing is the shared part between signing and verification and is also 88 expected to be the slowest component. When serializing a model, we need to 89 spend time proportional to the model size on disk. 90 91 This method returns a "manifest" of the model. A manifest is a collection of 92 every object in the model, paired with the corresponding hash. Currently, we 93 consider an object in the model to be either a file or a shard of the file. 94 Large models with large files will be hashed much faster when every shard is 95 hashed in parallel, at the cost of generating a larger payload for the 96 signature. In future releases we could support hashing individual tensors or 97 tensor slices for further speed optimizations for very large models. 98 99 Args: 100 model_path: The path to the model to hash. 101 102 Returns: 103 A manifest of the hashed model. 104 """ 105 return Config().hash(model_path) 106 107 108class Config: 109 """Configuration to use when hashing models. 110 111 Hashing is the shared part between signing and verification and is also 112 expected to be the slowest component. When serializing a model, we need to 113 spend time proportional to the model size on disk. 114 115 Hashing builds a "manifest" of the model. A manifest is a collection of 116 every object in the model, paired with the corresponding hash. Currently, we 117 consider an object in the model to be either a file or a shard of the file. 118 Large models with large files will be hashed much faster when every shard is 119 hashed in parallel, at the cost of generating a larger payload for the 120 signature. In future releases we could support hashing individual tensors or 121 tensor slices for further speed optimizations for very large models. 122 123 This configuration class supports configuring the hashing granularity. By 124 default, we hash at file level granularity. 125 126 This configuration class also supports configuring the hash method used to 127 generate the hash for every object in the model. We currently support SHA256 128 and BLAKE2, with SHA256 being the default. 129 130 This configuration class also supports configuring which paths from the 131 model directory should be ignored. These are files that doesn't impact the 132 behavior of the model, or files that won't be distributed with the model. By 133 default, only files that are associated with a git repository (`.git`, 134 `.gitattributes`, `.gitignore`, etc.) are ignored. 135 """ 136 137 def __init__(self): 138 """Initializes the default configuration for hashing.""" 139 self._ignored_paths = frozenset() 140 self._ignore_git_paths = True 141 self.use_file_serialization() 142 143 def hash(self, model_path: PathLike) -> manifest.Manifest: 144 """Hashes a model using the current configuration.""" 145 ignored_paths = [path for path in self._ignored_paths] 146 if self._ignore_git_paths: 147 ignored_paths.extend( 148 [".git/", ".gitattributes", ".github/", ".gitignore"] 149 ) 150 151 return self._serializer.serialize( 152 pathlib.Path(model_path), ignore_paths=ignored_paths 153 ) 154 155 def _build_stream_hasher( 156 self, hashing_algorithm: Literal["sha256", "blake2"] = "sha256" 157 ) -> hashing.StreamingHashEngine: 158 """Builds a streaming hasher from a constant string. 159 160 Args: 161 hashing_algorithm: The hashing algorithm to use. 162 163 Returns: 164 An instance of the requested hasher. 165 """ 166 # TODO: Once Python 3.9 support is deprecated revert to using `match` 167 if hashing_algorithm == "sha256": 168 return memory.SHA256() 169 if hashing_algorithm == "blake2": 170 return memory.BLAKE2() 171 172 raise ValueError(f"Unsupported hashing method {hashing_algorithm}") 173 174 def _build_file_hasher_factory( 175 self, 176 hashing_algorithm: Literal["sha256", "blake2"] = "sha256", 177 chunk_size: int = 1048576, 178 ) -> Callable[[pathlib.Path], io.SimpleFileHasher]: 179 """Builds the hasher factory for a serialization by file. 180 181 Args: 182 hashing_algorithm: The hashing algorithm to use to hash a file. 183 chunk_size: The amount of file to read at once. Default is 1MB. A 184 special value of 0 signals to attempt to read everything in a 185 single call. 186 187 Returns: 188 The hasher factory that should be used by the active serialization 189 method. 190 """ 191 192 def _factory(path: pathlib.Path) -> io.SimpleFileHasher: 193 hasher = self._build_stream_hasher(hashing_algorithm) 194 return io.SimpleFileHasher(path, hasher, chunk_size=chunk_size) 195 196 return _factory 197 198 def _build_sharded_file_hasher_factory( 199 self, 200 hashing_algorithm: Literal["sha256", "blake2"] = "sha256", 201 chunk_size: int = 1048576, 202 shard_size: int = 1_000_000_000, 203 ) -> Callable[[pathlib.Path, int, int], io.ShardedFileHasher]: 204 """Builds the hasher factory for a serialization by file shards. 205 206 Args: 207 hashing_algorithm: The hashing algorithm to use to hash a shard. 208 chunk_size: The amount of file to read at once. Default is 1MB. A 209 special value of 0 signals to attempt to read everything in a 210 single call. 211 shard_size: The size of a file shard. Default is 1 GB. 212 213 Returns: 214 The hasher factory that should be used by the active serialization 215 method. 216 """ 217 218 def _factory( 219 path: pathlib.Path, start: int, end: int 220 ) -> io.ShardedFileHasher: 221 hasher = self._build_stream_hasher(hashing_algorithm) 222 return io.ShardedFileHasher( 223 path, 224 hasher, 225 start=start, 226 end=end, 227 chunk_size=chunk_size, 228 shard_size=shard_size, 229 ) 230 231 return _factory 232 233 def use_file_serialization( 234 self, 235 *, 236 hashing_algorithm: Literal["sha256", "blake2"] = "sha256", 237 chunk_size: int = 1048576, 238 max_workers: Optional[int] = None, 239 allow_symlinks: bool = False, 240 ) -> Self: 241 """Configures serialization to build a manifest of (file, hash) pairs. 242 243 The serialization method in this configuration is changed to one where 244 every file in the model is paired with its digest and a manifest 245 containing all these pairings is being built. 246 247 Args: 248 hashing_algorithm: The hashing algorithm to use to hash a file. 249 chunk_size: The amount of file to read at once. Default is 1MB. A 250 special value of 0 signals to attempt to read everything in a 251 single call. 252 max_workers: Maximum number of workers to use in parallel. Default 253 is to defer to the `concurrent.futures` library to select the best 254 value for the current machine. 255 allow_symlinks: Controls whether symbolic links are included. If a 256 symlink is present but the flag is `False` (default) the 257 serialization would raise an error. 258 259 Returns: 260 The new hashing configuration with the new serialization method. 261 """ 262 self._serializer = file.Serializer( 263 self._build_file_hasher_factory(hashing_algorithm, chunk_size), 264 max_workers=max_workers, 265 allow_symlinks=allow_symlinks, 266 ) 267 return self 268 269 def use_shard_serialization( 270 self, 271 *, 272 hashing_algorithm: Literal["sha256", "blake2"] = "sha256", 273 chunk_size: int = 1048576, 274 shard_size: int = 1_000_000_000, 275 max_workers: Optional[int] = None, 276 allow_symlinks: bool = False, 277 ) -> Self: 278 """Configures serialization to build a manifest of (shard, hash) pairs. 279 280 The serialization method in this configuration is changed to one where 281 every file in the model is sharded in equal sized shards, every shard is 282 paired with its digest and a manifest containing all these pairings is 283 being built. 284 285 Args: 286 hashing_algorithm: The hashing algorithm to use to hash a shard. 287 chunk_size: The amount of file to read at once. Default is 1MB. A 288 special value of 0 signals to attempt to read everything in a 289 single call. 290 shard_size: The size of a file shard. Default is 1 GB. 291 max_workers: Maximum number of workers to use in parallel. Default 292 is to defer to the `concurrent.futures` library to select the best 293 value for the current machine. 294 allow_symlinks: Controls whether symbolic links are included. If a 295 symlink is present but the flag is `False` (default) the 296 serialization would raise an error. 297 298 Returns: 299 The new hashing configuration with the new serialization method. 300 """ 301 self._serializer = file_shard.Serializer( 302 self._build_sharded_file_hasher_factory( 303 hashing_algorithm, chunk_size, shard_size 304 ), 305 max_workers=max_workers, 306 allow_symlinks=allow_symlinks, 307 ) 308 return self 309 310 def set_ignored_paths( 311 self, *, paths: Iterable[PathLike], ignore_git_paths: bool = True 312 ) -> Self: 313 """Configures the paths to be ignored during serialization of a model. 314 315 If the model is a single file, there are no paths that are ignored. If 316 the model is a directory, all paths are considered as relative to the 317 model directory, since we never look at files outside of it. 318 319 If an ignored path is a directory, serialization will ignore both the 320 path and any of its children. 321 322 Args: 323 paths: The paths to ignore. 324 ignore_git_paths: Whether to ignore git related paths (default) or 325 include them in the signature. 326 327 Returns: 328 The new hashing configuration with a new set of ignored paths. 329 """ 330 self._ignored_paths = frozenset({pathlib.Path(p) for p in paths}) 331 self._ignore_git_paths = ignore_git_paths 332 return self
85def hash(model_path: PathLike) -> manifest.Manifest: 86 """Hashes a model using the default configuration. 87 88 Hashing is the shared part between signing and verification and is also 89 expected to be the slowest component. When serializing a model, we need to 90 spend time proportional to the model size on disk. 91 92 This method returns a "manifest" of the model. A manifest is a collection of 93 every object in the model, paired with the corresponding hash. Currently, we 94 consider an object in the model to be either a file or a shard of the file. 95 Large models with large files will be hashed much faster when every shard is 96 hashed in parallel, at the cost of generating a larger payload for the 97 signature. In future releases we could support hashing individual tensors or 98 tensor slices for further speed optimizations for very large models. 99 100 Args: 101 model_path: The path to the model to hash. 102 103 Returns: 104 A manifest of the hashed model. 105 """ 106 return Config().hash(model_path)
Hashes a model using the default configuration.
Hashing is the shared part between signing and verification and is also expected to be the slowest component. When serializing a model, we need to spend time proportional to the model size on disk.
This method returns a "manifest" of the model. A manifest is a collection of every object in the model, paired with the corresponding hash. Currently, we consider an object in the model to be either a file or a shard of the file. Large models with large files will be hashed much faster when every shard is hashed in parallel, at the cost of generating a larger payload for the signature. In future releases we could support hashing individual tensors or tensor slices for further speed optimizations for very large models.
Arguments:
- model_path: The path to the model to hash.
Returns:
A manifest of the hashed model.
109class Config: 110 """Configuration to use when hashing models. 111 112 Hashing is the shared part between signing and verification and is also 113 expected to be the slowest component. When serializing a model, we need to 114 spend time proportional to the model size on disk. 115 116 Hashing builds a "manifest" of the model. A manifest is a collection of 117 every object in the model, paired with the corresponding hash. Currently, we 118 consider an object in the model to be either a file or a shard of the file. 119 Large models with large files will be hashed much faster when every shard is 120 hashed in parallel, at the cost of generating a larger payload for the 121 signature. In future releases we could support hashing individual tensors or 122 tensor slices for further speed optimizations for very large models. 123 124 This configuration class supports configuring the hashing granularity. By 125 default, we hash at file level granularity. 126 127 This configuration class also supports configuring the hash method used to 128 generate the hash for every object in the model. We currently support SHA256 129 and BLAKE2, with SHA256 being the default. 130 131 This configuration class also supports configuring which paths from the 132 model directory should be ignored. These are files that doesn't impact the 133 behavior of the model, or files that won't be distributed with the model. By 134 default, only files that are associated with a git repository (`.git`, 135 `.gitattributes`, `.gitignore`, etc.) are ignored. 136 """ 137 138 def __init__(self): 139 """Initializes the default configuration for hashing.""" 140 self._ignored_paths = frozenset() 141 self._ignore_git_paths = True 142 self.use_file_serialization() 143 144 def hash(self, model_path: PathLike) -> manifest.Manifest: 145 """Hashes a model using the current configuration.""" 146 ignored_paths = [path for path in self._ignored_paths] 147 if self._ignore_git_paths: 148 ignored_paths.extend( 149 [".git/", ".gitattributes", ".github/", ".gitignore"] 150 ) 151 152 return self._serializer.serialize( 153 pathlib.Path(model_path), ignore_paths=ignored_paths 154 ) 155 156 def _build_stream_hasher( 157 self, hashing_algorithm: Literal["sha256", "blake2"] = "sha256" 158 ) -> hashing.StreamingHashEngine: 159 """Builds a streaming hasher from a constant string. 160 161 Args: 162 hashing_algorithm: The hashing algorithm to use. 163 164 Returns: 165 An instance of the requested hasher. 166 """ 167 # TODO: Once Python 3.9 support is deprecated revert to using `match` 168 if hashing_algorithm == "sha256": 169 return memory.SHA256() 170 if hashing_algorithm == "blake2": 171 return memory.BLAKE2() 172 173 raise ValueError(f"Unsupported hashing method {hashing_algorithm}") 174 175 def _build_file_hasher_factory( 176 self, 177 hashing_algorithm: Literal["sha256", "blake2"] = "sha256", 178 chunk_size: int = 1048576, 179 ) -> Callable[[pathlib.Path], io.SimpleFileHasher]: 180 """Builds the hasher factory for a serialization by file. 181 182 Args: 183 hashing_algorithm: The hashing algorithm to use to hash a file. 184 chunk_size: The amount of file to read at once. Default is 1MB. A 185 special value of 0 signals to attempt to read everything in a 186 single call. 187 188 Returns: 189 The hasher factory that should be used by the active serialization 190 method. 191 """ 192 193 def _factory(path: pathlib.Path) -> io.SimpleFileHasher: 194 hasher = self._build_stream_hasher(hashing_algorithm) 195 return io.SimpleFileHasher(path, hasher, chunk_size=chunk_size) 196 197 return _factory 198 199 def _build_sharded_file_hasher_factory( 200 self, 201 hashing_algorithm: Literal["sha256", "blake2"] = "sha256", 202 chunk_size: int = 1048576, 203 shard_size: int = 1_000_000_000, 204 ) -> Callable[[pathlib.Path, int, int], io.ShardedFileHasher]: 205 """Builds the hasher factory for a serialization by file shards. 206 207 Args: 208 hashing_algorithm: The hashing algorithm to use to hash a shard. 209 chunk_size: The amount of file to read at once. Default is 1MB. A 210 special value of 0 signals to attempt to read everything in a 211 single call. 212 shard_size: The size of a file shard. Default is 1 GB. 213 214 Returns: 215 The hasher factory that should be used by the active serialization 216 method. 217 """ 218 219 def _factory( 220 path: pathlib.Path, start: int, end: int 221 ) -> io.ShardedFileHasher: 222 hasher = self._build_stream_hasher(hashing_algorithm) 223 return io.ShardedFileHasher( 224 path, 225 hasher, 226 start=start, 227 end=end, 228 chunk_size=chunk_size, 229 shard_size=shard_size, 230 ) 231 232 return _factory 233 234 def use_file_serialization( 235 self, 236 *, 237 hashing_algorithm: Literal["sha256", "blake2"] = "sha256", 238 chunk_size: int = 1048576, 239 max_workers: Optional[int] = None, 240 allow_symlinks: bool = False, 241 ) -> Self: 242 """Configures serialization to build a manifest of (file, hash) pairs. 243 244 The serialization method in this configuration is changed to one where 245 every file in the model is paired with its digest and a manifest 246 containing all these pairings is being built. 247 248 Args: 249 hashing_algorithm: The hashing algorithm to use to hash a file. 250 chunk_size: The amount of file to read at once. Default is 1MB. A 251 special value of 0 signals to attempt to read everything in a 252 single call. 253 max_workers: Maximum number of workers to use in parallel. Default 254 is to defer to the `concurrent.futures` library to select the best 255 value for the current machine. 256 allow_symlinks: Controls whether symbolic links are included. If a 257 symlink is present but the flag is `False` (default) the 258 serialization would raise an error. 259 260 Returns: 261 The new hashing configuration with the new serialization method. 262 """ 263 self._serializer = file.Serializer( 264 self._build_file_hasher_factory(hashing_algorithm, chunk_size), 265 max_workers=max_workers, 266 allow_symlinks=allow_symlinks, 267 ) 268 return self 269 270 def use_shard_serialization( 271 self, 272 *, 273 hashing_algorithm: Literal["sha256", "blake2"] = "sha256", 274 chunk_size: int = 1048576, 275 shard_size: int = 1_000_000_000, 276 max_workers: Optional[int] = None, 277 allow_symlinks: bool = False, 278 ) -> Self: 279 """Configures serialization to build a manifest of (shard, hash) pairs. 280 281 The serialization method in this configuration is changed to one where 282 every file in the model is sharded in equal sized shards, every shard is 283 paired with its digest and a manifest containing all these pairings is 284 being built. 285 286 Args: 287 hashing_algorithm: The hashing algorithm to use to hash a shard. 288 chunk_size: The amount of file to read at once. Default is 1MB. A 289 special value of 0 signals to attempt to read everything in a 290 single call. 291 shard_size: The size of a file shard. Default is 1 GB. 292 max_workers: Maximum number of workers to use in parallel. Default 293 is to defer to the `concurrent.futures` library to select the best 294 value for the current machine. 295 allow_symlinks: Controls whether symbolic links are included. If a 296 symlink is present but the flag is `False` (default) the 297 serialization would raise an error. 298 299 Returns: 300 The new hashing configuration with the new serialization method. 301 """ 302 self._serializer = file_shard.Serializer( 303 self._build_sharded_file_hasher_factory( 304 hashing_algorithm, chunk_size, shard_size 305 ), 306 max_workers=max_workers, 307 allow_symlinks=allow_symlinks, 308 ) 309 return self 310 311 def set_ignored_paths( 312 self, *, paths: Iterable[PathLike], ignore_git_paths: bool = True 313 ) -> Self: 314 """Configures the paths to be ignored during serialization of a model. 315 316 If the model is a single file, there are no paths that are ignored. If 317 the model is a directory, all paths are considered as relative to the 318 model directory, since we never look at files outside of it. 319 320 If an ignored path is a directory, serialization will ignore both the 321 path and any of its children. 322 323 Args: 324 paths: The paths to ignore. 325 ignore_git_paths: Whether to ignore git related paths (default) or 326 include them in the signature. 327 328 Returns: 329 The new hashing configuration with a new set of ignored paths. 330 """ 331 self._ignored_paths = frozenset({pathlib.Path(p) for p in paths}) 332 self._ignore_git_paths = ignore_git_paths 333 return self
Configuration to use when hashing models.
Hashing is the shared part between signing and verification and is also expected to be the slowest component. When serializing a model, we need to spend time proportional to the model size on disk.
Hashing builds a "manifest" of the model. A manifest is a collection of every object in the model, paired with the corresponding hash. Currently, we consider an object in the model to be either a file or a shard of the file. Large models with large files will be hashed much faster when every shard is hashed in parallel, at the cost of generating a larger payload for the signature. In future releases we could support hashing individual tensors or tensor slices for further speed optimizations for very large models.
This configuration class supports configuring the hashing granularity. By default, we hash at file level granularity.
This configuration class also supports configuring the hash method used to generate the hash for every object in the model. We currently support SHA256 and BLAKE2, with SHA256 being the default.
This configuration class also supports configuring which paths from the
model directory should be ignored. These are files that doesn't impact the
behavior of the model, or files that won't be distributed with the model. By
default, only files that are associated with a git repository (.git
,
.gitattributes
, .gitignore
, etc.) are ignored.
138 def __init__(self): 139 """Initializes the default configuration for hashing.""" 140 self._ignored_paths = frozenset() 141 self._ignore_git_paths = True 142 self.use_file_serialization()
Initializes the default configuration for hashing.
144 def hash(self, model_path: PathLike) -> manifest.Manifest: 145 """Hashes a model using the current configuration.""" 146 ignored_paths = [path for path in self._ignored_paths] 147 if self._ignore_git_paths: 148 ignored_paths.extend( 149 [".git/", ".gitattributes", ".github/", ".gitignore"] 150 ) 151 152 return self._serializer.serialize( 153 pathlib.Path(model_path), ignore_paths=ignored_paths 154 )
Hashes a model using the current configuration.
234 def use_file_serialization( 235 self, 236 *, 237 hashing_algorithm: Literal["sha256", "blake2"] = "sha256", 238 chunk_size: int = 1048576, 239 max_workers: Optional[int] = None, 240 allow_symlinks: bool = False, 241 ) -> Self: 242 """Configures serialization to build a manifest of (file, hash) pairs. 243 244 The serialization method in this configuration is changed to one where 245 every file in the model is paired with its digest and a manifest 246 containing all these pairings is being built. 247 248 Args: 249 hashing_algorithm: The hashing algorithm to use to hash a file. 250 chunk_size: The amount of file to read at once. Default is 1MB. A 251 special value of 0 signals to attempt to read everything in a 252 single call. 253 max_workers: Maximum number of workers to use in parallel. Default 254 is to defer to the `concurrent.futures` library to select the best 255 value for the current machine. 256 allow_symlinks: Controls whether symbolic links are included. If a 257 symlink is present but the flag is `False` (default) the 258 serialization would raise an error. 259 260 Returns: 261 The new hashing configuration with the new serialization method. 262 """ 263 self._serializer = file.Serializer( 264 self._build_file_hasher_factory(hashing_algorithm, chunk_size), 265 max_workers=max_workers, 266 allow_symlinks=allow_symlinks, 267 ) 268 return self
Configures serialization to build a manifest of (file, hash) pairs.
The serialization method in this configuration is changed to one where every file in the model is paired with its digest and a manifest containing all these pairings is being built.
Arguments:
- hashing_algorithm: The hashing algorithm to use to hash a file.
- chunk_size: The amount of file to read at once. Default is 1MB. A special value of 0 signals to attempt to read everything in a single call.
- max_workers: Maximum number of workers to use in parallel. Default
is to defer to the
concurrent.futures
library to select the best value for the current machine. - allow_symlinks: Controls whether symbolic links are included. If a
symlink is present but the flag is
False
(default) the serialization would raise an error.
Returns:
The new hashing configuration with the new serialization method.
270 def use_shard_serialization( 271 self, 272 *, 273 hashing_algorithm: Literal["sha256", "blake2"] = "sha256", 274 chunk_size: int = 1048576, 275 shard_size: int = 1_000_000_000, 276 max_workers: Optional[int] = None, 277 allow_symlinks: bool = False, 278 ) -> Self: 279 """Configures serialization to build a manifest of (shard, hash) pairs. 280 281 The serialization method in this configuration is changed to one where 282 every file in the model is sharded in equal sized shards, every shard is 283 paired with its digest and a manifest containing all these pairings is 284 being built. 285 286 Args: 287 hashing_algorithm: The hashing algorithm to use to hash a shard. 288 chunk_size: The amount of file to read at once. Default is 1MB. A 289 special value of 0 signals to attempt to read everything in a 290 single call. 291 shard_size: The size of a file shard. Default is 1 GB. 292 max_workers: Maximum number of workers to use in parallel. Default 293 is to defer to the `concurrent.futures` library to select the best 294 value for the current machine. 295 allow_symlinks: Controls whether symbolic links are included. If a 296 symlink is present but the flag is `False` (default) the 297 serialization would raise an error. 298 299 Returns: 300 The new hashing configuration with the new serialization method. 301 """ 302 self._serializer = file_shard.Serializer( 303 self._build_sharded_file_hasher_factory( 304 hashing_algorithm, chunk_size, shard_size 305 ), 306 max_workers=max_workers, 307 allow_symlinks=allow_symlinks, 308 ) 309 return self
Configures serialization to build a manifest of (shard, hash) pairs.
The serialization method in this configuration is changed to one where every file in the model is sharded in equal sized shards, every shard is paired with its digest and a manifest containing all these pairings is being built.
Arguments:
- hashing_algorithm: The hashing algorithm to use to hash a shard.
- chunk_size: The amount of file to read at once. Default is 1MB. A special value of 0 signals to attempt to read everything in a single call.
- shard_size: The size of a file shard. Default is 1 GB.
- max_workers: Maximum number of workers to use in parallel. Default
is to defer to the
concurrent.futures
library to select the best value for the current machine. - allow_symlinks: Controls whether symbolic links are included. If a
symlink is present but the flag is
False
(default) the serialization would raise an error.
Returns:
The new hashing configuration with the new serialization method.
311 def set_ignored_paths( 312 self, *, paths: Iterable[PathLike], ignore_git_paths: bool = True 313 ) -> Self: 314 """Configures the paths to be ignored during serialization of a model. 315 316 If the model is a single file, there are no paths that are ignored. If 317 the model is a directory, all paths are considered as relative to the 318 model directory, since we never look at files outside of it. 319 320 If an ignored path is a directory, serialization will ignore both the 321 path and any of its children. 322 323 Args: 324 paths: The paths to ignore. 325 ignore_git_paths: Whether to ignore git related paths (default) or 326 include them in the signature. 327 328 Returns: 329 The new hashing configuration with a new set of ignored paths. 330 """ 331 self._ignored_paths = frozenset({pathlib.Path(p) for p in paths}) 332 self._ignore_git_paths = ignore_git_paths 333 return self
Configures the paths to be ignored during serialization of a model.
If the model is a single file, there are no paths that are ignored. If the model is a directory, all paths are considered as relative to the model directory, since we never look at files outside of it.
If an ignored path is a directory, serialization will ignore both the path and any of its children.
Arguments:
- paths: The paths to ignore.
- ignore_git_paths: Whether to ignore git related paths (default) or include them in the signature.
Returns:
The new hashing configuration with a new set of ignored paths.