Edit on GitHub

model_signing.hashing

High level API for the hashing interface of model_signing library.

Hashing is used both for signing and verification and users should ensure that the same configuration is used in both cases.

The module could also be used to just hash a single model, without signing it:

This module allows setting up the hashing configuration to a single variable and then sharing it between signing and verification.

hashing_config = model_signing.hashing.Config().set_ignored_paths(
    paths=["README.md"], ignore_git_paths=True
)

signing_config = (
    model_signing.signing.Config()
    .use_elliptic_key_signer(private_key="key")
    .set_hashing_config(hashing_config)
)

verifying_config = (
    model_signing.verifying.Config()
    .use_elliptic_key_verifier(public_key="key.pub")
    .set_hashing_config(hashing_config)
)

The API defined here is stable and backwards compatible.

  1# Copyright 2024 The Sigstore Authors
  2#
  3# Licensed under the Apache License, Version 2.0 (the "License");
  4# you may not use this file except in compliance with the License.
  5# You may obtain a copy of the License at
  6#
  7#      http://www.apache.org/licenses/LICENSE-2.0
  8#
  9# Unless required by applicable law or agreed to in writing, software
 10# distributed under the License is distributed on an "AS IS" BASIS,
 11# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12# See the License for the specific language governing permissions and
 13# limitations under the License.
 14
 15"""High level API for the hashing interface of `model_signing` library.
 16
 17Hashing is used both for signing and verification and users should ensure that
 18the same configuration is used in both cases.
 19
 20The module could also be used to just hash a single model, without signing it:
 21
 22```python
 23model_signing.hashing.hash(model_path)
 24```
 25
 26This module allows setting up the hashing configuration to a single variable and
 27then sharing it between signing and verification.
 28
 29```python
 30hashing_config = model_signing.hashing.Config().set_ignored_paths(
 31    paths=["README.md"], ignore_git_paths=True
 32)
 33
 34signing_config = (
 35    model_signing.signing.Config()
 36    .use_elliptic_key_signer(private_key="key")
 37    .set_hashing_config(hashing_config)
 38)
 39
 40verifying_config = (
 41    model_signing.verifying.Config()
 42    .use_elliptic_key_verifier(public_key="key.pub")
 43    .set_hashing_config(hashing_config)
 44)
 45```
 46
 47The API defined here is stable and backwards compatible.
 48"""
 49
 50from collections.abc import Callable, Iterable
 51import os
 52import pathlib
 53import sys
 54from typing import Literal, Optional, Union
 55
 56from model_signing import manifest
 57from model_signing._hashing import hashing
 58from model_signing._hashing import io
 59from model_signing._hashing import memory
 60from model_signing._serialization import file
 61from model_signing._serialization import file_shard
 62
 63
 64if sys.version_info >= (3, 11):
 65    from typing import Self
 66else:
 67    from typing_extensions import Self
 68
 69
 70# `TypeAlias` only exists from Python 3.10
 71# `TypeAlias` is deprecated in Python 3.12 in favor of `type`
 72if sys.version_info >= (3, 10):
 73    from typing import TypeAlias
 74else:
 75    from typing_extensions import TypeAlias
 76
 77
 78# Type alias to support `os.PathLike`, `str` and `bytes` objects in the API
 79# When Python 3.12 is the minimum supported version we can use `type`
 80# When Python 3.11 is the minimum supported version we can use `|`
 81PathLike: TypeAlias = Union[str, bytes, os.PathLike]
 82
 83
 84def hash(model_path: PathLike) -> manifest.Manifest:
 85    """Hashes a model using the default configuration.
 86
 87    Hashing is the shared part between signing and verification and is also
 88    expected to be the slowest component. When serializing a model, we need to
 89    spend time proportional to the model size on disk.
 90
 91    This method returns a "manifest" of the model. A manifest is a collection of
 92    every object in the model, paired with the corresponding hash. Currently, we
 93    consider an object in the model to be either a file or a shard of the file.
 94    Large models with large files will be hashed much faster when every shard is
 95    hashed in parallel, at the cost of generating a larger payload for the
 96    signature. In future releases we could support hashing individual tensors or
 97    tensor slices for further speed optimizations for very large models.
 98
 99    Args:
100        model_path: The path to the model to hash.
101
102    Returns:
103        A manifest of the hashed model.
104    """
105    return Config().hash(model_path)
106
107
108class Config:
109    """Configuration to use when hashing models.
110
111    Hashing is the shared part between signing and verification and is also
112    expected to be the slowest component. When serializing a model, we need to
113    spend time proportional to the model size on disk.
114
115    Hashing builds a "manifest" of the model. A manifest is a collection of
116    every object in the model, paired with the corresponding hash. Currently, we
117    consider an object in the model to be either a file or a shard of the file.
118    Large models with large files will be hashed much faster when every shard is
119    hashed in parallel, at the cost of generating a larger payload for the
120    signature. In future releases we could support hashing individual tensors or
121    tensor slices for further speed optimizations for very large models.
122
123    This configuration class supports configuring the hashing granularity. By
124    default, we hash at file level granularity.
125
126    This configuration class also supports configuring the hash method used to
127    generate the hash for every object in the model. We currently support SHA256
128    and BLAKE2, with SHA256 being the default.
129
130    This configuration class also supports configuring which paths from the
131    model directory should be ignored. These are files that doesn't impact the
132    behavior of the model, or files that won't be distributed with the model. By
133    default, only files that are associated with a git repository (`.git`,
134    `.gitattributes`, `.gitignore`, etc.) are ignored.
135    """
136
137    def __init__(self):
138        """Initializes the default configuration for hashing."""
139        self._ignored_paths = frozenset()
140        self._ignore_git_paths = True
141        self.use_file_serialization()
142
143    def hash(self, model_path: PathLike) -> manifest.Manifest:
144        """Hashes a model using the current configuration."""
145        ignored_paths = [path for path in self._ignored_paths]
146        if self._ignore_git_paths:
147            ignored_paths.extend(
148                [".git/", ".gitattributes", ".github/", ".gitignore"]
149            )
150
151        return self._serializer.serialize(
152            pathlib.Path(model_path), ignore_paths=ignored_paths
153        )
154
155    def _build_stream_hasher(
156        self, hashing_algorithm: Literal["sha256", "blake2"] = "sha256"
157    ) -> hashing.StreamingHashEngine:
158        """Builds a streaming hasher from a constant string.
159
160        Args:
161            hashing_algorithm: The hashing algorithm to use.
162
163        Returns:
164            An instance of the requested hasher.
165        """
166        # TODO: Once Python 3.9 support is deprecated revert to using `match`
167        if hashing_algorithm == "sha256":
168            return memory.SHA256()
169        if hashing_algorithm == "blake2":
170            return memory.BLAKE2()
171
172        raise ValueError(f"Unsupported hashing method {hashing_algorithm}")
173
174    def _build_file_hasher_factory(
175        self,
176        hashing_algorithm: Literal["sha256", "blake2"] = "sha256",
177        chunk_size: int = 1048576,
178    ) -> Callable[[pathlib.Path], io.SimpleFileHasher]:
179        """Builds the hasher factory for a serialization by file.
180
181        Args:
182            hashing_algorithm: The hashing algorithm to use to hash a file.
183            chunk_size: The amount of file to read at once. Default is 1MB. A
184              special value of 0 signals to attempt to read everything in a
185              single call.
186
187        Returns:
188            The hasher factory that should be used by the active serialization
189            method.
190        """
191
192        def _factory(path: pathlib.Path) -> io.SimpleFileHasher:
193            hasher = self._build_stream_hasher(hashing_algorithm)
194            return io.SimpleFileHasher(path, hasher, chunk_size=chunk_size)
195
196        return _factory
197
198    def _build_sharded_file_hasher_factory(
199        self,
200        hashing_algorithm: Literal["sha256", "blake2"] = "sha256",
201        chunk_size: int = 1048576,
202        shard_size: int = 1_000_000_000,
203    ) -> Callable[[pathlib.Path, int, int], io.ShardedFileHasher]:
204        """Builds the hasher factory for a serialization by file shards.
205
206        Args:
207            hashing_algorithm: The hashing algorithm to use to hash a shard.
208            chunk_size: The amount of file to read at once. Default is 1MB. A
209              special value of 0 signals to attempt to read everything in a
210              single call.
211            shard_size: The size of a file shard. Default is 1 GB.
212
213        Returns:
214            The hasher factory that should be used by the active serialization
215            method.
216        """
217
218        def _factory(
219            path: pathlib.Path, start: int, end: int
220        ) -> io.ShardedFileHasher:
221            hasher = self._build_stream_hasher(hashing_algorithm)
222            return io.ShardedFileHasher(
223                path,
224                hasher,
225                start=start,
226                end=end,
227                chunk_size=chunk_size,
228                shard_size=shard_size,
229            )
230
231        return _factory
232
233    def use_file_serialization(
234        self,
235        *,
236        hashing_algorithm: Literal["sha256", "blake2"] = "sha256",
237        chunk_size: int = 1048576,
238        max_workers: Optional[int] = None,
239        allow_symlinks: bool = False,
240    ) -> Self:
241        """Configures serialization to build a manifest of (file, hash) pairs.
242
243        The serialization method in this configuration is changed to one where
244        every file in the model is paired with its digest and a manifest
245        containing all these pairings is being built.
246
247        Args:
248            hashing_algorithm: The hashing algorithm to use to hash a file.
249            chunk_size: The amount of file to read at once. Default is 1MB. A
250              special value of 0 signals to attempt to read everything in a
251              single call.
252            max_workers: Maximum number of workers to use in parallel. Default
253              is to defer to the `concurrent.futures` library to select the best
254              value for the current machine.
255            allow_symlinks: Controls whether symbolic links are included. If a
256              symlink is present but the flag is `False` (default) the
257              serialization would raise an error.
258
259        Returns:
260            The new hashing configuration with the new serialization method.
261        """
262        self._serializer = file.Serializer(
263            self._build_file_hasher_factory(hashing_algorithm, chunk_size),
264            max_workers=max_workers,
265            allow_symlinks=allow_symlinks,
266        )
267        return self
268
269    def use_shard_serialization(
270        self,
271        *,
272        hashing_algorithm: Literal["sha256", "blake2"] = "sha256",
273        chunk_size: int = 1048576,
274        shard_size: int = 1_000_000_000,
275        max_workers: Optional[int] = None,
276        allow_symlinks: bool = False,
277    ) -> Self:
278        """Configures serialization to build a manifest of (shard, hash) pairs.
279
280        The serialization method in this configuration is changed to one where
281        every file in the model is sharded in equal sized shards, every shard is
282        paired with its digest and a manifest containing all these pairings is
283        being built.
284
285        Args:
286            hashing_algorithm: The hashing algorithm to use to hash a shard.
287            chunk_size: The amount of file to read at once. Default is 1MB. A
288              special value of 0 signals to attempt to read everything in a
289              single call.
290            shard_size: The size of a file shard. Default is 1 GB.
291            max_workers: Maximum number of workers to use in parallel. Default
292              is to defer to the `concurrent.futures` library to select the best
293              value for the current machine.
294            allow_symlinks: Controls whether symbolic links are included. If a
295              symlink is present but the flag is `False` (default) the
296              serialization would raise an error.
297
298        Returns:
299            The new hashing configuration with the new serialization method.
300        """
301        self._serializer = file_shard.Serializer(
302            self._build_sharded_file_hasher_factory(
303                hashing_algorithm, chunk_size, shard_size
304            ),
305            max_workers=max_workers,
306            allow_symlinks=allow_symlinks,
307        )
308        return self
309
310    def set_ignored_paths(
311        self, *, paths: Iterable[PathLike], ignore_git_paths: bool = True
312    ) -> Self:
313        """Configures the paths to be ignored during serialization of a model.
314
315        If the model is a single file, there are no paths that are ignored. If
316        the model is a directory, all paths are considered as relative to the
317        model directory, since we never look at files outside of it.
318
319        If an ignored path is a directory, serialization will ignore both the
320        path and any of its children.
321
322        Args:
323            paths: The paths to ignore.
324            ignore_git_paths: Whether to ignore git related paths (default) or
325              include them in the signature.
326
327        Returns:
328            The new hashing configuration with a new set of ignored paths.
329        """
330        self._ignored_paths = frozenset({pathlib.Path(p) for p in paths})
331        self._ignore_git_paths = ignore_git_paths
332        return self
PathLike: TypeAlias = Union[str, bytes, os.PathLike]
def hash( model_path: Union[str, bytes, os.PathLike]) -> model_signing.manifest.Manifest:
 85def hash(model_path: PathLike) -> manifest.Manifest:
 86    """Hashes a model using the default configuration.
 87
 88    Hashing is the shared part between signing and verification and is also
 89    expected to be the slowest component. When serializing a model, we need to
 90    spend time proportional to the model size on disk.
 91
 92    This method returns a "manifest" of the model. A manifest is a collection of
 93    every object in the model, paired with the corresponding hash. Currently, we
 94    consider an object in the model to be either a file or a shard of the file.
 95    Large models with large files will be hashed much faster when every shard is
 96    hashed in parallel, at the cost of generating a larger payload for the
 97    signature. In future releases we could support hashing individual tensors or
 98    tensor slices for further speed optimizations for very large models.
 99
100    Args:
101        model_path: The path to the model to hash.
102
103    Returns:
104        A manifest of the hashed model.
105    """
106    return Config().hash(model_path)

Hashes a model using the default configuration.

Hashing is the shared part between signing and verification and is also expected to be the slowest component. When serializing a model, we need to spend time proportional to the model size on disk.

This method returns a "manifest" of the model. A manifest is a collection of every object in the model, paired with the corresponding hash. Currently, we consider an object in the model to be either a file or a shard of the file. Large models with large files will be hashed much faster when every shard is hashed in parallel, at the cost of generating a larger payload for the signature. In future releases we could support hashing individual tensors or tensor slices for further speed optimizations for very large models.

Arguments:
  • model_path: The path to the model to hash.
Returns:

A manifest of the hashed model.

class Config:
109class Config:
110    """Configuration to use when hashing models.
111
112    Hashing is the shared part between signing and verification and is also
113    expected to be the slowest component. When serializing a model, we need to
114    spend time proportional to the model size on disk.
115
116    Hashing builds a "manifest" of the model. A manifest is a collection of
117    every object in the model, paired with the corresponding hash. Currently, we
118    consider an object in the model to be either a file or a shard of the file.
119    Large models with large files will be hashed much faster when every shard is
120    hashed in parallel, at the cost of generating a larger payload for the
121    signature. In future releases we could support hashing individual tensors or
122    tensor slices for further speed optimizations for very large models.
123
124    This configuration class supports configuring the hashing granularity. By
125    default, we hash at file level granularity.
126
127    This configuration class also supports configuring the hash method used to
128    generate the hash for every object in the model. We currently support SHA256
129    and BLAKE2, with SHA256 being the default.
130
131    This configuration class also supports configuring which paths from the
132    model directory should be ignored. These are files that doesn't impact the
133    behavior of the model, or files that won't be distributed with the model. By
134    default, only files that are associated with a git repository (`.git`,
135    `.gitattributes`, `.gitignore`, etc.) are ignored.
136    """
137
138    def __init__(self):
139        """Initializes the default configuration for hashing."""
140        self._ignored_paths = frozenset()
141        self._ignore_git_paths = True
142        self.use_file_serialization()
143
144    def hash(self, model_path: PathLike) -> manifest.Manifest:
145        """Hashes a model using the current configuration."""
146        ignored_paths = [path for path in self._ignored_paths]
147        if self._ignore_git_paths:
148            ignored_paths.extend(
149                [".git/", ".gitattributes", ".github/", ".gitignore"]
150            )
151
152        return self._serializer.serialize(
153            pathlib.Path(model_path), ignore_paths=ignored_paths
154        )
155
156    def _build_stream_hasher(
157        self, hashing_algorithm: Literal["sha256", "blake2"] = "sha256"
158    ) -> hashing.StreamingHashEngine:
159        """Builds a streaming hasher from a constant string.
160
161        Args:
162            hashing_algorithm: The hashing algorithm to use.
163
164        Returns:
165            An instance of the requested hasher.
166        """
167        # TODO: Once Python 3.9 support is deprecated revert to using `match`
168        if hashing_algorithm == "sha256":
169            return memory.SHA256()
170        if hashing_algorithm == "blake2":
171            return memory.BLAKE2()
172
173        raise ValueError(f"Unsupported hashing method {hashing_algorithm}")
174
175    def _build_file_hasher_factory(
176        self,
177        hashing_algorithm: Literal["sha256", "blake2"] = "sha256",
178        chunk_size: int = 1048576,
179    ) -> Callable[[pathlib.Path], io.SimpleFileHasher]:
180        """Builds the hasher factory for a serialization by file.
181
182        Args:
183            hashing_algorithm: The hashing algorithm to use to hash a file.
184            chunk_size: The amount of file to read at once. Default is 1MB. A
185              special value of 0 signals to attempt to read everything in a
186              single call.
187
188        Returns:
189            The hasher factory that should be used by the active serialization
190            method.
191        """
192
193        def _factory(path: pathlib.Path) -> io.SimpleFileHasher:
194            hasher = self._build_stream_hasher(hashing_algorithm)
195            return io.SimpleFileHasher(path, hasher, chunk_size=chunk_size)
196
197        return _factory
198
199    def _build_sharded_file_hasher_factory(
200        self,
201        hashing_algorithm: Literal["sha256", "blake2"] = "sha256",
202        chunk_size: int = 1048576,
203        shard_size: int = 1_000_000_000,
204    ) -> Callable[[pathlib.Path, int, int], io.ShardedFileHasher]:
205        """Builds the hasher factory for a serialization by file shards.
206
207        Args:
208            hashing_algorithm: The hashing algorithm to use to hash a shard.
209            chunk_size: The amount of file to read at once. Default is 1MB. A
210              special value of 0 signals to attempt to read everything in a
211              single call.
212            shard_size: The size of a file shard. Default is 1 GB.
213
214        Returns:
215            The hasher factory that should be used by the active serialization
216            method.
217        """
218
219        def _factory(
220            path: pathlib.Path, start: int, end: int
221        ) -> io.ShardedFileHasher:
222            hasher = self._build_stream_hasher(hashing_algorithm)
223            return io.ShardedFileHasher(
224                path,
225                hasher,
226                start=start,
227                end=end,
228                chunk_size=chunk_size,
229                shard_size=shard_size,
230            )
231
232        return _factory
233
234    def use_file_serialization(
235        self,
236        *,
237        hashing_algorithm: Literal["sha256", "blake2"] = "sha256",
238        chunk_size: int = 1048576,
239        max_workers: Optional[int] = None,
240        allow_symlinks: bool = False,
241    ) -> Self:
242        """Configures serialization to build a manifest of (file, hash) pairs.
243
244        The serialization method in this configuration is changed to one where
245        every file in the model is paired with its digest and a manifest
246        containing all these pairings is being built.
247
248        Args:
249            hashing_algorithm: The hashing algorithm to use to hash a file.
250            chunk_size: The amount of file to read at once. Default is 1MB. A
251              special value of 0 signals to attempt to read everything in a
252              single call.
253            max_workers: Maximum number of workers to use in parallel. Default
254              is to defer to the `concurrent.futures` library to select the best
255              value for the current machine.
256            allow_symlinks: Controls whether symbolic links are included. If a
257              symlink is present but the flag is `False` (default) the
258              serialization would raise an error.
259
260        Returns:
261            The new hashing configuration with the new serialization method.
262        """
263        self._serializer = file.Serializer(
264            self._build_file_hasher_factory(hashing_algorithm, chunk_size),
265            max_workers=max_workers,
266            allow_symlinks=allow_symlinks,
267        )
268        return self
269
270    def use_shard_serialization(
271        self,
272        *,
273        hashing_algorithm: Literal["sha256", "blake2"] = "sha256",
274        chunk_size: int = 1048576,
275        shard_size: int = 1_000_000_000,
276        max_workers: Optional[int] = None,
277        allow_symlinks: bool = False,
278    ) -> Self:
279        """Configures serialization to build a manifest of (shard, hash) pairs.
280
281        The serialization method in this configuration is changed to one where
282        every file in the model is sharded in equal sized shards, every shard is
283        paired with its digest and a manifest containing all these pairings is
284        being built.
285
286        Args:
287            hashing_algorithm: The hashing algorithm to use to hash a shard.
288            chunk_size: The amount of file to read at once. Default is 1MB. A
289              special value of 0 signals to attempt to read everything in a
290              single call.
291            shard_size: The size of a file shard. Default is 1 GB.
292            max_workers: Maximum number of workers to use in parallel. Default
293              is to defer to the `concurrent.futures` library to select the best
294              value for the current machine.
295            allow_symlinks: Controls whether symbolic links are included. If a
296              symlink is present but the flag is `False` (default) the
297              serialization would raise an error.
298
299        Returns:
300            The new hashing configuration with the new serialization method.
301        """
302        self._serializer = file_shard.Serializer(
303            self._build_sharded_file_hasher_factory(
304                hashing_algorithm, chunk_size, shard_size
305            ),
306            max_workers=max_workers,
307            allow_symlinks=allow_symlinks,
308        )
309        return self
310
311    def set_ignored_paths(
312        self, *, paths: Iterable[PathLike], ignore_git_paths: bool = True
313    ) -> Self:
314        """Configures the paths to be ignored during serialization of a model.
315
316        If the model is a single file, there are no paths that are ignored. If
317        the model is a directory, all paths are considered as relative to the
318        model directory, since we never look at files outside of it.
319
320        If an ignored path is a directory, serialization will ignore both the
321        path and any of its children.
322
323        Args:
324            paths: The paths to ignore.
325            ignore_git_paths: Whether to ignore git related paths (default) or
326              include them in the signature.
327
328        Returns:
329            The new hashing configuration with a new set of ignored paths.
330        """
331        self._ignored_paths = frozenset({pathlib.Path(p) for p in paths})
332        self._ignore_git_paths = ignore_git_paths
333        return self

Configuration to use when hashing models.

Hashing is the shared part between signing and verification and is also expected to be the slowest component. When serializing a model, we need to spend time proportional to the model size on disk.

Hashing builds a "manifest" of the model. A manifest is a collection of every object in the model, paired with the corresponding hash. Currently, we consider an object in the model to be either a file or a shard of the file. Large models with large files will be hashed much faster when every shard is hashed in parallel, at the cost of generating a larger payload for the signature. In future releases we could support hashing individual tensors or tensor slices for further speed optimizations for very large models.

This configuration class supports configuring the hashing granularity. By default, we hash at file level granularity.

This configuration class also supports configuring the hash method used to generate the hash for every object in the model. We currently support SHA256 and BLAKE2, with SHA256 being the default.

This configuration class also supports configuring which paths from the model directory should be ignored. These are files that doesn't impact the behavior of the model, or files that won't be distributed with the model. By default, only files that are associated with a git repository (.git, .gitattributes, .gitignore, etc.) are ignored.

Config()
138    def __init__(self):
139        """Initializes the default configuration for hashing."""
140        self._ignored_paths = frozenset()
141        self._ignore_git_paths = True
142        self.use_file_serialization()

Initializes the default configuration for hashing.

def hash( self, model_path: Union[str, bytes, os.PathLike]) -> model_signing.manifest.Manifest:
144    def hash(self, model_path: PathLike) -> manifest.Manifest:
145        """Hashes a model using the current configuration."""
146        ignored_paths = [path for path in self._ignored_paths]
147        if self._ignore_git_paths:
148            ignored_paths.extend(
149                [".git/", ".gitattributes", ".github/", ".gitignore"]
150            )
151
152        return self._serializer.serialize(
153            pathlib.Path(model_path), ignore_paths=ignored_paths
154        )

Hashes a model using the current configuration.

def use_file_serialization( self, *, hashing_algorithm: Literal['sha256', 'blake2'] = 'sha256', chunk_size: int = 1048576, max_workers: Optional[int] = None, allow_symlinks: bool = False) -> Self:
234    def use_file_serialization(
235        self,
236        *,
237        hashing_algorithm: Literal["sha256", "blake2"] = "sha256",
238        chunk_size: int = 1048576,
239        max_workers: Optional[int] = None,
240        allow_symlinks: bool = False,
241    ) -> Self:
242        """Configures serialization to build a manifest of (file, hash) pairs.
243
244        The serialization method in this configuration is changed to one where
245        every file in the model is paired with its digest and a manifest
246        containing all these pairings is being built.
247
248        Args:
249            hashing_algorithm: The hashing algorithm to use to hash a file.
250            chunk_size: The amount of file to read at once. Default is 1MB. A
251              special value of 0 signals to attempt to read everything in a
252              single call.
253            max_workers: Maximum number of workers to use in parallel. Default
254              is to defer to the `concurrent.futures` library to select the best
255              value for the current machine.
256            allow_symlinks: Controls whether symbolic links are included. If a
257              symlink is present but the flag is `False` (default) the
258              serialization would raise an error.
259
260        Returns:
261            The new hashing configuration with the new serialization method.
262        """
263        self._serializer = file.Serializer(
264            self._build_file_hasher_factory(hashing_algorithm, chunk_size),
265            max_workers=max_workers,
266            allow_symlinks=allow_symlinks,
267        )
268        return self

Configures serialization to build a manifest of (file, hash) pairs.

The serialization method in this configuration is changed to one where every file in the model is paired with its digest and a manifest containing all these pairings is being built.

Arguments:
  • hashing_algorithm: The hashing algorithm to use to hash a file.
  • chunk_size: The amount of file to read at once. Default is 1MB. A special value of 0 signals to attempt to read everything in a single call.
  • max_workers: Maximum number of workers to use in parallel. Default is to defer to the concurrent.futures library to select the best value for the current machine.
  • allow_symlinks: Controls whether symbolic links are included. If a symlink is present but the flag is False (default) the serialization would raise an error.
Returns:

The new hashing configuration with the new serialization method.

def use_shard_serialization( self, *, hashing_algorithm: Literal['sha256', 'blake2'] = 'sha256', chunk_size: int = 1048576, shard_size: int = 1000000000, max_workers: Optional[int] = None, allow_symlinks: bool = False) -> Self:
270    def use_shard_serialization(
271        self,
272        *,
273        hashing_algorithm: Literal["sha256", "blake2"] = "sha256",
274        chunk_size: int = 1048576,
275        shard_size: int = 1_000_000_000,
276        max_workers: Optional[int] = None,
277        allow_symlinks: bool = False,
278    ) -> Self:
279        """Configures serialization to build a manifest of (shard, hash) pairs.
280
281        The serialization method in this configuration is changed to one where
282        every file in the model is sharded in equal sized shards, every shard is
283        paired with its digest and a manifest containing all these pairings is
284        being built.
285
286        Args:
287            hashing_algorithm: The hashing algorithm to use to hash a shard.
288            chunk_size: The amount of file to read at once. Default is 1MB. A
289              special value of 0 signals to attempt to read everything in a
290              single call.
291            shard_size: The size of a file shard. Default is 1 GB.
292            max_workers: Maximum number of workers to use in parallel. Default
293              is to defer to the `concurrent.futures` library to select the best
294              value for the current machine.
295            allow_symlinks: Controls whether symbolic links are included. If a
296              symlink is present but the flag is `False` (default) the
297              serialization would raise an error.
298
299        Returns:
300            The new hashing configuration with the new serialization method.
301        """
302        self._serializer = file_shard.Serializer(
303            self._build_sharded_file_hasher_factory(
304                hashing_algorithm, chunk_size, shard_size
305            ),
306            max_workers=max_workers,
307            allow_symlinks=allow_symlinks,
308        )
309        return self

Configures serialization to build a manifest of (shard, hash) pairs.

The serialization method in this configuration is changed to one where every file in the model is sharded in equal sized shards, every shard is paired with its digest and a manifest containing all these pairings is being built.

Arguments:
  • hashing_algorithm: The hashing algorithm to use to hash a shard.
  • chunk_size: The amount of file to read at once. Default is 1MB. A special value of 0 signals to attempt to read everything in a single call.
  • shard_size: The size of a file shard. Default is 1 GB.
  • max_workers: Maximum number of workers to use in parallel. Default is to defer to the concurrent.futures library to select the best value for the current machine.
  • allow_symlinks: Controls whether symbolic links are included. If a symlink is present but the flag is False (default) the serialization would raise an error.
Returns:

The new hashing configuration with the new serialization method.

def set_ignored_paths( self, *, paths: Iterable[typing.Union[str, bytes, os.PathLike]], ignore_git_paths: bool = True) -> Self:
311    def set_ignored_paths(
312        self, *, paths: Iterable[PathLike], ignore_git_paths: bool = True
313    ) -> Self:
314        """Configures the paths to be ignored during serialization of a model.
315
316        If the model is a single file, there are no paths that are ignored. If
317        the model is a directory, all paths are considered as relative to the
318        model directory, since we never look at files outside of it.
319
320        If an ignored path is a directory, serialization will ignore both the
321        path and any of its children.
322
323        Args:
324            paths: The paths to ignore.
325            ignore_git_paths: Whether to ignore git related paths (default) or
326              include them in the signature.
327
328        Returns:
329            The new hashing configuration with a new set of ignored paths.
330        """
331        self._ignored_paths = frozenset({pathlib.Path(p) for p in paths})
332        self._ignore_git_paths = ignore_git_paths
333        return self

Configures the paths to be ignored during serialization of a model.

If the model is a single file, there are no paths that are ignored. If the model is a directory, all paths are considered as relative to the model directory, since we never look at files outside of it.

If an ignored path is a directory, serialization will ignore both the path and any of its children.

Arguments:
  • paths: The paths to ignore.
  • ignore_git_paths: Whether to ignore git related paths (default) or include them in the signature.
Returns:

The new hashing configuration with a new set of ignored paths.