Edit on GitHub

model_signing.hashing

High level API for the hashing interface of model_signing library.

Hashing is used both for signing and verification and users should ensure that the same configuration is used in both cases.

The module could also be used to just hash a single model, without signing it:

This module allows setting up the hashing configuration to a single variable and then sharing it between signing and verification.

hashing_config = model_signing.hashing.Config().set_ignored_paths(
    paths=["README.md"], ignore_git_paths=True
)

signing_config = (
    model_signing.signing.Config()
    .use_elliptic_key_signer(private_key="key")
    .set_hashing_config(hashing_config)
)

verifying_config = (
    model_signing.verifying.Config()
    .use_elliptic_key_verifier(public_key="key.pub")
    .set_hashing_config(hashing_config)
)

The API defined here is stable and backwards compatible.

  1# Copyright 2024 The Sigstore Authors
  2#
  3# Licensed under the Apache License, Version 2.0 (the "License");
  4# you may not use this file except in compliance with the License.
  5# You may obtain a copy of the License at
  6#
  7#      http://www.apache.org/licenses/LICENSE-2.0
  8#
  9# Unless required by applicable law or agreed to in writing, software
 10# distributed under the License is distributed on an "AS IS" BASIS,
 11# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12# See the License for the specific language governing permissions and
 13# limitations under the License.
 14
 15"""High level API for the hashing interface of `model_signing` library.
 16
 17Hashing is used both for signing and verification and users should ensure that
 18the same configuration is used in both cases.
 19
 20The module could also be used to just hash a single model, without signing it:
 21
 22```python
 23model_signing.hashing.hash(model_path)
 24```
 25
 26This module allows setting up the hashing configuration to a single variable and
 27then sharing it between signing and verification.
 28
 29```python
 30hashing_config = model_signing.hashing.Config().set_ignored_paths(
 31    paths=["README.md"], ignore_git_paths=True
 32)
 33
 34signing_config = (
 35    model_signing.signing.Config()
 36    .use_elliptic_key_signer(private_key="key")
 37    .set_hashing_config(hashing_config)
 38)
 39
 40verifying_config = (
 41    model_signing.verifying.Config()
 42    .use_elliptic_key_verifier(public_key="key.pub")
 43    .set_hashing_config(hashing_config)
 44)
 45```
 46
 47The API defined here is stable and backwards compatible.
 48"""
 49
 50from collections.abc import Callable, Iterable
 51import os
 52import pathlib
 53import sys
 54from typing import Literal, Optional, Union
 55
 56from model_signing import manifest
 57from model_signing._hashing import hashing
 58from model_signing._hashing import io
 59from model_signing._hashing import memory
 60from model_signing._serialization import file
 61from model_signing._serialization import file_shard
 62
 63
 64if sys.version_info >= (3, 11):
 65    from typing import Self
 66else:
 67    from typing_extensions import Self
 68
 69
 70# `TypeAlias` only exists from Python 3.10
 71# `TypeAlias` is deprecated in Python 3.12 in favor of `type`
 72if sys.version_info >= (3, 10):
 73    from typing import TypeAlias
 74else:
 75    from typing_extensions import TypeAlias
 76
 77
 78# Type alias to support `os.PathLike`, `str` and `bytes` objects in the API
 79# When Python 3.12 is the minimum supported version we can use `type`
 80# When Python 3.11 is the minimum supported version we can use `|`
 81PathLike: TypeAlias = Union[str, bytes, os.PathLike]
 82
 83
 84def hash(model_path: PathLike) -> manifest.Manifest:
 85    """Hashes a model using the default configuration.
 86
 87    Hashing is the shared part between signing and verification and is also
 88    expected to be the slowest component. When serializing a model, we need to
 89    spend time proportional to the model size on disk.
 90
 91    This method returns a "manifest" of the model. A manifest is a collection of
 92    every object in the model, paired with the corresponding hash. Currently, we
 93    consider an object in the model to be either a file or a shard of the file.
 94    Large models with large files will be hashed much faster when every shard is
 95    hashed in parallel, at the cost of generating a larger payload for the
 96    signature. In future releases we could support hashing individual tensors or
 97    tensor slices for further speed optimizations for very large models.
 98
 99    Args:
100        model_path: The path to the model to hash.
101
102    Returns:
103        A manifest of the hashed model.
104    """
105    return Config().hash(model_path)
106
107
108class Config:
109    """Configuration to use when hashing models.
110
111    Hashing is the shared part between signing and verification and is also
112    expected to be the slowest component. When serializing a model, we need to
113    spend time proportional to the model size on disk.
114
115    Hashing builds a "manifest" of the model. A manifest is a collection of
116    every object in the model, paired with the corresponding hash. Currently, we
117    consider an object in the model to be either a file or a shard of the file.
118    Large models with large files will be hashed much faster when every shard is
119    hashed in parallel, at the cost of generating a larger payload for the
120    signature. In future releases we could support hashing individual tensors or
121    tensor slices for further speed optimizations for very large models.
122
123    This configuration class supports configuring the hashing granularity. By
124    default, we hash at file level granularity.
125
126    This configuration class also supports configuring the hash method used to
127    generate the hash for every object in the model. We currently support SHA256
128    and BLAKE2, with SHA256 being the default.
129
130    This configuration class also supports configuring which paths from the
131    model directory should be ignored. These are files that doesn't impact the
132    behavior of the model, or files that won't be distributed with the model. By
133    default, only files that are associated with a git repository (`.git`,
134    `.gitattributes`, `.gitignore`, etc.) are ignored.
135    """
136
137    def __init__(self):
138        """Initializes the default configuration for hashing."""
139        self._ignored_paths = frozenset()
140        self._ignore_git_paths = True
141        self.use_file_serialization()
142        self._allow_symlinks = False
143
144    def hash(
145        self,
146        model_path: PathLike,
147        *,
148        files_to_hash: Optional[Iterable[PathLike]] = None,
149    ) -> manifest.Manifest:
150        """Hashes a model using the current configuration."""
151        # All paths in ``_ignored_paths`` are expected to be relative to the
152        # model directory. Join them to ``model_path`` and ensure they do not
153        # escape it.
154        model_path = pathlib.Path(model_path)
155        ignored_paths = []
156        for p in self._ignored_paths:
157            full = model_path / p
158            try:
159                full.relative_to(model_path)
160            except ValueError:
161                continue
162            ignored_paths.append(full)
163
164        if self._ignore_git_paths:
165            ignored_paths.extend(
166                [
167                    model_path / p
168                    for p in [
169                        ".git/",
170                        ".gitattributes",
171                        ".github/",
172                        ".gitignore",
173                    ]
174                ]
175            )
176
177        self._serializer.set_allow_symlinks(self._allow_symlinks)
178
179        return self._serializer.serialize(
180            pathlib.Path(model_path),
181            ignore_paths=ignored_paths,
182            files_to_hash=files_to_hash,
183        )
184
185    def _build_stream_hasher(
186        self, hashing_algorithm: Literal["sha256", "blake2"] = "sha256"
187    ) -> hashing.StreamingHashEngine:
188        """Builds a streaming hasher from a constant string.
189
190        Args:
191            hashing_algorithm: The hashing algorithm to use.
192
193        Returns:
194            An instance of the requested hasher.
195        """
196        # TODO: Once Python 3.9 support is deprecated revert to using `match`
197        if hashing_algorithm == "sha256":
198            return memory.SHA256()
199        if hashing_algorithm == "blake2":
200            return memory.BLAKE2()
201
202        raise ValueError(f"Unsupported hashing method {hashing_algorithm}")
203
204    def _build_file_hasher_factory(
205        self,
206        hashing_algorithm: Literal["sha256", "blake2"] = "sha256",
207        chunk_size: int = 1048576,
208    ) -> Callable[[pathlib.Path], io.SimpleFileHasher]:
209        """Builds the hasher factory for a serialization by file.
210
211        Args:
212            hashing_algorithm: The hashing algorithm to use to hash a file.
213            chunk_size: The amount of file to read at once. Default is 1MB. A
214              special value of 0 signals to attempt to read everything in a
215              single call.
216
217        Returns:
218            The hasher factory that should be used by the active serialization
219            method.
220        """
221
222        def _factory(path: pathlib.Path) -> io.SimpleFileHasher:
223            hasher = self._build_stream_hasher(hashing_algorithm)
224            return io.SimpleFileHasher(path, hasher, chunk_size=chunk_size)
225
226        return _factory
227
228    def _build_sharded_file_hasher_factory(
229        self,
230        hashing_algorithm: Literal["sha256", "blake2"] = "sha256",
231        chunk_size: int = 1048576,
232        shard_size: int = 1_000_000_000,
233    ) -> Callable[[pathlib.Path, int, int], io.ShardedFileHasher]:
234        """Builds the hasher factory for a serialization by file shards.
235
236        Args:
237            hashing_algorithm: The hashing algorithm to use to hash a shard.
238            chunk_size: The amount of file to read at once. Default is 1MB. A
239              special value of 0 signals to attempt to read everything in a
240              single call.
241            shard_size: The size of a file shard. Default is 1 GB.
242
243        Returns:
244            The hasher factory that should be used by the active serialization
245            method.
246        """
247
248        def _factory(
249            path: pathlib.Path, start: int, end: int
250        ) -> io.ShardedFileHasher:
251            hasher = self._build_stream_hasher(hashing_algorithm)
252            return io.ShardedFileHasher(
253                path,
254                hasher,
255                start=start,
256                end=end,
257                chunk_size=chunk_size,
258                shard_size=shard_size,
259            )
260
261        return _factory
262
263    def use_file_serialization(
264        self,
265        *,
266        hashing_algorithm: Literal["sha256", "blake2"] = "sha256",
267        chunk_size: int = 1048576,
268        max_workers: Optional[int] = None,
269        allow_symlinks: bool = False,
270        ignore_paths: Iterable[pathlib.Path] = frozenset(),
271    ) -> Self:
272        """Configures serialization to build a manifest of (file, hash) pairs.
273
274        The serialization method in this configuration is changed to one where
275        every file in the model is paired with its digest and a manifest
276        containing all these pairings is being built.
277
278        Args:
279            hashing_algorithm: The hashing algorithm to use to hash a file.
280            chunk_size: The amount of file to read at once. Default is 1MB. A
281              special value of 0 signals to attempt to read everything in a
282              single call.
283            max_workers: Maximum number of workers to use in parallel. Default
284              is to defer to the `concurrent.futures` library to select the best
285              value for the current machine.
286            allow_symlinks: Controls whether symbolic links are included. If a
287              symlink is present but the flag is `False` (default) the
288              serialization would raise an error.
289
290        Returns:
291            The new hashing configuration with the new serialization method.
292        """
293        self._serializer = file.Serializer(
294            self._build_file_hasher_factory(hashing_algorithm, chunk_size),
295            max_workers=max_workers,
296            allow_symlinks=allow_symlinks,
297            ignore_paths=ignore_paths,
298        )
299        return self
300
301    def use_shard_serialization(
302        self,
303        *,
304        hashing_algorithm: Literal["sha256", "blake2"] = "sha256",
305        chunk_size: int = 1048576,
306        shard_size: int = 1_000_000_000,
307        max_workers: Optional[int] = None,
308        allow_symlinks: bool = False,
309        ignore_paths: Iterable[pathlib.Path] = frozenset(),
310    ) -> Self:
311        """Configures serialization to build a manifest of (shard, hash) pairs.
312
313        The serialization method in this configuration is changed to one where
314        every file in the model is sharded in equal sized shards, every shard is
315        paired with its digest and a manifest containing all these pairings is
316        being built.
317
318        Args:
319            hashing_algorithm: The hashing algorithm to use to hash a shard.
320            chunk_size: The amount of file to read at once. Default is 1MB. A
321              special value of 0 signals to attempt to read everything in a
322              single call.
323            shard_size: The size of a file shard. Default is 1 GB.
324            max_workers: Maximum number of workers to use in parallel. Default
325              is to defer to the `concurrent.futures` library to select the best
326              value for the current machine.
327            allow_symlinks: Controls whether symbolic links are included. If a
328              symlink is present but the flag is `False` (default) the
329              serialization would raise an error.
330            ignore_paths: Paths of files to ignore.
331
332        Returns:
333            The new hashing configuration with the new serialization method.
334        """
335        self._serializer = file_shard.Serializer(
336            self._build_sharded_file_hasher_factory(
337                hashing_algorithm, chunk_size, shard_size
338            ),
339            max_workers=max_workers,
340            allow_symlinks=allow_symlinks,
341            ignore_paths=ignore_paths,
342        )
343        return self
344
345    def set_ignored_paths(
346        self, *, paths: Iterable[PathLike], ignore_git_paths: bool = True
347    ) -> Self:
348        """Configures the paths to be ignored during serialization of a model.
349
350        If the model is a single file, there are no paths that are ignored. If
351        the model is a directory, all paths are considered as relative to the
352        model directory, since we never look at files outside of it.
353
354        If an ignored path is a directory, serialization will ignore both the
355        path and any of its children.
356
357        Args:
358            paths: The paths to ignore.
359            ignore_git_paths: Whether to ignore git related paths (default) or
360              include them in the signature.
361
362        Returns:
363            The new hashing configuration with a new set of ignored paths.
364        """
365        # Preserve the user-provided relative paths; they are resolved against
366        # the model directory later when hashing.
367        self._ignored_paths = frozenset(pathlib.Path(p) for p in paths)
368        self._ignore_git_paths = ignore_git_paths
369        return self
370
371    def add_ignored_paths(
372        self, *, model_path: PathLike, paths: Iterable[PathLike]
373    ) -> None:
374        """Add more paths to ignore to existing set of paths.
375
376        Args:
377            model_path: The path to the model
378            paths: Additional paths to ignore. All path must be relative to
379                   the model directory.
380        """
381        newset = set(self._ignored_paths)
382        model_path = pathlib.Path(model_path)
383        for p in paths:
384            candidate = pathlib.Path(p)
385            full = model_path / candidate
386            try:
387                full.relative_to(model_path)
388            except ValueError:
389                continue
390            newset.add(candidate)
391        self._ignored_paths = newset
392
393    def set_allow_symlinks(self, allow_symlinks: bool) -> Self:
394        """Set whether following symlinks is allowed."""
395        self._allow_symlinks = allow_symlinks
396        return self
PathLike: TypeAlias = Union[str, bytes, os.PathLike]
def hash( model_path: Union[str, bytes, os.PathLike]) -> model_signing.manifest.Manifest:
 85def hash(model_path: PathLike) -> manifest.Manifest:
 86    """Hashes a model using the default configuration.
 87
 88    Hashing is the shared part between signing and verification and is also
 89    expected to be the slowest component. When serializing a model, we need to
 90    spend time proportional to the model size on disk.
 91
 92    This method returns a "manifest" of the model. A manifest is a collection of
 93    every object in the model, paired with the corresponding hash. Currently, we
 94    consider an object in the model to be either a file or a shard of the file.
 95    Large models with large files will be hashed much faster when every shard is
 96    hashed in parallel, at the cost of generating a larger payload for the
 97    signature. In future releases we could support hashing individual tensors or
 98    tensor slices for further speed optimizations for very large models.
 99
100    Args:
101        model_path: The path to the model to hash.
102
103    Returns:
104        A manifest of the hashed model.
105    """
106    return Config().hash(model_path)

Hashes a model using the default configuration.

Hashing is the shared part between signing and verification and is also expected to be the slowest component. When serializing a model, we need to spend time proportional to the model size on disk.

This method returns a "manifest" of the model. A manifest is a collection of every object in the model, paired with the corresponding hash. Currently, we consider an object in the model to be either a file or a shard of the file. Large models with large files will be hashed much faster when every shard is hashed in parallel, at the cost of generating a larger payload for the signature. In future releases we could support hashing individual tensors or tensor slices for further speed optimizations for very large models.

Arguments:
  • model_path: The path to the model to hash.
Returns:

A manifest of the hashed model.

class Config:
109class Config:
110    """Configuration to use when hashing models.
111
112    Hashing is the shared part between signing and verification and is also
113    expected to be the slowest component. When serializing a model, we need to
114    spend time proportional to the model size on disk.
115
116    Hashing builds a "manifest" of the model. A manifest is a collection of
117    every object in the model, paired with the corresponding hash. Currently, we
118    consider an object in the model to be either a file or a shard of the file.
119    Large models with large files will be hashed much faster when every shard is
120    hashed in parallel, at the cost of generating a larger payload for the
121    signature. In future releases we could support hashing individual tensors or
122    tensor slices for further speed optimizations for very large models.
123
124    This configuration class supports configuring the hashing granularity. By
125    default, we hash at file level granularity.
126
127    This configuration class also supports configuring the hash method used to
128    generate the hash for every object in the model. We currently support SHA256
129    and BLAKE2, with SHA256 being the default.
130
131    This configuration class also supports configuring which paths from the
132    model directory should be ignored. These are files that doesn't impact the
133    behavior of the model, or files that won't be distributed with the model. By
134    default, only files that are associated with a git repository (`.git`,
135    `.gitattributes`, `.gitignore`, etc.) are ignored.
136    """
137
138    def __init__(self):
139        """Initializes the default configuration for hashing."""
140        self._ignored_paths = frozenset()
141        self._ignore_git_paths = True
142        self.use_file_serialization()
143        self._allow_symlinks = False
144
145    def hash(
146        self,
147        model_path: PathLike,
148        *,
149        files_to_hash: Optional[Iterable[PathLike]] = None,
150    ) -> manifest.Manifest:
151        """Hashes a model using the current configuration."""
152        # All paths in ``_ignored_paths`` are expected to be relative to the
153        # model directory. Join them to ``model_path`` and ensure they do not
154        # escape it.
155        model_path = pathlib.Path(model_path)
156        ignored_paths = []
157        for p in self._ignored_paths:
158            full = model_path / p
159            try:
160                full.relative_to(model_path)
161            except ValueError:
162                continue
163            ignored_paths.append(full)
164
165        if self._ignore_git_paths:
166            ignored_paths.extend(
167                [
168                    model_path / p
169                    for p in [
170                        ".git/",
171                        ".gitattributes",
172                        ".github/",
173                        ".gitignore",
174                    ]
175                ]
176            )
177
178        self._serializer.set_allow_symlinks(self._allow_symlinks)
179
180        return self._serializer.serialize(
181            pathlib.Path(model_path),
182            ignore_paths=ignored_paths,
183            files_to_hash=files_to_hash,
184        )
185
186    def _build_stream_hasher(
187        self, hashing_algorithm: Literal["sha256", "blake2"] = "sha256"
188    ) -> hashing.StreamingHashEngine:
189        """Builds a streaming hasher from a constant string.
190
191        Args:
192            hashing_algorithm: The hashing algorithm to use.
193
194        Returns:
195            An instance of the requested hasher.
196        """
197        # TODO: Once Python 3.9 support is deprecated revert to using `match`
198        if hashing_algorithm == "sha256":
199            return memory.SHA256()
200        if hashing_algorithm == "blake2":
201            return memory.BLAKE2()
202
203        raise ValueError(f"Unsupported hashing method {hashing_algorithm}")
204
205    def _build_file_hasher_factory(
206        self,
207        hashing_algorithm: Literal["sha256", "blake2"] = "sha256",
208        chunk_size: int = 1048576,
209    ) -> Callable[[pathlib.Path], io.SimpleFileHasher]:
210        """Builds the hasher factory for a serialization by file.
211
212        Args:
213            hashing_algorithm: The hashing algorithm to use to hash a file.
214            chunk_size: The amount of file to read at once. Default is 1MB. A
215              special value of 0 signals to attempt to read everything in a
216              single call.
217
218        Returns:
219            The hasher factory that should be used by the active serialization
220            method.
221        """
222
223        def _factory(path: pathlib.Path) -> io.SimpleFileHasher:
224            hasher = self._build_stream_hasher(hashing_algorithm)
225            return io.SimpleFileHasher(path, hasher, chunk_size=chunk_size)
226
227        return _factory
228
229    def _build_sharded_file_hasher_factory(
230        self,
231        hashing_algorithm: Literal["sha256", "blake2"] = "sha256",
232        chunk_size: int = 1048576,
233        shard_size: int = 1_000_000_000,
234    ) -> Callable[[pathlib.Path, int, int], io.ShardedFileHasher]:
235        """Builds the hasher factory for a serialization by file shards.
236
237        Args:
238            hashing_algorithm: The hashing algorithm to use to hash a shard.
239            chunk_size: The amount of file to read at once. Default is 1MB. A
240              special value of 0 signals to attempt to read everything in a
241              single call.
242            shard_size: The size of a file shard. Default is 1 GB.
243
244        Returns:
245            The hasher factory that should be used by the active serialization
246            method.
247        """
248
249        def _factory(
250            path: pathlib.Path, start: int, end: int
251        ) -> io.ShardedFileHasher:
252            hasher = self._build_stream_hasher(hashing_algorithm)
253            return io.ShardedFileHasher(
254                path,
255                hasher,
256                start=start,
257                end=end,
258                chunk_size=chunk_size,
259                shard_size=shard_size,
260            )
261
262        return _factory
263
264    def use_file_serialization(
265        self,
266        *,
267        hashing_algorithm: Literal["sha256", "blake2"] = "sha256",
268        chunk_size: int = 1048576,
269        max_workers: Optional[int] = None,
270        allow_symlinks: bool = False,
271        ignore_paths: Iterable[pathlib.Path] = frozenset(),
272    ) -> Self:
273        """Configures serialization to build a manifest of (file, hash) pairs.
274
275        The serialization method in this configuration is changed to one where
276        every file in the model is paired with its digest and a manifest
277        containing all these pairings is being built.
278
279        Args:
280            hashing_algorithm: The hashing algorithm to use to hash a file.
281            chunk_size: The amount of file to read at once. Default is 1MB. A
282              special value of 0 signals to attempt to read everything in a
283              single call.
284            max_workers: Maximum number of workers to use in parallel. Default
285              is to defer to the `concurrent.futures` library to select the best
286              value for the current machine.
287            allow_symlinks: Controls whether symbolic links are included. If a
288              symlink is present but the flag is `False` (default) the
289              serialization would raise an error.
290
291        Returns:
292            The new hashing configuration with the new serialization method.
293        """
294        self._serializer = file.Serializer(
295            self._build_file_hasher_factory(hashing_algorithm, chunk_size),
296            max_workers=max_workers,
297            allow_symlinks=allow_symlinks,
298            ignore_paths=ignore_paths,
299        )
300        return self
301
302    def use_shard_serialization(
303        self,
304        *,
305        hashing_algorithm: Literal["sha256", "blake2"] = "sha256",
306        chunk_size: int = 1048576,
307        shard_size: int = 1_000_000_000,
308        max_workers: Optional[int] = None,
309        allow_symlinks: bool = False,
310        ignore_paths: Iterable[pathlib.Path] = frozenset(),
311    ) -> Self:
312        """Configures serialization to build a manifest of (shard, hash) pairs.
313
314        The serialization method in this configuration is changed to one where
315        every file in the model is sharded in equal sized shards, every shard is
316        paired with its digest and a manifest containing all these pairings is
317        being built.
318
319        Args:
320            hashing_algorithm: The hashing algorithm to use to hash a shard.
321            chunk_size: The amount of file to read at once. Default is 1MB. A
322              special value of 0 signals to attempt to read everything in a
323              single call.
324            shard_size: The size of a file shard. Default is 1 GB.
325            max_workers: Maximum number of workers to use in parallel. Default
326              is to defer to the `concurrent.futures` library to select the best
327              value for the current machine.
328            allow_symlinks: Controls whether symbolic links are included. If a
329              symlink is present but the flag is `False` (default) the
330              serialization would raise an error.
331            ignore_paths: Paths of files to ignore.
332
333        Returns:
334            The new hashing configuration with the new serialization method.
335        """
336        self._serializer = file_shard.Serializer(
337            self._build_sharded_file_hasher_factory(
338                hashing_algorithm, chunk_size, shard_size
339            ),
340            max_workers=max_workers,
341            allow_symlinks=allow_symlinks,
342            ignore_paths=ignore_paths,
343        )
344        return self
345
346    def set_ignored_paths(
347        self, *, paths: Iterable[PathLike], ignore_git_paths: bool = True
348    ) -> Self:
349        """Configures the paths to be ignored during serialization of a model.
350
351        If the model is a single file, there are no paths that are ignored. If
352        the model is a directory, all paths are considered as relative to the
353        model directory, since we never look at files outside of it.
354
355        If an ignored path is a directory, serialization will ignore both the
356        path and any of its children.
357
358        Args:
359            paths: The paths to ignore.
360            ignore_git_paths: Whether to ignore git related paths (default) or
361              include them in the signature.
362
363        Returns:
364            The new hashing configuration with a new set of ignored paths.
365        """
366        # Preserve the user-provided relative paths; they are resolved against
367        # the model directory later when hashing.
368        self._ignored_paths = frozenset(pathlib.Path(p) for p in paths)
369        self._ignore_git_paths = ignore_git_paths
370        return self
371
372    def add_ignored_paths(
373        self, *, model_path: PathLike, paths: Iterable[PathLike]
374    ) -> None:
375        """Add more paths to ignore to existing set of paths.
376
377        Args:
378            model_path: The path to the model
379            paths: Additional paths to ignore. All path must be relative to
380                   the model directory.
381        """
382        newset = set(self._ignored_paths)
383        model_path = pathlib.Path(model_path)
384        for p in paths:
385            candidate = pathlib.Path(p)
386            full = model_path / candidate
387            try:
388                full.relative_to(model_path)
389            except ValueError:
390                continue
391            newset.add(candidate)
392        self._ignored_paths = newset
393
394    def set_allow_symlinks(self, allow_symlinks: bool) -> Self:
395        """Set whether following symlinks is allowed."""
396        self._allow_symlinks = allow_symlinks
397        return self

Configuration to use when hashing models.

Hashing is the shared part between signing and verification and is also expected to be the slowest component. When serializing a model, we need to spend time proportional to the model size on disk.

Hashing builds a "manifest" of the model. A manifest is a collection of every object in the model, paired with the corresponding hash. Currently, we consider an object in the model to be either a file or a shard of the file. Large models with large files will be hashed much faster when every shard is hashed in parallel, at the cost of generating a larger payload for the signature. In future releases we could support hashing individual tensors or tensor slices for further speed optimizations for very large models.

This configuration class supports configuring the hashing granularity. By default, we hash at file level granularity.

This configuration class also supports configuring the hash method used to generate the hash for every object in the model. We currently support SHA256 and BLAKE2, with SHA256 being the default.

This configuration class also supports configuring which paths from the model directory should be ignored. These are files that doesn't impact the behavior of the model, or files that won't be distributed with the model. By default, only files that are associated with a git repository (.git, .gitattributes, .gitignore, etc.) are ignored.

Config()
138    def __init__(self):
139        """Initializes the default configuration for hashing."""
140        self._ignored_paths = frozenset()
141        self._ignore_git_paths = True
142        self.use_file_serialization()
143        self._allow_symlinks = False

Initializes the default configuration for hashing.

def hash( self, model_path: Union[str, bytes, os.PathLike], *, files_to_hash: Optional[Iterable[Union[str, bytes, os.PathLike]]] = None) -> model_signing.manifest.Manifest:
145    def hash(
146        self,
147        model_path: PathLike,
148        *,
149        files_to_hash: Optional[Iterable[PathLike]] = None,
150    ) -> manifest.Manifest:
151        """Hashes a model using the current configuration."""
152        # All paths in ``_ignored_paths`` are expected to be relative to the
153        # model directory. Join them to ``model_path`` and ensure they do not
154        # escape it.
155        model_path = pathlib.Path(model_path)
156        ignored_paths = []
157        for p in self._ignored_paths:
158            full = model_path / p
159            try:
160                full.relative_to(model_path)
161            except ValueError:
162                continue
163            ignored_paths.append(full)
164
165        if self._ignore_git_paths:
166            ignored_paths.extend(
167                [
168                    model_path / p
169                    for p in [
170                        ".git/",
171                        ".gitattributes",
172                        ".github/",
173                        ".gitignore",
174                    ]
175                ]
176            )
177
178        self._serializer.set_allow_symlinks(self._allow_symlinks)
179
180        return self._serializer.serialize(
181            pathlib.Path(model_path),
182            ignore_paths=ignored_paths,
183            files_to_hash=files_to_hash,
184        )

Hashes a model using the current configuration.

def use_file_serialization( self, *, hashing_algorithm: Literal['sha256', 'blake2'] = 'sha256', chunk_size: int = 1048576, max_workers: Optional[int] = None, allow_symlinks: bool = False, ignore_paths: Iterable[pathlib.Path] = frozenset()) -> Self:
264    def use_file_serialization(
265        self,
266        *,
267        hashing_algorithm: Literal["sha256", "blake2"] = "sha256",
268        chunk_size: int = 1048576,
269        max_workers: Optional[int] = None,
270        allow_symlinks: bool = False,
271        ignore_paths: Iterable[pathlib.Path] = frozenset(),
272    ) -> Self:
273        """Configures serialization to build a manifest of (file, hash) pairs.
274
275        The serialization method in this configuration is changed to one where
276        every file in the model is paired with its digest and a manifest
277        containing all these pairings is being built.
278
279        Args:
280            hashing_algorithm: The hashing algorithm to use to hash a file.
281            chunk_size: The amount of file to read at once. Default is 1MB. A
282              special value of 0 signals to attempt to read everything in a
283              single call.
284            max_workers: Maximum number of workers to use in parallel. Default
285              is to defer to the `concurrent.futures` library to select the best
286              value for the current machine.
287            allow_symlinks: Controls whether symbolic links are included. If a
288              symlink is present but the flag is `False` (default) the
289              serialization would raise an error.
290
291        Returns:
292            The new hashing configuration with the new serialization method.
293        """
294        self._serializer = file.Serializer(
295            self._build_file_hasher_factory(hashing_algorithm, chunk_size),
296            max_workers=max_workers,
297            allow_symlinks=allow_symlinks,
298            ignore_paths=ignore_paths,
299        )
300        return self

Configures serialization to build a manifest of (file, hash) pairs.

The serialization method in this configuration is changed to one where every file in the model is paired with its digest and a manifest containing all these pairings is being built.

Arguments:
  • hashing_algorithm: The hashing algorithm to use to hash a file.
  • chunk_size: The amount of file to read at once. Default is 1MB. A special value of 0 signals to attempt to read everything in a single call.
  • max_workers: Maximum number of workers to use in parallel. Default is to defer to the concurrent.futures library to select the best value for the current machine.
  • allow_symlinks: Controls whether symbolic links are included. If a symlink is present but the flag is False (default) the serialization would raise an error.
Returns:

The new hashing configuration with the new serialization method.

def use_shard_serialization( self, *, hashing_algorithm: Literal['sha256', 'blake2'] = 'sha256', chunk_size: int = 1048576, shard_size: int = 1000000000, max_workers: Optional[int] = None, allow_symlinks: bool = False, ignore_paths: Iterable[pathlib.Path] = frozenset()) -> Self:
302    def use_shard_serialization(
303        self,
304        *,
305        hashing_algorithm: Literal["sha256", "blake2"] = "sha256",
306        chunk_size: int = 1048576,
307        shard_size: int = 1_000_000_000,
308        max_workers: Optional[int] = None,
309        allow_symlinks: bool = False,
310        ignore_paths: Iterable[pathlib.Path] = frozenset(),
311    ) -> Self:
312        """Configures serialization to build a manifest of (shard, hash) pairs.
313
314        The serialization method in this configuration is changed to one where
315        every file in the model is sharded in equal sized shards, every shard is
316        paired with its digest and a manifest containing all these pairings is
317        being built.
318
319        Args:
320            hashing_algorithm: The hashing algorithm to use to hash a shard.
321            chunk_size: The amount of file to read at once. Default is 1MB. A
322              special value of 0 signals to attempt to read everything in a
323              single call.
324            shard_size: The size of a file shard. Default is 1 GB.
325            max_workers: Maximum number of workers to use in parallel. Default
326              is to defer to the `concurrent.futures` library to select the best
327              value for the current machine.
328            allow_symlinks: Controls whether symbolic links are included. If a
329              symlink is present but the flag is `False` (default) the
330              serialization would raise an error.
331            ignore_paths: Paths of files to ignore.
332
333        Returns:
334            The new hashing configuration with the new serialization method.
335        """
336        self._serializer = file_shard.Serializer(
337            self._build_sharded_file_hasher_factory(
338                hashing_algorithm, chunk_size, shard_size
339            ),
340            max_workers=max_workers,
341            allow_symlinks=allow_symlinks,
342            ignore_paths=ignore_paths,
343        )
344        return self

Configures serialization to build a manifest of (shard, hash) pairs.

The serialization method in this configuration is changed to one where every file in the model is sharded in equal sized shards, every shard is paired with its digest and a manifest containing all these pairings is being built.

Arguments:
  • hashing_algorithm: The hashing algorithm to use to hash a shard.
  • chunk_size: The amount of file to read at once. Default is 1MB. A special value of 0 signals to attempt to read everything in a single call.
  • shard_size: The size of a file shard. Default is 1 GB.
  • max_workers: Maximum number of workers to use in parallel. Default is to defer to the concurrent.futures library to select the best value for the current machine.
  • allow_symlinks: Controls whether symbolic links are included. If a symlink is present but the flag is False (default) the serialization would raise an error.
  • ignore_paths: Paths of files to ignore.
Returns:

The new hashing configuration with the new serialization method.

def set_ignored_paths( self, *, paths: Iterable[typing.Union[str, bytes, os.PathLike]], ignore_git_paths: bool = True) -> Self:
346    def set_ignored_paths(
347        self, *, paths: Iterable[PathLike], ignore_git_paths: bool = True
348    ) -> Self:
349        """Configures the paths to be ignored during serialization of a model.
350
351        If the model is a single file, there are no paths that are ignored. If
352        the model is a directory, all paths are considered as relative to the
353        model directory, since we never look at files outside of it.
354
355        If an ignored path is a directory, serialization will ignore both the
356        path and any of its children.
357
358        Args:
359            paths: The paths to ignore.
360            ignore_git_paths: Whether to ignore git related paths (default) or
361              include them in the signature.
362
363        Returns:
364            The new hashing configuration with a new set of ignored paths.
365        """
366        # Preserve the user-provided relative paths; they are resolved against
367        # the model directory later when hashing.
368        self._ignored_paths = frozenset(pathlib.Path(p) for p in paths)
369        self._ignore_git_paths = ignore_git_paths
370        return self

Configures the paths to be ignored during serialization of a model.

If the model is a single file, there are no paths that are ignored. If the model is a directory, all paths are considered as relative to the model directory, since we never look at files outside of it.

If an ignored path is a directory, serialization will ignore both the path and any of its children.

Arguments:
  • paths: The paths to ignore.
  • ignore_git_paths: Whether to ignore git related paths (default) or include them in the signature.
Returns:

The new hashing configuration with a new set of ignored paths.

def add_ignored_paths( self, *, model_path: Union[str, bytes, os.PathLike], paths: Iterable[typing.Union[str, bytes, os.PathLike]]) -> None:
372    def add_ignored_paths(
373        self, *, model_path: PathLike, paths: Iterable[PathLike]
374    ) -> None:
375        """Add more paths to ignore to existing set of paths.
376
377        Args:
378            model_path: The path to the model
379            paths: Additional paths to ignore. All path must be relative to
380                   the model directory.
381        """
382        newset = set(self._ignored_paths)
383        model_path = pathlib.Path(model_path)
384        for p in paths:
385            candidate = pathlib.Path(p)
386            full = model_path / candidate
387            try:
388                full.relative_to(model_path)
389            except ValueError:
390                continue
391            newset.add(candidate)
392        self._ignored_paths = newset

Add more paths to ignore to existing set of paths.

Arguments:
  • model_path: The path to the model
  • paths: Additional paths to ignore. All path must be relative to the model directory.