jsonl.load_archive¶
Load multiple JSON Lines files from a ZIP or TAR archive incrementally.
Function Signature¶
jsonl.load_archive(
file,
*,
pattern="*.jsonl",
pwd=None,
opener=None,
broken=False,
json_loads=None,
chunk_size=64 * 1024,
**json_loads_kwargs,
)
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
file |
str, PathLike, URL, Request, file-like |
(required) | Archive file to load from |
pattern |
str |
"*.jsonl" |
Unix shell-style wildcard pattern to filter filenames inside the archive |
pwd |
bytes or None |
None |
Password to decrypt the archive (ZIP only) |
opener |
Callable or None |
None |
Custom function to open the file (not supported for URLs) |
broken |
bool |
False |
If True, skip malformed lines and log a warning |
json_loads |
Callable or None |
None |
Custom deserialization function. Defaults to json.loads |
chunk_size |
int |
64 * 1024 | The size (in bytes) of chunks when reading from a URL to avoid loading the entire file into memory at once. |
**json_loads_kwargs |
Additional keyword arguments passed to the deserialization function |
Returns¶
Iterator[tuple[str, Iterator[Any]]] — An iterator of (filename, items) tuples, where items is an iterator of
deserialized objects.
Supported Archive Formats¶
- ZIP archives (
.zip) - TAR archives (
.tar), including compressed variants:.tar.gz,.tar.bz2,.tar.xz
Key Features¶
- Load from local files or remote URLs
- Filter files inside the archive using Unix shell-style wildcards
- Support for compressed
.jsonlfiles inside the archive (e.g.,.jsonl.gz,.jsonl.bz2,.jsonl.xz). Check compression detection for details. - ZIP archives with password protection
- Graceful handling of malformed lines via the
brokenparameter
Examples¶
Load from a local archive¶
import jsonl
for filename, items in jsonl.load_archive("archive.zip"):
print(f"--- {filename} ---")
for item in items:
print(item)
Load from a remote archive (URL)¶
You can load archives from a URL. For custom request headers, use urllib.request.Request:
import urllib.request
import jsonl
# Load directly from a URL
for filename, items in jsonl.load_archive("https://example.com/archive.zip"):
print(f"--- {filename} ---")
for item in items:
print(item)
# Load using a custom request with headers
req = urllib.request.Request("https://example.com/archive.zip", headers={"Accept": "application/zip"})
for filename, items in jsonl.load_archive(req, chunk_size=128 * 1024): # increase chunk size for larger files if needed
print(f"--- {filename} ---")
for item in items:
print(item)
Filter files with pattern matching¶
Use Unix shell-style wildcards to select specific files within the archive:
| Pattern | Matches |
|---|---|
*.jsonl |
All .jsonl files (default) |
users*.jsonl |
Files starting with users |
data/[ab]*.jsonl |
Files in data/ starting with a or b |
* |
All files |
For more details, see the fnmatch documentation.