Load multiple JSON Lines Files from an Archive (ZIP or TAR) incrementally
Allows to load multiple JSON Lines (.jsonl) files incrementally from a ZIP or TAR archive.
- Support compressed TAR archives (e.g.,
.tar.gz
,.tar.bz2
,.tar.xz
). - Support loading the archive from a URL.
- Support ZIP archives with password protection.
- Filename filtering using Unix shell-style wildcards via
fnmatch
. Use a pattern (e.g., *.jsonl) to selectively load only matching files within the archive. - Support for both compressed and uncompressed
.jsonl
files inside the archive. (e.g.,*.jsonl.gz
or*.jsonl.bz2
or*.jsonl.xz
(Check note for more details). - Graceful handling of malformed or broken lines.
- Optional custom deserialization and opener callbacks for advanced use cases.
Load from a local archive
import jsonl
path = "path/to/archive.zip"
# Load all JSON Lines files matching the pattern "*.jsonl"
for filename, iterator in jsonl.load_archive(path):
print("Filename:", filename)
print("Data:", tuple(iterator))
Load from a remote archive (URL)
You can load the archive from a URL, if needed you can also create custom requests using urllib.request.Request
.
import urllib.request
import jsonl
# Load all JSON Lines files matching the pattern "*.jsonl" from a remote archive:
# ------------------------
# Load directly from a URL
for filename, iterator in jsonl.load_archive("https://example.com/archive.zip"):
print("Filename:", filename)
print("Data:", tuple(iterator))
# Load using a custom request
req = urllib.request.Request("https://example.com/archive.zip", headers={"Accept": "application/zip"})
for filename, iterator in jsonl.load_archive(req):
print("Filename:", filename)
print("Data:", tuple(iterator))
Pattern matching
You can use Unix shell-style wildcards to filter files in the archive. The pattern
argument supports:
*
matches everything?
matches a single character[seq]
matches any character inseq
[!seq]
matches any character not inseq
For more information, refer to the fnmatch documentation.
import jsonl
path = "path/to/archive.zip"
# Load all JSON Lines files matching the pattern "myfile*.jsonl" from the archive
for filename, iterator in jsonl.load_archive(path, pattern="myfile*.jsonl"):
print("Filename:", filename)
print("Data:", tuple(iterator))