文件读取工具¶

FileReader 工具提供多格式文件读取功能，支持行号显示、分页和安全上限。支持纯文本文件、Jupyter Notebook 和 PDF。

类概述¶

FileReader - 四种读取方法，对应不同文件格式：
- read() - 文本文件，带行号和分页
- read_notebook() - Jupyter Notebook（.ipynb）
- read_pdf() - PDF 文件（需要可选依赖）
- read_image() - 图片文件，返回多模态内容块（需要可选依赖）

使用方法¶

读取文本文件¶

from toolregistry_hub import FileReader

# 读取文件，显示行号
content = FileReader.read("/path/to/file.py")
print(content)
# [/path/to/file.py] lines 1-50 of 200 (use offset=51 to read more)
# 1 | import os
# 2 | import sys
# 3 |
# 4 | def main():
# ...

# 分页读取
content = FileReader.read("/path/to/file.py", offset=50, limit=25)

读取 Jupyter Notebook¶

# 读取 notebook 单元格，显示类型标记和输出
content = FileReader.read_notebook("analysis.ipynb")
# [Notebook: analysis.ipynb]
#
# --- Cell 1 [markdown] ---
# # Data Analysis
#
# --- Cell 2 [code] ---
# ```python
# import pandas as pd
# df = pd.read_csv("data.csv")
# ```
# Output:
# ...

无需外部依赖 -- 使用标准库 json。

读取 PDF¶

# 读取所有页面（上限 20 页）
content = FileReader.read_pdf("document.pdf")

# 读取指定页面范围
content = FileReader.read_pdf("document.pdf", pages="5-10")

# 读取单页
content = FileReader.read_pdf("document.pdf", pages="3")

需要安装 pypdf 或 pdfplumber：

pip install toolregistry-hub[reader]

如果两者都已安装，优先使用 pdfplumber 以获得更好的文本质量。

读取图片¶

# 读取图片 — 返回多模态内容块
blocks = FileReader.read_image("screenshot.png")
# [
#   {"type": "text", "text": "[Image: screenshot.png (image/png, 45321 bytes)]"},
#   {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": "iVBOR..."}}
# ]

# 自定义最大尺寸（默认 5 MB base64）
blocks = FileReader.read_image("large_photo.jpg", max_size=1_000_000)

支持格式：.png、.jpg、.jpeg、.gif、.webp。

如果 base64 编码后的图片超过 max_size，将使用 Pillow 进行自适应质量压缩。需要安装 Pillow：

pip install toolregistry-hub[reader_image]

如果未安装 Pillow，将返回原始图片并记录警告日志。

参数¶

`read()`¶

参数	类型	默认值	描述
`path`	`str`	必填	文本文件路径
`offset`	`int`	`1`	起始行号（从 1 开始）
`limit`	`int \\| None`	`None`	最大读取行数（默认 2000）

`read_notebook()`¶

参数	类型	默认值	描述
`path`	`str`	必填	`.ipynb` 文件路径

`read_pdf()`¶

参数	类型	默认值	描述
`path`	`str`	必填	PDF 文件路径
`pages`	`str \\| None`	`None`	页面范围（如 `"1-5"`、`"3"`）

`read_image()`¶

参数	类型	默认值	描述
`path`	`str`	必填	图片文件路径
`max_size`	`int`	`5242880`	base64 编码最大字节数（5 MB）

安全上限¶

文本文件：最大 10 MB
文本行数：每次读取默认 2000 行
PDF 页数：每次调用最多 20 页
Notebook 输出：每个单元格输出最大 10 KB
图片：base64 编码最大 5 MB（超出时自动压缩）

MCP 服务端点¶

POST /tools/reader/read
POST /tools/reader/read_pdf
POST /tools/reader/read_notebook
POST /tools/reader/read_image

API 参考¶

toolregistry_hub.file_reader.FileReader ¶

Multi-format file reader with line numbers and pagination.

read `staticmethod` ¶

read(path: str, offset: int = 1, limit: int | None = None) -> str

Read a text file with line numbers.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to file.	required
`offset`	`int`	Starting line number (1-indexed). Defaults to 1.	`1`
`limit`	`int \| None`	Maximum number of lines to read. Defaults to 2000.	`None`

Returns:

Type	Description
`str`	File content with line numbers in `"N \| content"` format.
`str`	Includes a metadata header with file path, total lines, and
`str`	the range actually read.

Raises:

Type	Description
`FileNotFoundError`	If the file does not exist.
`IsADirectoryError`	If the path is a directory.
`ValueError`	If offset is less than 1.

Source code in toolregistry_hub/file_reader.py

@staticmethod
def read(
    path: str,
    offset: int = 1,
    limit: int | None = None,
) -> str:
    """Read a text file with line numbers.

    Args:
        path: Path to file.
        offset: Starting line number (1-indexed). Defaults to 1.
        limit: Maximum number of lines to read. Defaults to 2000.

    Returns:
        File content with line numbers in ``"N | content"`` format.
        Includes a metadata header with file path, total lines, and
        the range actually read.

    Raises:
        FileNotFoundError: If the file does not exist.
        IsADirectoryError: If the path is a directory.
        ValueError: If offset is less than 1.
    """
    p = Path(path)
    if not p.exists():
        raise FileNotFoundError(f"File not found: {path}")
    if p.is_dir():
        raise IsADirectoryError(f"Path is a directory, not a file: {path}")
    if offset < 1:
        raise ValueError("offset must be >= 1")

    effective_limit = limit if limit is not None else _MAX_LINES_DEFAULT

    # Size guard
    file_size = p.stat().st_size
    if file_size > _MAX_FILE_SIZE_BYTES:
        return (
            f"[File too large: {file_size:,} bytes "
            f"(limit {_MAX_FILE_SIZE_BYTES:,}). "
            f"Use offset/limit to read in segments.]"
        )

    text = p.read_text(encoding="utf-8", errors="replace")
    all_lines = text.splitlines()
    total_lines = len(all_lines)

    start = offset - 1  # convert to 0-indexed
    end = min(start + effective_limit, total_lines)
    selected = all_lines[start:end]

    # Build line-numbered output
    width = len(str(end))
    numbered = [
        f"{i + offset:>{width}} | {line}" for i, line in enumerate(selected)
    ]

    # Metadata header
    range_str = f"{offset}-{start + len(selected)}"
    header = f"[{path}] lines {range_str} of {total_lines}"
    if end < total_lines:
        header += f" (use offset={end + 1} to read more)"

    return header + "\n" + "\n".join(numbered)

read_image `staticmethod` ¶

read_image(path: str, max_size: int = _MAX_IMAGE_SIZE_BYTES) -> list

Read an image file and return as multimodal content blocks.

Returns a list of content blocks (TextBlock + ImageBlock) that the toolregistry pipeline can expand into format-specific multimodal messages via expand_content_blocks().

If the base64-encoded image exceeds max_size, Pillow is used to downsample it. If Pillow is not installed, the original image is returned with a warning.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to image file (.png, .jpg, .jpeg, .gif, .webp).	required
`max_size`	`int`	Maximum base64-encoded size in bytes. Defaults to 5 MB.	`_MAX_IMAGE_SIZE_BYTES`

Returns:

Type	Description
`list`	A list of two content blocks:: [ {"type": "text", "text": "[Image: name (mime, size)]"}, {"type": "image", "source": { "type": "base64", "media_type": "image/png", "data": "iVBOR..." }} ]

Raises:

Type	Description
`FileNotFoundError`	If the file does not exist.
`ValueError`	If the file extension is not supported.

Source code in toolregistry_hub/file_reader.py

@staticmethod
def read_image(
    path: str,
    max_size: int = _MAX_IMAGE_SIZE_BYTES,
) -> list:
    """Read an image file and return as multimodal content blocks.

    Returns a list of content blocks (TextBlock + ImageBlock) that the
    toolregistry pipeline can expand into format-specific multimodal
    messages via ``expand_content_blocks()``.

    If the base64-encoded image exceeds ``max_size``, Pillow is used to
    downsample it. If Pillow is not installed, the original image is
    returned with a warning.

    Args:
        path: Path to image file (.png, .jpg, .jpeg, .gif, .webp).
        max_size: Maximum base64-encoded size in bytes. Defaults to 5 MB.

    Returns:
        A list of two content blocks::

            [
                {"type": "text", "text": "[Image: name (mime, size)]"},
                {"type": "image", "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": "iVBOR..."
                }}
            ]

    Raises:
        FileNotFoundError: If the file does not exist.
        ValueError: If the file extension is not supported.
    """
    p = Path(path)
    if not p.exists():
        raise FileNotFoundError(f"File not found: {path}")

    ext = p.suffix.lower()
    if ext not in _SUPPORTED_IMAGE_EXTENSIONS:
        raise ValueError(
            f"Unsupported image format: '{ext}'. "
            f"Supported: {', '.join(sorted(_SUPPORTED_IMAGE_EXTENSIONS))}"
        )

    media_type = _EXTENSION_TO_MIME[ext]
    img_data = p.read_bytes()
    raw_size = len(img_data)

    b64_data = base64.b64encode(img_data).decode("ascii")

    if len(b64_data) > max_size:
        img_data, media_type = FileReader._downsample_image(
            img_data, media_type, max_size
        )
        b64_data = base64.b64encode(img_data).decode("ascii")

    return [
        {
            "type": "text",
            "text": f"[Image: {p.name} ({media_type}, {raw_size} bytes)]",
        },
        {
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": media_type,
                "data": b64_data,
            },
        },
    ]

read_notebook `staticmethod` ¶

read_notebook(path: str) -> str

Read a Jupyter notebook and return formatted cell contents.

Uses stdlib json only — no external dependencies.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to `.ipynb` file.	required

Returns:

Type	Description
`str`	All cells with type markers (code/markdown) and outputs.

Raises:

Type	Description
`FileNotFoundError`	If the file does not exist.
`ValueError`	If the file is not a valid notebook.

Source code in toolregistry_hub/file_reader.py

@staticmethod
def read_notebook(path: str) -> str:
    """Read a Jupyter notebook and return formatted cell contents.

    Uses stdlib ``json`` only — no external dependencies.

    Args:
        path: Path to ``.ipynb`` file.

    Returns:
        All cells with type markers (code/markdown) and outputs.

    Raises:
        FileNotFoundError: If the file does not exist.
        ValueError: If the file is not a valid notebook.
    """
    p = Path(path)
    if not p.exists():
        raise FileNotFoundError(f"File not found: {path}")

    try:
        data = json.loads(p.read_text(encoding="utf-8"))
    except json.JSONDecodeError as e:
        raise ValueError(f"Invalid notebook JSON: {e}") from e

    if "cells" not in data:
        raise ValueError(f"Not a valid Jupyter notebook (no 'cells' key): {path}")

    # Detect language from kernel info
    lang = "python"
    kernel_info = data.get("metadata", {}).get("kernelspec", {})
    if kernel_info.get("language"):
        lang = kernel_info["language"]

    lines: list[str] = []
    lines.append(f"[Notebook: {path}]")

    for i, cell in enumerate(data["cells"]):
        cell_type = cell.get("cell_type", "unknown")
        source = "".join(cell.get("source", []))

        lines.append(f"\n--- Cell {i + 1} [{cell_type}] ---")

        if cell_type == "code":
            lines.append(f"```{lang}")
            lines.append(source)
            lines.append("```")

            # Process outputs
            for output in cell.get("outputs", []):
                output_text = FileReader._extract_notebook_output(output)
                if output_text:
                    lines.append(f"Output:\n{output_text}")
        else:
            lines.append(source)

    return "\n".join(lines)

read_pdf `staticmethod` ¶

read_pdf(path: str, pages: str | None = None) -> str

Read a PDF file and extract text.

Uses pypdf (zero-dependency, BSD) by default. If pdfplumber is installed, uses it for better text quality.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to PDF file.	required
`pages`	`str \| None`	Page range string (e.g. `"1-5"`, `"3"`, `"10-20"`). Max 20 pages per call. Defaults to all pages (up to cap).	`None`

Returns:

Type	Description
`str`	Extracted text content with page markers.

Raises:

Type	Description
`FileNotFoundError`	If the file does not exist.
`ImportError`	If neither `pypdf` nor `pdfplumber` is installed.
`ValueError`	If page range is invalid.

Source code in toolregistry_hub/file_reader.py

@staticmethod
def read_pdf(
    path: str,
    pages: str | None = None,
) -> str:
    """Read a PDF file and extract text.

    Uses ``pypdf`` (zero-dependency, BSD) by default. If ``pdfplumber``
    is installed, uses it for better text quality.

    Args:
        path: Path to PDF file.
        pages: Page range string (e.g. ``"1-5"``, ``"3"``, ``"10-20"``).
            Max 20 pages per call. Defaults to all pages (up to cap).

    Returns:
        Extracted text content with page markers.

    Raises:
        FileNotFoundError: If the file does not exist.
        ImportError: If neither ``pypdf`` nor ``pdfplumber`` is installed.
        ValueError: If page range is invalid.
    """
    p = Path(path)
    if not p.exists():
        raise FileNotFoundError(f"File not found: {path}")

    start_page, end_page = FileReader._parse_page_range(pages)

    # Try pdfplumber first (better quality), fall back to pypdf
    try:
        return FileReader._read_pdf_pdfplumber(p, start_page, end_page)
    except ImportError:
        pass

    try:
        return FileReader._read_pdf_pypdf(p, start_page, end_page)
    except ImportError:
        raise ImportError(
            "PDF reading requires 'pypdf' or 'pdfplumber'. "
            "Install with: pip install toolregistry-hub[reader]"
        ) from None

文件读取工具¶

类概述¶

使用方法¶

读取文本文件¶

读取 Jupyter Notebook¶

读取 PDF¶

读取图片¶

参数¶

read()¶

read_notebook()¶

read_pdf()¶

read_image()¶

安全上限¶

MCP 服务端点¶

API 参考¶

toolregistry_hub.file_reader.FileReader ¶

read staticmethod ¶

read_image staticmethod ¶

read_notebook staticmethod ¶

read_pdf staticmethod ¶

`read()`¶

`read_notebook()`¶

`read_pdf()`¶

`read_image()`¶

read `staticmethod` ¶

read_image `staticmethod` ¶

read_notebook `staticmethod` ¶

read_pdf `staticmethod` ¶