Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: prevent Chinese examples from being converted to Unicode encoding #1787

Merged
merged 1 commit into from
Nov 11, 2024

Conversation

coolmian
Copy link
Contributor

When defining data structures with Pydantic, prevent Unicode-encoded strings from appearing in prompts during the process of converting metadata to JSON schema.

My case:

class Narrative(BaseModel):
    type: Literal["dialogue", "narration", "voiceover"] = Field(description="表示内容类型(narration:叙述, dialogue:对话, voiceover:内心独白)")
    content: str = Field(description="具体内容(注意第一人称转换)")
    reaction: str | None = Field(default=None, description="情绪/反应(仅dialogue和voiceover需要)")
    name: str | None = Field(default=None, description="说话人名字(仅dialogue和voiceover需要)")


class StoryToJSON(dspy.Signature):
    """
    Convert story text into structured JSON format with specific fields for narration, dialogue, and voiceover.
    Make the performance more like a script or animation script style, help the performer better understand the character's emotions and reactions, and make the content more expressive and situational.
    NOTE: Convert each paragraph based on the story_text without skipping or omitting any content.
    """
    story_text = dspy.InputField(desc="小说内容")
    json_output: list[Narrative] = dspy.OutputField(desc="输出json list")

# Define the predictor.
predictor = dspy.Predict(StoryToJSON)

System prompt generated by dspy

before (Please scroll to the right and focus on the JSON schema section):

Your input fields are:
1. `story_text` (str): 小说内容

Your output fields are:
1. `json_output` (list[Narrative]): 输出json list

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## story_text ## ]]
{story_text}

[[ ## json_output ## ]]
{json_output}        # note: the value you produce must be pareseable according to the following JSON schema: {"type": "array", "$defs": {"Narrative": {"type": "object", "properties": {"type": {"type": "string", "description": "\u8868\u793a\u5185\u5bb9\u7c7b\u578b(narration:\u53d9\u8ff0, dialogue:\u5bf9\u8bdd, voiceover:\u5185\u5fc3\u72ec\u767d)", "enum": ["dialogue", "narration", "voiceover"], "title": "Type"}, "content": {"type": "string", "description": "\u5177\u4f53\u5185\u5bb9(\u6ce8\u610f\u7b2c\u4e00\u4eba\u79f0\u8f6c\u6362)", "title": "Content"}, "name": {"anyOf": [{"type": "string"}, {"type": "null"}], "default": null, "description": "\u8bf4\u8bdd\u4eba\u540d\u5b57(\u4ec5dialogue\u548cvoiceover\u9700\u8981)", "title": "Name"}, "reaction": {"anyOf": [{"type": "string"}, {"type": "null"}], "default": null, "description": "\u60c5\u7eea/\u53cd\u5e94(\u4ec5dialogue\u548cvoiceover\u9700\u8981)", "title": "Reaction"}}, "required": ["type", "content"], "title": "Narrative"}}, "items": {"$ref": "#/$defs/Narrative"}}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Convert story text into structured JSON format with specific fields for narration, dialogue, and voiceover.
        Make the performance more like a script or animation script style, help the performer better understand the character's emotions and reactions, and make the content more expressive and situational.
        NOTE: Convert each paragraph based on the story_text without skipping or omitting any content.

after:

Your input fields are:
1. `story_text` (str): 小说内容

Your output fields are:
1. `json_output` (list[Narrative]): 输出json list

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## story_text ## ]]
{story_text}

[[ ## json_output ## ]]
{json_output}        # note: the value you produce must be pareseable according to the following JSON schema: {"type": "array", "$defs": {"Narrative": {"type": "object", "properties": {"type": {"type": "string", "description": "表示内容类型(narration:叙述, dialogue:对话, voiceover:内心独白)", "enum": ["dialogue", "narration", "voiceover"], "title": "Type"}, "content": {"type": "string", "description": "具体内容(注意第一人称转换)", "title": "Content"}, "name": {"anyOf": [{"type": "string"}, {"type": "null"}], "default": null, "description": "说话人名字(仅dialogue和voiceover需要)", "title": "Name"}, "reaction": {"anyOf": [{"type": "string"}, {"type": "null"}], "default": null, "description": "情绪/反应(仅dialogue和voiceover需要)", "title": "Reaction"}}, "required": ["type", "content"], "title": "Narrative"}}, "items": {"$ref": "#/$defs/Narrative"}}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Convert story text into structured JSON format with specific fields for narration, dialogue, and voiceover.
        Make the performance more like a script or animation script style, help the performer better understand the character's emotions and reactions, and make the content more expressive and situational.
        NOTE: Convert each paragraph based on the story_text without skipping or omitting any content.

@okhat okhat merged commit 793530c into stanfordnlp:main Nov 11, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants