fix: prevent Chinese examples from being converted to Unicode encoding #1787

coolmian · 2024-11-11T04:03:21Z

When defining data structures with Pydantic, prevent Unicode-encoded strings from appearing in prompts during the process of converting metadata to JSON schema.

My case:

class Narrative(BaseModel):
    type: Literal["dialogue", "narration", "voiceover"] = Field(description="表示内容类型(narration:叙述, dialogue:对话, voiceover:内心独白)")
    content: str = Field(description="具体内容(注意第一人称转换)")
    reaction: str | None = Field(default=None, description="情绪/反应(仅dialogue和voiceover需要)")
    name: str | None = Field(default=None, description="说话人名字(仅dialogue和voiceover需要)")


class StoryToJSON(dspy.Signature):
    """
    Convert story text into structured JSON format with specific fields for narration, dialogue, and voiceover.
    Make the performance more like a script or animation script style, help the performer better understand the character's emotions and reactions, and make the content more expressive and situational.
    NOTE: Convert each paragraph based on the story_text without skipping or omitting any content.
    """
    story_text = dspy.InputField(desc="小说内容")
    json_output: list[Narrative] = dspy.OutputField(desc="输出json list")

# Define the predictor.
predictor = dspy.Predict(StoryToJSON)

System prompt generated by dspy

before (Please scroll to the right and focus on the JSON schema section):

Your input fields are:
1. `story_text` (str): 小说内容

Your output fields are:
1. `json_output` (list[Narrative]): 输出json list

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## story_text ## ]]
{story_text}

[[ ## json_output ## ]]
{json_output}        # note: the value you produce must be pareseable according to the following JSON schema: {"type": "array", "$defs": {"Narrative": {"type": "object", "properties": {"type": {"type": "string", "description": "\u8868\u793a\u5185\u5bb9\u7c7b\u578b(narration:\u53d9\u8ff0, dialogue:\u5bf9\u8bdd, voiceover:\u5185\u5fc3\u72ec\u767d)", "enum": ["dialogue", "narration", "voiceover"], "title": "Type"}, "content": {"type": "string", "description": "\u5177\u4f53\u5185\u5bb9(\u6ce8\u610f\u7b2c\u4e00\u4eba\u79f0\u8f6c\u6362)", "title": "Content"}, "name": {"anyOf": [{"type": "string"}, {"type": "null"}], "default": null, "description": "\u8bf4\u8bdd\u4eba\u540d\u5b57(\u4ec5dialogue\u548cvoiceover\u9700\u8981)", "title": "Name"}, "reaction": {"anyOf": [{"type": "string"}, {"type": "null"}], "default": null, "description": "\u60c5\u7eea/\u53cd\u5e94(\u4ec5dialogue\u548cvoiceover\u9700\u8981)", "title": "Reaction"}}, "required": ["type", "content"], "title": "Narrative"}}, "items": {"$ref": "#/$defs/Narrative"}}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Convert story text into structured JSON format with specific fields for narration, dialogue, and voiceover.
        Make the performance more like a script or animation script style, help the performer better understand the character's emotions and reactions, and make the content more expressive and situational.
        NOTE: Convert each paragraph based on the story_text without skipping or omitting any content.

after:

Your input fields are:
1. `story_text` (str): 小说内容

Your output fields are:
1. `json_output` (list[Narrative]): 输出json list

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## story_text ## ]]
{story_text}

[[ ## json_output ## ]]
{json_output}        # note: the value you produce must be pareseable according to the following JSON schema: {"type": "array", "$defs": {"Narrative": {"type": "object", "properties": {"type": {"type": "string", "description": "表示内容类型(narration:叙述, dialogue:对话, voiceover:内心独白)", "enum": ["dialogue", "narration", "voiceover"], "title": "Type"}, "content": {"type": "string", "description": "具体内容(注意第一人称转换)", "title": "Content"}, "name": {"anyOf": [{"type": "string"}, {"type": "null"}], "default": null, "description": "说话人名字(仅dialogue和voiceover需要)", "title": "Name"}, "reaction": {"anyOf": [{"type": "string"}, {"type": "null"}], "default": null, "description": "情绪/反应(仅dialogue和voiceover需要)", "title": "Reaction"}}, "required": ["type", "content"], "title": "Narrative"}}, "items": {"$ref": "#/$defs/Narrative"}}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Convert story text into structured JSON format with specific fields for narration, dialogue, and voiceover.
        Make the performance more like a script or animation script style, help the performer better understand the character's emotions and reactions, and make the content more expressive and situational.
        NOTE: Convert each paragraph based on the story_text without skipping or omitting any content.

#1787)

fix: prevent Chinese examples from being converted to Unicode encoding

4ff06e4

okhat merged commit 793530c into stanfordnlp:main Nov 11, 2024
4 checks passed

isaacbmiller pushed a commit that referenced this pull request Dec 11, 2024

fix: prevent Chinese examples from being converted to Unicode encoding (

e8af6da

#1787)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent Chinese examples from being converted to Unicode encoding #1787

fix: prevent Chinese examples from being converted to Unicode encoding #1787

coolmian commented Nov 11, 2024

fix: prevent Chinese examples from being converted to Unicode encoding #1787

fix: prevent Chinese examples from being converted to Unicode encoding #1787

Conversation

coolmian commented Nov 11, 2024

My case:

System prompt generated by dspy