Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: prevent Chinese examples from being converted to Unicode encoding #1774

Merged
merged 1 commit into from
Nov 8, 2024

Conversation

coolmian
Copy link
Contributor

@coolmian coolmian commented Nov 8, 2024

Using ensure_ascii=False provides better support for Chinese characters directly

before:

[[ ## json_output ## ]]
[{"type": "narration", "content": "\u5c0f\u660e\u8d70\u51fa\u5bb6\u95e8\uff0c\u8ddf\u90bb\u5c45\u6253\u62db\u547c"}, {"type": "dialogue", "name": "\u5c0f\u660e", "reaction": "\u9ad8\u5174", "content": "\u4f60\u597d\u5440"}, {"type": "narration", "content": "\u90bb\u5c45\u5fae\u7b11\u671d\u4ed6\u70b9\u5934"}, {"type": "voiceover", "name": "\u90bb\u5c45", "reaction": "\u5185\u5fc3\u5947\u602a", "content": "\u8fd9\u5c0f\u5b50\u4eca\u5929\u600e\u4e48\u5bf9\u6211\u8fd9\u4e48\u6709\u793c\u8c8c"}]

[[ ## completed ## ]]

after:

[[ ## json_output ## ]]
[{"type": "narration", "content": "小明走出家门,跟邻居打招呼"}, {"type": "dialogue", "name": "小明", "reaction": "高兴", "content": "你好呀"}, {"type": "narration", "content": "邻居微笑朝他点头"}, {"type": "voiceover", "name": "邻居", "reaction": "内心奇怪", "content": "这小子今天怎么对我这么有礼貌"}]

[[ ## completed ## ]]

Using secure_ascii=False provides better support for Chinese characters directly
@coolmian
Copy link
Contributor Author

coolmian commented Nov 8, 2024

my case:

class Narrative(BaseModel):
    type: Literal["dialogue", "narration", "voiceover"] = Field()
    content: str = Field(default=None)
    name: str | None = Field(default=None)
    reaction: str | None = Field(default=None)

class StoryToJSON(dspy.Signature):
    """
    Convert story text into structured JSON format with specific fields for narration, dialogue, and voiceover.
    Make the performance more like a script or animation script style, help the performer better understand the character's emotions and reactions, and make the content more expressive and situational.
    NOTE: Convert each paragraph based on the story_text without skipping or omitting any content.
    """

    story_text = dspy.InputField()
    json_output: list[Narrative] = dspy.OutputField(desc="list of narratives")

# Define the predictor.
predictor = dspy.Predict(StoryToJSON)
example = dspy.Example(
    story_text = "小明走出家门,跟邻居打招呼:“你好呀”。邻居微笑朝他点头,内心奇怪这小子今天怎么对他这么有礼貌?",
    json_output = [
        {"type": "narration", "content": "小明走出家门,跟邻居打招呼"},
        {"type": "dialogue", "name": "小明", "reaction": "高兴", "content": "你好呀"},
        {"type": "narration", "content": "邻居微笑朝他点头"},
        {"type": "voiceover", "name":"邻居", "reaction": "内心奇怪", "content": "这小子今天怎么对我这么有礼貌"}
    ]
)

predictor.demos = [example]
with open("dataset/1.txt", "r") as f:
    story_text = f.read()

# Call the predictor on a particular input.
pred = predictor(story_text=story_text)
print(f"Question: {story_text}")
for item in pred.json_output:
    print(item.model_dump())

If examples containing Chinese strings are converted to Unicode encoding, the LLM tends to reply with Unicode encoded strings, resulting in a decrease in reply quality and additional decoding work

@okhat okhat merged commit 4822d47 into stanfordnlp:main Nov 8, 2024
4 checks passed
@okhat
Copy link
Collaborator

okhat commented Nov 8, 2024

Thanks a lot @coolmian !

@coolmian coolmian deleted the patch-1 branch November 9, 2024 04:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants