Skip to content

Commit

Permalink
Merge pull request #27 from stanford-oval/refactored-codebase-sync
Browse files Browse the repository at this point in the history
Refactored codebase sync
  • Loading branch information
Yucheng-Jiang authored Apr 23, 2024
2 parents 72d85c9 + 85b6479 commit 5d21c39
Show file tree
Hide file tree
Showing 28 changed files with 3,571 additions and 9,505 deletions.
146 changes: 120 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,28 @@
# STORM: Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking

This repository contains the code for our NAACL 2024 paper [Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models](https://arxiv.org/abs/2402.14207) by [Yijia Shao](https://cs.stanford.edu/~shaoyj), [Yucheng Jiang](https://yucheng-jiang.github.io/), Theodore A. Kanell, Peter Xu, [Omar Khattab](https://omarkhattab.com/), and [Monica S. Lam](https://suif.stanford.edu/~lam/).
<p align="center">
| <a href="http://storm.genie.stanford.edu"><b>Research preview</b></a> | <a href="https://arxiv.org/abs/2402.14207"><b>Paper</b></a> | <b>Documentation (WIP)</b> |


**Latest News** 🔥

- [2024/04] We release refactored version of STORM codebase! We define [interface](src/interface.py) for STORM pipeline and reimplement STORM-wiki (check out [`src/storm_wiki`](src/storm_wiki)) to demonstrate how to instantiate the pipeline. We provide API to support customization of different language models and retrieval/search integration.

## Overview [(Try STORM now!)](https://storm.genie.stanford.edu/)

<p align="center">
<img src="assets/overview.png" style="width: 90%; height: auto;">
</p>
STORM is a LLM system that writes Wikipedia-like articles from scratch based on Internet search.

While the system cannot produce publication-ready articles that often require a significant number of edits, experienced Wikipedia editors have found it helpful in their pre-writing stage.

**Try out our [live demo](https://storm.genie.stanford.edu/) to see how STORM can help your knowledge exploration journey and please provide feedback to help us improve the system 🙏!**
**Try out our [live research preview](https://storm.genie.stanford.edu/) to see how STORM can help your knowledge exploration journey and please provide feedback to help us improve the system 🙏!**



## How STORM works

## Research Before Writing
STORM breaks down generating long articles with citations into two steps:
1. **Pre-writing stage**: The system conducts Internet-based research to collect references and generates an outline.
2. **Writing stage**: The system uses the outline and references to generate the full-length article with citations.
Expand All @@ -24,41 +34,128 @@ STORM identifies the core of automating the research process as automatically co
1. **Perspective-Guided Question Asking**: Given the input topic, STORM discovers different perspectives by surveying existing articles from similar topics and uses them to control the question-asking process.
2. **Simulated Conversation**: STORM simulates a conversation between a Wikipedia writer and a topic expert grounded in Internet sources to enable the language model to update its understanding of the topic and ask follow-up questions.

Based on the separation of the two stages, STORM is implemented in a highly modular way (see [engine.py](src/engine.py)) using [dspy](https://github.com/stanfordnlp/dspy).
Based on the separation of the two stages, STORM is implemented in a highly modular way using [dspy](https://github.com/stanfordnlp/dspy).



## Setup
## Getting started

**We view STORM as an example of automated knowledge curation. We are working on enhancing our codebase to increase its extensibility. Stay tuned!**
### 1. Setup

Below, we provide a quick start guide to run STORM locally to reproduce our experiments.
Below, we provide a quick start guide to run STORM locally.

1. Install the required packages.
```shell
conda create -n storm python=3.11
conda activate storm
pip install -r requirements.txt
```
2. Set up OpenAI API key and [You.com search API](https://api.you.com/) key. Create a file `secrets.toml` under the root directory and add the following content:
2. Set up OpenAI API key (if you want to use OpenAI models to power STORM) and [You.com search API](https://api.you.com/) key. Create a file `secrets.toml` under the root directory and add the following content:
```shell
# Set up OpenAI API key.
OPENAI_API_KEY=<your_openai_api_key>
OPENAI_API_KEY="your_openai_api_key"
# If you are using the API service provided by OpenAI, include the following line:
OPENAI_API_TYPE="openai"
# If you are using the API service provided by Microsoft Azure, include the following lines:
OPENAI_API_TYPE="azure"
AZURE_API_BASE=<your_azure_api_base_url>
AZURE_API_VERSION=<your_azure_api_version>
AZURE_API_BASE="your_azure_api_base_url"
AZURE_API_VERSION="your_azure_api_version"
# Set up You.com search API key.
YDC_API_KEY=<your_youcom_api_key>
YDC_API_KEY="your_youcom_api_key"
```

## Paper Experiments
The FreshWiki dataset used in our experiments can be found in [./FreshWiki](FreshWiki).

### 2. Running STORM-wiki locally

Currently, we provide example scripts under [`examples`](examples) to demonstrate how you can run STORM using different models.

**To run STORM with `gpt` family models**: Make sure you have set up the OpenAI API key and run the following command.

```
python scripts/run_storm_wiki_gpt.py \
--output_dir $OUTPUT_DIR \
--do-research \
--do-generate-outline \
--do-generate-article \
--do-polish-article
```
- `--do-research`: if True, simulate conversation to research the topic; otherwise, load the results.
- `--do-generate-outline`: If True, generate an outline for the topic; otherwise, load the results.
- `--do-generate-article`: If True, generate an article for the topic; otherwise, load the results.
- `--do-polish-article`: If True, polish the article by adding a summarization section and (optionally) removing duplicate content.
**To run STORM with `mistral` family models on local VLLM server**: have a VLLM server running with the `Mistral-7B-Instruct-v0.2` model and run the following command.
```
python scripts/run_storm_wiki_mistral.py \
--url $URL \
--port $PORT \
--output_dir $OUTPUT_DIR \
--do-research \
--do-generate-outline \
--do-generate-article \
--do-polish-article
```
- `--url` URL of the VLLM server.
- `--port` Port of the VLLM server.
## Customize STORM
### Customization of the Pipeline
STORM is a knowledge curation engine consisting of 4 modules:
1. Knowledge Curation Module: Collects a broad coverage of information about the given topic.
2. Outline Generation Module: Organizes the collected information by generating a hierarchical outline for the curated knowledge.
3. Article Generation Module: Populates the generated outline with the collected information.
4. Article Polishing Module: Refines and enhances the written article for better presentation.
The interface for each module is defined in `src/interface.py`, while their implementations are instantiated in `src/storm_wiki/modules/*`. These modules can be customized according to your specific requirements (e.g., generating sections in bullet point format instead of full paragraphs).
:star2: **You can share your customization of `Engine` by making PRs to this repo!**
### Customization of Retriever Module
As a knowledge curation engine, STORM grabs information from the Retriever module. The interface for the Retriever module is defined in [`src/interface.py`]('src/interface.py'). Please consult the interface documentation if you plan to create a new instance or replace the default search engine API. By default, STORM utilizes the You.com search engine API (see `YouRM` in [`src/rm.py`](src/rm.py)).
:star2: **PRs for integrating more search engines/retrievers are highly appreciated!**
### Customization of Language Models
STORM provides the following language model implementations in [`src/lm.py`](src/lm.py):
- `OpenAIModel`
- `ClaudeModel`
- `VLLMClient`
- `TGIClient`
- `TogetherClient`
:star2: **PRs for integrating more language model clients are highly appreciated!**
:bulb: **For a good practice,**
- choose a cheaper/faster model for `conv_simulator_lm` which is used to split queries, synthesize answers in the conversation.
- if you need to conduct the actual writing step, choose a more powerful model for `article_gen_lm`. Based on our experiments, weak models are bad at generating text with citations.
- for open models, adding one-shot example can help it better follow instructions.
Please refer to the scripts in the [`examples`](examples) directory for concrete guidance on customizing the language model used in the pipeline.
## Replicate NAACL2024 result
Please switch to the branch `NAACL-2024-code-backup`
<details>
<summary>Show me instructions</summary>
### Paper Experiments
The FreshWiki dataset used in our experiments can be found in [./FreshWiki](FreshWiki).
Run the following commands under [./src](src).
### Pre-writing Stage
#### Pre-writing Stage
For batch experiment on FreshWiki dataset:
```shell
python -m scripts.run_prewriting --input-source file --input-path ../FreshWiki/topic_list.csv --engine gpt-4 --do-research --max-conv-turn 5 --max-perspective 5
Expand All @@ -79,7 +176,7 @@ python -m scripts.run_prewriting --input-source console --engine gpt-4 --max-con
The generated outline will be saved in `{output_dir}/{topic}/storm_gen_outline.txt` and the collected references will be saved in `{output_dir}/{topic}/raw_search_results.json`.


### Writing Stage
#### Writing Stage
For batch experiment on FreshWiki dataset:
```shell
python -m scripts.run_writing --input-source file --input-path ../FreshWiki/topic_list.csv --engine gpt-4 --do-polish-article --remove-duplicate
Expand All @@ -94,37 +191,35 @@ python -m scripts.run_writing --input-source console --engine gpt-4 --do-polish-

The generated article will be saved in `{output_dir}/{topic}/storm_gen_article.txt` and the references corresponding to citation index will be saved in `{output_dir}/{topic}/url_to_info.json`. If `--do-polish-article` is set, the polished article will be saved in `{output_dir}/{topic}/storm_gen_article_polished.txt`.

## Customize the STORM Configurations
### Customize the STORM Configurations
We set up the default LLM configuration in `LLMConfigs` in [src/modules/utils.py](src/modules/utils.py). You can use `set_conv_simulator_lm()`,`set_question_asker_lm()`, `set_outline_gen_lm()`, `set_article_gen_lm()`, `set_article_polish_lm()` to override the default configuration. These functions take in an instance from `dspy.dsp.LM` or `dspy.dsp.HFModel`.

:bulb: **For a good practice,**
- choose a cheaper/faster model for `conv_simulator_lm` which is used to split queries, synthesize answers in the conversation.
- if you need to conduct the actual writing step, choose a more powerful model for `article_gen_lm`. Based on our experiments, weak models are bad at generating text with citations.

## Automatic Evaluation
### Automatic Evaluation

In our paper, we break down the evaluation into two parts: outline quality and full-length article quality.

### Outline Quality
#### Outline Quality
We introduce *heading soft recall* and *heading entity recall* to evaluate the outline quality. This makes it easier to prototype methods for pre-writing.

Run the following command under [./eval](eval) to compute the metrics on FreshWiki dataset:
```shell
python eval_outline_quality.py --input-path ../FreshWiki/topic_list.csv --gt-dir ../FreshWiki --pred-dir ../results --pred-file-name storm_gen_outline.txt --result-output-path ../results/storm_outline_quality.csv
```

### Full-length Article Quality
#### Full-length Article Quality
[eval/eval_article_quality.py](eval/eval_article_quality.py) provides the entry point of evaluating full-length article quality using ROUGE, entity recall, and rubric grading. Run the following command under `eval` to compute the metrics:
```shell
python eval_article_quality.py --input-path ../FreshWiki/topic_list.csv --gt-dir ../FreshWiki --pred-dir ../results --gt-dir ../FreshWiki --output-dir ../results/storm_article_eval_results --pred-file-name storm_gen_article_polished.txt
```

### Use the Metric Yourself
#### Use the Metric Yourself
The similarity-based metrics (i.e., ROUGE, entity recall, and heading entity recall) are implemented in [eval/metrics.py](eval/metrics.py).

For rubric grading, we use the [prometheus-13b-v1.0](https://huggingface.co/kaist-ai/prometheus-13b-v1.0) introduced in [this paper](https://arxiv.org/abs/2310.08491). [eval/evaluation_prometheus.py](eval/evaluation_prometheus.py) provides the entry point of using the metric.

</details>

## Contributions
If you have any questions or suggestions, please feel free to open an issue or pull request. We welcome contributions to improve the system and the codebase!

Expand All @@ -140,4 +235,3 @@ Please cite our paper if you use this code or part of it in your work:
booktitle={Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)}
}
```
113 changes: 113 additions & 0 deletions examples/run_storm_wiki_gpt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
"""
STORM Wiki pipeline powered by GPT-3.5/4 and You.com search engine.
You need to set up the following environment variables to run this script:
- OPENAI_API_KEY: OpenAI API key
- OPENAI_API_TYPE: OpenAI API type (e.g., 'openai' or 'azure')
- AZURE_API_BASE: Azure API base URL if using Azure API
- AZURE_API_VERSION: Azure API version if using Azure API
- YDC_API_KEY: You.com API key
Output will be structured as below
args.output_dir/
topic_name/ # topic_name will follow convention of underscore-connected topic name w/o space and slash
conversation_log.json # Log of information-seeking conversation
raw_search_results.json # Raw search results from search engine
direct_gen_outline.txt # Outline directly generated with LLM's parametric knowledge
storm_gen_outline.txt # Outline refined with collected information
url_to_info.json # Sources that are used in the final article
storm_gen_article.txt # Final article generated
storm_gen_article_polished.txt # Polished final article (if args.do_polish_article is True)
"""

import os
import sys
from argparse import ArgumentParser

sys.path.append('./src')
from lm import OpenAIModel
from storm_wiki.engine import STORMWikiRunnerArguments, STORMWikiRunner, STORMWikiLMConfigs
from utils import load_api_key


def main(args):
load_api_key(toml_file_path='secrets.toml')
lm_configs = STORMWikiLMConfigs()
openai_kwargs = {
'api_key': os.getenv("OPENAI_API_KEY"),
'api_provider': os.getenv('OPENAI_API_TYPE'),
'temperature': 1.0,
'top_p': 0.9,
'api_base': os.getenv('AZURE_API_BASE'),
'api_version': os.getenv('AZURE_API_VERSION'),
}

# STORM is a LM system so different components can be powered by different models.
# For a good balance between cost and quality, you can choose a cheaper/faster model for conv_simulator_lm
# which is used to split queries, synthesize answers in the conversation. We recommend using stronger models
# for outline_gen_lm which is responsible for organizing the collected information, and article_gen_lm
# which is responsible for generating sections with citations.
conv_simulator_lm = OpenAIModel(model='gpt-3.5-turbo', max_tokens=500, **openai_kwargs)
question_asker_lm = OpenAIModel(model='gpt-3.5-turbo', max_tokens=500, **openai_kwargs)
outline_gen_lm = OpenAIModel(model='gpt-4-0125-preview', max_tokens=400, **openai_kwargs)
article_gen_lm = OpenAIModel(model='gpt-4-0125-preview', max_tokens=700, **openai_kwargs)
article_polish_lm = OpenAIModel(model='gpt-4-0125-preview', max_tokens=4000, **openai_kwargs)

lm_configs.set_conv_simulator_lm(conv_simulator_lm)
lm_configs.set_question_asker_lm(question_asker_lm)
lm_configs.set_outline_gen_lm(outline_gen_lm)
lm_configs.set_article_gen_lm(article_gen_lm)
lm_configs.set_article_polish_lm(article_polish_lm)

engine_args = STORMWikiRunnerArguments(
output_dir=args.output_dir,
max_conv_turn=args.max_conv_turn,
max_perspective=args.max_perspective,
search_top_k=args.search_top_k,
)
runner = STORMWikiRunner(engine_args, lm_configs)

topic = input('Topic: ')
runner.run(
topic=topic,
do_research=args.do_research,
do_generate_outline=args.do_generate_outline,
do_generate_article=args.do_generate_article,
do_polish_article=args.do_polish_article,
)
runner.post_run()
runner.summary()


if __name__ == '__main__':
parser = ArgumentParser()
# global arguments
parser.add_argument('--output-dir', type=str, default='./results/gpt',
help='Directory to store the outputs.')
parser.add_argument('--max_thread_num', type=int, default=3,
help='Maximum number of threads to use. The information seeking part and the article generation'
'part can speed up by using multiple threads. Consider reducing it if keep getting '
'"Exceed rate limit" error when calling LM API.')
# stage of the pipeline
parser.add_argument('--do-research', action='store_true',
help='If True, simulate conversation to research the topic; otherwise, load the results.')
parser.add_argument('--do-generate-outline', action='store_true',
help='If True, generate an outline for the topic; otherwise, load the results.')
parser.add_argument('--do-generate-article', action='store_true',
help='If True, generate an article for the topic; otherwise, load the results.')
parser.add_argument('--do-polish-article', action='store_true',
help='If True, polish the article by adding a summarization section and (optionally) removing '
'duplicate content.')
# hyperparameters for the pre-writing stage
parser.add_argument('--max-conv-turn', type=int, default=3,
help='Maximum number of questions in conversational question asking.')
parser.add_argument('--max-perspective', type=int, default=3,
help='Maximum number of perspectives to consider in perspective-guided question asking.')
parser.add_argument('--search-top-k', type=int, default=3,
help='Top k search results to consider for each search query.')
# hyperparameters for the writing stage
parser.add_argument('--retrieve-top-k', type=int, default=3,
help='Top k collected references for each section title.')
parser.add_argument('--remove-duplicate', action='store_true',
help='If True, remove duplicate content from the article.')

main(parser.parse_args())
Loading

0 comments on commit 5d21c39

Please sign in to comment.