Skip to content
This repository has been archived by the owner on Dec 18, 2024. It is now read-only.

HTTPArchive/data-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deprecated. The next version of the data pipeline: https://github.com/HTTPArchive/dataform

data-pipeline

The new HTTP Archive data pipeline built entirely on GCP

GitHub branch checks state GitHub branch checks state GitHub branch checks state Coverage badge

Table of contents generated with markdown-toc

Introduction

This repo handles the HTTP Archive data pipeline, which takes the results of the monthly HTTP Archive run and saves this to the httparchive dataset in BigQuery.

A secondary pipeline is responsible for populating the Technology Report Firestore collections.

There are currently two main pipelines:

  • The all pipeline which saves data to the new httparchive.all dataset
  • The combined pipline which saves data to the legacy tables. This processes both the summary tables (summary_pages and summary_requests) and non-summary pipeline (pages, requests, response_bodies....etc.)

The secondary tech_report pipeline saves data to a Firestore database (e.g. tech-report-apis-prod) across various collections (see TECHNOLOGY_QUERIES in constants.py)

The pipelines are run in Google Cloud Platform (GCP) and are kicked off automatically on crawl completion, based on the code in the main branch which is deployed to GCP on each merge.

The data-pipeline workflow as defined by the data-pipeline-workflows.yaml file, runs the whole process from start to finish, including generating the manifest file for each of the two runs (desktop and mobile) and then starting the four dataflow jobs (desktop all, mobile all, desktop combined, mobile combined) in sequence to upload of the HAR files to the BigQuery tables. This can be rerun in case of failure by publishing a crawl-complete message, providing no data was saved to the final BigQuery tables.

The four dataflow jobs can also be rerun individually in case of failure, but the BigQuery tables need to be cleared down first (including any lingering temp tables)

The dataflow jobs can also be run locally, whereby the local code is uploaded to GCP for that particular run.

Diagrams

GCP Workflows pipeline execution

sequenceDiagram
    participant PubSub
    participant Workflows
    participant Monitoring
    participant Cloud Storage
    participant Cloud Build
    participant BigQuery
    participant Dataflow

    PubSub->>Workflows: crawl-complete event
    loop until crawl queue is empty
        Workflows->>Monitoring: check crawl queue
    end
    rect rgb(191, 223, 255)
        Note right of Workflows: generate HAR manifest
        break when manifest already exists
            Workflows->>Cloud Storage: check if HAR manifest exists
        end
        Workflows->>Cloud Build: trigger job
        Cloud Build->>Cloud Build: list HAR files and generate manifest file
        Cloud Build->>Cloud Storage: upload HAR manifest to GCS
    end
    rect rgb(191, 223, 255)
        Note right of Workflows: check BigQuery and run Dataflow jobs
        break when BigQuery records exist for table and date
            Workflows->>BigQuery: check all/combined tables for records in the given date
        end
        loop run jobs until retry limit is reached
            Workflows->>Dataflow: run flex template
            loop until job is complete
                Workflows-->Dataflow: wait for job completion
            end
        end
    end
Loading

Development workflow

sequenceDiagram
    autonumber
    actor developer
    participant Local as Local Environment / IDE
    participant Dataflow
    participant Cloud Build
    participant Workflows

    developer->>Local: create/update Dataflow code
    developer->>Local: run Dataflow job with DirectRunner via run_*.py
    developer->>Dataflow: run Dataflow job with DataflowRunner via run_pipeline_*.sh
    developer->>Cloud Build: run build_flex_template.sh
    developer->>Workflows: update flexTemplateBuildTag
Loading

Manually running the pipeline

sequenceDiagram
    actor developer
    participant Local as Local Environment / IDE
    participant Dataflow
    participant PubSub
    participant Workflows

    alt run Dataflow job from local environment using the Dataflow runner
        developer->>Local: clone repository and execute run_pipeline_*.sh
    else run Dataflow job as a flex template
        alt from local environment
            developer->>Dataflow: clone repository and execute run_flex_template.sh
        else from Google Cloud Console
            developer->>Dataflow: use the Google Cloud Console to run a flex template as documented by GCP
        end
    else trigger a Google Workflows execution
        alt
            developer->>PubSub: create a new message containing a HAR manifest path from GCS
        else
            developer->>Workflows: rerun a previously failed Workflows execution
        end
    end
Loading

Run the pipeline

Dataflow jobs can be triggered several ways:

  • Locally using bash scripts (this can be used to test uncommited code, or code on a non-`main`` branch)
  • From the Google Cloud Console in Dataflow section by choosing to run a flex template (this can be used to run commited code for a particular dataflow pipeline only)
  • From the Google Cloud Console in Workflow section by choosing to execute a failed data-pipeline workflow again (this can be used to rerun failed parts of the workflow after reason for failure is fixed)
  • By publishing a Pub/Sub message to run the whole workflow (this kicks off the whole workflow and not just the pipeline so is good for the batch kicking off jobs when done, or to rerun the whole process manually when the manifest file was not generated)

Locally using the run_*.sh scripts

This method is best used when developing locally, as a convenience for running the pipeline's python scripts and GCP CLI commands.

# run the pipeline locally
./run_pipeline_combined.sh
./run_pipeline_all.sh

# run the pipeline using a flex template
./run_flex_template all [...]
./run_flex_template combined [...]
./run_flex_template tech_report [...]

Running a flex template from the Cloud Console

This method is useful for running individual dataflow jobs from the web console since it does not require a development environment.

Flex templates accept additional parameters as mentioned in the GCP documentation below, while custom parameters are defined in flex_template_metadata_*.json

https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#specify-options

https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates#run-a-flex-template-pipeline

Steps:

  1. Locate the desired build tag (e.g. see flexTemplateBuildTag in the data-pipeline.workflows.yaml)
  2. From the Google Cloud Console, navigate to the Dataflow > Jobs page
  3. Click "CREATE JOB FROM TEMPLATE" at the top of the page.
  4. Provide a "Job name"
  5. Change region to us-west1 (as that's where we have most compute capacity)
  6. Choose "Custom Template" from the bottom of the "Dataflow template" drop down.
  7. Browse to the template directory by pasting httparchive/dataflow/templates/ into the "Template path", ignoring the error saying this is not a file, and then clicking Browse to choose the actual file from that directory.
  8. Choose the pipeline type (e.g. all or combined) for the chosen build tag (e.g. data-pipeline-combined-2023-02-10_03-55-04.json - choose the latest one for all or combined)
  9. Expand "Optional Parameters" and provide an input for the "GCS input file" pointing to the manifests file (e.g. gs://httparchive/crawls_manifest/chrome-Jul_1_2023.txt for Desktop Jul 2023 or gs://httparchive/crawls_manifest/android-Jul_1_2023.txt for Mobile for July 2023).
  10. (Optional) provide values for any additional parameters
  11. Click "RUN JOB"

Rerunning a failed workflow

This method is useful for running the entire workflow from the web console since it does not require a development environment. It is useful when the part of the workflow failed for known reasons that have since been resolved. Prevous steps should be skipped as the workflow checks if they have already been run.

Steps:

  1. From the Google Cloud Console, navigate to the Workflow > Workflows page
  2. Select the data-pipeline workflow
  3. In the Actions column click the three dots and select "Execute again"

Publishing a Pub/Sub message

This method is best used for serverlessly running the entire workflow, including logic to

  • block execution when the crawl is still running, by waiting for the crawl's Pub/Sub queue to drain
  • skip jobs where BigQuery tables have already been populated
  • automatically retry failed jobs

Publishing a message containing the crawl's GCS path(s) will trigger a GCP workflow, including generating the HAR zip file for that run.

# single path
gcloud pubsub topics publish projects/httparchive/topics/crawl-complete --message "gs://httparchive/crawls/android-Nov_1_2022"

# multiple paths must be comma separated, without spaces
gcloud pubsub topics publish projects/httparchive/topics/crawl-complete --message "gs://httparchive/crawls/chrome-Feb_1_2023,gs://httparchive/crawls/android-Feb_1_2023"

Note that this can be run for an individual crawl (first example), or for both crawls (second example).

Pipeline types

Running the combined pipeline will produce summary and non-summary tables by default. Summary and non-summary outputs can be controlled using the --pipeline_type argument.

# example
./run_pipeline_combined.sh --pipeline_type=summary

./run_flex_template.sh combined --parameters pipeline_type=summary

Inputs

This pipeline can read individual HAR files, or a single file containing a list of HAR file paths.

# Run the `all` pipeline on both desktop and mobile using their pre-generated manifests.
./run_flex_template.sh all --parameters input_file=gs://httparchive/crawls_manifest/*-Nov_1_2022.txt

# Run the `combined` pipeline on mobile using its manifest.
./run_flex_template.sh combined --parameters input_file=gs://httparchive/crawls_manifest/android-Nov_1_2022.txt

# Run the `combined` pipeline on desktop using its individual HAR files (much slower, not encouraged).
./run_flex_template.sh combined --parameters input=gs://httparchive/crawls/chrome-Nov_1_2022

Note the run_pipeline_combined.sh and run_pipeline_all.sh scriprts uses the parameters in the scripts and these cannot be overridden with command line parameters. These are often useful for local testing of changes (local testing still results in the processing happening in GCP but using code copied from locally).

To save to different tables for testing, temporarily edit the modules/constants.py to prefix all the tables with experimental_ (note the experimental_parsed_css is current production table so use experimental_gc_parsed_css instead for now).

Generating HAR manifest files

The pipeline can read a manifest file (text file containing GCS file paths separated by new lines for each HAR file). Follow the example to generate a manifest file:

# generate manifest files
nohup gsutil ls gs://httparchive/crawls/chrome-Nov_1_2022 > chrome-Nov_1_2022.txt 2> chrome-Nov_1_2022.err &
nohup gsutil ls gs://httparchive/crawls/android-Nov_1_2022 > android-Nov_1_2022.txt 2> android-Nov_1_2022.err &

# watch for completion (i.e. file sizes will stop changing)
#   if the err file increases in size, open and check for issues
watch ls -l ./*Nov*

# upload to GCS
gsutil -m cp ./*Nov*.txt gs://httparchive/crawls_manifest/

Outputs

  • GCP DataFlow & Monitoring metrics - TODO: runtime metrics and dashboards
  • Dataflow temporary and staging artifacts in GCS
  • BigQuery (final landing zone)

Builds and Deployments

GitHub actions are used to automate the build and deployment of Google Cloud Workflows and Dataflow Flex Templates. Actions are triggered on merges to the main branch, for specific files, and when other related GitHub actions have completed successfully.

PRs with a title of Bump dataflow flex template build tag should be merged providing they are only updating the build datetime in the flexTemplateBuildTag. Check it has not zeroed the build datetime out (this can happen if the job errors in unusual ways).

Build inputs and artifacts

GCP's documentation for creating and building Flex Templates are linked here

The following files are used for building and deploying Dataflow Flex Templates:

Cloud Build is used to create Dataflow flex templates and upload them to Artifact Registry and Google Cloud Storage

To build and deploy manually

The GitHub Actions can be triggered manually from the repository by following the documentation here for Manually running a workflow.

flowchart LR
    Start((Start))
    End((End))
    A{Updating Dataflow?}
    B[Run 'Deploy Dataflow Flex Template']
    DDFTA[['Deploy Dataflow Flex Template' executes]]
    C{Updating Cloud Workflows?}
    D[Run 'Deploy Cloud Workflow']
    DCWA[['Deploy Cloud Workflow' executes]]

    Start --> A
    Start --> C
    A --> B
    B -->DDFTA
    DDFTA -->|automatically triggers| DCWA
    C --> D
    D --> DCWA
    DCWA --> End
Loading

Alternatively, a combination of bash scripts and the Google Cloud Console can be used to manually deploy Cloud Workflows and Dataflow Flex Templates.

flowchart LR
    Start((Start))
    End((End))
    A{Updating Dataflow?}
    B[Run build_flex_template.sh]
    C{Updating Cloud Workflows?}
    D[Note the latest build tag from the script output]
    E[Update the 'data-pipeline' workflow via the Cloud Console]

    Start --> A
    Start --> C
    A -->|Yes| B
    A -->|No| C
    B --> D
    D --> E
    C -->|Yes| E
    E --> End
Loading

This can be started by makling changes locally and then running the run_pipeline_all.sh or run_pipeline_combined.sh scripts (after changing input paramters in those scripts). Local code is copied to GCP for each run so your shell needs to be authenticated to GCP and have permissions to run.

To save to different tables for testing, temporarily edit the modules/constants.py to prefix all the tables with experimental_ (note the experimental_parsed_css is current production table so use experimental_gc_parsed_css instead for now).

Logs

Known issues

Data Pipeline

Temp table cleanup

Since this pipeline uses the FILE_LOADS BigQuery insert method, failures will leave behind temporary tables. Use the saved query below and replace the dataset name as desired.

https://console.cloud.google.com/bigquery?sq=226352634162:82dad1cd1374428e8d6eaa961d286559

FOR field IN
    (SELECT table_schema, table_name
    FROM lighthouse.INFORMATION_SCHEMA.TABLES
    WHERE table_name like 'beam_bq_job_LOAD_%')
DO
    EXECUTE IMMEDIATE format("drop table %s.%s;", field.table_schema, field.table_name);
END FOR;

Streaming pipeline

Initially this pipeline was developed to stream data into tables as individual HAR files became available in GCS from a live/running crawl. This allowed for results to be viewed faster, but came with additional burdens. For example:

  • Job failures and partial recovery/cleaning of tables.
  • Partial table population mid-crawl led to consumer confusion since they were previously accustomed to full tables being available.
  • Dataflow API for streaming inserts burried some low-level configuration leading to errors which were opaque and difficult to troubleshoot.

Dataflow

Logging

The work item requesting state read is no longer valid on the backend

This log message is benign and expected when using an auto-scaling pipeline https://cloud.google.com/dataflow/docs/guides/common-errors#work-item-not-valid

Response cache-control max-age

Various parsing issues due to unhandled cases

New file formats

New file formats from responses will be noted in WARNING logs

mimetypes and file extensions

Using ported custom logic from legacy PHP rather than standard libraries produces missing values and inconsistencies

About

The new HTTP Archive data pipeline built entirely on GCP

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published