Skip to content

Commit

Permalink
[SPARK-50712][INFRA][PS][TESTS] Add a daily build for Pandas API on S…
Browse files Browse the repository at this point in the history
…park with old dependencies

### What changes were proposed in this pull request?
Add a daily build for Pandas API on Spark with old dependencies

### Why are the changes needed?
The PS part requires a newer version of Pandas

### Does this PR introduce _any_ user-facing change?
No, infra-only

### How was this patch tested?
PR builder with
```
default: '{"PYSPARK_IMAGE_TO_TEST": "python-ps-minimum", "PYTHON_TO_TEST": "python3.9"}'

default: '{"pyspark": "true", "pyspark-pandas": "true"}'
```

https://github.com/zhengruifeng/spark/runs/35054863846

### Was this patch authored or co-authored using generative AI tooling?

Closes #49343 from zhengruifeng/infra_ps_mini.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
  • Loading branch information
zhengruifeng committed Jan 2, 2025
1 parent 3c0824d commit c374c00
Show file tree
Hide file tree
Showing 5 changed files with 150 additions and 1 deletion.
13 changes: 13 additions & 0 deletions .github/workflows/build_infra_images_cache.yml
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,19 @@ jobs:
- name: Image digest (PySpark with old dependencies)
if: hashFiles('dev/spark-test-image/python-minimum/Dockerfile') != ''
run: echo ${{ steps.docker_build_pyspark_python_minimum.outputs.digest }}
- name: Build and push (PySpark PS with old dependencies)
if: hashFiles('dev/spark-test-image/python-ps-minimum/Dockerfile') != ''
id: docker_build_pyspark_python_ps_minimum
uses: docker/build-push-action@v6
with:
context: ./dev/spark-test-image/python-ps-minimum/
push: true
tags: ghcr.io/apache/spark/apache-spark-github-action-image-pyspark-python-ps-minimum-cache:${{ github.ref_name }}-static
cache-from: type=registry,ref=ghcr.io/apache/spark/apache-spark-github-action-image-pyspark-python-ps-minimum-cache:${{ github.ref_name }}
cache-to: type=registry,ref=ghcr.io/apache/spark/apache-spark-github-action-image-pyspark-python-ps-minimum-cache:${{ github.ref_name }},mode=max
- name: Image digest (PySpark PS with old dependencies)
if: hashFiles('dev/spark-test-image/python-ps-minimum/Dockerfile') != ''
run: echo ${{ steps.docker_build_pyspark_python_ps_minimum.outputs.digest }}
- name: Build and push (PySpark with PyPy 3.10)
if: hashFiles('dev/spark-test-image/pypy-310/Dockerfile') != ''
id: docker_build_pyspark_pypy_310
Expand Down
47 changes: 47 additions & 0 deletions .github/workflows/build_python_ps_minimum.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#

name: "Build / Python-only (master, Python PS with old dependencies)"

on:
schedule:
- cron: '0 10 * * *'
workflow_dispatch:

jobs:
run-build:
permissions:
packages: write
name: Run
uses: ./.github/workflows/build_and_test.yml
if: github.repository == 'apache/spark'
with:
java: 17
branch: master
hadoop: hadoop3
envs: >-
{
"PYSPARK_IMAGE_TO_TEST": "python-ps-minimum",
"PYTHON_TO_TEST": "python3.9"
}
jobs: >-
{
"pyspark": "true",
"pyspark-pandas": "true"
}
81 changes: 81 additions & 0 deletions dev/spark-test-image/python-ps-minimum/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Image for building and testing Spark branches. Based on Ubuntu 22.04.
# See also in https://hub.docker.com/_/ubuntu
FROM ubuntu:jammy-20240911.1
LABEL org.opencontainers.image.authors="Apache Spark project <[email protected]>"
LABEL org.opencontainers.image.licenses="Apache-2.0"
LABEL org.opencontainers.image.ref.name="Apache Spark Infra Image For Pandas API on Spark with old dependencies"
# Overwrite this label to avoid exposing the underlying Ubuntu OS version label
LABEL org.opencontainers.image.version=""

ENV FULL_REFRESH_DATE=20250102

ENV DEBIAN_FRONTEND=noninteractive
ENV DEBCONF_NONINTERACTIVE_SEEN=true

RUN apt-get update && apt-get install -y \
build-essential \
ca-certificates \
curl \
gfortran \
git \
gnupg \
libcurl4-openssl-dev \
libfontconfig1-dev \
libfreetype6-dev \
libfribidi-dev \
libgit2-dev \
libharfbuzz-dev \
libjpeg-dev \
liblapack-dev \
libopenblas-dev \
libpng-dev \
libpython3-dev \
libssl-dev \
libtiff5-dev \
libxml2-dev \
openjdk-17-jdk-headless \
pkg-config \
qpdf \
tzdata \
software-properties-common \
wget \
zlib1g-dev


# Should keep the installation consistent with https://apache.github.io/spark/api/python/getting_started/install.html

# Install Python 3.9
RUN add-apt-repository ppa:deadsnakes/ppa
RUN apt-get update && apt-get install -y \
python3.9 \
python3.9-distutils \
&& apt-get autoremove --purge -y \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*


ARG BASIC_PIP_PKGS="pyarrow==11.0.0 pandas==2.2.0 six==1.16.0 numpy scipy coverage unittest-xml-reporting"
# Python deps for Spark Connect
ARG CONNECT_PIP_PKGS="grpcio==1.67.0 grpcio-status==1.67.0 googleapis-common-protos==1.65.0 graphviz==0.20 protobuf"

# Install Python 3.9 packages
RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.9
RUN python3.9 -m pip install --force $BASIC_PIP_PKGS $CONNECT_PIP_PKGS && \
python3.9 -m pip cache purge
8 changes: 7 additions & 1 deletion python/pyspark/pandas/tests/io/test_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,12 @@
from pyspark import pandas as ps
from pyspark.testing.pandasutils import PandasOnSparkTestCase
from pyspark.testing.sqlutils import SQLTestUtils
from pyspark.testing.utils import have_tabulate, tabulate_requirement_message
from pyspark.testing.utils import (
have_jinja2,
jinja2_requirement_message,
have_tabulate,
tabulate_requirement_message,
)


# This file contains test cases for 'Serialization / IO / Conversion'
Expand Down Expand Up @@ -91,6 +96,7 @@ def test_from_dict(self):
psdf = ps.DataFrame.from_dict(data, orient="index", columns=["A", "B", "C", "D"])
self.assert_eq(pdf, psdf)

@unittest.skipIf(not have_jinja2, jinja2_requirement_message)
def test_style(self):
# Currently, the `style` function returns a pandas object `Styler` as it is,
# processing only the number of rows declared in `compute.max_rows`.
Expand Down
2 changes: 2 additions & 0 deletions python/pyspark/pandas/tests/io/test_series_conversion.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
from pyspark import pandas as ps
from pyspark.testing.pandasutils import PandasOnSparkTestCase
from pyspark.testing.sqlutils import SQLTestUtils
from pyspark.testing.utils import have_jinja2, jinja2_requirement_message


class SeriesConversionTestsMixin:
Expand All @@ -48,6 +49,7 @@ def test_to_clipboard(self):
psser.to_clipboard(sep=",", index=False), pser.to_clipboard(sep=",", index=False)
)

@unittest.skipIf(not have_jinja2, jinja2_requirement_message)
def test_to_latex(self):
pser = self.pser
psser = self.psser
Expand Down

0 comments on commit c374c00

Please sign in to comment.