Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35274: [Java][CI] Enable GCS on MacOS #35277

Closed
wants to merge 19 commits into from

Conversation

davisusanibar
Copy link
Contributor

@davisusanibar davisusanibar commented Apr 21, 2023

Rationale for this change

Enables GCS on MacOS when building the Arrow Dataset for Java.

What changes are included in this PR?

CI build to enable GCS

Are these changes tested?

import org.apache.arrow.dataset.file.FileFormat;
import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
import org.apache.arrow.dataset.jni.NativeMemoryPool;
import org.apache.arrow.dataset.scanner.ScanOptions;
import org.apache.arrow.dataset.scanner.Scanner;
import org.apache.arrow.dataset.source.Dataset;
import org.apache.arrow.dataset.source.DatasetFactory;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.ipc.ArrowReader;
import org.apache.arrow.vector.types.pojo.Schema;

public class Test {
  public static void main(String[] args) {
    String uri = "gs://anonymous@voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet"; // Google Cloud Storage
    ScanOptions options = new ScanOptions(/*batchSize*/ 32768);
    try (
        BufferAllocator allocator = new RootAllocator();
        DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
        Dataset dataset = datasetFactory.finish();
        Scanner scanner = dataset.newScan(options);
        ArrowReader reader = scanner.scanBatches()
    ) {
      Schema schema = scanner.schema();
      System.out.println(schema); // Schema<vendor_name: Utf8, pickup_datetime: Timestamp(MILLISECOND, null), dropoff_datetime: Timestamp(MILLISECOND, null), passenger_count: Int(64, true), trip_distance: FloatingPoint(DOUBLE), pickup_longitude: FloatingPoint(DOUBLE), pickup_latitude: FloatingPoint(DOUBLE), rate_code: Utf8, store_and_fwd: Utf8, dropoff_longitude: FloatingPoint(DOUBLE), dropoff_latitude: FloatingPoint(DOUBLE), payment_type: Utf8, fare_amount: FloatingPoint(DOUBLE), extra: FloatingPoint(DOUBLE), mta_tax: FloatingPoint(DOUBLE), tip_amount: FloatingPoint(DOUBLE), tolls_amount: FloatingPoint(DOUBLE), total_amount: FloatingPoint(DOUBLE), improvement_surcharge: FloatingPoint(DOUBLE), congestion_surcharge: FloatingPoint(DOUBLE), pickup_location_id: Int(64, true), dropoff_location_id: Int(64, true)>
      while (reader.loadNextBatch()) {
        System.out.println("RowCount: " + reader.getVectorSchemaRoot().getRowCount()); //2979
      }
    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}

Are there any user-facing changes?

No

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@davisusanibar davisusanibar changed the title feat: enable GCS on macosx GH-35274: [Java][CI] Enable GCS on MacOS Apr 21, 2023
@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue apache/arrow-java#202 has been automatically assigned in GitHub to PR creator.

@kou
Copy link
Member

kou commented Apr 21, 2023

@github-actions crossbow submit java-jars

@github-actions
Copy link

Revision: a05f58e

Submitted crossbow builds: ursacomputing/crossbow @ actions-24f328b4cf

Task Status
java-jars Github Actions

Copy link
Member

@assignUser assignUser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor change

cpp/build-support/run-test.sh Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 22, 2023
@davisusanibar
Copy link
Contributor Author

I am seeing error on MacOS aarch_64 with message dyld: lazy symbol binding failed: Symbol not found: _curl_multi_poll @kou / @assignUser Do you have any idea about the reason of?

Running arrow-gcsfs-test, redirecting output into /Users/voltrondata/github-actions-runner/_work/crossbow/crossbow/arrow/cpp-build/cpp/build/test-logs/arrow-gcsfs-test.txt (attempt 1/1)
/Users/voltrondata/github-actions-runner/_work/crossbow/crossbow/arrow/cpp/build-support/run-test.sh: line 91: 33793 Abort trap: 6           $TEST_EXECUTABLE "$@" > $LOGFILE.raw 2>&1
INFO:werkzeug:WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on http://localhost:63593
INFO:werkzeug:Press CTRL+C to quit
INFO:werkzeug: * Restarting with stat
Running main() from /Users/voltrondata/github-actions-runner/_work/crossbow/crossbow/arrow/cpp-build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc
[==========] Running 99 tests from 3 test suites.
[----------] Global test environment set-up.
[----------] 26 tests from TestGCSFSGeneric
[ RUN      ] TestGCSFSGeneric.Empty
dyld: lazy symbol binding failed: Symbol not found: _curl_multi_poll
  Referenced from: /Users/voltrondata/github-actions-runner/_work/crossbow/crossbow/arrow/cpp-build/cpp/release/arrow-gcsfs-test
  Expected in: /usr/lib/libcurl.4.dylib

dyld: Symbol not found: _curl_multi_poll
  Referenced from: /Users/voltrondata/github-actions-runner/_work/crossbow/crossbow/arrow/cpp-build/cpp/release/arrow-gcsfs-test
  Expected in: /usr/lib/libcurl.4.dylib

...
...

Total Test time (real) = 121.27 sec

The following tests FAILED:
	 61 - arrow-gcsfs-test (Failed)
Errors while running CTest

@assignUser
Copy link
Member

I think the installed curl version is too old. It looks like libcurl 4 is installed but multi pool was introduced in 7 https://curl.se/libcurl/c/curl_multi_poll.html

davisusanibar and others added 2 commits April 24, 2023 09:47
@davisusanibar
Copy link
Contributor Author

@github-actions crossbow submit java-jars

@github-actions
Copy link

Revision: d6da99d

Submitted crossbow builds: ursacomputing/crossbow @ actions-fa53b98180

Task Status
java-jars Github Actions

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Apr 24, 2023
@assignUser
Copy link
Member

assignUser commented May 10, 2023

This should fix the m1 issues

--- a/dev/tasks/java-jars/github.yml
+++ b/dev/tasks/java-jars/github.yml
@@ -108,6 +108,7 @@ jobs:
           set -e
           # make brew Java available to CMake
           if [ "{{ arch }}" = "aarch_64" ]; then
+            export CURL_ROOT=$(brew --prefix curl)
             export JAVA_HOME=$(brew --prefix openjdk@11)/libexec/openjdk.jdk/Contents/Home
           fi
           arrow/ci/scripts/java_jni_macos_build.sh \

@github-actions github-actions bot added the awaiting change review Awaiting change review label Jun 6, 2023
@davisusanibar
Copy link
Contributor Author

@github-actions crossbow submit java-jars

@github-actions
Copy link

github-actions bot commented Jun 6, 2023

Revision: e1db91b

Submitted crossbow builds: ursacomputing/crossbow @ actions-7a70b20d3a

Task Status
java-jars Github Actions

@davisusanibar
Copy link
Contributor Author

@lidavidm do you have some insight about what I could review to solve this error https://github.com/apache/arrow/actions/runs/5274483912/jobs/9538989549?pr=35277:

[70/122] Generating arrow-flight-glib/ArrowFlight-1.0.gir with a custom command (wrapped by meson to set env)
FAILED: arrow-flight-glib/ArrowFlight-1.0.gir 
env PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:/build/c_glib/meson-uninstalled /usr/bin/g-ir-scanner --quiet --no-libtool --namespace=ArrowFlight --nsversion=1.0 --warn-all --output arrow-flight-glib/ArrowFlight-1.0.gir --c-include=arrow-flight-glib/arrow-flight-glib.h --warn-all --include-uninstalled=./arrow-glib/Arrow-1.0.gir -I/arrow/c_glib/arrow-flight-glib -I/build/c_glib/arrow-flight-glib -I/arrow/c_glib/. -I/build/c_glib/. --filelist=/build/c_glib/arrow-flight-glib/libarrow-flight-glib.so.1300.0.0.p/ArrowFlight_1.0_gir_filelist --include=Arrow-1.0 --symbol-prefix=gaflight --identifier-prefix=GAFlight --pkg-export=arrow-flight-glib --cflags-begin -I/arrow/c_glib/. -I/build/c_glib/. -I/usr/local/include -I/usr/include/glib-2.0 -I/usr/lib/x86_64-linux-gnu/glib-2.0/include -I/usr/include/gobject-introspection-1.0 -DARROW_NO_DEPRECATED_API --cflags-end --add-include-path=/build/c_glib/arrow-glib --add-include-path=/usr/share/gir-1.0 -L/build/c_glib/arrow-flight-glib --library arrow-flight-glib -L/build/c_glib/arrow-glib -L/usr/local/lib -L/usr/local/lib --extra-library=arrow_flight --extra-library=arrow --extra-library=arrow_acero --extra-library=gobject-2.0 --extra-library=glib-2.0 --extra-library=girepository-1.0 --sources-top-dirs /arrow/c_glib/ --sources-top-dirs /build/c_glib/ --warn-error
/usr/bin/ld: /usr/local/lib/libarrow_flight.so: undefined reference to `absl::lts_20230125::crc_internal::CrcCordState::CrcCordState()'
/usr/bin/ld: /usr/local/lib/libarrow_flight.so: undefined reference to `absl::lts_20230125::crc_internal::CrcCordState::operator=(absl::lts_20230125::crc_internal::CrcCordState&&)'
/usr/bin/ld: /usr/local/lib/libarrow_flight.so: undefined reference to `absl::lts_20230125::crc_internal::CrcCordState::CrcCordState(absl::lts_20230125::crc_internal::CrcCordState&&)'
/usr/bin/ld: /usr/local/lib/libarrow_flight.so: undefined reference to `absl::lts_20230125::base_internal::StrError[abi:cxx11](int)'
/usr/bin/ld: /usr/local/lib/libarrow_flight.so: undefined reference to `absl::lts_20230125::crc_internal::CrcCordState::~CrcCordState()'
/usr/bin/ld: /usr/local/lib/libarrow_flight.so: undefined reference to `absl::lts_20230125::crc_internal::CrcCordState::Checksum() const'
collect2: error: ld returned 1 exit status
linking of temporary binary failed: Command '['x86_64-linux-gnu-gcc', '-pthread', '-o', '/build/c_glib/tmp-introspectci4byn1o/ArrowFlight-1.0', '-DARROW_NO_DEPRECATED_API', '/build/c_glib/tmp-introspectci4byn1o/ArrowFlight-1.0.o', '-L.', '-Wl,-rpath,.', '-Wl,--no-as-needed', '-L/build/c_glib/arrow-flight-glib', '-Wl,-rpath,/build/c_glib/arrow-flight-glib', '-L/build/c_glib/arrow-glib', '-Wl,-rpath,/build/c_glib/arrow-glib', '-L/usr/local/lib', '-Wl,-rpath,/usr/local/lib', '-L/usr/local/lib', '-Wl,-rpath,/usr/local/lib', '-larrow-flight-glib', '-larrow_flight', '-larrow', '-larrow_acero', '-lgobject-2.0', '-lglib-2.0', '-lgirepository-1.0', '-lgio-2.0', '-lgobject-2.0', '-Wl,--export-dynamic', '-lgmodule-2.0', '-pthread', '-lglib-2.0']' returned non-zero exit status 1.
[71/122] Compiling C++ object gandiva-glib/libgandiva-glib.so.1300.0.0.p/expression.cpp.o
[72/122] Compiling C++ object arrow-flight-sql-glib/libarrow-flight-sql-glib.so.1300.0.0.p/server.cpp.o
ninja: build stopped: subcommand failed.

@kou
Copy link
Member

kou commented Jun 15, 2023

Hmm. It's not reproduced on local...

It seems that the error message show that we need to add Abseil's CRC related modules to https://github.com/apache/arrow/blob/main/cpp/cmake_modules/ThirdpartyToolchain.cmake#L2856-L3756 .

BTW, can we use 20230125.3 https://github.com/abseil/abseil-cpp/releases/tag/20230125.3 ?

@davisusanibar
Copy link
Contributor Author

@github-actions crossbow submit java-jars

@github-actions
Copy link

Revision: 828d689

Submitted crossbow builds: ursacomputing/crossbow @ actions-221f6dfca1

Task Status
java-jars Github Actions

@davisusanibar
Copy link
Contributor Author

@lidavidm please could you help me to move this PR to draft?, there are some problems on the CI that I need to test and probably need to call it again and again. Thank you in advance

@kou
Copy link
Member

kou commented Jul 3, 2023

You can find the "Convert to draft" link in the right side bar.

@davisusanibar davisusanibar marked this pull request as draft July 3, 2023 21:23
@davisusanibar
Copy link
Contributor Author

You can find the "Convert to draft" link in the right side bar.

Thank you

@davisusanibar
Copy link
Contributor Author

@github-actions crossbow submit java-jars

@github-actions
Copy link

github-actions bot commented Jul 5, 2023

Revision: 1deb06b

Submitted crossbow builds: ursacomputing/crossbow @ actions-abbebdf72b

Task Status
java-jars Github Actions

@davisusanibar
Copy link
Contributor Author

@github-actions crossbow submit java-jars

@github-actions
Copy link

Revision: 480a896

Submitted crossbow builds: ursacomputing/crossbow @ actions-54eb93bef5

Task Status
java-jars Github Actions

@davisusanibar
Copy link
Contributor Author

@github-actions crossbow submit java-jars

@github-actions
Copy link

github-actions bot commented Sep 5, 2023

Revision: 9acf7dc

Submitted crossbow builds: ursacomputing/crossbow @ actions-8d83452f61

Task Status
java-jars Github Actions

@davisusanibar
Copy link
Contributor Author

In order to define a better alternative solution, the current pull request has been closed.

Copy link

⚠️ GitHub issue #35274 has no components, please add labels for components.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Java][Dataset][MacOS] Enable GCS
5 participants