Arkouda nightly performance charts
Bill Reus' March 2021 talk at the NJIT Data Science Seminar
Bill Reus' CHIUW 2020 Keynote video and slides
Mike Merrill's CHIUW 2019 talk
Exploratory data analysis (EDA) is a prerequisite for all data science, as illustrated by the ubiquity of Jupyter notebooks, the preferred interface for EDA among data scientists. The operations involved in exploring and transforming the data are often at least as computationally intensive as downstream applications (e.g. machine learning algorithms), and as datasets grow, so does the need for HPC-enabled EDA. However, the inherently interactive and open-ended nature of EDA does not mesh well with current HPC usage models. Meanwhile, several existing projects from outside the traditional HPC space attempt to combine interactivity and distributed computation using programming paradigms and tools from cloud computing, but none of these projects have come close to meeting our needs for high-performance EDA.
To fill this gap, we have developed a software package, called Arkouda, which allows a user to interactively issue massively parallel computations on distributed data using functions and syntax that mimic NumPy, the underlying computational library used in the vast majority of Python data science workflows. The computational heart of Arkouda is a Chapel interpreter that accepts a pre-defined set of commands from a client (currently implemented in Python) and uses Chapel's built-in machinery for multi-locale and multithreaded execution. Arkouda has benefited greatly from Chapel's distinctive features and has also helped guide the development of the language.
In early applications, users of Arkouda have tended to iterate rapidly between multi-node execution with Arkouda and single-node analysis in Python, relying on Arkouda to filter a large dataset down to a smaller collection suitable for analysis in Python, and then feeding the results back into Arkouda computations on the full dataset. This paradigm has already proved very fruitful for EDA. Our goal is to enable users to progress seamlessly from EDA to specialized algorithms by making Arkouda an integration point for HPC implementations of expensive kernels like FFTs, sparse linear algebra, and graph traversal. With Arkouda serving the role of a shell, a data scientist could explore, prepare, and call optimized HPC libraries on massive datasets, all within the same interactive session.
Arkouda is not trying to replace Pandas but to allow for some Pandas-style operation at a much larger scale. In our experience Pandas can handle dataframes up to about 500 million rows before performance becomes a real issue, this is provided that you run on a sufficently capable compute server. Arkouda breaks the shared memory paradigm and scales its operations to dataframes with over 200 billion rows, maybe even a trillion. In practice we have run Arkouda server operations on columns of one trillion elements running on 512 compute nodes. This yielded a >20TB dataframe in Arkouda.
- Prerequisites
- Building Arkouda
- Testing Arkouda
- Installing Arkouda Python libs and deps
- Running arkouda_server
- Logging
- Type Checking in Arkouda
- Environment Variables
- Versioning
- Contributing
Prerequisites toc
Requirements: toc
- requires chapel 1.24.1
- requires zeromq version >= 4.2.5, tested with 4.2.5 and 4.3.1
- requires hdf5
- requires python 3.7 or greater
- requires numpy
- requires typeguard for runtime type checking
- requires pandas for testing and conversion utils
- requires pytest, pytest-env, and h5py to execute the Python test harness
- requires sphinx, sphinx-argparse, and sphinx-autoapi to generate docs
- requires versioneer for versioning
MacOS Environment toc
Installing Chapel toc
Option 1: Setup using brew
(click to see more)
brew install zeromq
brew install hdf5
brew install chapel
Option 2: Build Chapel from source
(click to see more)
# build chapel in the user home directory with these settings...
export CHPL_HOME=~/chapel/chapel-1.24.1
source $CHPL_HOME/util/setchplenv.bash
export CHPL_COMM=gasnet
export CHPL_COMM_SUBSTRATE=smp
export CHPL_TARGET_CPU=native
export GASNET_QUIET=Y
export CHPL_RT_OVERSUBSCRIBED=yes
cd $CHPL_HOME
make
# Build chpldoc to enable generation of Arkouda docs
make chpldoc
# Add the Chapel and Chapel Doc executables (chpl and chpldoc, respectiveley) to
# PATH either in ~/.bashrc (single user) or /etc/environment (all users):
export PATH=$CHPL_HOME/bin/linux64-x86_64/:$PATH
Mac - Python / Anaconda toc
While not required, it is highly recommended to install Anaconda to provide a Python3 environment and manage Python dependencies. Otherwise, python can be installed via brew.
# The recommended Python install is via Anaconda:
wget https://repo.anaconda.com/archive/Anaconda3-2020.07-MacOSX-x86_64.sh
sh Anaconda3-2020.07-MacOSX-x86_64.sh
source ~/.bashrc
# Otherwise, Python 3 can be installed with brew
brew install python3
# versioneer is required, use either conda or pip
pip install versioneer
or
conda install versioneer
# these packages are nice but not a requirement (manual install required if Python installed with brew)
pip3 install pandas
pip3 install jupyter
Linux Environment toc
Installing Chapel on Linux toc
There is no Linux Chapel install, so the first two steps in the Linux Arkouda install are to install the Chapel dependencies followed by downloading and building Chapel.
(click to see more)
# Update Linux kernel and install Chapel dependencies
sudo apt-get update
sudo apt-get install gcc g++ m4 perl python python-dev python-setuptools bash make mawk git pkg-config
# Download latest Chapel release, explode archive, and navigate to source root directory
wget https://github.com/chapel-lang/chapel/releases/download/1.24.1/chapel-1.24.1.tar.gz
tar xvf chapel-1.24.1.tar.gz
cd chapel-1.24.1/
# Set CHPL_HOME
export CHPL_HOME=$PWD
# Add chpl to PATH
source $CHPL_HOME/util/setchplenv.bash
# Set remaining env variables and execute make
export CHPL_COMM=gasnet
export CHPL_COMM_SUBSTRATE=smp
export CHPL_TARGET_CPU=native
export GASNET_QUIET=Y
export CHPL_RT_OVERSUBSCRIBED=yes
cd $CHPL_HOME
make
# Build chpldoc to enable generation of Arkouda docs
make chpldoc
# Optionally add the Chapel executable (chpl) to the PATH for all users: /etc/environment
export PATH=$CHPL_HOME/bin/linux64-x86_64/:$PATH
Python environment setup - Anaconda toc
As is the case with the MacOS install, it is highly recommended to install Anaconda to provide a Python environment and manage Python dependencies:
(click to see more)
wget https://repo.anaconda.com/archive/Anaconda3-2020.07-Linux-x86_64.sh
sh Anaconda3-2020.07-Linux-x86_64.sh
source ~/.bashrc
# Install versioneer and other required python packages if they are not included in yoru anaconda install
conda install versioneer
or
pip install versioneer
# Repeat for any missing pacakges using your package manager of choice (conda or pip)
Windows Environment (WSL2) toc
It is possible to set up a basic arkouda installation on MS Windows using the Windows Subsystem for Linux (WSL2). The general strategy here is to use Linux terminals on WSL to launch the server If you are going to try this route we suggest using WSL-2 with Ubuntu 20.04 LTS. There are a number of tutorials available online such has MicroSoft's
Key installation points:
- Make sure to use WSL2
- Ubuntu 20.04 LTS from the MS app store
- Don't forget to create a user account and password as part of the Linux install
Once configured you can follow the basic Linux installation instructions for installing Chapel & Arkouda. We also recommend installing Anaconda for windows.
The general plan is to compile & run the arkouda-server
process from a Linux terminal on WSL and then either connect
to it with the python client using another Linux terminal running on WSL or using the Windows Anaconda-Powershell.
If running an IDE you can use either the Windows or Linux version, however, you may need to install an X-window system on Windows such as VcXsrv, X410, or an alternative. Follow the setup instructions for whichever one you choose, but keep in mind you may need to update your Windows firewall to allow the Xserver to connect. Also, on the Linux side of the house we found it necessary to add
export DISPLAY=$(cat /etc/resolv.conf | grep nameserver | awk '{print $2; exit;}'):0.0
to our ~/.bashrc
file to get the display correctly forwarded.
Building Arkouda toc
Download, clone, or fork the arkouda repo. Further instructions assume that the current directory is the top-level directory of the repo.
Build the source toc
If your environment requires non-system paths to find dependencies (e.g., if using the ZMQ and HDF5 bundled
with [Anaconda]), append each path to a new file Makefile.paths
like so:
# Makefile.paths
# Custom Anaconda environment for Arkouda
$(eval $(call add-path,/home/user/anaconda3/envs/arkouda))
# ^ Note: No space after comma.
The chpl
compiler will be executed with -I
, -L
and an -rpath
to each path.
# If zmq and hdf5 have not been installed previously, execute make install-deps
make install-deps
# Run make to build the arkouda_server executable
make
Building the Arkouda documentation toc
The Arkouda documentation is hosted on Read-the-Docs.
(click to see more)
First ensure that all Python doc dependencies including sphinx and sphinx extensions have been installed as detailed
above. Important: if Chapel was built locally, make chpldoc
must be executed as detailed above to enable
generation of the Chapel docs via the chpldoc executable.
Now that all doc generation dependencies for both Python and Chapel have been installed, there are three make targets for generating docs:
# make doc-python generates the Python docs only
make doc-python
# make doc-server generates the Chapel docs only
make doc-server
# make doc generates both Python and Chapel documentation
make doc
The Python docs are written out to the arkouda/docs directory while the Chapel docs are exported to the arkouda/docs/server directory.
arkouda/docs/ # Python frontend documentation
arkouda/docs/server # Chapel backend server documentation
To view the Arkouda documentation locally, type the following url into the browser of choice:
file:///path/to/arkouda/docs/index.html
, substituting the appropriate path for the Arkouda directory configuration.
The make doc
target detailed above prepares the Arkouda Python and Chapel docs for hosting both locally and on Read-the-Docs.
There are three easy steps to hosting Arkouda docs on Github Pages. First, the Arkouda docs generated via make doc
are pushed to the Arkouda or Arkouda fork master branch. Next, navigate to the Github project home and click the
"Settings" tab. Finally, scroll down to the Github Pages section and select the "master branch docs/ folder" source
option. The Github Pages docs url will be displayed once the source option is selected. Click on the link and the
Arkouda documentation homepage will be displayed.
Testing Arkouda toc
(click to see more)
There are two unit test suites for Arkouda, one for Python and one for Chapel. As mentioned above, the Arkouda
Python test harness leverages multiple libraries such as pytest and
pytest-env that must be installed via pip3 install -e .[dev]
,
whereas the Chapel test harness does not require any external librares.
The default Arkouda test executes the Python test harness and is invoked as follows:
make test
The Chapel unit tests can be executed as follows:
make test-chapel
Both the Python and Chapel unit tests are executed as follows:
make test-all
For more details regarding Arkouda testing, please consult the Python test README and Chapel test README, respectively.
Installing the Arkouda Python Library and Dependencies toc
Now that the arkouda_server is built and tested, install the Python library.
The Arkouda Python library along with its dependent libraries are installed with pip. There are four types of Python dependencies for the Arkouda developer to install: requires, dev, test, and doc. The required libraries, which are the runtime dependencies of the Arkouda python library, are installed as follows:
pip3 install -e .
Arkouda and the Python libraries required for development, test, and doc generation activities are installed as follows:
pip3 install -e .[dev]
Alternatively you can build a distributable package via
# We'll use a virtual environment to build
python -m venv build-client-env
source build-client-env/bin/activate
python -m pip install --upgrade pip build wheel versioneer
python setup.py clean --all
python -m build
# Clean up our virtual env
deactivate
rm -rf build-client-env
# You should now have 2 files in the dist/ directory which can be installed via pip
pip install dist/arkouda*.whl
# or
pip install dist/arkouda*.tar.gz
Running arkouda_server toc
The command-line invocation depends on whether you built a single-locale version (with CHPL_COMM=none
) or
multi-locale version (with CHPL_COMM
set to the desired number of locales).
Single-locale startup:
./arkouda_server
Multi-locale startup (user selects the number of locales):
./arkouda_server -nl 2
Memory tracking is turned on by default now, you can run server with memory tracking turned off by
./arkouda_server --memTrack=false
By default, the server listens on port 5555
. This value can be overridden with the command-line flag
--ServerPort=1234
Memory tracking is turned on by default and turned off by using the --memTrack=false
flag
Trace logging messages are turned on by default and turned off by using the --trace=false
flag
Other command line options are available and can be viewed by using the --help
flag
./arkouda-server --help
Sanity check arkouda_server toc
To sanity check the arkouda server, you can run
make check
This will start the server, run a few computations, and shut the server down. In addition, the check script can be executed against a running server by running the following Python command:
python3 tests/check.py localhost 5555
Token-Based Authentication in Arkouda toc
Arkouda features a token-based authentication mechanism analogous to Jupyter, where a randomized alphanumeric string is generated or loaded at arkouda_server startup. The command to start arkouda_server with token authentication is as follows:
./arkouda_server --authenticate
The generated token is saved to the tokens.txt file which is contained in the .arkouda directory located in the same working directory the arkouda_server is launched from. The arkouda_server will re-use the same token until the .arkouda/tokens.txt file is removed, which forces arkouda_server to generate a new token and corresponding tokens.txt file.
Connecting to Arkouda toc
The client connects to the arkouda_server either by supplying a host and port or by providing a connect_url connect string:
arkouda.connect(server='localhost', port=5555)
arkouda.connect(connect_url='tcp://localhost:5555')
When arkouda_server is launched in authentication-enabled mode, clients connect by either specifying the access_token parameter or by adding the token to the end of the connect_url connect string:
arkouda.connect(server='localhost', port=5555, access_token='dcxCQntDQllquOsBNjBp99Pu7r3wDJn')
arkouda.connect(connect_url='tcp://localhost:5555?token=dcxCQntDQllquOsBNjBp99Pu7r3wDJn')
Note: once a client has successfully connected to an authentication-enabled arkouda_server, the token is cached in the user's $ARKOUDA_HOME .arkouda/tokens.txt file. As long as the arkouda_server token remains the same, the user can connect without specifying the token via the access_token parameter or token url argument.
Logging toc
The Arkouda server features a Chapel logging framework that prints out the module name, function name and line number for all logged messages. An example is shown below:
2021-04-15:06:22:59 [ConcatenateMsg] concatenateMsg Line 193 DEBUG [Chapel] creating pdarray id_4 of type Int64
2021-04-15:06:22:59 [ServerConfig] overMemLimit Line 175 INFO [Chapel] memory high watermark = 44720 memory limit = 30923764531
2021-04-15:06:22:59 [MultiTypeSymbolTable] addEntry Line 127 DEBUG [Chapel] adding symbol: id_4
Available logging levels are ERROR, CRITICAL, WARN, INFO, and DEBUG. The default logging level is INFO where all messages at the ERROR, CRITICAL, WARN, and INFO levels are printed. The log level can be set globally by passing in the --logLevel parameter upon arkouda_server startup. For example, passing the --logLevel=LogLevel.DEBUG parameter as shown below sets the global log level to DEBUG:
./arkouda_server --logLevel=LogLevel.DEBUG
In addition to setting the global logging level, the logging level for individual Arkouda modules can also be configured. For example, to set MsgProcessing to DEBUG for the purposes of debugging Arkouda array creation, pass the MsgProcessing.logLevel=LogLevel.DEBUG parameter upon arkouda_server startup as shown below:
./arkouda_server --MsgProcessing.logLevel=LogLevel.DEBUG --logLevel=LogLevel.WARN
In this example, the logging level for all other Arkouda modules will be set to the global value WARN.
Type Checking in Arkouda toc
Both static and runtime type checking are becoming increasingly popular in Python, especially for large Python code bases such as those found at dropbox. Arkouda uses mypy for static type checking and typeguard for runtime type checking.
(click to see more)
Enabling runtime as well as static type checking in Python starts with adding type hints, as shown below to a method signature:
def connect(server : str="localhost", port : int=5555, timeout : int=0,
access_token : str=None, connect_url=None) -> None:
mypy static type checking can be invoked either directly via the mypy command or via make:
$ mypy arkouda
Success: no issues found in 16 source files
$ make mypy
python3 -m mypy arkouda
Success: no issues found in 16 source files
Runtime type checking is enabled at the Python method level by annotating the method if interest with the @typechecked decorator, an example of which is shown below:
@typechecked
def save(self, prefix_path : str, dataset : str='array', mode : str='truncate') -> str:
Type checking in Arkouda is implemented on an "opt-in" basis. Accordingly, Arkouda continues to support duck typing for parts of the Arkouda API where type checking is too confining to be useful. As detailed above, both runtime and static type checking require type hints. Consequently, to opt-out of type checking, simply leave type hints out of any method declarations where duck typing is desired.
Environment Variables toc
The various Arkouda aspects (compilation, run-time, client, tests, etc.) can be configured using a number of environment variables (env vars). See the ENVIRONMENT documentation for more details.
Versioning toc
Beginning after tag v2019.12.10
versioning is now performed using Versioneer
which determines the version based on the location in git
.
An example using a hypothetical tag 1.2.3.4
git checkout 1.2.3.4
python -m arkouda |tail -n 2
>> Client Version: 1.2.3.4
>> 1.2.3.4
# If you were to make uncommitted changes and repeat the command you might see something like:
python -m arkouda|tail -n 2
>> Client Version: 1.2.3.4+0.g9dca4c8.dirty
>> 1.2.3.4+0.g9dca4c8.dirty
# If you commit those changes you would see something like
python -m arkouda|tail -n 2
>> Client Version: 1.2.3.4+1.g9dca4c8
>> 1.2.3.4+1.g9dca4c8
In the hypothetical cases above Versioneer tells you the version and how far / how many commits beyond the tag your repo is.
When building the server-side code the same versioning information is included in the build. If the server and client do not match you will receive a warning. For developers this is a useful reminder when you switch branches and forget to rebuild.
# Starting the arkouda when built from tag 1.2.3.4 shows the following in the startup banner
arkouda server version = 1.2.3.4
# If you built from an arbitrary branch the version string is based on the derived coordinates from the "closest" tag
arkouda server version = v2019.12.10+1679.abc2f48a
# The .dirty extension denotes a build from uncommitted changes, or a "dirty branch" in git vernacular
arkouda server version = v2019.12.10+1679.abc2f48a.dirty
For maintainers, creating a new version is as simple as creating a tag in the repository; i.e.
git checkout master
git tag 1.2.3.4
python -m arkouda |tail -n 2
>> Client Version: 1.2.3.4
>> 1.2.3.4
git push --tags
Contributing to Arkouda toc
If you'd like to contribute, please see CONTRIBUTING.md.