Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exec /opt/.sagemakerinternal/conda/kgw_variant: exec format error #28

Open
tomasborrella opened this issue Dec 29, 2022 · 6 comments
Open

Comments

@tomasborrella
Copy link

tomasborrella commented Dec 29, 2022

I've been trying to use my own custom image in SageMaker Studio and I always get the same error when trying to associate a notebook to the kernel (as seen in SageMaker Studio) :

Failed to start kernel
Failed to launch app [xxxxxxx-ml-t3-medium-176855f5e1df73edeeb33d0c81f0]. InternalFailure

And, the only log event seen in CloudWatch for the image is:

exec /opt/.sagemakerinternal/conda/kgw_variant: exec format error.

At first I thought it might be a bug in my container, so I have carefully checked all the steps described in DEVELOPMENT.md.

And finally, to try to discard a bug in the platform I tried to use the examples
echo-kernel-image and python-poetry-image without modifications from the repository and in both of them the same error occurred.

I have also tried the tf23-image example, and it works, so I have been comparing the images and my feeling is that it could be related to the base image (the operating system used in the container): echo-kernel-image and python-poetry-image are based on debian and tf23-image is based on ubuntu. Could that be the issue?

Could you please confirm that echo-kernel-image and python-poetry-image images are working (as is) in current SageMaker Studio version?

@tomasborrella
Copy link
Author

Finally the problem was that I was doing the build on a Mac with M1 Apple Silicon, and the architecture is not compatible.

The solution is simply to specify the architecture in the Dockerfile or in the build command:
--platform=linux/amd64

I close the issue. I hope the error message I used in the title can help someone in the future. As a suggestion it would be interesting to add the architecture inside the Dockerfile or a note in the README.

@manas86
Copy link

manas86 commented Jan 18, 2023

@tomasborrella thanks for your prompt reply. But I should have mentioned earlier, I already this in my Dockerfile, but still having the same error. It's worth to share the my Dockerfile, only thing I could think of the FileSystemConfig the mount-path as /home/sagemaker-user, may be changing it to /root would be helpful. But I have not tried it yet.
Build command was:
docker build -t sm . --platform=linux/amd64
Dockerfile:

FROM python:3.7

ARG NB_USER="sagemaker-user"
ARG NB_UID="1000"
ARG NB_GID="100"

RUN apt-get update && \
    apt-get install -y sudo && \
    useradd -m -s /bin/bash -N -u $NB_UID $NB_USER && \
    chmod g+w /etc/passwd && \
    echo "${NB_USER}    ALL=(ALL)    NOPASSWD:    ALL" >> /etc/sudoers && \
    # Prevent apt-get cache from being persisted to this layer.
    rm -rf /var/lib/apt/lists/*

RUN apt-get update && \
    apt-get install -y openjdk-11-jdk-headless && \
    apt-get clean

RUN pip install packaging jupyter ipykernel awscli sagemaker boto3
RUN python -m ipykernel install --sys-prefix

ENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-arm64/
ENV SHELL=/bin/bash

USER $NB_UID

@tomasborrella
Copy link
Author

tomasborrella commented Jan 18, 2023

I specify the platform within the Dockerfile, in the FROM (first line):
FROM --platform=linux/amd64 python:3.7
Not sure if it behaves the same using it in the build command, so give it a try.

Regarding the user home and IDs, make sure you are adding the image settings correctly, it should look like this:

{
  "AppImageConfigName": "custom-poetry-kernel-image-config",
  "KernelGatewayImageConfig": {
    "KernelSpecs": [
      {
        "Name": "python3",
        "DisplayName": "Python 3 (poetry)"
      }
    ],
    "FileSystemConfig": {
      "MountPath": "/home/sagemaker-user",
      "DefaultUid": 1000,
      "DefaultGid": 100
    }
  }
}

@manas86
Copy link

manas86 commented Jan 18, 2023

Thanks I think it's working only thing I changed now instance type...be aware even though it works fine, it will still throw same error in CWLogs ... don't know why...but it's working

@ystoneman
Copy link

This doesn't seem to have been added to the Readme or the Dockerfile. Maybe we should re-open until this has been done @tomasborrella?

@tomasborrella
Copy link
Author

tomasborrella commented Feb 23, 2023

I reopen the issue (as suggested) because it would be nice to have the information updated in the Readme or the Dockerfile.

@tomasborrella tomasborrella reopened this Feb 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants