Object stores offered by CSPs such as AWS S3 are important for users of Gluten to store their data. This doc will discuss all details of configs, and use cases around using Gluten with object stores. In order to use an S3 endpoint as your data source, please ensure you are using the following S3 configs in your spark-defaults.conf. If you’re experiencing any issues authenticating to S3 with additional auth mechanisms, please reach out to us using the ‘Issues’ tab.

Working with S3

Configuring S3 endpoint

S3 provides the endpoint based method to access the files, here’s the example configuration. Users may need to modify some values based on real setup.

spark.hadoop.fs.s3a.impl                        org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.aws.credentials.provider    org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
spark.hadoop.fs.s3a.access.key                  XXXXXXXXX
spark.hadoop.fs.s3a.secret.key                  XXXXXXXXX
spark.hadoop.fs.s3a.endpoint                    https://s3.us-west-1.amazonaws.com
spark.hadoop.fs.s3a.connection.ssl.enabled      true
spark.hadoop.fs.s3a.path.style.access           false

Configuring S3 instance credentials

S3 also provides other methods for accessing, you can also use instance credentials by setting the following config

spark.hadoop.fs.s3a.use.instance.credentials true

Note that in this case, “spark.hadoop.fs.s3a.endpoint” won’t take affect as Gluten will use the endpoint set during instance creation.

Configuring S3 IAM roles

You can also use iam role credentials by setting the following configurations. Instance credentials have higher priority than iam credentials.

spark.hadoop.fs.s3a.iam.role  xxxx
spark.hadoop.fs.s3a.iam.role.session.name xxxx

Note that spark.hadoop.fs.s3a.iam.role.session.name is optional.

Other authentatication methods are not supported yet

Log granularity of AWS C++ SDK in velox

You can change log granularity of AWS C++ SDK by setting the spark.gluten.velox.awsSdkLogLevel configuration. The Allowed values are:

  • OFF
  • FATAL
  • ERROR
  • WARN
  • INFO
  • DEBUG
  • TRACE

Local Caching support

Velox supports a local cache when reading data from S3. Please refer Velox Local Cache part for more detailed configurations.


Back to top

Copyright © 2024 The Apache Software Foundation, Licensed under the Apache License, Version 2.0. Apache Gluten, Gluten, Apache, the Apache feather logo, and the Apache Gluten project logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

Apache Gluten is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

Privacy Policy