Skip to content

WARN Utils: An error occurred while trying to read the S3 bucket lifecycle configuration java.lang.NullPointerException #346

@pedromb

Description

@pedromb

Hello guys, I am getting this warn

WARN Utils$: An error occurred while trying to read the S3 bucket lifecycle configuration
java.lang.NullPointerException
        at java.lang.String.startsWith(String.java:1385)
        at java.lang.String.startsWith(String.java:1414)
        at com.databricks.spark.redshift.Utils$$anonfun$3.apply(Utils.scala:102)
        at com.databricks.spark.redshift.Utils$$anonfun$3.apply(Utils.scala:98)
        at scala.collection.Iterator$class.exists(Iterator.scala:753)
        at scala.collection.AbstractIterator.exists(Iterator.scala:1157)
        at scala.collection.IterableLike$class.exists(IterableLike.scala:77)
        at scala.collection.AbstractIterable.exists(Iterable.scala:54)
        at com.databricks.spark.redshift.Utils$.checkThatBucketHasObjectLifecycleConfiguration(Utils.scala:98)
        at com.databricks.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:361)
        at com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:106)
        at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:222)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)

I have seen this issue here before, but it still occurs for me.

I do have a lifecycle configuration for my bucket. I've traced this warn to this piece of code

def checkThatBucketHasObjectLifecycleConfiguration(
      tempDir: String,
      s3Client: AmazonS3Client): Unit = {
    try {
      val s3URI = createS3URI(Utils.fixS3Url(tempDir))
      val bucket = s3URI.getBucket
      assert(bucket != null, "Could not get bucket from S3 URI")
      val key = Option(s3URI.getKey).getOrElse("")
      val hasMatchingBucketLifecycleRule: Boolean = {
        val rules = Option(s3Client.getBucketLifecycleConfiguration(bucket))
          .map(_.getRules.asScala)
          .getOrElse(Seq.empty)
        rules.exists { rule =>
          // Note: this only checks that there is an active rule which matches the temp directory;
          // it does not actually check that the rule will delete the files. This check is still
          // better than nothing, though, and we can always improve it later.
          rule.getStatus == BucketLifecycleConfiguration.ENABLED && key.startsWith(rule.getPrefix)
        }
      }
      if (!hasMatchingBucketLifecycleRule) {
        log.warn(s"The S3 bucket $bucket does not have an object lifecycle configuration to " +
          "ensure cleanup of temporary files. Consider configuring `tempdir` to point to a " +
          "bucket with an object lifecycle policy that automatically deletes files after an " +
          "expiration period. For more information, see " +
          "https://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html")
      }
    } catch {
      case NonFatal(e) =>
        log.warn("An error occurred while trying to read the S3 bucket lifecycle configuration", e)
    }
  }

I believe the exception is thrown because of this
key.startsWith(rule.getPrefix)

I checked the Amazon SDK documents, the method getPrefix returns null if the prefix wasn't set using the setPrefix method, therefore it will always return null in this case.

I have a very limited knowledge of the Amazon SDK and Scala, so I'm not really sure about this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions