Skip to content

Error Reading Sharded Zarrs on S3 #79

Description

@kkoz

Python zarr version: 3.2.1
zarr-java version: 0.1.3

Sharded zarrs which can be read without issue locally fail when the same data is read from S3 with a checksum of the sharding index is invalid error.

Example

I created a simple 2D zarr with the following python code:

  size = 1024
  zarrcube = zarr.create_array(
      store='/path/to/testsquare.zarr',
      shape=(size, size),
      shards=(size//4, size//4),
      chunks=(size//16, size//16),
      dtype='uint8')
  zarrcube[:, :] = np.random.randint(0, 256, (size, size))

I then uploaded the data to a publicly-readable S3 bucket in us-east-1 using the following commands:

cd testsquare.zarr
aws s3 sync . 's3://my-public-zarr-bucket/shardtest/testsquare.zarr'

I then attempted to read both the local and remote files using the following java code:

        String localPath = "/path/to/testsquare.zarr";

        URI endpoint = new URI("https://s3.us-east-1.amazonaws.com");

        S3ClientBuilder clientBuilder =  S3Client.builder()
                .httpClientBuilder(UrlConnectionHttpClient.builder()
                        .socketTimeout(Duration.ofMinutes(5)));
        clientBuilder.endpointOverride(endpoint);
        clientBuilder.region(Region.US_EAST_1);

        S3Configuration s3Config = S3Configuration.builder().pathStyleAccessEnabled(true)
            .build();
        clientBuilder.serviceConfiguration(s3Config);

        clientBuilder.credentialsProvider(AnonymousCredentialsProvider.create());

        S3Client client = clientBuilder.build();

        S3Store store = new S3Store(client, "gs-public-zarr-dev", "shardtest");
        Array s3Array = Array.open(store.resolve("testsquare.zarr"));

        Array localArray = Array.open(localPath);

        for (int i = 0; i < 10; i++) {
            long [] offset = new long[] {100l*i, 100l*i};
            long [] shape = new long[] {100l, 100l};
            localArray.read(offset, shape);
            s3Array.read(offset, shape);
        }

The following exception was thrown:

java.lang.RuntimeException: dev.zarr.zarrjava.ZarrException: The checksum of the sharding index is invalid. Stored: 1384733839 Computed: -246033701
	at dev.zarr.zarrjava.core.Array.lambda$read$2(Array.java:437)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
	at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:992)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
	at java.base/java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290)
	at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:754)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
	at java.base/java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:686)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
	at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
	at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:765)
	at dev.zarr.zarrjava.core.Array.read(Array.java:407)
	at dev.zarr.zarrjava.core.Array.read(Array.java:344)
	at com.glencoesoftware.omero.zarr.ZarrPixelBufferTest.testSharding(ZarrPixelBufferTest.java:336)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
	at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
	at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
	at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
	at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:93)
	at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:40)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:520)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:748)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:443)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:211)
Caused by: dev.zarr.zarrjava.ZarrException: The checksum of the sharding index is invalid. Stored: 1384733839 Computed: -246033701
	at dev.zarr.zarrjava.v3.codec.core.Crc32cCodec.decode(Crc32cCodec.java:40)
	at dev.zarr.zarrjava.core.codec.CodecPipeline.decode(CodecPipeline.java:114)
	at dev.zarr.zarrjava.v3.codec.core.ShardingIndexedCodec.decodeInternal(ShardingIndexedCodec.java:205)
	at dev.zarr.zarrjava.v3.codec.core.ShardingIndexedCodec.decodePartial(ShardingIndexedCodec.java:254)
	at dev.zarr.zarrjava.core.codec.CodecPipeline.decodePartial(CodecPipeline.java:95)
	at dev.zarr.zarrjava.core.Array.lambda$read$2(Array.java:422)
	... 42 more

Please let me know if there's any other information you need from me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions