Discussion:
[MarkLogic Dev General] AWS S3 Object Count
Prasanth N V R
2014-07-25 14:23:22 UTC
Permalink
Hi,

I am trying to get number of objects from AWS S3 using xquery via MarkLogic.

One option I tried using listing the objects from the bucket.

But it returns only 1000 keys at the max in a single call.

Is there a way to get number of objects(total count) from AWS S3 bucket
using XQuery.
Or any MarkLogic APIs available.


Thanks,
Prasanth
Kash Badami
2014-07-25 14:28:51 UTC
Permalink
Prasanth,

this is a limitation of the underlying S3 API that the XQuery modules wrap.

http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html.

You will notice based on S3 documentation that this api only returns up to
1000 keys. Use the NextMarker that is returned in the response xml/json to
get the subsequent list of objects.

Hope this helps.

Regards,
kash.
Post by Prasanth N V R
Hi,
I am trying to get number of objects from AWS S3 using xquery via MarkLogic.
One option I tried using listing the objects from the bucket.
But it returns only 1000 keys at the max in a single call.
Is there a way to get number of objects(total count) from AWS S3 bucket
using XQuery.
Or any MarkLogic APIs available.
Thanks,
Prasanth
_______________________________________________
General mailing list
http://developer.marklogic.com/mailman/listinfo/general
--
Kash Badami
571-220-6208
David Lee
2014-07-25 23:52:12 UTC
Permalink
Which API are you using ? What ML Version ?
In V7 and later you can use
http://docs.marklogic.com/xdmp:filesystem-directory for S3 ...

xdmp:filesystem-directory(
$pathname<http://docs.marklogic.com/xdmp:filesystem-directory#pathname> as xs:string
) as element(dir:directory)

using an S3 URL for the path like ("s3://mybucket/dir")
I believe that this will return the correct list ... even if it's over 1000 , and if it doesn't I would like to know.
Note this is a flat not recursive listing ... although "recursive" is inaccurate for S3 ...
people tend to use S3 as if it were a directory structure, and it maps to that well ... but under the hood
it's a flat key/value store. the filesytem-directory call uses the options of S3 requests to simulate the equivalent
of a directory list. I.e. it uses "/" as a key separator (a feature of S3 bucket listing) ... but it is useful to realize
that S3 isn't really a hierarchical store ... There are no "directories" ...
Interestingly this is a near perfect match to Mark Logic's use of the term "directory" in URLs. Unless you create a directory node ... the term "directory" in the API's is not literal. It's a convention that is useful often, but it's not actually how the system works. Like AWS, ML is a flat key/value store as well ...

Also realize there are good reasons for the underlying limits - for both ML and S3. S3 buckets have *no limit* on the number of keys/objects. By this I mean literally no limit until you run out of time or money ...
In a practical sense ,at least millions, probably billions .. .but at some point IO to a single bucket is an IO bottleneck.
Realizing this concept helps understand AWS seemingly arbitrary limits.
They want to guarantee that the requests work. What if you request a bucket list with a billion keys ?
Its improbable either AWS or you could actually handle that without things breaking. So nearly all AWS requests
that list things where the list can be unbounded use a pagination model.
Again, this is a near exact match to MarkLogic API's ... most result sets (such as cts:search) can be (and usually should be) paginated. For the same reasons - if the result has a billion results - its impractical to return them all at once.




-----------------------------------------------------------------------------
David Lee
Lead Engineer
MarkLogic Corporation
***@marklogic.com<mailto:***@marklogic.com>
Phone: +1 812-482-5224
Cell: +1 812-630-7622
www.marklogic.com<http://www.marklogic.com/>

From: general-***@developer.marklogic.com [mailto:general-***@developer.marklogic.com] On Behalf Of Kash Badami
Sent: Friday, July 25, 2014 7:29 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] AWS S3 Object Count

Prasanth,

this is a limitation of the underlying S3 API that the XQuery modules wrap.

http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html.

You will notice based on S3 documentation that this api only returns up to 1000 keys. Use the NextMarker that is returned in the response xml/json to get the subsequent list of objects.

Hope this helps.

Regards,
kash.

On Fri, Jul 25, 2014 at 10:23 AM, Prasanth N V R <***@gmail.com<mailto:***@gmail.com>> wrote:
Hi,

I am trying to get number of objects from AWS S3 using xquery via MarkLogic.

One option I tried using listing the objects from the bucket.

But it returns only 1000 keys at the max in a single call.

Is there a way to get number of objects(total count) from AWS S3 bucket using XQuery.
Or any MarkLogic APIs available.


Thanks,
Prasanth

_______________________________________________
General mailing list
***@developer.marklogic.com<mailto:***@developer.marklogic.com>
http://developer.marklogic.com/mailman/listinfo/general
--
Kash Badami
571-220-6208
Loading...