question

Upvotes
Accepted
16 2 3 5

How to sort RDP tick history files by size

Following is a sample of response by pulling a file-set via RDP CFS APIs. Can you advise how to use 'jq' to sort the response by the value of 'size'? Thanks.

{
  "value": [
    {
      "id": "4a8e-1abb-6fd8c40c-94f4-352733f33ebc",
      "name": "salesforce-1-XXX_normalised_89568BBAF27C4E2EAC292CC28D80E875",
      "bucketName": "TICKHISTORY_CUSTOM_ARCHIVE",
      "packageId": "4a13-95e9-c37783d6-bba0-4851594758d0",
      "attributes": [
        {
          "name": "fileName",
          "value": "hmh tas 2013"
        },
        {
          "name": "size",
          "value": "1858351473"
        },
        {
          "name": "view",
          "value": "normalised"
        },
        {
          "name": "message_id",
          "value": "e9d7e3cb-e0ff-4f4c-8150-d327a0f9e51b"
        },
        {
          "name": "md5",
          "value": "1de22361eebb6993217e5da10e9a4798"
        },
        {
          "name": "releaseTime",
          "value": "2022-08-17T10:45:28Z"
        }
      ],
      "files": [
        "4478-d443-7f611230-bdf4-86c64a1deb05",
        "45d2-4891-60c79c56-a3bd-71ddb0ba2425",
        "4761-13f8-a64f6b32-ae82-250cc77c6b99"
      ],
      "numFiles": 3,
      "availableFrom": "2022-08-17T12:07:02Z",
      "availableTo": "2023-01-13T18:26:30Z",
      "status": "READY",
      "created": "2022-08-17T12:07:03Z",
      "modified": "2022-08-17T12:07:04Z"
    },
    {
      "id": "4aa5-3b50-d2924fcb-8633-032e977e0041",
      "name": "salesforce-1-XXX_normalised_37316F8A011E432AAA540E5868132330",
      "bucketName": "TICKHISTORY_CUSTOM_ARCHIVE",
      "packageId": "4a13-95e9-c37783d6-bba0-4851594758d0",
      "attributes": [
        {
          "name": "md5",
          "value": "e55e6731f5b008d28e1c070436bc5ebb"
        },
        {
          "name": "view",
          "value": "normalised"
        },
        {
          "name": "message_id",
          "value": "b6a715b3-e1f5-4b1f-bdb4-3aded9173939"
        },
        {
          "name": "releaseTime",
          "value": "2022-08-17T10:45:26Z"
        },
        {
          "name": "fileName",
          "value": "XXX index tas 2007"
        },
        {
          "name": "size",
          "value": "1197364640"
        }
      ],
      "files": [
        "41c3-80e1-f008c970-b117-72dfa0d856c3",
        "4565-eed3-1caf7c98-9872-5dd0fe328f9f",
        "45be-0607-b8b66964-8f94-2369ce109094"
      ],
      "numFiles": 3,
      "availableFrom": "2022-08-17T11:03:06Z",
      "availableTo": "2023-01-13T16:34:39Z",
      "status": "READY",
      "created": "2022-08-17T11:03:08Z",
      "modified": "2022-08-17T11:03:10Z"
    },
    {
      "id": "4ab0-7d5e-fefa8fd5-98e2-6a34b035cfff",
      "name": "salesforce-1-XXX_normalised_BB8B299D6C634AF580CE9DAC7C5F28A8",
      "bucketName": "TICKHISTORY_CUSTOM_ARCHIVE",
      "packageId": "4a13-95e9-c37783d6-bba0-4851594758d0",
      "attributes": [
        {
          "name": "md5",
          "value": "9552ea275e7eb1987ee1627f845a3a4b"
        },
        {
          "name": "releaseTime",
          "value": "2022-08-17T10:45:56Z"
        },
        {
          "name": "message_id",
          "value": "9e719be7-0dde-4a01-b954-7f4ab9348d16"
        },
        {
          "name": "size",
          "value": "2850978035"
        },
        {
          "name": "view",
          "value": "normalised"
        },
        {
          "name": "fileName",
          "value": "hcus tas 2021"
        }
      ],
      "files": [
        "4708-70f8-41d7d449-b522-2c3694b62e79",
        "4c5c-d187-3e002a0b-a6a5-6c957685a0a8",
        "4f18-c3cb-7d8daf29-acff-1d1b01b18030"
      ],
      "numFiles": 3,
      "availableFrom": "2022-08-17T12:09:12Z",
      "availableTo": "2023-01-18T09:48:47Z",
      "status": "READY",
      "created": "2022-08-17T12:09:13Z",
      "modified": "2022-08-17T12:09:14Z"
    }
]


tick-history-rest-apirefinitiv-data-platformtrth-rest-apiclient-file-store
icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Hello @Tony.Zhu_R ,

Thank you for your participation in the forum.

Is the reply below satisfactory in resolving your query?

If yes please click the 'Accept' text next to the most appropriate reply. This will guide all community members who have a similar question.

Otherwise please post again offering further insight into your question.


Thanks,

-AHS

Please be informed that a reply has been verified as correct in answering the question, and has been marked as such.

Thanks,
AHS

Upvotes
Accepted
78.8k 250 52 74

@Tony.Zhu_R

Please try this query:

jq-win64.exe ".value |= sort_by(.attributes | from_entries | .size | tonumber)"

To reverse the order, please use this query.

jq-win64.exe ".value |= sort_by(.attributes | from_entries | .size | tonumber) | .value |= reverse"
icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
32.2k 40 11 20

Hello @Tony.Zhu_R ,

I think you are requesting Filsets:

https://{
                {RDP_HOST}}/file-store/{
                {RDP_VERSION}}/file-sets?bucket={
                {RDP_TH_Bucket}}&attributes=venue:{
                {CFS_VBD_VENUE}},view:normalised

Which makes sorting on file size challenging?

Try requesting File details by Fileset id:

https://{
                {RDP_HOST}}/file-store/{
                {RDP_VERSION}}/files?filesetId={
                {fileSetId}}

The output should look like:

{
    "value": [
        {
            "id": "4709-44ae-ae48bed4-8651-6ca30681bec8",
            "filename": "LSE-2018-06-01-NORMALIZEDMP-Report-1-of-1.csv.gz",
            "filesetId": "4000-ee57-9e402927-8e3e-a967e20d26f9",
            "fileType": "File",
            "description": "Merged data file",
            "storageLocation": {
                "url": "https://s3-us-east-1.amazonaws.com/a205143-use1-prod-results-vbd/normalised/LSE/2018-06-01/data/merged/LSE-2018-06-01-NORMALIZEDMP-Report-1-of-1.csv.gz",
                "rolearn": "arn:aws:iam::259431915815:role/a205143-prod-push-mechanism-EdsCfsS3Access",
                "@type": "s3"
            },
            "created": "2022-06-25T02:33:09Z",
            "modified": "2022-06-25T02:33:09Z",
            "href": "https://api.refinitiv.com/file-store/v1/files/4709-44ae-ae48bed4-8651-6ca30681bec8/stream",
            "fileSizeInBytes": 904974,
            "md5": "1d9d418dd96e7659c055b857c2bfc0a9"
        },
        {
            "id": "497f-2e71-5796a24e-8162-917fef100c75",
            "filename": "LSE-2018-06-01-NORMALIZEDMP-Data-1-of-1.csv.gz",
            "filesetId": "4000-ee57-9e402927-8e3e-a967e20d26f9",
            "fileType": "File",
            "description": "Merged data file",
            "storageLocation": {
                "url": "https://s3-us-east-1.amazonaws.com/a205143-use1-prod-results-vbd/normalised/LSE/2018-06-01/data/merged/LSE-2018-06-01-NORMALIZEDMP-Data-1-of-1.csv.gz",
                "rolearn": "arn:aws:iam::259431915815:role/a205143-prod-push-mechanism-EdsCfsS3Access",
                "@type": "s3"
            },
            "created": "2022-06-25T02:33:09Z",
            "modified": "2022-06-25T02:33:09Z",
            "href": "https://api.refinitiv.com/file-store/v1/files/497f-2e71-5796a24e-8162-917fef100c75/stream",
            "fileSizeInBytes": 1067179413,
            "md5": "591c5c30619f26e046cfc8ea1857fa5a"
        }
    ]
}

Enabling you to sort on fileSizeInBytes.

Will this work for your use case, or you are looking for something different?

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
16 2 3 5

@zoya faberov

The JSON I put in above is got from

https://api.refinitiv.com/file-store/v1/file-sets?bucket=TICKHISTORY_CUSTOM_ARCHIVE

Normally, it responds with 25 filesetIds but up to 100 when pageSize = 100 specified.

What I asked is how to use jq to sort those filesetIDs by its attribute of size.

{
          "name": "size",
          "value": "2850978035"
        },

For instance, I need to filter out those filesetIDs with sizes of more than 100GBytes what's the filter to use for jq if the total number of filesetIDs is hundreds?

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
32.2k 40 11 20

Hello @Tony.Zhu_R ,

Let me try to explain what I meant, hoping that the explanation will help clarify. I believe that the size that you obtain from the call is not the size of a file, rather is the size of a fileset. For example:

https://{
                {RDP_HOST}}/file-store/{
                {RDP_VERSION}}/file-sets?bucket={
                {RDP_TH_Bucket}}&attributes=venue:{
                {CFS_VBD_VENUE}},view:normalised

Results in:

{
    "value": [
        {
            "id": "4000-0fef-8c14bb0e-ac4a-9d9356699d2f",
            "name": "LSE_normalised_2007-05-28",
            "bucketName": "TICKHISTORY_VBD_UNLIMITED",
            "packageId": "40eb-1240-f887f89f-b6e2-ff08ec2e2603",
            "attributes": [
                {
                    "name": "message_id",
                    "value": "f546fa27-7d1f-477a-89a6-ba2f3ca325e2"
                },
                {
                    "name": "view",
                    "value": "normalised"
                },
                {
                    "name": "depth",
                    "value": "UNLIMITED"
                },
                {
                    "name": "size",
                    "value": "23001"
                },
                {
                    "name": "releaseTime",
                    "value": "2007-05-29T04:00:00.000000000Z"
                },
                {
                    "name": "venue",
                    "value": "LSE"
                },
                {
                    "name": "md5",
                    "value": "565390e25c255a712e3d42ef790e07f0"
                }
            ],
            "files": [
                "4414-fdfd-35d0e262-be9a-664b4ea09787",
                "44d8-267c-5a5af2c9-adbe-9abe81c1c060"
            ],
            "numFiles": 2,

Whereas:

https://{
                {RDP_HOST}}/file-store/{
                {RDP_VERSION}}/files?filesetId=4000-0fef-8c14bb0e-ac4a-9d9356699d2f

Results in:

"value": [
        {
            "id": "4414-fdfd-35d0e262-be9a-664b4ea09787",
            "filename": "LSE-2007-05-28-NORMALIZEDMP-Report-1-of-1.csv.gz",
            "filesetId": "4000-0fef-8c14bb0e-ac4a-9d9356699d2f",
            "fileType": "File",
            "description": "Merged data file",
            "storageLocation": {
                "url": "https://s3-us-east-1.amazonaws.com/a205143-use1-prod-results-vbd/normalised/LSE/2007-05-28/data/merged/LSE-2007-05-28-NORMALIZEDMP-Report-1-of-1.csv.gz",
                "rolearn": "arn:aws:iam::259431915815:role/a205143-prod-push-mechanism-EdsCfsS3Access",
                "@type": "s3"
            },
            "created": "2021-04-22T15:33:18Z",
            "modified": "2021-04-22T15:33:18Z",
            "href": "https://api.refinitiv.com/file-store/v1/files/4414-fdfd-35d0e262-be9a-664b4ea09787/stream",
            "fileSizeInBytes": 22560,
            "md5": "fda504b3a6efa56a34db998e2128b7f0"
        },
        {
            "id": "44d8-267c-5a5af2c9-adbe-9abe81c1c060",
            "filename": "LSE-2007-05-28-NORMALIZEDMP-Data-1-of-1.csv.gz",
            "filesetId": "4000-0fef-8c14bb0e-ac4a-9d9356699d2f",
            "fileType": "File",
            "description": "Merged data file",
            "storageLocation": {
                "url": "https://s3-us-east-1.amazonaws.com/a205143-use1-prod-results-vbd/normalised/LSE/2007-05-28/data/merged/LSE-2007-05-28-NORMALIZEDMP-Data-1-of-1.csv.gz",
                "rolearn": "arn:aws:iam::259431915815:role/a205143-prod-push-mechanism-EdsCfsS3Access",
                "@type": "s3"
            },
            "created": "2021-04-22T15:33:18Z",
            "modified": "2021-04-22T15:33:18Z",
            "href": "https://api.refinitiv.com/file-store/v1/files/44d8-267c-5a5af2c9-adbe-9abe81c1c060/stream",
            "fileSizeInBytes": 441,
            "md5": "4da4c501efd0f042bb7d1c1b5130306c"
        }
    ]

23001 = 22560+441

Before starting to parse this, are you looking for sizes of files, or sizes of FileSets, that can potentially contain multiple files?

If you are looking for sizes of FileSets, you can paginate over your results, concatenate them into a dataframe and derive fileset sizes.

If you are looking for sizes of files, you will need an extra step- to paginate through FileSets results from the bucket, and request either File details by FilesetID or File Details by FileID. Both will contain the sizes of the discrete files. You could concatenate these results into a dataframe and obtain all file sizes from this dataframe.

At this time, I am not seeing any files in TICKHISTORY_CUSTOM_ARCHIVE, but this is likely because I lack the permissions to access this bucket.

Does this explanation make sense, is this what you observe on your side?


icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
32.2k 40 11 20

Hi @Tony.Zhu_R ,

I would like to mention another approach that you may find of interest- RD Library Python.

Please see example:

https://github.com/Refinitiv-API-Samples/Example.DataLibrary.Python/blob/main/Examples/3-Delivery/3.03-Bulk/EX-3.03.02-Bulk-ClientFileStore-TickHistory.ipynb

Section: Browse Tick History buckets.

This I believe will fulfill your requirement, except for pagination.

Additionally see example:

https://github.com/Refinitiv-API-Samples/Example.DataLibrary.Python/blob/main/Examples/3-Delivery/3.03-Bulk/EX-3.03.01-Bulk-ClientFileStore.ipynb

Section: Get FileSets with pagination


icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
16 2 3 5

Thanks for above sharing, @zoya faberov

It's not what I asked for, though.

Endpoint: https://api.refinitiv.com/file-store/v1/file-sets?bucket=TICKHISTORY_CUSTOM_ARCHIVE&pageSize=3&attributes=view%3Anormalised

Given the number of fileset in RTH HMC bucket varies, I take 3 as an example with following response.

{
  "value": [
    {
      "id": "4030-fd1e-db11637d-b277-19bbd95c73ac",
      "name": "salesforce-1-XXX_normalised_3C4C0EC0D2A643188BB5E9C1F8C329C0",
      "bucketName": "TICKHISTORY_CUSTOM_ARCHIVE",
      "packageId": "4a13-95e9-c37783d6-bba0-4851594758d0",
      "attributes": [
        {
          "name": "size",
          "value": "3634152545"
        },
        {
          "name": "md5",
          "value": "612d8ee7b77605bc342f148c16db2929"
        },
        {
          "name": "message_id",
          "value": "a5c136ba-4821-4898-a511-18bfa607c86f"
        },
        {
          "name": "releaseTime",
          "value": "2022-08-17T10:45:56Z"
        },
        {
          "name": "fileName",
          "value": "hmh tas 2015"
        },
        {
          "name": "view",
          "value": "normalised"
        }
      ],
      "files": [
        "4745-b059-944935e3-a2c1-1f47875d39a1",
        "4b21-037a-58a832d2-b486-ff831a2e223e",
        "4bb1-dc27-81b6d485-8cbc-9854df8c651b"
      ],
      "numFiles": 3,
      "availableFrom": "2022-08-17T12:07:46Z",
      "availableTo": "2023-01-18T08:13:10Z",
      "status": "READY",
      "created": "2022-08-17T12:07:46Z",
      "modified": "2022-08-17T12:07:47Z"
    },
    {
      "id": "4039-98c8-91ff43cc-95c8-646c7f51f25c",
      "name": "salesforce-1-XXX_normalised_1D80CBA8BE694370B21F08080015732C",
      "bucketName": "TICKHISTORY_CUSTOM_ARCHIVE",
      "packageId": "4a13-95e9-c37783d6-bba0-4851594758d0",
      "attributes": [
        {
          "name": "message_id",
          "value": "94344e73-7d97-4c8a-a773-5246edfeaac3"
        },
        {
          "name": "md5",
          "value": "51c1b3242c42307a6b51383e7d9e2f65"
        },
        {
          "name": "releaseTime",
          "value": "2022-08-17T10:45:27Z"
        },
        {
          "name": "fileName",
          "value": "hsi tas 2017"
        },
        {
          "name": "view",
          "value": "normalised"
        },
        {
          "name": "size",
          "value": "11812238421"
        }
      ],
      "files": [
        "42b7-420b-b2e9cb81-9340-15f12d2e1b52",
        "43d9-8883-8c8649e2-a96c-eaab9b70bed3",
        "4e8e-c1a1-60ce42ad-bcf2-9cfa55b8578a"
      ],
      "numFiles": 3,
      "availableFrom": "2022-08-17T12:05:46Z",
      "availableTo": "2023-01-13T17:37:03Z",
      "status": "READY",
      "created": "2022-08-17T12:05:46Z",
      "modified": "2022-08-17T12:05:47Z"
    },
    {
      "id": "4061-5f65-2a3a0753-aab7-9f2abaec0016",
      "name": "salesforce-1-XXX_normalised_88D37A2541B44CEEBE715A1300C4F175",
      "bucketName": "TICKHISTORY_CUSTOM_ARCHIVE",
      "packageId": "4a13-95e9-c37783d6-bba0-4851594758d0",
      "attributes": [
        {
          "name": "releaseTime",
          "value": "2022-08-17T10:45:56Z"
        },
        {
          "name": "message_id",
          "value": "b94f669b-a925-4cca-ba7c-911d2811223a"
        },
        {
          "name": "size",
          "value": "17227956"
        },
        {
          "name": "view",
          "value": "normalised"
        },
        {
          "name": "md5",
          "value": "c8db431e72af8ae0599bf9eead6a2a48"
        },
        {
          "name": "fileName",
          "value": "hmce tas 2005"
        }
      ],
      "files": [
        "4842-db68-deb53105-ad1b-48c4126d38d4",
        "4cd5-e0e2-543e2fd3-b8c8-0738f180eb24",
        "4ed5-1054-3b985648-8301-682f0c61e3ba"
      ],
      "numFiles": 3,
      "availableFrom": "2022-08-17T12:08:20Z",
      "availableTo": "2023-01-18T08:33:00Z",
      "status": "READY",
      "created": "2022-08-17T12:08:21Z",
      "modified": "2022-08-17T12:08:21Z"
    }
  ]
}

It's nothing to do with exact file size irrespective of fileset or its constituent files. The question is what kind of filter to use with 'jq' to sort the above response by its attribute below.

 {
          "name": "size",
          "value": "17227956"
        },

For instance, I like to see the all filesets in order of its size, largest to smallest, and vice versa.

So, any idea what kind of 'jq' filter can achieve it?

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
32.2k 40 11 20

Hi @Tony.Zhu_R ,

A quick disclaimer- I am not permissioned for the same content in custom archive bucket. I am testing with TICKHISTORY_VBD_UNLIMITED, which I believe to be structured the same as custom. But if you see any discrepancies in the format on your side- please let me know.

What I would like to discuss is not the only approach to sorting this result, but possibly the simplest: putting the result into a dataframe. The added benefit can be that if you would require additional transformations, next, you can do them with a dataframe making them simpler as well.

df = pd.json_normalize(jsonFullResp['value'])

Next, I create a new column size, that contains the size value, and convert the value to integer, so that we can sort the column as containing integer values, not as strings.

for index, row in df.iterrows():
    print(index, [sub['value'] for sub in df['attributes'][index] if sub['name'] == 'size'])
    df.loc[index,'size'] = [sub['value'] for sub in df['attributes'][index] if sub['name'] == 'size']
    df.loc[index,'size'] = int(df.loc[index,'size'])

My result looks thus:

size.gif

And now I can sort it:

df2 = df.sort_values(by=['size'])
df2

Getting:

size2.gif

Please let us know if this is what you are looking to do?



size.gif (96.0 KiB)
size2.gif (124.4 KiB)
icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Hello @Tony.Zhu_R ,

Was this answer helpful to you?

Did you have a chance to try the suggestion since, and how did it work on your side?

Upvotes
32.2k 40 11 20

Hi @Tony.Zhu_R ,

I cannot test with your exact result at the moment, but I would expect the larger result sets not to be made available in a single request, and to require iterative pagination via skip token.

If that is the case, we can concatenate each new page of the results to the total results dataframe, and sort by integer size only once the result that is obtained is complete.

Let us know if this is something that you observe to be required, and would like to discuss more?

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
16 2 3 5

@zoya faberov

As raised initially, I'm looking for 'filter' to run with jq so sort the fileset by its 'size'

Can you use above sample of 3 filesets and sort them by size of each with jq instead of pandas?

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.