FMRD: possible ways to extract data from the jsonl files downloaded through s3 buckets?

I downloaded the FRMD file from S3 buckets like RFT-FMRD-Quote_Symbology-EQUFND-Delta, RFT-FMRD-MKT_EVNT-REF-Init. 
When unzipped these files were in .jsonl and each row was a json object.
What would be the possible ways to extract the data from this jsonl file to a more readable format. I did try extracting speicific (key, values) from each json obj row and save them to a csv file but that would take > 24 hrs and not sure if each row could have its own schema and i fear that i might miss some data if i try to choose specific (key,values)

please find this sample json row from both the files for reference.sample1.txtsample2.txt


Best Answer

  • raksina.samasiri
    Answer ✓

    Hi @akhilesh.naredla ,

    Have you tried using json.loads() in Python

    First, let's say the string is stored in the variable x

    1681106460045.png

    we're loading them into variable y

    y = json.loads(x)

    1681106525608.png

    Then with the variable y, we can access the value with the keys

    y['Reference']['Names']

    the output of the code is

    [{'Name': 'SmartPool Trading Sessions'}]


    Hope this helps and please let me know in case you have any further questions

Answers

  • To make this data more readable and break down the complex structure to simpler csv:

    import json
    import json
    import pandas as pd

    pd.set_option("display.max_columns", None)
    pd.set_option("display.max_rows", None)


    def flatten_dict(d, prefix=''):
    flat_dict = {}
    for key, value in d.items():
    new_key = f"{prefix}.{key}" if prefix else key
    if isinstance(value, dict):
    flat_dict.update(flatten_dict(value, prefix=new_key))
    elif isinstance(value, list) and all(isinstance(x, dict) for x in value):
    for i, v in enumerate(value):
    flat_dict.update(flatten_dict(v, prefix=f"{new_key}.{i}"))
    else:
    flat_dict[new_key] = value
    return flat_dict



    count = 1

    rows = []
    with open(jsonl_file_path, encoding="utf-8") as f:
    for line in f:
    json_obj = json.loads(line)
    print(count)
    if json_obj is not None:
    flat_data = flatten_dict(json_obj)
    rows.append(flat_data)
    count += 1

    df = pd.DataFrame(rows)


    df.to_csv(csv_file.csv)