question

Upvotes
Accepted
51 7 17 13

How to import json files of metadata of Research reports?

My customer who contracted for both Research API and Research bulk feed(= historical data of real-time research reports) is now trying to import a metadata file of the Research bulk feed to python using the following scripts, but can't import it successfully.

Can you please advise on how to import the metadata file to python successfully?


Scripts:

import pandas as pd

df = pd.read_json("ctb4614_050_metadata.json", lines=True)


Returen:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-87067cc29fae> in <module>
      1 # df = pd.read_json("ctb4614_050_metadata.json", orient="split")
----> 2 df = pd.read_json("ctb4614_050_metadata.json", lines=True)
      3 # df = pd.read_json("ctb4614_050_metadata.json", lines=True, orient="values")
      4 # df = pd.read_json("ctb4614_050_metadata.zip", compression='infer')

/opt/anaconda3/lib/python3.7/site-packages/pandas/io/json/_json.py in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines, chunksize, compression)
    590         return json_reader
    591 
--> 592     result = json_reader.read()
    593     if should_close:
    594         try:

/opt/anaconda3/lib/python3.7/site-packages/pandas/io/json/_json.py in read(self)
    713         elif self.lines:
    714             data = ensure_str(self.data)
--> 715             obj = self._get_object_parser(self._combine_lines(data.split("\n")))
    716         else:
    717             obj = self._get_object_parser(self.data)

/opt/anaconda3/lib/python3.7/site-packages/pandas/io/json/_json.py in _get_object_parser(self, json)
    737         obj = None
    738         if typ == "frame":
--> 739             obj = FrameParser(json, **kwargs).parse()
    740 
    741         if typ == "series" or obj is None:

/opt/anaconda3/lib/python3.7/site-packages/pandas/io/json/_json.py in parse(self)
    847 
    848         else:
--> 849             self._parse_no_numpy()
    850 
    851         if self.obj is None:

/opt/anaconda3/lib/python3.7/site-packages/pandas/io/json/_json.py in _parse_no_numpy(self)
   1091         if orient == "columns":
   1092             self.obj = DataFrame(
-> 1093                 loads(json, precise_float=self.precise_float), dtype=None
   1094             )
   1095         elif orient == "split":

ValueError: Expected object or value
rdp-apirefinitiv-data-platformresearch
icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

1 Answer

Upvotes
Accepted
15.1k 28 8 12

Hello @hiroko.yamaguchi

The ctb4614_050_metadata.json file is not a valid JSON message format file. It contains a lot of "docID" nodes without array element (example {"docID": 1}, {"docID": 2}, {"docID":3}).

The application needs to add "[" and "]" to set all "docID" nodes in the same array element

 [{"docID": 1}, {"docID": 2}, {"docID":3}]

I have tried to add "[" at the first of the file and "]" at the end of file, then load with pd.read_json("ctb4614_050_metadata.json") and it works fine.


ahs-json-work.png (64.1 KiB)
icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

I could replicate it! Thank you!

On the other hand, they're provided with lots of such a file of metadata for all brokers they're entitled to, therefore, is there any way to add the "[" and "]" programmatically?

Hi @hiroko.yamaguchi
I think the first step is to verify if those metadata json files are the valid JSON message or not , then fix the file/content based on how errors or invalid format.


The Python application can use json.load() function to validate JSON file, but the file needs to be manual fix if it is invalid format.