Download market depth day by day

I am trying to download depth data from Tick History Market Depth/Legacy market depth using REST API from this post.

As you already know, depth data is quite large. so I would like to save the files by date. i.e. instead of having 1 giant file, I have one file for each day that contains depth data for all stocks in the list. I would also love to separate by stock and then by date as well but by date is fine for now.

How would I go about it? The following is my code. I use Python 3.6 with Pycharm by the way. And I am not really good with Python. I normally use SAS.

1. The JSON file that specifies the fields to download

{
  "ExtractionRequest": {
    "@odata.type": "#ThomsonReuters.Dss.Api.Extractions.ExtractionRequests.TickHistoryMarketDepthExtractionRequest",
    "ContentFieldNames": [
      "Ask Price",
      "Ask Size",
      "Bid Price",
      "Bid Size"
    ],
    "IdentifierList": {
        "@odata.type": "#ThomsonReuters.Dss.Api.Extractions.ExtractionRequests.InstrumentListIdentifierList",
        "InstrumentListId":"0x06698c5d00301db4"
     },
    "Condition": {
      "View": "NormalizedLL2",
      "NumberOfLevels": 5,
      "MessageTimeStampIn": "GmtUtc",
      "ReportDateRangeType": "Range",
      "QueryStartDate": "1996-01-01T00:00:00.000Z",
      "QueryEndDate": "2018-06-06T23:59:59.999Z",
      "DisplaySourceRIC": "True"
    }
  }
}

And here is the code to run and get data:

#!/usr/bin/python
# -*- coding: UTF-8 -*-
from json import dumps, loads, load
from requests import post
from requests import get
from time import sleep
from getpass import _raw_input as input
from getpass import getpass
from getpass import GetPassWarning
from collections import OrderedDict
import os
import gzip
import pandas as pd




_outputFilePath="./"
_outputFileName="TestOutput"
_retryInterval=int(30) #value in second used by Pooling loop to check request status on the server
_jsonFileName="TickHistoricalRequest.json"


def RequestNewToken(username="",password=""):
    _AuthenURL = "https://hosted.datascopeapi.reuters.com/RestApi/v1/Authentication/RequestToken"
    _header= {}
    _header['Prefer']='respond-async'
    _header['Content-Type']='application/json; odata.metadata=minimal'
    _data={'Credentials':{
        'Password':password,
        'Username':username
        }
    }

    print("Send Login request")
    resp=post(_AuthenURL,json=_data,headers=_header)


    if resp.status_code!=200:
        message="Authentication Error Status Code: "+ str(resp.status_code) +" Message:"+dumps(loads(resp.text),indent=4)
        raise Exception(str(message))


    return loads(resp.text)['value']


def ExtractRaw(token,json_payload):
    try:
        _extractRawURL="https://hosted.datascopeapi.reuters.com/RestApi/v1/Extractions/ExtractRaw"
        #Setup Request Header
        _header={}
        _header['Prefer']='respond-async'
        _header['Content-Type']='application/json; odata.metadata=minimal'
        _header['Accept-Charset']='UTF-8'
        _header['Authorization']='Token'+token

        #Post Http Request to DSS server using extract raw URL
        resp=post(_extractRawURL,data=None,json=json_payload,headers=_header)


        #Print Status Code return from HTTP Response
        print("Status Code="+str(resp.status_code) )


        #Raise exception with error message if the returned status is not 202 (Accepted) or 200 (Ok)
        if resp.status_code!=200:
            if resp.status_code!=202:
                message="Error: Status Code:"+str(resp.status_code)+" Message:"+resp.text
                raise Exception(message)


            #Get location from header, URL must be https so we need to change it using string replace function
            _location=str.replace(resp.headers['Location'],"http://","https://")


            print("Get Status from "+str(_location))
            _jobID=""


            #pooling loop to check request status every 2 sec.
            while True:
                resp=get(_location,headers=_header)
                _pollstatus = int(resp.status_code)


                if _pollstatus==200:
                    break
                else:
                    print("Status:"+str(resp.headers['Status']))
                sleep(_retryInterval) #wait for _retyInterval period and re-request the status to check if it already completed


        # Get the jobID from HTTP response
        json_resp = loads(resp.text)
        _jobID = json_resp.get('JobId')
        print("Status is completed the JobID is "+ str(_jobID)+ "\n")


        # Check if the response contains Notes.If the note exists print it to console.
        if len(json_resp.get('Notes')) > 0:
            print("Notes:\n======================================")
            for var in json_resp.get('Notes'):
                print(var)
            print("======================================\n")


        # Request should be completed then Get the result by passing jobID to RAWExtractionResults URL
        _getResultURL = str("https://hosted.datascopeapi.reuters.com/RestApi/v1/Extractions/RawExtractionResults(\'" + _jobID + "\')/$value")
        print("Retrieve result from " + _getResultURL)
        resp=get(_getResultURL,headers=_header,stream=True)


        #Write Output to file.
        outputfilepath = str(_outputFilePath + _outputFileName + str(os.getpid()) + '.csv.gz')
        if resp.status_code==200:
            with open(outputfilepath, 'wb') as f:
                f.write(resp.raw.read())


        print("Write output to "+outputfilepath+" completed\n\n")
        print("Below is sample data from "+ outputfilepath)
        #Read data from csv.gz and shows output from dataframe head() and tail() 
        df=pd.read_csv(outputfilepath,compression='gzip')
        print(df.head())
        print("....")
        print(df.tail())


    except Exception as ex:
        print("Exception occrus:", ex)


    return






def main():
    try:
        #Request a new Token
        print("Login to DSS Server")
        _DSSUsername=input('Enter DSS Username:')
        try:
            _DSSPassword=getpass(prompt='Enter DSS Password:')
            _token=RequestNewToken(_DSSUsername,_DSSPassword)
        except GetPassWarning as e:
             print(e)
        print("Token="+_token+"\n")


        #Read the HTTP request body from JSON file. So you can change the request in JSON file instead.
        queryString = {}
        with open(_jsonFileName, "r") as filehandle:
            queryString=load(filehandle,object_pairs_hook=OrderedDict)


        #print(queryString)
        ExtractRaw(_token,queryString)


    except Exception as e:
        print(e)


print(__name__)


if __name__=="__main__":
    main()

Sorry for the long post!

Find more posts tagged with

historical

Download

tick-history-rest-api

depth-of-book

Accepted answers

Christiaan Meihsl

@duckman,

TRTH delivers data for the instrument list and date range specified in the request, as a compressed CSV file.

To save this in separate files by date (and/or instrument) there are several possibilities, most of them based on decompressing the data, identifying when the date or instrument changes, and saving to files accordingly. How to do that ?

Decompress data on the fly, and save to different files ?
Avoid this, as per this advisory it is strongly recommended not to decompress data on the fly, you will probably run into issues.
Make multiple requests, one per day of data ? This requires to build a simple loop in the request code, which is fairly easy. Your request is for all data since 1996 till now, which would make for several hundred requests. This is not recommended, as making multiple requests is inefficient (each request has some overhead). For tips on efficiency, see the best practices document.
Save the giant file to disk, then split it in several files ? This requires to read it from disk, decompress, analyze and then save content to separate files (compressed or not).This is a possibility, that requires some coding, but should not generate any technical issues, and be fairly efficient.
Make a compromise ? You could download data for more manageable chunks. Instead of requesting everything from 1996 to now, you could make a request for every single year, or month.

Conclusion: if you really want to split the data in smaller chunks, I'd recommend the 4th (or 3rd) option.

Final comment: all that said, I do not understand why having a large file is a concern. Market depth is voluminous data, that's the way it is. Storing it in one big file is not more complex than using multiple files. Could you tell us why you want to do this ?

All comments

duckman

I don't know why the code appears strangely. I got the code from here

https://github.com/TR-API-Samples/Article.TRTH.Python.REST.OndemandRequestTimeAndSales/blob/master/TickHistoryTimesAndSalesRequest.py

Christiaan Meihsl

@duckman,

TRTH delivers data for the instrument list and date range specified in the request, as a compressed CSV file.

Decompress data on the fly, and save to different files ?
Avoid this, as per this advisory it is strongly recommended not to decompress data on the fly, you will probably run into issues.
Make multiple requests, one per day of data ? This requires to build a simple loop in the request code, which is fairly easy. Your request is for all data since 1996 till now, which would make for several hundred requests. This is not recommended, as making multiple requests is inefficient (each request has some overhead). For tips on efficiency, see the best practices document.
Save the giant file to disk, then split it in several files ? This requires to read it from disk, decompress, analyze and then save content to separate files (compressed or not).This is a possibility, that requires some coding, but should not generate any technical issues, and be fairly efficient.
Make a compromise ? You could download data for more manageable chunks. Instead of requesting everything from 1996 to now, you could make a request for every single year, or month.

Conclusion: if you really want to split the data in smaller chunks, I'd recommend the 4th (or 3rd) option.

duckman

@Christiaan Meihsl

Thank you very much for the reply. I have built a code the loops over each day and request data, i.e. change the date in the JSON file and run the request again. I will try option 4 which I think is the best option.

The reason I want to separate them is that I will do some manipulation/ calculation using SAS. And having a big file takes a long time to run. And if the computer freezes or disrupted then I have to rerun again which is not efficient.

Christiaan Meihsl

@duckman, ok, makes sense; thank you for your feedback. And I hope your computer does not freeze too often :-)

duckman

@Christiaan Meihsl

While we are on this topic, is it possible to request hourly snapshot of the depth data, instead of getting all data? One potential way is to change the the query time field so that only data between say 11:00 and 11:01 is downloaded.

"QueryStartDate": "2018-06-06T11:00:00.000Z",
"QueryEndDate": "2018-06-06T11:01:00.000Z",

However, this only gives me 1 snapshot at 11:00. But I would like to get hourly snapshots at 10:00, 11:00, 12:00 ...etc...

What would be the best way?

Christiaan Meihsl

@duckman,

I'm sorry to say that is not possible for market depth. You can request Intraday Summaries to get hourly data, but that API call does not include market depth fields.

EXPLORE OUR SITES