How to Optimize Large-Scale Daily Time Series Extraction

I have a script that extracts daily time series data and static data for the individual constituents of the ETF IEAC.L over a 10-year period. Currently, I’m fetching the data day by day for each of roughly 4000 constituents, which makes the process very slow — a test for a single day already took about 1 hour.

My current approach:

 

# Configuartions
etf_ric = "IEAC.L"
snapshot_date = "2025-04-30"  
start_ts = "2015-04-30"       
end_ts = "2025-04-30"

 

# Holding and static fields
fields_holdings = ["TR.FundHoldingRIC", "TR.FundHoldingName", "TR.FundLatestFilingDate"]

 

static_fields = [
    "TR.CommonName", "TR.ISIN", "TR.FIIssuerName", "TR.TRBCEconomicSector", 
    "TR.TRBCBusinessSector", "TR.TRBCIndustryGroup", "TR.FiCountryName", 
    "TR.FIMaturityDate", "TR.FiFaceIssuedTotal", "TR.FiIssueDate",
    "TR.FiIssuerCountry", "TR.FILiquidationRank", "TR.FI.Group", "TR.FI.Rank", 
    "TR.CouponRate", "TR.FiCouponFrequency", "TR.FiCouponFrequencyDescription", 
    "TR.FiFirstCouponDate", "TR.FiLastCouponDate", "TR.ADF_COUPON", 
    "TR.FiMaturityStandardYield", "TR.MACAULAYDURATION"
]

 

# --- 1. Extract constituents from a single date  ---
parameters = {
    "SDate": snapshot_date,
    "EDate": snapshot_date,
    "Frq": "D",
    "EndNum": 4000
}

 

holdings = rd.get_data([etf_ric], fields=fields_holdings, parameters=parameters)

 

df_holdings = holdings.dropna(subset=["Holding RIC"])
df_holdings = df_holdings[df_holdings["Holding RIC"].str.strip() != ""]
print(f"Total unique holdings found: {df_holdings['Holding RIC'].nunique()}")

 

# --- 2. Static data ---
unique_rics = df_holdings["Holding RIC"].astype(str).unique().tolist()

 

def fetch_static_data(rics, fields, chunk_size=500):
    dfs = []
    for i in range(0, len(rics), chunk_size):
        chunk = rics[i:i + chunk_size]
        try:
            df_chunk = rd.get_data(chunk, fields)
            dfs.append(df_chunk)
        except Exception as e:
            print(f"Error static data: {e}")
        time.sleep(1)
    return pd.concat(dfs, ignore_index=True)

 

df_static = fetch_static_data(unique_rics, static_fields)

 

# --- 3. Daily ime series  ---
ts_data_list = []

 

for ric in unique_rics:
    try:
        df_ts = rd.get_history(universe=ric, start=start_ts, end=end_ts)
        if df_ts is not None and not df_ts.empty:
            df_ts = df_ts.reset_index()
            df_ts["RIC"] = ric
            ts_data_list.append(df_ts)
            print(f"OK: {ric}")
    except Exception as e:
        print(f"Error time series {ric}: {e}")
    time.sleep(1)

 

df_ts_all = pd.concat(ts_data_list, ignore_index=True)

 

# --- 4. Merge and Export ---
df_final = df_ts_all.merge(df_static, how="left", left_on="RIC", right_on="Instrument")
df_final_sorted = df_final.sort_values(by=["Date", "RIC"]).reset_index(drop=True)

 

df_final_sorted.to_csv("IEAC_static_and_timeseries_data.csv", index=False)
print("File saved: IEAC_static_and_timeseries_data.csv")

 

 

  • Get ETF holdings snapshot for a given date (~4000 instruments)
  • Retrieve static data for all constituents in chunks
  • Retrieve daily historical data for each constituent individually over 10 years
  • Merge and export the data

 

I’m wondering if there are more efficient ways to speed up the extraction, for example by using batch requests, multithreading or multiprocessing, or any built-in API features I might be missing.

Would appreciate any advice on optimizing this workflow or examples of best practices with the Refinitiv Data API for this kind of large-scale data extraction.

Thanks!

Answers

  • Hello @BeatrizRogerio

    You didn't specify which session is being used here. The scope of this data is beyond what should be extracted from the Workspace desktop. You will probably need an enterprise grade product like Tick History or RDP bulk.

    If you are using Workspace desktop session - I would advise you to stay within the data extraction limits and guidelines described in this document.

  • Hi Gurpreet, thank you for your reply.

    To clarify, I am currently using a Workspace desktop session — the session is opened in my code with:

    import refinitiv.data as rd 
    rd.open_session()
    

    Given that, I understand I’m working within the limits of the Workspace-based LSEG Data Library (LDL). I also reviewed the limits document you referenced.

    About the workload, I'm trying to retrieve:

    • ~4000 ETF constituents
    • Static data for each constituent (batched in chunks of 500)
    • 10 years of daily time series per constituent, using rd.get_history() (so potentially >10 million rows total)

    My questions:

    Is there any way to make this workflow more efficient within the Workspace context, such as:

    1. Batch time series requests for multiple RICs in a single get_history() call?
    2. Use of ThreadPoolExecutor or parallel requests to better utilize the 5 req/sec / 50MB/minute allowance?
    3. Any built-in throttling strategy or retry handler you recommend?
  • Jirapongse
    Jirapongse ✭✭✭✭✭

    @BeatrizRogerio

    You may refer to the answer on this discussion.