Message:
Hello,
I am currently working on filtering a large dataset containing Reuters news from the past 20 years, which I've loaded into a Pandas DataFrame. The dataset has over 2 million news records, sourced monthly from an SFTP server. My objective is to isolate news items related to the European Central Bank (ECB) and their interest rate decisions.
Details:
- Hostname: archive.news.refinitiv.com
- Username: GE-A-01103867-3-15059
- Filepath: /News/RTRS/Monthly/
- Data Format: JSON
I am specifically looking to filter out articles tagged with "ECB" or "M:I" in the data.subjects
column, but exclude any tagged with "ECB/INT".
Current Method:My current approach uses the following Pandas code snippet:
df_clean = df[
(df['data.subjects'].str.contains('M:I|ECB', na=False)) &
(~df['data.subjects'].str.contains('ECB/INT', na=False))
]
Issue:Despite these tags being standard and correctly formatted according to the official guide, the filter returns an empty DataFrame, and there are no error messages that indicate what might be wrong.
Questions:
- Is there an error in how I'm applying the filter conditions?
- Could there be an unseen issue with how the DataFrame is structured or how the data is being read into Pandas?
Any assistance in adjusting the code or troubleshooting this issue would be greatly appreciated.
Thank you!