How to get the full news story in CSV in Text Style

I have fetched the news story in a dataframe and saved it to a CSV. The data is having HTML tags. Is there a way i can get the story in plain text?

syntax1 = "(Hospital OR Health Center OR Medical center OR health system OR university hospital OR Emergency Department OR Inpatient OR Rehabilitat OR ICU ) AND ( build OR reopen OR construct OR expansion OR upgrade OR develop OR repurpose OR modern )"
df = ek.get_news_headlines(syntax1,100,date_from="2021-03-25T00:00:00", date_to="2021-04-10T00:00:00")
stories = pd.DataFrame(columns=['DATE','STORY'])
for index, headline_row in df.iterrows():   
    story = ek.get_news_story(headline_row['storyId'])
    stories = stories.append({'DATE':index,'STORY':story}, ignore_index=True)
stories = stories.set_index('DATE')
result = pd.concat([df, stories], axis=1)
result.to_csv("news.csv")

The result dataframe looks like this. I want to get rid of the html tags.

1617811496770.png

Find more posts tagged with

python

eikon

workspace

eikon-data-api

refinitiv-dataplatform-eikon

workspace-data-api

Accepted answers

jason.ramchandani01

@alankar.gupta So our news stories are delivered as HTML - so you can use a package like Beautiful Soup (BS4) to strip the text of its html, hyperlinks etc. Please see this article on how to do it. I hope it can help.

All comments

jason.ramchandani01

A simple solution with html2text lib to extract text from html :

import html2text
...

result = pd.concat([df, stories], axis=1)result['STORY'] = result['STORY'].apply(html2text.html2text)result.to_csv("news.csv")

EXPLORE OUR SITES