Upgrade from Eikon -> Workspace. Learn about programming differences.

For a deeper look into our Eikon Data API, look into:

Overview |  Quickstart |  Documentation |  Downloads |  Tutorials |  Articles

question

Upvotes
Accepted
3 0 2 5

How to get the full news story in CSV in Text Style

I have fetched the news story in a dataframe and saved it to a CSV. The data is having HTML tags. Is there a way i can get the story in plain text?


syntax1 = "(Hospital OR Health Center OR Medical center OR health system OR university hospital OR Emergency Department OR Inpatient OR Rehabilitat OR ICU ) AND ( build OR reopen OR construct OR expansion OR upgrade OR develop OR repurpose OR modern )"
df = ek.get_news_headlines(syntax1,100,date_from="2021-03-25T00:00:00", date_to="2021-04-10T00:00:00")
stories = pd.DataFrame(columns=['DATE','STORY'])
for index, headline_row in df.iterrows():   
    story = ek.get_news_story(headline_row['storyId'])
    stories = stories.append({'DATE':index,'STORY':story}, ignore_index=True)
stories = stories.set_index('DATE')
result = pd.concat([df, stories], axis=1)
result.to_csv("news.csv")

The result dataframe looks like this. I want to get rid of the html tags.

eikoneikon-data-apipythonrefinitiv-dataplatform-eikonworkspaceworkspace-data-api
1617811496770.png (40.7 KiB)
icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Thank you for your participation in the forum.
Is the reply below satisfactory in answering your question?
If yes please click the 'Accept' text next to the reply.
This will guide all community members who have a similar question.
Otherwise please post again offering further insight into your question.
Thanks,
AHS

@alankar.gupta

Hi,

Please be informed that a reply has been verified as correct in answering the question, and has been marked as such.

Thanks,

AHS

Upvotes
Accepted
10.2k 18 6 9

@alankar.gupta So our news stories are delivered as HTML - so you can use a package like Beautiful Soup (BS4) to strip the text of its html, hyperlinks etc. Please see this article on how to do it. I hope it can help.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Upvotes
4.3k 2 4 5

A simple solution with html2text lib to extract text from html :

import html2text
...

result = pd.concat([df, stories], axis=1)

result['STORY'] = result['STORY'].apply(html2text.html2text)

result.to_csv("news.csv")


icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.