Bug 1109277

Summary: Problem with RSLP Stemmer in application which uses nltk
Product: OpenShift Online Reporter: Junior <juniorcaemj>
Component: ImageAssignee: Jakub Hadvig <jhadvig>
Status: CLOSED NOTABUG QA Contact: libra bugs <libra-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 1.xCC: jokerman, mmccomas
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-06-17 22:40:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Picture of problem none

Description Junior 2014-06-13 14:41:38 UTC
Created attachment 908598 [details]
Picture of problem

- Description of problem:

I hosted an python webservice application in openshift which uses RSLP Stemmer module of nltk, but the log of service reported that:

"[...] Resource 'stemmers/rslp/step0.pt' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download()

Searched in:
 - '/var/lib/openshift/539a61ab5973caa2410000bf/nltk_data'
 - '/usr/share/nltk_data'
 - '/usr/local/share/nltk_data'
 - '/usr/lib/nltk_data'
 - '/usr/local/lib/nltk_data'  [...]  "

I concluded that the module is not installed properly and so I'm reporting the bug.

- How reproducible:
Use the following code snippet:

import nltk
from nltk.stem import RSLPStemmer
stemmer = RSLPStemmer()


- Actual results:
The application not be working.

- Expected results:
The application should be working.

Comment 1 Jakub Hadvig 2014-06-17 22:40:18 UTC
Junior the problem is that the NLTK package by default expect corpus in user home directory. Unfortunatelly, you cannot write to user home, you have to use $OPENSHIFT_DATA_DIR for storing data. To solve this problem do the following:

1. Create an environment variable called NLTK_DATA with value $OPENSHIFT_DATA_DIR. After creating environment variable restart the app using rhc app-restart command.
2. SSH into your application gear using rhc ssh command
3. Activate the virtual environment and download the corpus using the commads shown below.

1.# . $VIRTUAL_ENV/bin/activate
2.# curl https://raw.githubusercontent.com/sloria/TextBlob/dev/textblob/download_corpora.py | python



There was also an blog post which solves your problem.
https://www.openshift.com/blogs/day-9-textblob-finding-sentiments-in-text

Comment 2 Junior 2014-06-18 00:16:52 UTC
Thanks for help, Shekhar. I following your instructions but the URL was broken. However, this feature of create environment variables was useful because I created a folder containing the content of nltk which I needed, and set an environment variable NLTK_DATA for this folder. Again, thanks for the help.

Comment 3 Jakub Hadvig 2014-06-18 07:54:34 UTC
Junior the correct url is:
https://raw.githubusercontent.com/sloria/TextBlob/dev/textblob/download_corpora.py

This one is working and is the right one.

-Jakub