Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cleaned data #11

Closed
wants to merge 8 commits into from
Closed

Fix cleaned data #11

wants to merge 8 commits into from

Conversation

Gupta-Anubhav12
Copy link
Member

Leading and trailing \n, ||1.1|| and whitespaces removed and replaced data from xml to solve most of the -- issue

@samanyougarg
Copy link
Member

samanyougarg commented Dec 31, 2021

@Gupta-Anubhav12 @Amritpal2001 Thank you very much for taking on this task. It looks pretty clean now.

Just highlighting some potential minor issues.

  • If you see the English translations, somehow most of them don't have a full stop (or other punctuations) at the end.

Screenshot 2021-12-31 at 10 33 14 AM

  • Same for Hindi translations and commentaries.

Example:

What it looks like now:

हे द्विजोत्तम! हमारे पक्षमें भी जो मुख्य हैं, उनपर भी आप ध्यान दीजिये। आपको याद दिलानेके लिये मेरी सेनाके जो नायक हैं, उनको मैं कहता हूँ

What it should be like:

हे द्विजोत्तम! हमारे पक्ष में भी जो मुख्य हैं, उनपर भी आप ध्यान दीजिये। आपको याद दिलाने के लिये मेरी सेना के जो नायक हैं, उनको मैं कहता हूँ।

  • For all the Sanskrit text, we have removed the verse number from the end along with the newline characters. Here's what a sample text looks like right now:

धृतराष्ट्र उवाच\n\nधर्मक्षेत्रे कुरुक्षेत्रे समवेता युयुत्सवः।\n\nमामकाः पाण्डवाश्चैव किमकुर्वत सञ्जय

Here's what it should look like:

धृतराष्ट्र उवाच |\n\nधर्मक्षेत्रे कुरुक्षेत्रे समवेता युयुत्सवः |\n\nमामकाः पाण्डवाश्चैव किमकुर्वत सञ्जय ||1||


I am not sure whether this directly comes from the data that we extracted or whether there is some issue with the cleaning script that seems to be somehow removing the punctuations when it removes the newline characters.

@Amritpal2001
Copy link

Yup, sir got the issue and fixed it. The script was removing only full stop but not any other punctuations. If there's again any issue please tell and we will fix it.
Thanks

@Gupta-Anubhav12
Copy link
Member Author

I think we should not remove the ।।1.38।। thing
this helps us in full text search to find the verses,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants