Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[unified_multi_sum.jsonl] excessive ”\n“ in this file. #12

Open
Mythos-Rudy opened this issue Mar 27, 2023 · 1 comment
Open

[unified_multi_sum.jsonl] excessive ”\n“ in this file. #12

Mythos-Rudy opened this issue Mar 27, 2023 · 1 comment

Comments

@Mythos-Rudy
Copy link

I noticed that there are excessive line break characters ('\n') and blank spaces in a single sentence within the unified_multi_sum.jsonl file. I suspect that this data was collected from books or PDFs, and as a result, the line breaks in every line from the original sources were included in the dataset. This issue can be seen in Line 1, for instance:

{"text": "\n: Summarize the following proposed legislation (bill): SECTION 1. SHORT TITLE.\n\n This Act may be cited as the Patients and Public Health \nPartnership Act of 2008''.\n\nSEC. 2. DEMONSTRATION PROJECT FOR INTEGRATED HEALTH SYSTEMS TO EXPAND \n ACCESS TO PRIMARY AND PREVENTIVE SERVICES FOR THE \n MEDICALLY UNDERSERVED.\n\n Part D of title III of the Public Health Service Act (42 U.S.C. \n259b et seq.) is amended by adding at the end the following new \nsubpart:\n\n Subpart XI--Demonstration Project for Integrated Health Systems to \n Expand Access to Primary and Preventive Services for the Medically \n Underserved\n\nSEC. 340H. DEMONSTRATION PROJECT FOR INTEGRATED HEALTH SYSTEMS TO \n EXPAND ACCESS TO PRIMARY AND PREVENTIVE CARE FOR THE \n MEDICALLY UNDERSERVED.\n\n (a) Establishment of Demonstration.--\n (1) In general.--Not later than January 1, 2009, the \n Secretary shall establish a demonstration project (hereafter in \n this section referred to as the `demonstration') under which up \n to 30 qualifying integrated health systems receive grants for \n the costs of their operations to expand access to primary and \n preventive services for the medically underserved.\n (2) Rule of construction.--Nothing in this section shall \n be construed as authorizing grants to be made or used for the \n costs of specialty care or hospital care furnished by an \n integrated health system.\n (b) Application.--Any integrated health system desiring to \nparticipate in the demonstration shall submit an application in such \nmanner, at such time, and containing such information as the Secretary \nmay require.\n (c) Criteria for Selection.--In selecting integrated health \nsystems to participate in the demonstration (hereafter referred to as \n`participating integrated health systems'), the Secretary shall ensure \nrepresentation of integrated health systems that are located in a \nvariety of States (including the District of Columbia and the \nterritories and possessions of the United States) and locations within \nStates, including rural areas, inner-city areas, and frontier areas.\n (d) Duration.--Subject to the availability of appropriations, the \ndemonstration shall be conducted (and operating grants be made to each \nparticipating integrated health system) for a period of 3 years.\n (e) Reports.--\n >

@huu4ontocord
Copy link
Contributor

Yes. I see. These are legislations. This comes from the original dataset. Let's see if we can remove som exceissive \n. But note that the spaces are needed because they are meant to format the leglislation. If you don't want to wait, maybe you can just filter it out by doing a replace("\n\n", "\n")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants