Assignment Task
Task
Dataset (Rnews_v1 document collection).
A-Plus Writing Help For University Students
Get expert assistance in any academic field. All courses and programs covered.
Get Help Now!- You will be working with a sample dataset which is a small subset of XML documents (TREC RCV1 data collection), which is a pre-tokenized version of (for convenience, and for copyright reasons). The dataset can be downloaded from Blackboard.
You are asked to design Python code for three questions (9 tasks). You can add new variables, functions, methods, or update function parameters. However, you should provide comments to clearly describe why you are doing this.
Question 1. Document & query parsing.
The motivation for Question 1 is to design your own document and query parsers. So please don’t use python packages that we didn’t use in the workshop.
Task 1.1:
Define a document parsing function parse_rcv_coll(inputpath, stop_words) to parse a data collection (e.g., Rnews_v1 dataset), where parameter inputpath is the folder that stores a set of XML files, and parameter stop_words is a list of common English words (you may use the file ‘common-english-words.txt’ to find all stop words). The following are the major steps in the document parsing function:
Task 1.2:
Define a query parsing function parse_query(query0, stop_words), where we assume the original query is a simple sentence or a title in a String format (query0), and stop_words is a list of stop words that you can get from ‘common-english-words.txt’.
Task 1.3:
Define a main function to test function parse_rcv_coll( ). The main function uses the provided dataset, calls function parse_rcv_coll() to get a collection of BowDoc objects. For each document in the collection, firstly print out its docID, the number of index terms and the total number of works in the document (doc_len). It then sorts index terms (by frequency) and prints out a term:freq list. At last, it saves the output into a text file (file name is “your full name_Q1.txt”).
Question 2. Tf*idf based IR model.
Tf*idf is a popular term weighting method, which uses the following Eq. (1) to calculate a weight for term k in a document i, where the base of log is 10. You may review lecture notes to get the meaning of each variable in the equation.
Task 2.1:
Define a function calc_df(coll) to calculate document-frequency (df) for a given BowDoc collection coll and return a {term:df, …} dictionary.
Task 2.3:
Define a main function to print out top 12 terms (with its value of tf*idf weight) for each document in Rnews_v1 if it has more than 12 terms and save the output into a text file (file name is “your full name_Q2.txt”)
Question 3. BM25-based IR model.
BM25 IR model is a popular and effective ranking algorithm, which uses the following Eq. (3) to calculate a document score or ranking for a given query Q and a document D, where the base of log is 2. You may review lecture notes to get the meaning of each variable in the equation.
Task 3.1:
Define a Python function avg_doc_len(coll) to calculate and return the average document length of all documents in the collection coll.
- In the BowDoc class, for the variable doc_len (the document length), add accessor (get) and mutator (set) methods for it.
- You may modify your code defined in Question 1 by calling the mutator method of doc_len to save the document length in a BowDoc object when creating the BowDoc object. At the same time, sum up every BowDoc’s doc_len as totalDocLength, then at the end, calculate the average document length and return it.
Task 3.2:
Use Eq (3) to define a python function bm25(coll, q, df) to calculate documents’ BM25 score for a given original query q, where df is a {term:df, …} dictionary. Please note you should parse query using the same method as parsing documents (you can call function parse_query() that you defined for Question 1). For the given query q, the function returns a dictionary of {docID: bm25_score, … } for all documents in collection coll.
Task 3.3:
Define a main function to implement a BM25-based IR model to rank documents in the given document collection News_v1 using your functions.
• You are required to test all the following queries:
- This British fashion
- All fashion awards
- The stock markets
- The British-Fashion Awards
- The BM25-based IR model needs to print out the ranking result (in descending order) of top-5 possible relevant documents for a given query and append outputs into the text file (“your full name_Q3.txt”).
This IFN647-IT Computer Science Assignment has been solved by our IT Computer Science Expert at TV Assignment Help. Our Assignment Writing Experts are efficient to provide a fresh solution to this question. We are serving more than 10000+ Students in Australia, UK & US by helping them to score HD in their academics. Our Experts are well trained to follow all marking rubrics & referencing Style. Be it a used or new solution, the quality of the work submitted by our assignment experts remains unhampered. You may continue to expect the same or even better quality with the used and new assignment solution files respectively. There’s one thing to be noticed that you could choose one between the two and acquire an HD either way. You could choose a new assignment solution file to get yourself an exclusive, plagiarism (with free Turn tin file), expert quality assignment or order an old solution file that was considered worthy of the highest distinction.
Welcome to our Online Essay Writing Agency. Securing higher grades costing your pocket? Order your assignment online at the lowest price now! Our online essay writers are able to provide high-quality assignment help within your deadline. With our homework writing company, you can order essays, term papers, research papers, capstone projects, movie review, presentation, annotated bibliography, reaction paper, research proposal, discussion, or another assignment without having to worry about its originality – we offer 100% original content written completely from scratch
We write papers within your selected deadline. Just share the instructions