联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> javajava

日期:2019-08-07 10:17

2019/8/4 COMP9319 2019T2 Assignment 2

COMP9319 2019T2 Assignment 2: RLFM Index

(Run-Length Encoded BWT)

Your task inthis assignment is to create asearch program that implements BWT backward search, which

canefficiently search aRLFM encoded record file. The original file (before RLFM) format is:

[<offset1>]<text1>[<offset2>]<text2>[<offset3>]<text3>... ...

where <offset1>, <offset2>, <offset3>, etc. are integer values that are used as unique record identifiers;

and <text1>, <text2>, <text3>, etc. are record values (text), which include any ASCII alphabets with ASCII

values from 32 to 126, tab (ASCII 9) and newline (ASCII 10 and 13). For simplicity, there will be no openor

close square bracket inthe record values.

Your C/C++ program, called rlebwt , accepts:

1. acommand argument of either:

-m for the number of matching substrings (count duplicates),

-r for the number of unique matching records,

-a for listing the identifiers of all the matching records (no duplicates and in ascending order), or

-n for displaying the record value of agivenrecord identifier;

2. the path to aRLFM encoded file (without its file extension);

3. the path to aindex folder; and

4. aquoted query string (i.e., the search term) for option-m, -r, -a, or aquoted record identifier for

option-n

as commandline input arguments. The search term canbe up to 512 characters. To make the assignment

easier, we assume that the search is case sensitive.

If -a is specified, using the givenquery string, rlebwt will perform backward search onthe givenRLFM

encoded file, and output the sorted and unique identifiers (no duplicates) of all the records that contain the

input query string to the standard output. Each identifier is enclosed in a pair of square brackets, one line

(ending with a'\n') for each match.

If -m is specified, givenaquery string, rlebwt will output the total number of matching substrings (count

duplicates) to the standard output. The output is the total number, with an ending newline character.

Similarly, rlebwt will output the total number of unique matching records (do not count duplicates) if -r is

specified.

If -n is specified, using the givenrecord identifier, rlebwt will output the original record value (text) to the

standard output with a'\n' at the end.

For any of the above options, if amatch cannot be found, simply output nothing.

Although you do not need to submit aBWT encoder, it is apart of this assignment that you will implement

asimple BWT encoding program based onRLFM (this will help you in understanding the lecture materials

and assist intesting your assignment).

File Extensions and Formats

Sample files are provided in~cs9319/a2/.

wagner % pwd

/import/kamen/1/cs9319/a2

2019/8/4 COMP9319 2019T2 Assignment 2

https://www.cse.unsw.edu.au/~wong/cs9319-2019a2.html 2/6

wagner % ls -l

total 12088

-rw-r--r-- 1 cs9319 cs9319 911184 Jun 27 23:12 dblp.b

-rw-r--r-- 1 cs9319 cs9319 911184 Jun 27 23:12 dblp.bb

-rw-r--r-- 1 cs9319 cs9319 3132682 Jun 27 23:12 dblp.s

-r--r--r-- 1 cs9319 cs9319 7289468 Jun 27 23:12 dblp.txt

-rw-r--r-- 1 cs9319 cs9319 3191 Jun 27 22:40 shopping.b

-rw-r--r-- 1 cs9319 cs9319 3191 Jun 27 22:40 shopping.bb

-rw-r--r-- 1 cs9319 cs9319 13700 Jun 27 22:40 shopping.s

-r--r--r-- 1 cs9319 cs9319 25525 Jun 27 23:15 shopping.txt

-rw-r--r-- 1 cs9319 cs9319 2 Jun 27 23:11 simple1.b

-rw-r--r-- 1 cs9319 cs9319 2 Jun 27 23:11 simple1.bb

-rw-r--r-- 1 cs9319 cs9319 11 Jun 27 23:11 simple1.s

-r--r--r-- 1 cs9319 cs9319 15 Jun 27 23:11 simple1.txt

-rw-r--r-- 1 cs9319 cs9319 8 Jun 27 23:11 simple2.b

-rw-r--r-- 1 cs9319 cs9319 8 Jun 27 23:11 simple2.bb

-rw-r--r-- 1 cs9319 cs9319 35 Jun 27 23:11 simple2.s

-r--r--r-- 1 cs9319 cs9319 58 Jun 27 23:11 simple2.txt

-rw-r--r-- 1 cs9319 cs9319 70 Jun 27 23:11 simple3.b

-rw-r--r-- 1 cs9319 cs9319 70 Jun 27 23:11 simple3.bb

-rw-r--r-- 1 cs9319 cs9319 378 Jun 27 23:11 simple3.s

-r--r--r-- 1 cs9319 cs9319 553 Jun 27 23:11 simple3.txt

wagner %

The file extensions represent their corresponding types:

FILENAME.txt - the original text file. It is provided for your reference only. It will not be available

during auto marking.

FILENAME.s - corresponds to S in the RLFM lecture slides and its original paper. It is the BWT text

with the consecutive duplicates removed.

FILENAME.b - corresponds to the bit array B inthe RLFM lecture slides and its original paper. It is in

binary format, which canbe inspected using xxd as shownlater.

FILENAME.bb - corresponds to the bit array B' inthe RLFM lecture slides and its original paper. It is

inbinary format, which canbe inspected using xxd as shownlater. This file is not provided during

auto marking. Your rlebwt will need to generate it.

For the B and B' arrays, after the last bit is writtento the file, fill inthe gap (if any) of the last byte with bit

1. Check the xxd examples below for details.

Initializationand External Files

Whenever rlebwt is executed using agivenfile FILENAME, for example:

rlebwt -X FILENAME INDEX_FOLDER QUERY_STRING

where X canbe any one of the options (-m, -r, -a, -n), it will take FILENAME.s and FILENAME.b as input;

and also check if FILENAME.bb exists. If FILENAME.bb does not exist, it will generate one.

After that, it will check if INDEX_FOLDER exists. If not, it will create it as an index folder. Index files will

thenbe generated inside this index folder accordingly.

Inadditionto the B' array, your solutionis allowed to write out up to 6 external index files that are intotal

no larger thanthe total size of the given, input FILENAME.s file plus 2 x the size of the given FILENAME.b.

If your index files are larger thanthis limit, you will receive zero points for the tests that involve that given

FILENAME. You may assume that the index folder (and its index files inside) will not be deleted during all

the tests for agivenFILENAME, and all the INDEX_FOLDER are uniquely and correspondingly named.

Therefore, to save time, you only need to generate the index files whentheir folder does not exist yet.

Example

Suppose the original file (say dummy.txt) before RLFM is:

2019/8/4 COMP9319 2019T2 Assignment 2

https://www.cse.unsw.edu.au/~wong/cs9319-2019a2.html 3/6

[3]Computers in industry[25]Data compression[33]Integration[40]Big data indexing[90]1990-02-19[190]20.55

Some examples:

%wagner> rlebwt -m ~/a2/dummy ~/a2/dummyIndex "in"

4

%wagner> rlebwt -r ~/a2/dummy ~/a2/dummyIndex "in"

2

%wagner> rlebwt -a ~/a2/dummy ~/a2/dummyIndex "in"

[3]

[40]

%wagner> rlebwt -n ~/a2/dummy ~/a2/dummyIndex "3"

Computers in industry

%wagner>

Inthe above example, we assume dummy.s and dummy.b exist inthe a2 folder of our home directory.

rlebwt will generate dummy.bb inside a2, and will thencreate anindex folder called dummyIndex (with the

index files inside dummyIndex) inside a2 as well.

Inthe following example, we assume dummy.s and dummy.b exist inthe XYZ folder of the account

MyAccount. You will check if dummy.bb exists in~MyAccount/XYZ/. If not, your submitted rlebwt will

generate dummy.bb in~MyAccount/XYZ/(assume you have write permission in that folder). You will create

anindex folder called dummy (with the index files inside dummy) at your current directory.

%wagner> rlebwt -m ~MyAccount/XYZ/dummy dummy "in "

1

%wagner> rlebwt -r ~MyAccount/XYZ/dummy dummy "in "

1

%wagner>

%wagner> rlebwt -a ~MyAccount/XYZ/dummy dummy "In"

[33]

%wagner>

%wagner> rlebwt -m ~MyAccount/XYZ/dummy dummy "9"

3

%wagner> rlebwt -r ~MyAccount/XYZ/dummy dummy "9"

1

%wagner> rlebwt -a ~MyAccount/XYZ/dummy dummy "9"

[90]

%wagner> rlebwt -n ~MyAccount/XYZ/dummy dummy "9"

%wagner>

%wagner> rlebwt -n ~MyAccount/XYZ/dummy dummy "90"

1990-02-19

%wagner> rlebwt -n ~MyAccount/XYZ/dummy dummy "25"

Data compression

%wagner>

Note that it is possible that your submissionmay be tested with the B' files provided. For example, the

RLFM encoded file path could be ~cs9319/a2/simple1 and path to index folder could be ~/a2/myIndex.

Since simple1.bb is already there, you do not need to generate the B' file again and just read and use it

from ~cs9319/a2/. You will thengenerate the index folder called myIndex at your own a2 folder.

Inspecting the Binary Files

You may find the tool xxd useful to inspect the binary files correspond to the B and B' arrays. For example,

you may use xxd to inspect the provided sample files:

wagner % pwd

/import/kamen/1/cs9319/a2

wagner %

wagner % xxd -b simple1.b

0000000: 10111111 11101001 ..

wagner % xxd -b simple1.bb

0000000: 11101011 00111111 .?

wagner % cat simple1.s

[an12nbnb]awagner %

wagner % xxd -b simple2.b

0000000: 10000110 10111111 11111111 11111001 00000010 00111100 .....<

0000006: 11100110 10111111 ..

wagner % xxd -b simple2.bb

2019/8/4 COMP9319 2019T2 Assignment 2

https://www.cse.unsw.edu.au/~wong/cs9319-2019a2.html 4/6

0000000: 11011111 11100001 01000000 11100101 10010011 11111111 ..@...

0000006: 01111100 01111111 |.

wagner % cat simple2.s

[1[1endgnad1234245ndbnb]ngnabdiaiaiwagner %

wagner %

Inparticular, simple1.s has 11 characters. Therefore, there will be 11 ones in the array B that correspond to

the 11 characters insimple1.s. Since all the zeros representing the duplicates of these 11 characters, you

canobserve that the last bit "1" is just agap filler. This also means the original text file will contain 15

characters.

Compiling Your Submission

We will use the make command below to compile your solution. Please provide amakefile and ensure that

the code you submit canbe compiled onaCSE Linux machine, e.g., wagner. Solutions that have

compilationerrors will receive zero points for the entire assignment.

make

Your solutionshould not write out any external files other thanthe B' file and the index folder with

maximum six files inside. Any solutionthat writes out external files other thanthese files will receive zero

points for the entire assignment.

Performance

Your solutionwill be marked based onspace and runtime performance. Your soluton will not be tested

against any RLFM encoded files with their original files that are larger than 160MB.

Runtime memory is assumed to be always less than16MB. Runtime memory consumptionwill be measured

by valgrind massif with the option--pages-as-heap=yes, i.e., all the memory used by your program will be

measured. Any solutionthat violates this memory requirement will receive zero points for that query test.

To help you started, your tutor will provide anoverview onvalgrind and makefile during the tutorial in week

5 or week 6.

Any solutionthat runs for more than90 seconds onamachine with similar specificationas wagner for the

first query onagivenRLFM file will be killed, and will receive zero points for the queries for that RLFM file.

After that any solutionthat runs for more than20 seconds for any one of the subsequent queries onthat

RLFM file will be killed, and will receive zero points for that query test. We will use the time command and

count both the user and system time as runtime measurement.

Documentationand Code Inspection

Your source code will be inspected. Marks may be deducted if your code is very poor on readability and

ease of understanding; your code does not follow the requirements of this specification document; or your

code is writtendeliberately to avoid being accurately measured by valgrind.

Assumptions/clarifications

1. To avoid large runtime memory for sorting, none of the testcases for marking will result in more than

5,000 record matches.

2. The input filename is apath to the givenRLFM encoded file (without its extension .s and .b). Please

openthese files as read-only incase you do not have the write permission for these files.

3. Marks will be deducted for output of any extratext, other thanthe required, correct answers (in the

right order). This extrainformationincludes (but not limited to) debugging messages, line numbers

and so on.

2019/8/4 COMP9319 2019T2 Assignment 2

https://www.cse.unsw.edu.au/~wong/cs9319-2019a2.html 5/6

4. You canassume that the input query string will not be anempty string (i.e., ""). Furthermore, except

with the command argument -n, search terms containing only numbers shall not match any record

identifiers. Finally, search terms containing any square bracket will not be tested.

5. You may assume that offset >= 0 and will fit inanunsigned int.

6. Whencounting the number of substring matches (i.e., with -m option), to make it easier for backward

search matching, all combinations of matches should be counted. E.g., There are 2 matches of "aa"

onthe record value "aaa"; 2 matches of "ana"on"banana".

7. You are allowed to use up to 6 external index files to enhance the performance of your solution.

However, if you believe that your solutionis fast enough without using index files, you do not have to

generate these files. Eveninsuch case, your solutionshould still accept a path to index folder as one

of the input argument as specified.

8. A record will not be unreasonably long, e.g., you will not see aline that is 5,000+ chars long.

9. Empty records may exist inthe original files (before RLFM). However, these records will never be

matched during searching because the empty string will not be used as a search term when testing

your program.

Marking

This assignment is worth 100 points. Below is anindicative marking proportionfor large vs small files. You

may want to ensure your assignment working properly for small files before enhance it for larger files.

Test Files Points

TXT size <= 8MB 60

TXT size > 8MB 40

Bonus

Bonus marks (up to 10 points) will be awarded for the solutionthat achieves 100 points and runs the

fastest overall (i.e., the shortest total time to finish all the tests). Note: regardless of the bonus marks you

receive inthis assignment, the maximum final mark for the subject is capped at 100.

Submission

Deadline: Monday5th August 12:00 (noon). Late submissions will have marks deducted from the maximum

achievable mark at the rate of roughly 1% of the total mark per hour that they are late (i.e., 24% per day),

and no submissions will be accepted after 3 days late.

Use the give command below to submit the assignment or submit viaWebCMS3:

give cs9319 a2 makefile *.h *.c *.cpp

Please use "classrun" to check your submissionto make sure that you have submitted all the necessary

files.

Plagiarism

The work you submit must be your ownwork. Submissionof work partially or completely derived from any

other personor jointly writtenwith any other personis not permitted. The penalties for such anoffence

may include negative marks, automatic failure of the course and possibly other academic discipline.

Assignment submissions will be examined both automatically and manually for such submissions.

Relevant scholarship authorities will be informed if students holding scholarships are involved inanincident

of plagiarism or other misconduct.

2019/8/4 COMP9319 2019T2 Assignment 2

https://www.cse.unsw.edu.au/~wong/cs9319-2019a2.html 6/6

Do not provide or show your assignment work to any other person- apart from the teaching staff of this

subject. If you knowingly provide or show your assignment work to another personfor any reason, and

work derived from it is submitted you may be penalized, evenif the work was submitted without your

knowledge or consent. This may apply evenif your work is submitted by athird party unknownto you.


版权所有:留学生程序网 2020 All Rights Reserved 联系方式:QQ:99515681 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。