$treeview $search $mathjax $extrastylesheet
librsync
2.3.1
$projectbrief
|
$projectbrief
|
$searchbox |
00001 # librsync TODO 00002 00003 * We have a few functions to do with reading a netint, stashing 00004 it somewhere, then moving into a different state. Is it worth 00005 writing generic functions for that, or would it be too confusing? 00006 00007 * Optimisations and code cleanups; 00008 00009 scoop.c: Scoop needs major refactor. Perhaps the API needs 00010 tweaking? 00011 00012 rsync.h: rs_buffers_s and rs_buffers_t should be one typedef? 00013 00014 mdfour.c: This code has a different API to the RSA code in libmd 00015 and is coupled with librsync in unhealthy ways (trace?). Recommend 00016 changing to RSA API? 00017 00018 * Just how useful is rs_job_drive anyway? 00019 00020 * Don't use the rs_buffers_t structure. 00021 00022 There's something confusing about the existence of this structure. 00023 In part it may be the name. I think people expect that it will be 00024 something that behaves like a FILE* or C++ stream, and it really 00025 does not. Also, the structure does not behave as an object: it's 00026 really just a shorthand for passing values in to the encoding 00027 routines, and so does not have a lot of identity of its own. 00028 00029 An alternative might be 00030 00031 result = rs_job_iter(job, 00032 in_buf, &in_len, in_is_ending, 00033 out_buf, &out_len); 00034 00035 where we update the length parameters on return to show how much we 00036 really consumed. 00037 00038 One technicality here will be to restructure the code so that the 00039 input buffers are passed down to the scoop/tube functions that need 00040 them, which are relatively deeply embedded. I guess we could just 00041 stick them into the job structure, which is becoming a kind of 00042 catch-all "environment" for poor C programmers. 00043 00044 * Meta-programming 00045 00046 * Plot lengths of each function 00047 00048 * Some kind of statistics on delta each day 00049 00050 * Encoding format 00051 00052 * Include a version in the signature and difference fields 00053 00054 * Remember to update them if we ever ship a buggy version (nah!) so 00055 that other parties can know not to trust the encoded data. 00056 00057 * abstract encoding 00058 00059 In fact, we can vary on several different variables: 00060 00061 * what signature format are we using 00062 00063 * what command protocol are we using 00064 00065 * what search algorithm are we using? 00066 00067 * what implementation version are we? 00068 00069 Some are more likely to change than others. We need a chart 00070 showing which source files depend on which variable. 00071 00072 * Encoding algorithm 00073 00074 * Self-referential copy commands 00075 00076 Suppose we have a file with repeating blocks. The gdiff format 00077 allows for COPY commands to extend into the *output* file so that 00078 they can easily point this out. By doing this, they get 00079 compression as well as differencing. 00080 00081 It'd be pretty simple to implement this, I think: as we produce 00082 output, we'd also generate checksums (using the search block 00083 size), and add them to the sum set. Then matches will fall out 00084 automatically, although we might have to specially allow for 00085 short blocks. 00086 00087 However, I don't see many files which have repeated 1kB chunks, 00088 so I don't know if it would be worthwhile. 00089 00090 * Support compression of the difference stream. Does this 00091 belong here, or should it be in the client and librsync just have 00092 an interface that lets it cleanly plug in? 00093 00094 I think if we're going to just do plain gzip, rather than 00095 rsync-gzip, then it might as well be external. 00096 00097 rsync-gzip: preload with the omitted text so as to get better 00098 compression. Abo thinks this gets significantly better 00099 compression. On the other hand we have to important and maintain 00100 our own zlib fork, at least until we can persuade the upstream to 00101 take the necessary patch. Can that be done? 00102 00103 abo says 00104 00105 It does get better compression, but at a price. I actually 00106 think that getting the code to a point where a feature like 00107 this can be easily added or removed is more important than the 00108 feature itself. Having generic pre and post processing layers 00109 for hit/miss data would be useful. I would not like to see it 00110 added at all if it tangled and complicated the code. 00111 00112 It also doesn't require a modified zlib... pysync uses the 00113 standard zlib to do it by compressing the data, then throwing 00114 it away. I don't know how much benefit the rsync modifications 00115 to zlib actually are, but if I was implementing it I would 00116 stick to a stock zlib until it proved significantly better to 00117 go with the fork. 00118 00119 * Licensing 00120 00121 Will the GNU Lesser GPL work? Specifically, will it be a problem 00122 in distributing this with Mozilla or Apache? 00123 00124 * Testing 00125 00126 * Just more testing in general. 00127 00128 * Test broken pipes and that IO errors are handled properly. 00129 00130 * Test files >2GB, >4GB. Presumably these must be done in streams 00131 so that the disk requirements to run the test suite are not too 00132 ridiculous. I wonder if it will take too long to run these 00133 tests? Probably, but perhaps we can afford to run just one 00134 carefully-chosen test. 00135 00136 * Fuzz instruction streams. <https://code.google.com/p/american-fuzzy-lop/>? 00137 00138 * Generate random data; do random mutations. 00139 00140 * Tests should fail if they can't find their inputs, or have zero 00141 inputs: at present they tend to succeed by default. 00142 00143 * Security audit 00144 00145 * If this code was to read differences or sums from random machines 00146 on the network, then it's a security boundary. Make sure that 00147 corrupt input data can't make the program crash or misbehave.