Private Maintainer -- TODO -*- markdown -*- A much smaller _public TODO_ is part of the file `./README` Things done are moved to `./DONE-MM` ### `clusGap()` * currently fails [in `svd()`] when `x` has any NA's * clusGap() : now have `"original"` in addition to PCA-rotation with `scaleH0` string. In addition should provide `H0gen = a RNG *function*`: Chris Fields (in March 2012) had proposed to alternatively allow uniform on a n-simplex instead of an n-cube. -> `~/R/MM/Pkg-ex/cluster/clusGap-ChrisField-thoughts` ==> Master Thesis of Emmanuel Profumo (~/Betreute-Arbeiten/Profumo_Emmanuel/) has a __generalized__ clusGap() --> added but not yet {exported+documented} >>> R/clusGapGen.R NB: See more in ~/R/MM/Pkg-ex/cluster/clusGap/clusGapGen/ notably ~/R/MM/Pkg-ex/cluster/clusGap/clusGapGen/README.md -------------------------------------------------- * clusGap `print()` method: note that Tibshirani et al. proposed see their large data example (!) a different method than the one we implemnt currently --> provide both(!) ### `clara()` * clara() to work with *daisy*-like distances, not just L2 & L1. -- Possibility: using stats package C code `~/R/D/r-devel/R/src/library/stats/src/distance.c` does not have Gower, but a few more kinds of distance methods. * clara() - bug?: tests/clara.Rout.save.~13~ (and ~12~ , ~11~ ...) gives a clearly better result clara(ru4, k=3, met="manhattan", sampsize=4) than current (later) clara.Rout files... -> bug since 'Mar 11 2004' (= cluster-1.[89].1 for R 1.[89]) ?? FIXME ?!?!?! * It should be possible to _re_start `clara()` with a **given** "best sample" * Have introduced as.data.frame() via `S3method()` in `NAMESPACE` + copy (from base) of `as.data.frame.matrix`. Nicer (but less back compatible ==> need rev.dep checks!) would have been to change the *class* to `c("silhouette", class(matrix())` such that all matrix (and array) methods would work with silhouette. * `silhouette.clara()` - silhouette(*, full=TRUE): Allow option of *NOT* pass full-size length n(n-1)/2 dist object but compute `daisy()`-like distance *on the fly* inside `src/sildist.c` ==> still needs O(n^2) CPU-effort, but not O(n^2) memory { which is even not *possible*, e.g., for n=70'000 } - Consider `silhouette(*, full = 0.50)1 to compute silhouette for a (random ?) subset of {0.50 * n} of the observations * - `plot.silhouette()` : if an observation's width is == 0., draw a small stripe instead of nothing at all ### `pam()` (and also `clara()`) * clara() got metric="jaccard" (donated in 2018) ==> copy to pam(), too !! well: actually that is *BUGGY* * pam() and clara(): Should be possible to "re"start with GIVEN medoids. now possible for pam(), not for clara(); --> "synchronize src/pam.c and src/clara.c in particular bswap() vs bswap2() !! * pam() and clara(): With NA's, medoids "often" contain NAs even when there are only few NAs. ==> use modified d(.,.) which make NAs "bad" somehow. * pamila(): Major smart idea: Do save the d(i,j) i=1,..k j=1,..n only between *medoids* and everything else -- speedup(?) -> optional * R/agnes.q , R/diana.q and R/pam.q have almost identical clause if(data.class(x) != "dissimilarity") { if(!is.numeric(x) || is.na(sizeDiss(x))) stop("x is not of class dissimilarity and can not be converted to this class." ) ## convert input vector to class "dissimilarity" class(x) <- ..dClass attr(x, "Size") <- sizeDiss(x) attr(x, "Metric") <- "unspecified" } which can be modularized out into a NAMESPACE-local fixupDiss() function [ agnes() and diana() even more in common --> namespace-local functions! see also "8b)" below!] ### diana() {divisive hierarchical}: * Should allow __early stopping__ (for speed and size) -- simultaneously, could think of _`diss()` on the fly_ instead of diss() matrix, but see `./src/NOTES-MM` (and `pamila()`) ! Dec. 2002: o clara(ruspini, 4) BUG in clara.c (see below) -- worked fine in cluster-1.5.2 (with clara.f!) -- gives error in " 1.6.1 [and later] ==== AARGH (the problem is *not* an integer/double one, here!) Status 28.Dec.2002: - The August-2002 fortran code doesn't seem to have a problem ==> ~/R/Pkgs/TMP/cluster/ - The F2C code (called via .Fortran()) seems the same ==> ~/R/Pkgs/T_F2C/cluster/ - A very slight change of the F2C code (using .C()) has one problem but not all of the "modern" C version ==> ~/R/Pkgs/T_F2C-2/cluster/ Fixed most of the above 2002-12-28 _late_ -- still one small problem! but it seems clear this was even in early clara.f (at least, final result is the same for that example) src/clara.c << needs more o diana(ruspini) --> ok (again) o bannerplot() is now `standalone' and has a help, man/bannerplot.Rd . HOWEVER it's "details" are found in man/plot.agnes.Rd (and ???) instead --> centralize this info (and keep short ref.s in the man/plot.* o agnes() and hclust() should be merged {and based on C, not Fortran} o agnes() for large objects needs TWICE the time of hclust(); both need MUCH MORE time than hcluster() in pkg 'amap', which is said to be the same as 'hclust' but just only malloc()ating the "huge" dissimilarities inside C. --> translate agnes, i.e. src/twins.f to C July 2002: o Idea for new functionality : e.g., pamila() := PAM In Large Application should not *save* dissimilarities but rather re-compute them on the fly --> save huge storage ==> should give identical results but be faster for larger n, or at least feasible for n = 10'000 or so where it currently aint. June 2002: mona() : I think it should be possible to write an [ as.hclust.mona() or as.twins.mona() method and hence also draw a dendrogram of a mona object. Jan. 2002: clusellipses() ``like part of clusplot'' for *adding* ellipses to plot; maybe do this with "add = TRUE, plotchar = !add, labels = 0 May 23, 2001 / Jan.2002 : ------------------------- I found problems with missing values / NAs treatment : o Also, I'm not sure if the NAs are dealt with sensibly in clara() : The result changes too much with very few NAs o --> look at all the subroutine dysta*() s in src/*.f Clean these up and merge in one single! Aug.02: partly done -- fanny() is different than others. In the future: When "mva" will have a C API, use dist()'s C function! 7) Get rid of the many \section{GENERATION}, {METHODS} and {INHERITANCE} sections in man/*.Rd -- make sure that info is available, at least partially otherwise. 6a) The \references{} mostly contain the same things. man/plot.agnes.Rd has some of them nicely. Collect in a few places (*.Rd files), and refer to these {partly ok} b) Similarly for the \section{BACKGROUND} which appears in quite a few *.Rd files. --> done partly ( ./ChangeLog 2002-01-24 ) 8b) Think about "merging" the plot.agnes and plot.diana methods. ------------------------------------------------ Older TODO ========== (were in `README_MM` which is now eliminated) 3) daisy() for the case of mixed variables should allow a weight vector (of length p = #vars) for up- or downweighing variables. daisy() really should accept the other methods mva's dist() does _and_ it should use dist's C API -- but we have no C API for package code, ARRGH! 4) Eliminate the many Fortran (g77 -Wall) warnings of the form >> mona.f:101: warning: `jma' might be used uninitialized in this function --------- 9) man/daisy.Rd should mention 'Gower (1971)' ; mention that Kaufman & Rousseeuw *generalize* this; and probably show the full formula from Kauf+Rouss p.35 10) Implement the plot for "fuzzy cluster membership" of section 5.4, from Kauf+Rouss p. 195 ff : I.e. PCA of the membership matrix for the points + "the pure clusters" <---> Export as.membership() and toCrisp() in ./R/fanny.q