Guys I have an interesting coding problem for you that I'm really stuck with

Guys I have an interesting coding problem for you that I'm really stuck with.

Say I have 2 lists of names. Both in random orders but with a few of them in both. I need to find out which names are in group B but not group A. How would I do this in excel or matlab?

To complicate matters further some entries aren't 100% similar but are very similar for example in list A a name might be Andrew B. Cosby and in group be just Andrew Cosby but obviously this is a match and should not be in my answer list.

Thanks guys!

Do you know any other coding languages? I don't think excel (and I dont know about matlab) is the best way to handle lists of stings.

>To complicate matters further some entries aren't 100% similar

I don't know what you mean with that but you can use the levenshtein difference with a tolerance level to find similar but not identical strings.

about the other thing,

repmat(A, N_b,1) - repmat(B,N_a,1)'

the zeros are your doubles.

Nigger if you have data like that you need to standarize it first

I don't really unfortunately, I'm a maths student so my using of coding is limited to calculations on matlab do you not think matlab or excel could handle something like this?

Would take for slow going but I think you're right. Any ideas how to do it with standardised data?

What language is this?

I mean if you've standardized the data you can just use sets. Or a terrible ugly for loop.

matlab

What command would you use to compare an element of A with an element of B?
What are the inputs in this case?

A, your first list, B, your second list, N_a, length of A, N_b, length of B

you're a math student and you've never used repmat?

Nope, I'll give it a browse

Thanks all for your help, if this works I'll share some of the £65k with you!

>£65k

pfff yea right

uninteresting programming problem in a shit language

First of all you should sort your fucking data.
After that its pretty simple:
Compare A[0] with the first letter of B[n]
If its a match; compare the names(just first and last) if the names match record the name/ remove from list
Else Break the loop and move onto A[1]
this is probably the simplest but it wont be terribly fast

Sounds like a job for setdiff.

it depends how your data is formatted. if you are using VBA for Excel you can use the Front() commands and compare the first n characters.

Use sets in Python
Set B - ( Set A N Set B )

He needs to normalize the data first so that equivalent names are equal

sha4096

Matlab is bad at this because it's a shit language (with shit string support), but you can do something like that:
First go through both lists of names and convert them to upper(or lower) case while also removing things like B. in your Andrew Cosby example (a good way to do this would probably be to take the first and last word).
After that, use the appropriate set operations on the lists.

post it to mechanical turk for peanuts, your time clearly is more valuable

alternatively if your sets are really big make it into a captcha and let faggots do it for free

FOR EACH X NOT LISTC()[] IN LISTA {
LISTD [] = X
}

Listc() {
For each X in LISTB[] {
LISTC [] = "*" & X & "*"
}
}

hisssss :^)

[code]
void faggot {
FOR EACH X NOT LISTC()[] IN LISTA {
LISTD[] = X
}
}

static array listc()[] {
For each Y in LISTB[] {
LISTC[] = "*" & X & "*"
}
}
[/code]

Theres some python for you.

Dont do this it makes mustard gas

But really this will infinitely loop and segfault Windows. 9/10

not sure i can think of a non O(n^2) way to do it.

just go one by one thru list b, checking each value of list a. you should also do a isSimilar() method to take two names, split across whitespace and compare the first and last values (names).

>O(n^2) way to do it.

Concatenate the lists in 1
Sort the list in N log N
Run through the list and check neighbors in N.

There you go, N log N solution

If the lists are already sorted it's an N solution.

Fucking noobs

Perl has some lovely regular expression and this amazing data structure known as a hash for just this sort of thing. I encourage you to look it up, even if its the legacy of legacy.

Python has similar stuff going on, but regexp in Python is a little bit less intuitive for me (please dont ask me how /// is easier than regexp.) And a hash is just a 2 dimensional array in Python with naming and size restrictions.

Matlab has very poor regexp support from what I understand, even though I like it.

You have yourself there a week 1 day 5 regexp problem in Perl

Seeing how you are thinking about excel or matlab you probably don't give a shit about time complexity.

Store both lists as simple arrays.

Take a name from list B and compare it to literally every other member in list A. If there is no match (track this with a boolean) then you output this name.

Repeat this for every member in list B and there you have.

Assuming lists of the same size this is just O(n squared) so it is not absolutely shit, but is literally as bad as you can do.

in R, only considering exact matches:

unique(B[! B %in% A])

>please dont ask me how /// is easier than regexp
it's not the syntax that's shit in python's regex, but the implementation.

they recommend you pre-compile your patterns, but have it set up so you can just pass a pattern string instead of a pattern object, but it's caching behind the scenes so there's sometimes no difference in the behavior no matter how you set up the search

it's a great example of horribly planned pre-optimization