liv: ribbon diagram of a p53 monomer (p53)
[personal profile] liv
I have a spreadsheet with approx forty thousand rows. Around 6000 of them are irrelevant - they're mixed in with the rest, but are identifiable based on data in one of the columns. The data covers three years. The years are not recorded as proper dates, but as plain text saying things like 2015/16.

My task is that for each year, I need to count the unique values in one of the columns. This column contains only text, no blanks. It's not made out of meaningful English words, but serial numbers containing letters and digits. The values in each column are repeated anywhere from 1 to 60 times; I just want to know how many different serial numbers there are, not overall, but separately for each year.

Year Serial Number Flag
2015/16 AAA111 Relevant
2015/16AAA128 Irrelevant
2016/17 AAA111 Irrelevant
2016/17 AAB139 Relevant
2016/17 AAA111 Relevant


My own knowledge of Excel, supplemented by searching for things like Excel count unique values, isn't quite sufficient. Anything that involves doing this semi-manually (eg sorting the columns then counting by hand) is unfeasible over tens of thousands of rows. Anything automated needs to not make Excel choke with a large-ish spreadsheet.

I tried following the instructions in this official Microsoft article, and it's not quite working for my case. As soon as I try to filter a column by unique values only, it overrides the filter I started with for taking out the 6000 rows marked 'irrelevant' in a different column. Even worse than that, instead of copying the roughly 10,000 cells in the same column that I tried to select, it copies a whole chunk of the spreadsheet with several columns and I'm not sure exactly how it's related to the area I selected. The second method, count using functions, I don't understand well enough to try, and since the method I thought I understood behaved very unexpectedly, I don't want to start blindly pasting in a formula I really won't be able to debug.

Does anyone have any suggestions for how to approach this?

(The reason why I'm trying to wrangle this myself rather than delegating it to someone who has relevant expertise is, well, annoying work politics. But the fact remains that I need to do it.)

(no subject)

Date: 2018-02-15 10:38 am (UTC)
aldabra: (Default)
From: [personal profile] aldabra
First, add a column on the far left, call it Incoming, and number it in the order your lines are in now. That means you can sort them and if it breaks you can put them back to how they were when you started. You're probably not going to need to sort them, but always better to be safe.

Then you want COUNTIFS. That lets you count on more than one condition.

There's a page here:

https://exceljet.net/formula/count-unique-text-values-with-criteria

which does counting of unique values using FREQUENCY and MATCH. I'm slightly too much at work myself to put these together, but they're the things I'd look at. I'd be aiming for a column which was TRUE if my serial number didn't occur in any of the rows above it, but FALSE if it did (that's the FREQUENCY and MATCH thing), and then I'd COUNTIFS the year was 2015-16 and the FREQUENCY-MATCH thing was TRUE and the relevance was RELEVANT.

(no subject)

Date: 2018-02-15 10:49 am (UTC)
aldabra: (Default)
From: [personal profile] aldabra
Actually, it's more complicated than that 8-) Looking further...

(no subject)

Date: 2018-02-15 11:00 am (UTC)
aldabra: (Default)
From: [personal profile] aldabra
Right, you've got three years; the quickest way to do this is going to be to do each year separately. I've added a column with

=IF(AND(A2="2015/16",C2="Relevant"),B2,"")

which gives me the serial number only if the year is 2015/16 and the flag is Relevant. (You can type that in the first row and then drag it down, or copy and paste it down, and it will do the right thing in all further rows.)

Now you can filter on just that column, so you only have one filter in play. Does that work for filtering on unique values only?

(no subject)

Date: 2018-02-15 02:00 pm (UTC)
ceb: (Default)
From: [personal profile] ceb
First, add a column on the far left, call it Incoming, and number it in the order your lines are in now. That means you can sort them and if it breaks you can put them back to how they were when you started. You're probably not going to need to sort them, but always better to be safe.

Just repeating this, as it is the single most useful piece of Excel advice ever and I've had to dig someone else out of not doing it today...

(no subject)

Date: 2018-02-15 10:40 am (UTC)
jack: (Default)
From: [personal profile] jack
I do not know this off the top of my head so hopefully someone with the specific experience will come along and just know the easiest way, but off the top of my head, the two immediate things I'd consider are:

Do you have Access or another database program you're familiar with? If so it can probably import the .xls, and databases are made for this sort of query. IIRC you want an aggregate query with three columns, 'group by' on the year and on the serial number, and 'count' on another column (maybe one that's unique for each row, like a primary id?)

Alternatively, looking at the knowledge base article, I don't quite know how the filter works, but can you specify both conditions (unique and year) in one filter, instead of two successive filters? I don't know if you can, but that seems to be what's needed.

(no subject)

Date: 2018-02-15 11:03 am (UTC)
jack: (Default)
From: [personal profile] jack
Another couple of thoughts, now my brain has warmed up a bit, a hybrid of "use formulas" and "use filters".

Can you create a new row which is the concatenation of the serial number and year? And then count all the unique entries in that? So you get "AAA111 2015/16" "30" "AAA112 2015/16" "25" "AAA111 2016/2017" "19" etc

Or, can you filter by year, then copy all the cells to separate sheet, and then do the unique counting thing? I'm not sure that's how copying when you filter works, but I think it might be (and of course, it might be too slow, I'm not sure).

(no subject)

Date: 2018-02-15 11:18 am (UTC)
aldabra: (Default)
From: [personal profile] aldabra
NB: The way you copy the cells to another sheet with the filter on is to select the sheet, and then go Alt-E Alt-G Alt-S Y, which is Edit, Go To, Special, Visible cells only (in some backwards-compatible previous invisible version of the menus; note that "Visible" is Y not V), and then copy and paste that. Otherwise it selects all the cells you haven't filtered as well as the ones you have.

(no subject)

Date: 2018-02-15 11:20 am (UTC)
jack: (Default)
From: [personal profile] jack
Ah, thank you! I could tell it must be something from the way people talked about filter and copy, but I couldn't tell which. But it sounds like your approach probably solves the problem now anyway.

(no subject)

Date: 2018-02-15 11:28 am (UTC)
jjhunter: a watercolor 'teal deer' (tl;dr)
From: [personal profile] jjhunter
Why not custom sort 1st by the Year column and 2nd by the Serial Number column, and then just run a simple test for uniqueness on the serial number column in a helper column? E.g. =if( [Flag cell current row]='Relevant', if ( [serial# cell current row] = [serial# cell row above ], 0, 1), 0).

Then you can have a little summary set of your choice of sumifs or countifs for each year: =countifs( [Year column: Year column], "[Year]", [helper column: helper column ], 1)

If you need the spreadsheet to be resortable after, just copy the helper column and save as values first.

(no subject)

Date: 2018-02-15 11:51 am (UTC)
jjhunter: Drawing of human J.J. in red and brown inks with steampunk goggle glasses (red J.J. inked)
From: [personal profile] jjhunter
If your 'Year' column is populated with inconsistently formatted dates, you may need to first start with creating a helper column specifically for the Year to convert them all into a standard format - the date() formula is fantastic for this (it's how I routinely get from, say, yyyy.mm.dd to mm/dd/yyyy).

You can see how many variations you have to convert by selecting the entire Year column, copying it, opening a new sheet, pasting as values there, and while you have that copied column still selected using the extremely handy option under the Data tab to 'Remove Duplicates'.

(no subject)

Date: 2018-02-15 11:40 am (UTC)
wychwood: chess queen against a runestone (Default)
From: [personal profile] wychwood
Have you tried pivot tables? They are excellent for this sort of thing. Using your example data there, I would go to Insert -> PivotTable, and then put "Serial Number" in the "rows" box; if you put "Flag" into the filters box, you can specify that you only want to see "relevant" results. That would give you a list of all the serial numbers marked as relevant; add "Year" to the "rows" box and that will also break down by year. If you want a list for each year (rather than the years for each code) drag "year" in the rows box above serial number, and that will then give you:

- 2015/16
AAA111
- 2016/17
AAA111
AAB139

Which you can easily then count from to get your answers (if necessary, you can select all the rows in a given category, and the "count" will show in the toolbar on the bottom right of the Excel window).

If you want any more help with pivot tables, feel free to give me a shout! I love them a lot.

(no subject)

Date: 2018-02-15 07:17 pm (UTC)
syderia: cyber wolf (geek)
From: [personal profile] syderia
That was going to be my suggestion also.

(no subject)

Date: 2018-02-15 07:58 pm (UTC)
mathcathy: number ball (Default)
From: [personal profile] mathcathy
And mine

(no subject)

Date: 2018-02-15 08:44 pm (UTC)
siderea: (Default)
From: [personal profile] siderea
Likewise.

Who are you? :)

(no subject)

Date: 2018-02-15 08:50 pm (UTC)
wolby: Medieval illustration of a canine holding a duck by the neck; the duck says "queck." (Default)
From: [personal profile] wolby
Fourthing the suggestion of PivotTables! It's a lot like the grouping you can do in Access or other database programs, but it lives entirely in Excel.

(no subject)

Date: 2018-02-15 08:51 pm (UTC)
siderea: (Default)
From: [personal profile] siderea
I'd do "Flag" into the filter box, "Serial number" in the left column, "Year" in the top column and "Count of Year" in the body.

If I did that right, it should return:

Filter: Relevant
2015/162016/17
AAA11111
AAB13901

(no subject)

Date: 2018-02-17 09:10 am (UTC)
damerell: (computers)
From: [personal profile] damerell
I would export it as something-separated data then use Perl. (This is my hammer such that every problem looks like a nail.) This would be a really, really easy job in Perl.

Soundbite

Miscellaneous. Eclectic. Random. Perhaps markedly literate, or at least suffering from the compulsion to read any text that presents itself, including cereal boxes.

Page Summary

Top topics

December 2025

S M T W T F S
 123456
78910111213
14151617181920
21222324252627
282930 31   

Expand Cut Tags

No cut tags

Subscription Filters