-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics "Number of Datasets per org" not updating #5027
Comments
SELECT COUNT(*)
FROM package
JOIN harvest_object
ON package.id = harvest_object.package_id
WHERE harvest_object.harvest_source_id in (select id from harvest_source where url like '%noaa%')
AND harvest_object.current = TRUE
AND package.state = 'active'
AND package.private = FALSE;
count
-------
92284
(1 row)
my hunch is more records are returned as a result of the join between packages and harvest objects ( i.e. harvest objects is the culprit ). comparing the above result to... select count(*)
from package
where owner_org = '5f4f1195-e770-4a2a-8f75-195cd98860ce' -- noaa
and private = FALSE
and state = 'active';
count
-------
76688
(1 row) this still isn't the same as what's on staging metrics( off by 71 ) but closer interestingly, when I count the number of unique package IDs using the following query I get 76614 packages which is 3 less than what's to be "expected" SELECT COUNT( DISTINCT package.id )
FROM package
JOIN harvest_object
ON package.id = harvest_object.package_id
WHERE harvest_object.harvest_source_id in (select id from harvest_source where url like '%noaa%')
AND harvest_object.current = TRUE
AND package.state = 'active'
AND package.private = FALSE;
count
-------
76614
(1 row)
running ^ this same command for |
Ah super fascinating, thanks @rshewitt! Ok, I am comparing the November & December downloads with some I have from July and October December: global__datasets_per_org.2024-12-31.csv November: global__datasets_per_org.2024-11-30.csv October: Oct_2024_global__datasets_per_org.csv July: global__datasets_per_org.csv It is WILD to me that they all have 113 orgs with datasets. The numbers do vary, but in that time we've had movement on active orgs from our org audit in Oct/Nov. Can it be coincidence that there are 113 rows of data for every month? |
One big factor for the count difference is collection data. Take GSA for example, we have 338 datasets in total and excluding collection datasets it is 167. 338 is what you see on the Harvest report on catalog.data.gov metrics dashboard, 167 is what you see on data.gov metrics. For NOAA data however, there is more to it. We can open another ticket on it based on the query result, but for the count difference between catalog.data.gov metrics dashboard and data.gov metrics, it is expected. We can modify one end to make the number consistent if that is important. |
@FuhuXia thanks for the info! |
The "Number of Datasets per org" table on /metrics is not up to date with the actual CKAN records.
Example, on 1/6 on /metrics:
But on Harvest info section in CKAN:
How to reproduce
Expected behavior
These numbers should match and the /metrics table should update monthly with the rest of the reports
Actual behavior
The table is stale
Sketch
The text was updated successfully, but these errors were encountered: