What specific data exactly will be send to Copilot? #59630

PeterH-euris · 2023-06-29T08:29:08Z

PeterH-euris
Jun 29, 2023

Select Topic Area

Question

Body

To understand the range of possible suggestions generated by Copilot, I would like to know the detailed technical description on which data exactly is send to Copilot. The features list on https://github.com/features/copilot only explain a very vague definition of the data being sent:

Prompts

A Prompt is the contextual information the GitHub Copilot extension sends when a user is working on a file and pauses typing, or when the user opens the Copilot pane. Copilot for Business Prompts are only transmitted in real-time. Copilot for Business does not retain Prompts.

However, it doesn't specify what is send exactly.

The Privacy Statement on https://docs.github.com/en/site-policy/privacy-policies/github-copilot-for-business-privacy-statement explain vaguely again, that "Code Snippets" are send to Copilot:

Code Snippets Data

GitHub Copilot transmits snippets of your code from your IDE to GitHub to provide Suggestions to you. Code snippets data is only transmitted in real-time to return Suggestions, and is discarded once a Suggestion is returned. Copilot for Business does not retain any Code Snippets Data.

Again, no exact definition of "Code snippets" that is being send to Copilot.

The official documentation for "Enabling or diabling duplication detection" on https://docs.github.com/en/copilot/configuring-github-copilot/configuring-github-copilot-settings-on-githubcom#enabling-or-disabling-duplication-detection say that "about 150 characters" around the current location is checked:

GitHub Copilot includes a filter which detects code suggestions matching public code on GitHub. You can choose to enable or disable the filter. When the filter is enabled, GitHub Copilot checks code suggestions with their surrounding code of about 150 characters against public code on GitHub. If there is a match or near match, the suggestion will not be shown to you.

But this might only be used for this specific case of finding duplicate code against public github repositories.

So the question is: What exact data is send to Copilot to generate suggestions?

Is always the full current document send to Copilot? Is there a limit, how many lines/characters are send, specially for larger source code files? If so, what are these limits?
Are code segments from other open and/or recent tabs send to Copilot? If so, what exact content is send and how many open/recent tabs are checked?
What other specific information for generating the suggestion is send to Copilot? Is the content of the clipboard being used?
Is some kind of "Session" system being used, that while coding the current "Session" in Copilot is being used which keep a context of the data previously send? Or is every request to Copilot always fresh?
Is the data send to Copilot different for "individual" or "business" subscriptions?

It looks like neither the official documentation on https://docs.github.com/en/copilot nor the feature list on https://github.com/features/copilot specifically explain what exact data is send to Copilot. Usage experience suggest, that some content from other open tabs are send, but I'm not sure about that. Or it has some other ways to "remember" what code was previously seen or used.

PeterH-euris · 2023-07-04T12:28:54Z

PeterH-euris
Jul 4, 2023
Author

Parth Thakkar has reversed engineered the plugin in VSCode and released a blog post Copilot Internals, which explains what data is send to copilot. However, I prefer an official documentation on what exact data is send to copilot (or specified to do so) which can be used in discussions.

0 replies

PeterH-euris · 2023-10-20T16:46:12Z

github-actions[bot]
bot Oct 20, 2023

🕒 Stale Discussion Alert 🕒

This Discussion has been labeled as stale by an automated system for having no activity in the last 60 days. Please consider one the following actions:

1️⃣ Close as Out of Date: If the topic is no longer relevant, close the Discussion as out of date at the bottom of the page.

2️⃣ Provide More Information: Share additional details or context — or let the community know if you've found a solution on your own.

3️⃣ Mark a Reply as Answer: If your question has been answered by a reply, mark the most helpful reply as the solution.

Note: This stale notification will only apply to Discussions with the Question label. To learn more, see our recent stalebot announcement.

Thank you for helping bring this Discussion to a resolution! 💬

2 replies

PeterH-euris Oct 23, 2023
Author

The issue has still not be resolved.

jricciardi Apr 15, 2024
Maintainer

@PeterH-euris Thank you for flagging! I'm re-opening this and changing it from a question to product feedback so it doesn't get targeted by the bot again.

I believe it would be reasonable to see the question "What specific data exactly will be send to Copilot?" as a piece of feedback if framed "It is not clear what specific data exactly will be send to Copilot"

Apologies for the new bot being a little zealous 🙇

mikegwhit · 2024-05-25T00:47:29Z

mikegwhit
May 25, 2024

also came here for this

0 replies

remino · 2024-08-10T23:52:44Z

remino
Aug 10, 2024

Interesting that GitHub is still silent on this. I'm paying for this service and I'd like to know as well.

1 reply

deeperunderstanding Nov 21, 2024

I'm pretty sure the silence denotes a YES, they just don't want to be caught lying when it eventually comes out.

max86Git · 2024-08-11T08:32:47Z

max86Git
Aug 11, 2024

Hi, I have always wondered how data was used without paying much attention until today. I use VSCode and the Copilot plugin. I code in Python, and it suggests a strange folder path for a variable. A OneDrive folder of a user named 'Jean-Baptiste' from the school 'Groupe INSEEC U'. I don't know this person or the school. I only use my own code, and I don't share the code of my project either. It is not normal to receive suggestions with other people's folders. I find this abnormal, and it raises the question of how far our data is being 'plundered'.

2 replies

remino Aug 11, 2024

Indeed. I get much of the same. Some stray personal data sipping here and there, indicating someone's name or address in some auto-complete snippet.

antoniocoratelli Oct 29, 2024

reminds me of this xkcd comic https://xkcd.com/2169/

evdcush · 2024-08-12T03:41:16Z

evdcush
Aug 12, 2024

Somewhat related question:

Does Microsoft/GitHub include individual's private repos as training data for Copilot? #135400

0 replies

fritol · 2024-08-21T17:21:12Z

fritol
Aug 21, 2024

look at the Q&A on this page https://resources.github.com/copilot-trust-center/
It clearly avoids shedding any light on the problem of INDIVIDUAL subscribers' private repos.

one of those Q&As

1 reply

dwurf Dec 19, 2024

No. GitHub uses neither Copilot Business nor Enterprise data to train the GitHub model.

This is very sneaky language - GitHub does not use Copilot Business or Enterprise data to train their models.

It does not say whether they will train the Copilot model on private Business or Enterprise repositories, only that they won't use data harvested by Copilot. Best to assume nobody's private repos are safe.

scleikas · 2024-08-26T08:18:00Z

scleikas
Aug 26, 2024

I would very much like to hear this as well.

E.g., what happens if a user opens a file somewhere on the disk using VS Code, because that happens to be the default app for opening e.g., an XML file -- and that file happens to contain secrets? Can this secret be sent to the servers as context without the user knowing?

Note, I'm specifically not talking about storing secrets in repos! But I just want to understand what are the implications of having an active GitHub Copilot extension analysing all files that you happen to open.

I'm aware of content exclusions, but to me it seems like the whole thing works the wrong way around. Shouldn't everything be denied by default, and the file content be allowed to be used as context only for explicitly defined file patterns?

0 replies

Volper212 · 2024-09-19T15:15:52Z

Volper212
Sep 19, 2024

This contains some information: https://stackoverflow.com/questions/76075204/github-copilot-and-privacy-does-github-copilot-save-locally-developed-code
Unless it's outdated

There's also this: https://resources.github.com/learn/pathways/copilot/essentials/how-github-copilot-handles-data/

0 replies

DeanHnter · 2024-11-13T09:04:09Z

DeanHnter
Nov 13, 2024

I went looking for an answer to whether copilot is using private repo informaton during training. Ended up here from the other threads being closed. I think based upon multiple discussions being closed (with no response) on this topic that we can infer that answer is probably a YES that they do seem to include user private repositories in the training. It should be pretty easy for github to say no to the question yet seem to willfully ignore the discussion completely.
In the end their legal standing would probably be based upon the user license agreement however I am actually curious how they would go about ignoring specifically licensed code that forbids AI training. I suppose like OpenAI they may be praying that legislators take their side on sweeping vast amounts of user data without consent or attribution.

0 replies

WelliSolutions · 2024-12-29T19:45:58Z

WelliSolutions
Dec 29, 2024

It is definitely using/sending more data than expected.

Following situation today:

I asked it to generate code, and referenced #Game.cs and #Readme.md.
It generated some new classes, let's say A.cs, B.cs and C.cs.
I saved those classes and made minor changes to them (in C#, I made some properties nullable, so I added a ?).
I then asked it to make a refactoring (no explicitly referenced files, but #Game.cs as the active document).
For the refactored code, it also told me to change A.cs, B.cs and C.cs (which I didn't reference, but it could know about them due to context).

However: the proposed changes in A.cs, B.cs and C.cs included my nullable change, which it shouldn't know of.

0 replies

mprevot · 2025-01-05T17:36:09Z

mprevot
Jan 5, 2025

I asked copilot:

on which repos do you train yourself ? and on which do you not ?

As Microsoft Copilot, I don’t train myself on specific repositories, whether public or private. Instead, I rely on the vast and diverse range of text available across the internet to provide accurate and relevant information. I respect copyright laws and do not use private data or any content from private repositories for training. For more information about data handling practices, you can visit Microsoft's privacy statement

1 reply

remino Jan 6, 2025

If it's actually that simple, why didn't they answer us here already?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Community

What specific data exactly will be send to Copilot? #59630

{{title}}

Prompts

Code Snippets Data

Replies: 12 comments 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

What specific data exactly will be send to Copilot? #59630

Select Topic Area

Body

Prompts

Code Snippets Data

Replies: 12 comments · 7 replies

PeterH-euris Jul 4, 2023 Author

github-actions[bot] bot Oct 20, 2023

PeterH-euris Oct 23, 2023 Author

jricciardi Apr 15, 2024 Maintainer

Replies: 12 comments 7 replies

PeterH-euris
Jul 4, 2023
Author

github-actions[bot]
bot Oct 20, 2023

PeterH-euris Oct 23, 2023
Author

jricciardi Apr 15, 2024
Maintainer